› Forums › Speech Synthesis › Evaluation › Output quality: common evaluation framework & materials (SUS)
- This topic has 2 replies, 2 voices, and was last updated 8 years, 6 months ago by Simon.
-
AuthorPosts
-
-
February 7, 2016 at 01:23 #2458
While evaluating the output quality of the system using human listeners:
1) If each TTS engineer decides on the evaluation metric/test arbitrarily then there is no common ‘rule of conduct’ on the experimental design. Benoît et al. (Benoît, C., Grice, M. & Hazan, V. (1996). “The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences”. Speech Communication, vol. 18, pp. 381-392.) emphasise the importance of the test materials collection for the outcome of an evaluation test, when describing their Semantic Unpredictable Sentences test and recommending the optimal way to use it. Is actually SUS or any other common framework used by researchers, so that comparisons between different systems’ performance can be made easily accessible (without having to design a new test to compare the systems of interest each time)?
2) Semantic predictability
I wonder if there is any point in conducting semantic unpredictable evaluation. Yes, I guess it is more important to actually hear the right word than be able to guess it from semantic context, but in the SUS case, I think the task becomes cognitively more difficult (rather than acoustically), especially for non natives. However, acontextual sentences aren’t a good fit to how TTS is used in most commercial applications – which is the ultimate reason of existence of the TTS. Are there intelligibility tests using situations that mimic the desired applications?Thanks!
-
February 7, 2016 at 12:00 #2546
In the field of speech coding there are standardised tests and methodologies for evaluating codecs. This standardisation is driven by commercial concerns, both those who invent new codecs, and those who use them (e.g., telecoms or broadcasting).
But in speech synthesis there appears to be no commercial demand for equivalent standardised tests. Commercial producers of speech synthesisers never reveal the evaluation results for their products (the same is true of automatic speech recognition).
There are, however, conventions and accepted methods for evaluation that are widely used in research and development. The SUS method is one such method and is fairly widely used (although Word Error Rate is usually reported and not Sentence Error Rate).
The Blizzard Challenge is the only substantial effort to make fair comparisons across multiple systems. The listening test design in the Blizzard Challenge is straightforward (it includes a section of SUS) and is widely used by others. The materials (speech databases + text of the test sentences) are publicly available and are also quite widely used. This is a kind of de facto standardisation.
-
February 7, 2016 at 12:13 #2547
The reason for using SUS is, of course, to avoid a ceiling effect in intelligibility. But you are not the first person(*) to suggest that SUS are highly unnatural and to wonder how much a SUS test actually tells us about real-world intelligibility.
A slightly more ecologically valid example would be evaluating the intelligibility of synthetic speech in noise, where SUS would be too difficult and ‘more normal’ sentences could be used instead. But such tests are still generally done in the lab, with artificially-added noise. They could hardly be called ecologically valid.
You ask whether “there [are] intelligibility tests using situations that mimic the desired applications?” This would certainly be desirable, and commercial companies might do this as part of usability testing. Unfortunately, mimicking the end application is a lot of work, and so makes the test slow and expensive. Once we start evaluating the synthetic speech as part of a final application, it will get harder to separate out the underlying causes for users’ responses. At this point, we reach the limit of my expertise, and would be better asking an expert, such as Maria Wolters.
* Paul Taylor always told me he was very sceptical of SUS intelligibility testing. He asserted that all commercial systems were already at ceiling intelligibility in real-world conditions, so there was no point measuring it; researchers should focus on naturalness instead. I agree with him as far as listening in quiet conditions is concerned, but synthetic speech is certainly not at ceiling intelligibility when heard in noise.
-
-
AuthorPosts
- You must be logged in to reply to this topic.