Forum Replies Created
-
AuthorPosts
-
Could you clarify the question a little bit – I’m not sure about “a dimension corresponding to naturalness, and a second principal dimension strongly corresponding to prosodic naturalness”
If you want to try an objective measure (perhaps to see if it correlates with your listeners’ judgements), here’s a Python implementation of Mel Cepstral Distortion (MCD) by Matt Shannon.
This requires skills in compiling code (if you do this, please post information here), is entirely optional and certainly not required for this assignment.
We will look at how to calibrate the materials, in the lecture.
We’ll spell this out in the lecture.
We’ll look at various ways of calibrating listeners’ responses, in the lecture.
Including a control trial sounds like the right thing to do. In almost all types of evaluation, we need something to compare to:
- We explicitly ask listeners to make a comparison and tell us their judgement
- We compare the responses of listeners across several conditions
The second case applies here. If we cannot control for unknown factors (e.g., working memory) then we have to make sure those factors have the same value across all conditions that we wish to compare. In your example, we would have the same subjects perform the same task, once on the synthetic speech, and once on natural speech, then we would quantify the difference in their responses (e.g., accuracy in a comprehension test).
You are correct in saying that it’s very hard to trace back a problem in the output speech to a specific component in the synthesiser, especially so when the text-processing has the typical pipeline architecture.
Where possible, we will perform component-level testing and, if we are lucky, that can be done objectively (without humans) by comparing output to a gold-standard reference.
Otherwise, what we have to do is make a hypothesis, then create an experiment to test (or refute) that hypothesis. In general, this is going to require us to create two or more systems that differ in a specific way (e.g., the pronunciation dictionary is or is not carefully tuned to the speaker), then compare them in a listening test.
Unit tests (actually, component tests – as discussed in this thread) are used during system development. For example, we might iteratively test and improve the letter-to-sound module in the front end until we can no longer improve its accuracy on a test set of out-of-vocabulary words. This component testing provides more useful information than system testing, because we know precisely which component is causing the errors we observe, so we know where improvements are needed.
Once our system is complete, and we think we’ve got decent performance in each component, we then perform end-to-end system testing.
Your insight that testing and improving individual components might not lead to best end-to-end performance certainly has some truth in it. Some components contribute far more than others to overall system performance. Putting that another way, end-to-end testing might sometimes help us identify (using our engineering intuition) which component needs most improvement: where a given amount of work would have the largest effect.
I think you are also suggesting that components should be optimised jointly. This is another good insight. For example, if errors in one component can be effectively corrected later in the pipeline, then there is no need to improve that earlier component. Unfortunately, the machine-learning techniques used in modules in a typical system (e.g., Festival’s text processing front end) do not lend themselves to this kind of joint optimisation.
If we don’t want to perform a listening test after every minor change to the system, then we need to rely on either
- Our own listening judgement
- Objective measures
We’ll cover objective measures in the lecture.
Objective measures are widely used in statistical parametric synthesis. In fact, the statistical model is essentially trained to minimise a kind of objective error with the training data. We can then measure the error with respect to some held-out data.
Multiple rounds of evaluation would be normal when developing a large system over a period of time. For the “Build your own unit selection voice” exercise, that would probably take too much time though.
In general, it’s difficult to evaluate a single system in isolation: this is because most types of evaluation provide a relative judgement compared to one or more other systems or references. Even in the case of intelligibility testing, where evaluating a single system sounds reasonable, we still need to interpret the result: for example, is a Word Error Rate of 15% good or bad? One way to know would be to measure the intelligibility of natural speech under the same conditions.
When comparing multiple systems, we would normally use the same speaker and in fact the exact same database (unless we were investigating the effect of database size or content). Trying to compare two systems built from different speakers’ data would not enable us to separate the effects of speaker from those of the system.
February 7, 2016 at 12:13 in reply to: Output quality: common evaluation framework & materials (SUS) #2547The reason for using SUS is, of course, to avoid a ceiling effect in intelligibility. But you are not the first person(*) to suggest that SUS are highly unnatural and to wonder how much a SUS test actually tells us about real-world intelligibility.
A slightly more ecologically valid example would be evaluating the intelligibility of synthetic speech in noise, where SUS would be too difficult and ‘more normal’ sentences could be used instead. But such tests are still generally done in the lab, with artificially-added noise. They could hardly be called ecologically valid.
You ask whether “there [are] intelligibility tests using situations that mimic the desired applications?” This would certainly be desirable, and commercial companies might do this as part of usability testing. Unfortunately, mimicking the end application is a lot of work, and so makes the test slow and expensive. Once we start evaluating the synthetic speech as part of a final application, it will get harder to separate out the underlying causes for users’ responses. At this point, we reach the limit of my expertise, and would be better asking an expert, such as Maria Wolters.
* Paul Taylor always told me he was very sceptical of SUS intelligibility testing. He asserted that all commercial systems were already at ceiling intelligibility in real-world conditions, so there was no point measuring it; researchers should focus on naturalness instead. I agree with him as far as listening in quiet conditions is concerned, but synthetic speech is certainly not at ceiling intelligibility when heard in noise.
February 7, 2016 at 12:00 in reply to: Output quality: common evaluation framework & materials (SUS) #2546In the field of speech coding there are standardised tests and methodologies for evaluating codecs. This standardisation is driven by commercial concerns, both those who invent new codecs, and those who use them (e.g., telecoms or broadcasting).
But in speech synthesis there appears to be no commercial demand for equivalent standardised tests. Commercial producers of speech synthesisers never reveal the evaluation results for their products (the same is true of automatic speech recognition).
There are, however, conventions and accepted methods for evaluation that are widely used in research and development. The SUS method is one such method and is fairly widely used (although Word Error Rate is usually reported and not Sentence Error Rate).
The Blizzard Challenge is the only substantial effort to make fair comparisons across multiple systems. The listening test design in the Blizzard Challenge is straightforward (it includes a section of SUS) and is widely used by others. The materials (speech databases + text of the test sentences) are publicly available and are also quite widely used. This is a kind of de facto standardisation.
There are some examples for the Speech Communication paper.
This is potentially confusing – and we don’t want to get hung up on terminology. I’ve added some clarification.
The reasons for avoiding very long sentences in the prompts for recording a unit selection database are
- they are hard to read out without the speaker making a mistake
- the proportion of phrase-initial and phrase-final diphones is low
Short sentences might be avoided because they have unusual prosody, and so units from short phrases (e.g., “Hi!”) may not be very suitable for synthesising ‘ordinary’ sentences.
-
AuthorPosts