Forum Replies Created
-
AuthorPosts
-
1. some tags need to be supported by the voice
why will unit selection voices generally not support tags that modify pitch, duration, emphasis, etc ?
This might be because you did a cut-and-paste from this webpage, which picked up HTML versions of some characters?
Yes, this will change between voices. The format of the name of a voice is the same that you would use within Festival, minus the “voice_” prefix. Try creating a file called
test.sable
(make sure the suffix is .sable and that your editor doesn’t add another suffix) with these contents:Changes of speaker may appear in the text. Using one speaker Eventually returning to the original default speaker.and run it through Festival like this
bash$ festival --tts test.sable
Note that SABLE was a putative standard developed a long time ago by us in Edinburgh with a few companies. It has been superseded. See also the earlier standard SSML and the related standard for interactive systems, VoiceXML.
There was an error a path inside the text2wave script. Try again and report back. Remember to source the setup.sh first – this sets your PATH.
(Also, post the complete error message including the full command line you are running so I can replicate the error.)
The usual form of surprising results is that listeners didn’t hear an improvement that the designers thought they had made, or that some other aspect of the synthetic speech masked the possible improvement (e.g., the speech did sound more prosodically natural, but the waveform quality was lower, and so listeners preferred the baseline).
I’m struggling to think of any genuine positive surprises, but will keep thinking…
Yes, that would seem a reasonable conclusion. Your hypothetical MDS test has found that listeners only use prosodic naturalness to distinguish between stimuli. Either they do not hear segmental problems, or there are none (it doesn’t matter which).
Could you clarify the question a little bit – I’m not sure about “a dimension corresponding to naturalness, and a second principal dimension strongly corresponding to prosodic naturalness”
If you want to try an objective measure (perhaps to see if it correlates with your listeners’ judgements), here’s a Python implementation of Mel Cepstral Distortion (MCD) by Matt Shannon.
This requires skills in compiling code (if you do this, please post information here), is entirely optional and certainly not required for this assignment.
We will look at how to calibrate the materials, in the lecture.
We’ll spell this out in the lecture.
We’ll look at various ways of calibrating listeners’ responses, in the lecture.
Including a control trial sounds like the right thing to do. In almost all types of evaluation, we need something to compare to:
- We explicitly ask listeners to make a comparison and tell us their judgement
- We compare the responses of listeners across several conditions
The second case applies here. If we cannot control for unknown factors (e.g., working memory) then we have to make sure those factors have the same value across all conditions that we wish to compare. In your example, we would have the same subjects perform the same task, once on the synthetic speech, and once on natural speech, then we would quantify the difference in their responses (e.g., accuracy in a comprehension test).
You are correct in saying that it’s very hard to trace back a problem in the output speech to a specific component in the synthesiser, especially so when the text-processing has the typical pipeline architecture.
Where possible, we will perform component-level testing and, if we are lucky, that can be done objectively (without humans) by comparing output to a gold-standard reference.
Otherwise, what we have to do is make a hypothesis, then create an experiment to test (or refute) that hypothesis. In general, this is going to require us to create two or more systems that differ in a specific way (e.g., the pronunciation dictionary is or is not carefully tuned to the speaker), then compare them in a listening test.
Unit tests (actually, component tests – as discussed in this thread) are used during system development. For example, we might iteratively test and improve the letter-to-sound module in the front end until we can no longer improve its accuracy on a test set of out-of-vocabulary words. This component testing provides more useful information than system testing, because we know precisely which component is causing the errors we observe, so we know where improvements are needed.
Once our system is complete, and we think we’ve got decent performance in each component, we then perform end-to-end system testing.
Your insight that testing and improving individual components might not lead to best end-to-end performance certainly has some truth in it. Some components contribute far more than others to overall system performance. Putting that another way, end-to-end testing might sometimes help us identify (using our engineering intuition) which component needs most improvement: where a given amount of work would have the largest effect.
I think you are also suggesting that components should be optimised jointly. This is another good insight. For example, if errors in one component can be effectively corrected later in the pipeline, then there is no need to improve that earlier component. Unfortunately, the machine-learning techniques used in modules in a typical system (e.g., Festival’s text processing front end) do not lend themselves to this kind of joint optimisation.
If we don’t want to perform a listening test after every minor change to the system, then we need to rely on either
- Our own listening judgement
- Objective measures
We’ll cover objective measures in the lecture.
Objective measures are widely used in statistical parametric synthesis. In fact, the statistical model is essentially trained to minimise a kind of objective error with the training data. We can then measure the error with respect to some held-out data.
-
AuthorPosts