Forum Replies Created
-
AuthorPosts
-
The filter co-efficients are not cross-faded. Remember that they are specified frame-by-frame (not sample-by-sample, like the residual). We just concatenate the sequences of frames of filter co-efficients for all the candidates – this gives us a complete sequence of filter co-efficients for the full utterance.
You need to do some more detective work to find out how the pitchmarks are found. For example, try omitting the pitchmarking step and see what happens as you build the voice.
What is the pipeline for concatenation and RELP waveform generation?
A complete residual signal (which is just a waveform) for the whole utterance is constructed by concatenating the residuals of the selected candidates. Overlap-and-add (i.e., crossfade) is performed at the joins, over a duration of one pitch period. A corresponding sequence of LPC filter coefficients is also constructed.
The function
lpc_filter_fast
in.../speech_tools/sigpr/filter.cc
then does the waveform generation. The inputs are the utterance-length residual waveform and the sequence of LPC filter co-efficients. I’ve just realised that I wrote that code nearly 20 years ago…/*************************************************************************/ /* Author : Simon King */ /* Date : October 1996 */ /*-----------------------------------------------------------------------*/ /* Filter functions */ /* */ /*=======================================================================*/
Why does Festival use Residual Excited LP (RELP)?
The released version of Festival uses RELP for two reasons. The first reason is practical – TDPSOLA is patented:
Method and apparatus for speech synthesis by wave form overlapping and adding EP0363233 (A1)
The patent was filed by the French state via their research centre CNET which then became France Telecom, or Orange, as they are known today.
The second reason is that RELP allows pitch/time/spectral envelope modification, as you mention. In the older diphone engine, RELP is indeed used for time- and pitch-modification. In Multisyn, no modification or join smoothing is performed, although in principle it would be possible to add this to the implementation.
Yes, you need to allow for homophones, and I also recommend allowing for spelling mistakes as well: you are testing your systems, not testing the listeners.
In the Blizzard Challenge, and almost everywhere else, Word Error Rate (WER) is used. I have rarely, if ever, seen anyone following the recommendation from the original paper to score entire sentences.
With WER, there is no need to normalise for sentence length. Just use the same formula that is used in automatic speech recognition.
The “bad f0” penalty is part of the target cost. It obtains F0 at the concatenation points of the unit from the “stripped” join cost features (coef2). In fact, only the voicing status is used, and not the actual value of F0.
Normally, we do not calculate a score for an individual listener at all, we just pool all responses. But, you are right that the Latin Square design would prevent us from computing per-listener scores.
You seem to be suggesting that scores should be somehow rescaled, or made relative within each listener. That seems reasonable for an intelligibility test, but in fact is not something that is normally done.
But, this might not be the perfect arrangement, so feel free to try something different.
Your suggestion to include natural speech is good, but don’t expect listeners to transcribe that perfectly: they will still make errors (see Blizzard Challenge summary papers for typical WERs on naturally-spoken SUS – they are not 0%) and so you can’t necessarily use this as a means to exclude listeners who didn’t follow the instructions carefully.
I’m not sure that’s a convincing explanation. You are arguing that units with “bad pitchmarks” won’t get used. Fine, but then an increase in the proportion of units with bad pitchmarks effectively reduces the inventory size, which should lead to worse quality.
Does using the ‘male’ setting actually lead to more ‘bad pitchmarking’ warnings”? These warnings relate to units without any pitchmarks at all, and this then results in a penalty.
If the ‘male’ setting just results in fewer pitchmarks overall (say, about half as many), that’s a different situation – think about what the consequences of that would be.
For the domain-specific voice, obviously you will want to try domain-specific test sentences (that’s the whole point). You are right to consider how to control the difficulty of those sentences, when measuring intelligibility. Here are some options:
1. Try ‘normal’ domain-specific sentences and see whether you do indeed have a ceiling effect on intelligibility (just make that judgement informally, although there are formal statistical tests for this kind of thing).
2. If ‘normal’domain-specific sentences are too easy, then you may decide to try one of your two suggestions: domain-specific SUS, or adding noise
I’m not entirely sure about domain-specific SUS: although the words will be well-covered by the database, the sequences of words will not, and so there will still be lots of joins. But try it and see!
Adding noise is a good option, but you will need to choose an appropriate Signal-to-Noise Ratio (SNR) carefully. See this reply, and the paper I mention there, for some clues. Don’t aim to replicate that paper (!) but you might do something similar on an informal basis.
3. Or, simply don’t measure intelligibility, and only measure naturalness, in this experiment.
The x-axis is the experimental variable that you control (for you: the difficulty of the text) and the y-axis is the response (for you: Word Error Rate (WER), or Accuracy).
For this assignment there is no need to actually plot out a full psychometric curve: that would consume a lot of listening test time. Instead, you can informally try a few different types of text, and choose a type that gives you a WER in about the right range (let’s say 30-50%).
An example curve is Figure 1 in this paper (there, the x-axis is Signal-to-Noise Ratio, but you could imagine that it is “text difficulty” ranging from “very hard” on the left to “really easy” on the right).
A value of exactly 0 for the beam widths is actual a special value that turns pruning off altogether. If you want to try a lot of pruning, use a small value, but something greater than 0.
For your specific example, don’t try to fix it, but you could report it as part of your qualitative analysis. Try to suggest how it could be fixed though.
For testing intelligibility
Now things get more complicated. You correctly state two things:
1. all systems must synthesise exactly the same set of sentences, in case some are easier or more difficult than others.
2. listeners cannot be presented with the same sentence more than once, because they will probably remember it and so make fewer errors on the second presentation.
So, we need an experimental design that satisfies the above points. The standard way to do that is with a Latin Square. Let’s do a really simple 2×2 design, for comparing two systems (call them A and B) using, say, 10 sentences (1 to 10)
The trick is to split the listeners into groups:
Listener group 1 hears:
System A synthesising sentences 1-5
System B synthesising sentences 6-10Listener group 2 hears:
System B synthesising sentences 1-5
System A synthesising sentences 6-10and we pool all responses. What is happening is that a pair of listeners (one from each group) form a “virtual listener” who hears all combinations.
The number of listeners needs to be a multiple of the number of systems (2 in this example) so that each listener group is the same size. Imagine we recruit 20 listeners – we put 10 of them into each group.
For testing naturalness
In a standard MOS test, listeners rate a single stimulus at a time (one particular system saying one sentence). Across the whole test, it is important that all systems synthesise the same set of sentences. You will probably either pseudo-randomise the order of presentation, or use a fancier such as a Latin Square.
So, listeners will not be aware of how many different systems there are. They will notice the same sentence occurring more than once but the pseudo-randomisation (or other method) can be made to ensure that they never hear the same sentence two times in a row.
-
AuthorPosts