Forum Replies Created
-
AuthorPosts
-
BeaqleJS looks like a general framework that is worth investigating – no idea how easy it is to use. It can do preference tests and MUSHRA.
That is indeed the function that is doing the vowel reduction at synthesis time. You could try modifying (redefining) it, to prevent any vowel reduction at all, but you would then almost certainly make a lot of other test sentences worse.
The problem you have found is a mismatch between the vowel reduction rules used at synthesis time, and the method for identifying reduced vowels in the database. It is impossible(*) for these to perfectly match, because the speaker may not reduce exactly the vowels that the front-end predicts should be reduced.
Editing the labels on the database sounds like the solution in this case, assuming that the corrected label is a closer match to the speech. Of course, whilst you might improve this particular test sentence, you may make others worse.
(*) unless you can get the speaker to read out phonetic transcriptions of the sentences?
It’s actually quite hard to specify how many listeners you need for a reliable result (e.g., to find statistically significant differences, or to get stable results that would not change if you added more listeners) because it depends to some extent on the magnitude of the difference between the systems you are comparing and the number of systems. This paper gives some guidelines:
Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations. In Proc. Interspeech, pages 3476-3480, Dresden, September 2015.
and a critique of published papers that include listening test results. The paper suggests that 30 listeners are needed for a typical MOS “naturalness” test (although 20 listeners gets you quite close: see Figure 1).
However, for this assignment, 30 listeners may be more than you can easily recruit, so it’s OK to use fewer.
One thing you must not do is to analyse your results, then add a few more listeners, and keep going until you reach statistical significance. That would be a fishing expedition.
You actually asked how many listeners per stimulus, and in the paper you then need to look to Figure 4, where a datapoint corresponds to one listener giving a response for a single stimulus.
The analysis in this paper is for one year of the Blizzard Challenge (where 11 systems are being compared), so the numbers of listeners and datapoints might not transfer directly to your listening test. But, the checklist in Section 2 is definitely applicable.
Are you sure you are increasing the beam width for HVite, and not HERest. The beam for HERest is specified as three numbers (as per your other post in this topic) but for HVite it is just a single number.
Other possible causes:
- the speech is not a perfect match to the transcription (e.g., speaker errors) in too many places
- there just isn’t enough data
If you can’t get this working, then move on to another variation instead (e.g., train on all the data, and just exclude data the simple way at synthesis time).
It looks like you have moved all your ss directories into a directory called voices. That’s fine, but now the paths to the MFCC files (which are stored as absolute paths in
train.scp
) is wrong. Deletetrain.scp
and re-run themake_mfcc_list
script in this step to rebuild thetrain.scp
file.It’s easier to create your own script in Scheme and execute that in Festival. Create a script that contains the sequence of commands you want to run (load the voice, synthesise a sentence, save that sentence).
Tip: use Festival in interactive mode first, to work out the sequence of commands that you need.
If you place that script in file called
myscript.scm
then you can run it like this:$ festival myscript.scm
You may want to create
myscript.scm
using a shell script or a simple python program, if you need to synthesise a long list of sentences and save each to a file.See also this post.
This error normally indicates that the flat-start training has failed to create good models. One possible cause is excessive long silences at the start or end of the recordings. Try endpointing the data and then run the alignment again. Report your findings here.
I think it is fine for them to have included the “neutral” category: it’s informative to know what percentage of listeners had no preference. I don’t think it weakens their claim that the DNN is preferred over the HMM.
They could have used a two-way forced choice instead, but then we would expect to get quite large error bars on those responses, where listeners had to make an arbitrary choice (because “no preference” was not an available response).
Of course, if you include “no preference” in the test, it must also be reported! Otherwise we move into the territory of cat food marketing where we see phrases like “8 out of 10 cats” being used, but with “(of those that expressed a preference)” in a footnote. Maybe only 10 cats out of 1000 actually express a preference, the other 990 being happy to eat anything you put in front of them…
Yes, thinking of eigenvoices as “standardized voice ingredients” is reasonable.
One problem with trying to listen to these voices is that the models are constructed in a normalised space, and so it doesn’t actually make sense to synthesise from the underlying models. The same problem would occur when trying to listen to the eigenvoices: they may not make sense on their own.
Here are some slides and examples from Mark Gales that give an overview of the main ideas of “Controllable and Adaptable Speech Synthesis”.
Informally, think of eigenvoices as being a set of “axes” in some abstract “speaker space”. We can create a voice for any new speaker (i.e., we can do speaker adaptation) as a weighted combination of these eigenvoices. The only parameters we need to learn are the weights. Because the number of weights will be very small (compared to the number of model parameters), we can learn them from a very small amount of data.
When you first try to understand this concept, it’s OK to imagine that the eigenvoices correspond to the actual real speakers in the training set.
In fact we can do better than that, by finding a set of basis vectors that is as small as possible (smaller than the number of training speakers) whilst still being able to represent all the different “axes” of variation across speakers.
(To get in to more depth, this topic would need more than just this written forum answer. I can consider including it in lecture 10 if you wish.)
I think Zen has been “borrowing” text from Wikipedia !
XOR (which means “exclusive OR”) is a logic function and is often used as an example of something that is non-trivial to learn. For a decision tree to compute XOR, the tree will have duplicated parts, which is inefficient. Here’s a video that explains:
To compute XOR with a neural network, at least two layers are needed.
More generally, the divide-and-conquor approach of decision trees is inefficient for considering combinations of predictors that “behave like XOR”: the tree gets deep, and the lower parts will not be well-trained because only a subset of the data is used.
It’s hard to say what XOR, d-bit parity functions, or multiplex problems have got to do with speech synthesis though (we should ask Zen!), other than that they are also non-trivial to compute.
So, all that Zen is really saying is that neural networks are more powerful models than decision trees. Whether neural networks actually work better than decision trees for speech synthesis remains a purely empirical question though: try them both and see which sounds best!
I’ve added some information on that to the instructions.
Choices about sizes and numbers of hidden layers are generally made empirically, to minimise the error on a development set. In the quote you give above, that is what Zen is saying: he tried different options and chose that one that worked the best.
It is computationally expensive to explore all possible architectures, so in practice these things are roughly optimised and then left fixed (e.g., 5 hidden layers of 1024 units each).
The transformation from input linguistic features to output vocoder features is highly non-linear. The only essential requirement in a neural network is that the units in the hidden layers have non-linear activation functions (if all activations were linear, the network would be a linear function regardless of the number of layers: it would be a sequence of matrix multiplies and additions of biases).
There is some variation in the terminology used to refer to the weights that connect one layer to the next. Because the number of weights between two layers is equal to the product of the numbers of units in the two layers, it is natural to think of the weights as being a rectangular matrix: hence “weight matrix”.
However, many authors conceptualise all the trainable parameters of the network (several weight matrices and all the individual biases) as one variable, and they will place them all together into a vector: hence “parameter vector” or “weight vector”. This is a notational convenience, so we can write a single equation for the derivative of the error with respect to the weight vector as a whole.
-
AuthorPosts