Forum Replies Created
-
AuthorPosts
-
It’s easier to create your own script in Scheme and execute that in Festival. Create a script that contains the sequence of commands you want to run (load the voice, synthesise a sentence, save that sentence).
Tip: use Festival in interactive mode first, to work out the sequence of commands that you need.
If you place that script in file called
myscript.scm
then you can run it like this:$ festival myscript.scm
You may want to create
myscript.scm
using a shell script or a simple python program, if you need to synthesise a long list of sentences and save each to a file.See also this post.
This error normally indicates that the flat-start training has failed to create good models. One possible cause is excessive long silences at the start or end of the recordings. Try endpointing the data and then run the alignment again. Report your findings here.
I think it is fine for them to have included the “neutral” category: it’s informative to know what percentage of listeners had no preference. I don’t think it weakens their claim that the DNN is preferred over the HMM.
They could have used a two-way forced choice instead, but then we would expect to get quite large error bars on those responses, where listeners had to make an arbitrary choice (because “no preference” was not an available response).
Of course, if you include “no preference” in the test, it must also be reported! Otherwise we move into the territory of cat food marketing where we see phrases like “8 out of 10 cats” being used, but with “(of those that expressed a preference)” in a footnote. Maybe only 10 cats out of 1000 actually express a preference, the other 990 being happy to eat anything you put in front of them…
Yes, thinking of eigenvoices as “standardized voice ingredients” is reasonable.
One problem with trying to listen to these voices is that the models are constructed in a normalised space, and so it doesn’t actually make sense to synthesise from the underlying models. The same problem would occur when trying to listen to the eigenvoices: they may not make sense on their own.
Here are some slides and examples from Mark Gales that give an overview of the main ideas of “Controllable and Adaptable Speech Synthesis”.
Informally, think of eigenvoices as being a set of “axes” in some abstract “speaker space”. We can create a voice for any new speaker (i.e., we can do speaker adaptation) as a weighted combination of these eigenvoices. The only parameters we need to learn are the weights. Because the number of weights will be very small (compared to the number of model parameters), we can learn them from a very small amount of data.
When you first try to understand this concept, it’s OK to imagine that the eigenvoices correspond to the actual real speakers in the training set.
In fact we can do better than that, by finding a set of basis vectors that is as small as possible (smaller than the number of training speakers) whilst still being able to represent all the different “axes” of variation across speakers.
(To get in to more depth, this topic would need more than just this written forum answer. I can consider including it in lecture 10 if you wish.)
I think Zen has been “borrowing” text from Wikipedia !
XOR (which means “exclusive OR”) is a logic function and is often used as an example of something that is non-trivial to learn. For a decision tree to compute XOR, the tree will have duplicated parts, which is inefficient. Here’s a video that explains:
To compute XOR with a neural network, at least two layers are needed.
More generally, the divide-and-conquor approach of decision trees is inefficient for considering combinations of predictors that “behave like XOR”: the tree gets deep, and the lower parts will not be well-trained because only a subset of the data is used.
It’s hard to say what XOR, d-bit parity functions, or multiplex problems have got to do with speech synthesis though (we should ask Zen!), other than that they are also non-trivial to compute.
So, all that Zen is really saying is that neural networks are more powerful models than decision trees. Whether neural networks actually work better than decision trees for speech synthesis remains a purely empirical question though: try them both and see which sounds best!
I’ve added some information on that to the instructions.
Choices about sizes and numbers of hidden layers are generally made empirically, to minimise the error on a development set. In the quote you give above, that is what Zen is saying: he tried different options and chose that one that worked the best.
It is computationally expensive to explore all possible architectures, so in practice these things are roughly optimised and then left fixed (e.g., 5 hidden layers of 1024 units each).
The transformation from input linguistic features to output vocoder features is highly non-linear. The only essential requirement in a neural network is that the units in the hidden layers have non-linear activation functions (if all activations were linear, the network would be a linear function regardless of the number of layers: it would be a sequence of matrix multiplies and additions of biases).
There is some variation in the terminology used to refer to the weights that connect one layer to the next. Because the number of weights between two layers is equal to the product of the numbers of units in the two layers, it is natural to think of the weights as being a rectangular matrix: hence “weight matrix”.
However, many authors conceptualise all the trainable parameters of the network (several weight matrices and all the individual biases) as one variable, and they will place them all together into a vector: hence “parameter vector” or “weight vector”. This is a notational convenience, so we can write a single equation for the derivative of the error with respect to the weight vector as a whole.
Don’t use phrases like “the frames are trained” or “inputs are trained” – you will confuse yourself. The trainable parameters in a neural network are the weights that connect one layer to the next layer, plus the bias values in each unit (neuron).
In the simplest form of training, the training input vectors are presented to the network one at a time, and a forward pass is made. The error between the network’s outputs and the correct (target) outputs is measured. After all training vectors have been presented, the average error at each output unit is calculated, and this signal is back-propagated: the weights (and biases) are thus updated by small amount. This whole procedure is called gradient descent. It is repeated for a number of epochs (i.e., passes through the complete training set), until the average error – summed across all the outputs – stops decreasing (i.e., converges to a stable value).
In practice, the above procedure is very slow. So, the weights are usually updated after processing smaller subsets (called “mini-batches”) of the training set. The training procedure is then called stochastic gradient descent. The term stochastic is used to indicated that the procedure is now approximate because we are measuring the error on only a subset of the training data. It is usual to randomly shuffle all the frames in the training set before dividing it into mini-batches, so that each mini-batch contains a nice variety of input and output values. In general, this form of training is much faster: it takes fewer epochs to converge.
Each individual input unit (“neuron”) in a neural network takes just one number as its input (typically scaled to between -1 and 1, or to between 0 and 1).
The set of input units therefore takes a vector as input: this is a single frame of linguistic specification.
In the simplest configuration (a feed-forward network), the neural network takes a single frame of linguistic specification as input, and predicts a single frame of vocoder parameters as output. Each frame is thus processed independently of all the other frames.
I’ve also added more information on changing the relative weight between the target cost and the join cost.
Maybe you’d like to fix the permissions on all files and directories below the current directory:
find . -type f -exec chmod u+rw,go+r {} \; find . -type d -exec chmod u+rwx,go+rx {} \;
You can time Festival, just like any other program. But, to get meaningful results you may not want to run it in interactive mode. Create a script for Festival to execute (load your voice, synthesise a test set, exit) and time the execution of that:
$ time festival myscript.scm
where
myscript.scm
might contain something like(voice_localdir_multisyn-rpx) (SayText "My first sentence.") (SayText "My second sentence.") (SayText "...and so on...") (quit)
Make the test set large enough so that the time spent loading the voice is not a significant portion of the execution time.
You should probably disable audio output too, which can be done by changing the playback method to use a command, and setting that command to be empty – add these lines to the top of the script
(Parameter.set 'Audio_Method 'Audio_Command) (Parameter.set 'Audio_Command "")
When you only have two systems to compare, then a pairwise test is a good choice.
Offering a “neither” option is a design choice on your part and I don’t think there is a right or wrong answer.
Simple forced-choice (2-way response):
- Pros: the simplest possible task for listeners (easy to explain to them, hard for them to do it incorrectly); no danger that they will choose “neither” all the time
- Cons: you will be forcing listeners to guess, or provide an arbitrary response, for pairs where they really don’t have a preference
Including a third “neither” option (3-way response):
- Pros: might be a more natural task for listeners; the number of “neither” responses is itself informative
- Cons: listeners might choose “neither” too much of the time, and not bother listening for small differences (might make the test lest sensitive)
Remember that, during research and development, we often need to measure quite small improvements, so need to use sensitive tests. In other words, we are trying to measure an improvement that, by itself, may not significantly improve real-world performance in a final application. This is because research typically proceeds in small steps (with only infrequent “breakthroughs”), but over time those small incremental improvements do accumulate and do indeed lead to measurable real-world improvements.
What do we do with the “neither” responses? Well we can report them as in Table 1 of this paper in which only selected pairs from a set of systems were compared, or graphically as in Figure 1 of this paper.
Without the “neither” option, we would report the results as in Figure 7 of this paper.
Pairwise comparisons scale very badly with the number of systems we are comparing. The reason for using multi-level or continuous-valued responses (e.g., MOS or MUSHRA) is so that we can compare many different systems at once, without having to do a quadratic number of pairwise tests.
-
AuthorPosts