Forum Replies Created
-
AuthorPosts
-
Don’t use phrases like “the frames are trained” or “inputs are trained” – you will confuse yourself. The trainable parameters in a neural network are the weights that connect one layer to the next layer, plus the bias values in each unit (neuron).
In the simplest form of training, the training input vectors are presented to the network one at a time, and a forward pass is made. The error between the network’s outputs and the correct (target) outputs is measured. After all training vectors have been presented, the average error at each output unit is calculated, and this signal is back-propagated: the weights (and biases) are thus updated by small amount. This whole procedure is called gradient descent. It is repeated for a number of epochs (i.e., passes through the complete training set), until the average error – summed across all the outputs – stops decreasing (i.e., converges to a stable value).
In practice, the above procedure is very slow. So, the weights are usually updated after processing smaller subsets (called “mini-batches”) of the training set. The training procedure is then called stochastic gradient descent. The term stochastic is used to indicated that the procedure is now approximate because we are measuring the error on only a subset of the training data. It is usual to randomly shuffle all the frames in the training set before dividing it into mini-batches, so that each mini-batch contains a nice variety of input and output values. In general, this form of training is much faster: it takes fewer epochs to converge.
Each individual input unit (“neuron”) in a neural network takes just one number as its input (typically scaled to between -1 and 1, or to between 0 and 1).
The set of input units therefore takes a vector as input: this is a single frame of linguistic specification.
In the simplest configuration (a feed-forward network), the neural network takes a single frame of linguistic specification as input, and predicts a single frame of vocoder parameters as output. Each frame is thus processed independently of all the other frames.
I’ve also added more information on changing the relative weight between the target cost and the join cost.
Maybe you’d like to fix the permissions on all files and directories below the current directory:
find . -type f -exec chmod u+rw,go+r {} \; find . -type d -exec chmod u+rwx,go+rx {} \;
You can time Festival, just like any other program. But, to get meaningful results you may not want to run it in interactive mode. Create a script for Festival to execute (load your voice, synthesise a test set, exit) and time the execution of that:
$ time festival myscript.scm
where
myscript.scm
might contain something like(voice_localdir_multisyn-rpx) (SayText "My first sentence.") (SayText "My second sentence.") (SayText "...and so on...") (quit)
Make the test set large enough so that the time spent loading the voice is not a significant portion of the execution time.
You should probably disable audio output too, which can be done by changing the playback method to use a command, and setting that command to be empty – add these lines to the top of the script
(Parameter.set 'Audio_Method 'Audio_Command) (Parameter.set 'Audio_Command "")
When you only have two systems to compare, then a pairwise test is a good choice.
Offering a “neither” option is a design choice on your part and I don’t think there is a right or wrong answer.
Simple forced-choice (2-way response):
- Pros: the simplest possible task for listeners (easy to explain to them, hard for them to do it incorrectly); no danger that they will choose “neither” all the time
- Cons: you will be forcing listeners to guess, or provide an arbitrary response, for pairs where they really don’t have a preference
Including a third “neither” option (3-way response):
- Pros: might be a more natural task for listeners; the number of “neither” responses is itself informative
- Cons: listeners might choose “neither” too much of the time, and not bother listening for small differences (might make the test lest sensitive)
Remember that, during research and development, we often need to measure quite small improvements, so need to use sensitive tests. In other words, we are trying to measure an improvement that, by itself, may not significantly improve real-world performance in a final application. This is because research typically proceeds in small steps (with only infrequent “breakthroughs”), but over time those small incremental improvements do accumulate and do indeed lead to measurable real-world improvements.
What do we do with the “neither” responses? Well we can report them as in Table 1 of this paper in which only selected pairs from a set of systems were compared, or graphically as in Figure 1 of this paper.
Without the “neither” option, we would report the results as in Figure 7 of this paper.
Pairwise comparisons scale very badly with the number of systems we are comparing. The reason for using multi-level or continuous-valued responses (e.g., MOS or MUSHRA) is so that we can compare many different systems at once, without having to do a quadratic number of pairwise tests.
I’ll add more detail in the next lecture about how the linguistic specification from the front end is converted to a vector of numerical (mostly binary) features, ready for input to the DNN.
Good – you have correctly understood that the DNN is not replacing the HMM, but rather it is replacing the regression tree that is used to cluster (tie) the HMM parameters. In current DNN systems, we still need a separate model of the sequence (and in particular, something that divides phones into sub-phonetic units and models the durations of those sub-phone units): that is why there is still an HMM involved, and the HMM state is the sub-phone unit.
In the next lecture I will spell out the relationship between the two approaches in more detail.
All good questions – I’ll cover these points in the next lecture.
You need to report more than just the result of the listening test, of course. Explain what you were testing (i.e., what was your hypothesis), and how that led to your chosen design of listening test.
You should report enough detail about your listening test to tell the reader what you did and how you did it. A screenshot is one possible way to explain what the interface looked like, and exactly what question you presented to the listeners.
Not quite right, no.
Let’s separate out the three stages
1. preparing the data
deltas are computed from the so-called ‘static’ parameters, as explained above (e.g., simple difference between consecutive frames) – this is a simple deterministic process
2. training the model
the ‘static’ and delta parameters are now components of the same observation vector of the HMMs, which is modelled with multivariate Gaussians; the fact that one part of the observation vector contains the deltas of another other part is not taken into consideration (*)
3. generation
MLPG finds the most likely trajectory, given the statics and deltas – think of the deltas as constraints on how fast the trajectory moves from the static of one state to the static of the next state
(*) there are more advanced training algorithms that respect the relationship between statics and deltas – we don’t really need to know about that here
Deltas (of any parameter, including F0) are always computed using more than one frame. There is no way to compute them from a single frame, because there is only a single value (of F0, say) to work from.
Minimally, we need the current frame and one adjacent frame (previous or next) to compute the delta – in this case, it would simply be the difference between the two frames (the value in one frame minus the value in the other frame). It is actually more common to compute the deltas across several frames, centred on the current frame.
Adding deltas is a way to compensate for the frame-wise independence assumption that is made by the HMM. However, in synthesis, we also need them as a constraint on trajectory generation at synthesis time.
Just open the label file in Aquamacs and remove all lines that have an sp label of zero duration (i.e., whose end time is the same as the end time of the preceding label). You only need to do this for the few files under investigation, not the whole database.
For example – before:
2.046 26 i ; score -199.268387 ; 2.192 26 ng ; score -4145.471191 ; 2.192 26 sp ; score -0.229073 ; 2.232 26 dh ; score -1160.089966 ; 2.304 26 @ ; score -2054.851562 ;
after
2.046 26 i ; score -199.268387 ; 2.192 26 ng ; score -4145.471191 ; 2.232 26 dh ; score -1160.089966 ; 2.304 26 @ ; score -2054.851562 ;
You need to
bash$ source setup.sh
first, so that the shell variable $MBDIR is defined.
To load the pitchmarks, first convert them to label files, then load one of those labels files into a transcription pane in Wavesurfer (after loading the corresponding waveform).
Yes, that’s the correct idea.
In fact we don’t actually average the model parameters, but instead we pool all the training data associated with those models and use that to train a new, single model. Averaging the model parameters would be incorrect because it wouldn’t account for the fact that each was trained on a different amount of data.
-
AuthorPosts