Forum Replies Created
-
AuthorPosts
-
I’ll add more detail in the next lecture about how the linguistic specification from the front end is converted to a vector of numerical (mostly binary) features, ready for input to the DNN.
Good – you have correctly understood that the DNN is not replacing the HMM, but rather it is replacing the regression tree that is used to cluster (tie) the HMM parameters. In current DNN systems, we still need a separate model of the sequence (and in particular, something that divides phones into sub-phonetic units and models the durations of those sub-phone units): that is why there is still an HMM involved, and the HMM state is the sub-phone unit.
In the next lecture I will spell out the relationship between the two approaches in more detail.
All good questions – I’ll cover these points in the next lecture.
You need to report more than just the result of the listening test, of course. Explain what you were testing (i.e., what was your hypothesis), and how that led to your chosen design of listening test.
You should report enough detail about your listening test to tell the reader what you did and how you did it. A screenshot is one possible way to explain what the interface looked like, and exactly what question you presented to the listeners.
Not quite right, no.
Let’s separate out the three stages
1. preparing the data
deltas are computed from the so-called ‘static’ parameters, as explained above (e.g., simple difference between consecutive frames) – this is a simple deterministic process
2. training the model
the ‘static’ and delta parameters are now components of the same observation vector of the HMMs, which is modelled with multivariate Gaussians; the fact that one part of the observation vector contains the deltas of another other part is not taken into consideration (*)
3. generation
MLPG finds the most likely trajectory, given the statics and deltas – think of the deltas as constraints on how fast the trajectory moves from the static of one state to the static of the next state
(*) there are more advanced training algorithms that respect the relationship between statics and deltas – we don’t really need to know about that here
Deltas (of any parameter, including F0) are always computed using more than one frame. There is no way to compute them from a single frame, because there is only a single value (of F0, say) to work from.
Minimally, we need the current frame and one adjacent frame (previous or next) to compute the delta – in this case, it would simply be the difference between the two frames (the value in one frame minus the value in the other frame). It is actually more common to compute the deltas across several frames, centred on the current frame.
Adding deltas is a way to compensate for the frame-wise independence assumption that is made by the HMM. However, in synthesis, we also need them as a constraint on trajectory generation at synthesis time.
Just open the label file in Aquamacs and remove all lines that have an sp label of zero duration (i.e., whose end time is the same as the end time of the preceding label). You only need to do this for the few files under investigation, not the whole database.
For example – before:
2.046 26 i ; score -199.268387 ; 2.192 26 ng ; score -4145.471191 ; 2.192 26 sp ; score -0.229073 ; 2.232 26 dh ; score -1160.089966 ; 2.304 26 @ ; score -2054.851562 ;
after
2.046 26 i ; score -199.268387 ; 2.192 26 ng ; score -4145.471191 ; 2.232 26 dh ; score -1160.089966 ; 2.304 26 @ ; score -2054.851562 ;
You need to
bash$ source setup.sh
first, so that the shell variable $MBDIR is defined.
To load the pitchmarks, first convert them to label files, then load one of those labels files into a transcription pane in Wavesurfer (after loading the corresponding waveform).
Yes, that’s the correct idea.
In fact we don’t actually average the model parameters, but instead we pool all the training data associated with those models and use that to train a new, single model. Averaging the model parameters would be incorrect because it wouldn’t account for the fact that each was trained on a different amount of data.
Correct – the value is a beam width (wider = less pruning) and takes values between 0 and 1. The default values are set in the file …/festival/lib/multisyn/multisyn.scm
I’ve added more information on pruning to the exercise.
I’ve realised there is indeed a run-time interface to all of the various join and target cost weights and beam widths, etc. I had originally thought that these were deprecated and values were compiled in to the code, but I was wrong.
See the full list of functions – look for those that start “du_” (which means “diphone unit”)
This should be simpler than what you’re doing above.
We’ll look at this in the lecture.
Yes, there is some pruning of the candidates before search commences, then more pruning during the Viterbi search.
Some of the relevant functions within Festival are as follows:
festival> (du_voice.set_tc_rescoring_beam currentMultiSynVoice 0.5) festival> (du_voice.set_tc_rescoring_weight currentMultiSynVoice 3.0) festival> (du_voice.set_ob_pruning_beam currentMultiSynVoice 0.3) festival> (du_voice.set_pruning_beam currentMultiSynVoice 0.3)
which you execute after loading a multisyn voice. Note that you use them literally as above, with the “currentMultiSynVoice” argument exactly as written (i.e., don’t replace that with the name of your voice).
See the full list of functions – look for those that start “du_” (which means “diphone unit”)
As you make the beam sizes smaller, the speech will gradually get worse. For very small numbers in some cases, you may prevent any sequence being found, and get the error message “No best candidate sequence found”.
Yes, I think that would work. Changing the normalisation of the join cost coefficients (not just MFCCs – also F0 and energy) effectively changes the relative weight between join cost and target cost.
Try making the join cost coeffs very small – you should get more joins (fewer contiguous sequences of candidate), and therefore presumably more bad joins.
Try making them rather large, and you should get more contiguous sequences of candidates, but which match the target context less well.
Festival doesn’t do anything to bias against joins in [r] etc – but commercial systems certainly do.
The join cost for naturally-contiguous units is simply defined to be zero and isn’t even calculated.
Festival computes join cost entirely locally, just from the frames either side of the join.
-
AuthorPosts