Forum Replies Created
-
AuthorPosts
-
You can time Festival, just like any other program. But, to get meaningful results you may not want to run it in interactive mode. Create a script for Festival to execute (load your voice, synthesise a test set, exit) and time the execution of that:
$ time festival myscript.scm
where
myscript.scm
might contain something like(voice_localdir_multisyn-rpx) (SayText "My first sentence.") (SayText "My second sentence.") (SayText "...and so on...") (quit)
Make the test set large enough so that the time spent loading the voice is not a significant portion of the execution time.
You should probably disable audio output too, which can be done by changing the playback method to use a command, and setting that command to be empty – add these lines to the top of the script
(Parameter.set 'Audio_Method 'Audio_Command) (Parameter.set 'Audio_Command "")
When you only have two systems to compare, then a pairwise test is a good choice.
Offering a “neither” option is a design choice on your part and I don’t think there is a right or wrong answer.
Simple forced-choice (2-way response):
- Pros: the simplest possible task for listeners (easy to explain to them, hard for them to do it incorrectly); no danger that they will choose “neither” all the time
- Cons: you will be forcing listeners to guess, or provide an arbitrary response, for pairs where they really don’t have a preference
Including a third “neither” option (3-way response):
- Pros: might be a more natural task for listeners; the number of “neither” responses is itself informative
- Cons: listeners might choose “neither” too much of the time, and not bother listening for small differences (might make the test lest sensitive)
Remember that, during research and development, we often need to measure quite small improvements, so need to use sensitive tests. In other words, we are trying to measure an improvement that, by itself, may not significantly improve real-world performance in a final application. This is because research typically proceeds in small steps (with only infrequent “breakthroughs”), but over time those small incremental improvements do accumulate and do indeed lead to measurable real-world improvements.
What do we do with the “neither” responses? Well we can report them as in Table 1 of this paper in which only selected pairs from a set of systems were compared, or graphically as in Figure 1 of this paper.
Without the “neither” option, we would report the results as in Figure 7 of this paper.
Pairwise comparisons scale very badly with the number of systems we are comparing. The reason for using multi-level or continuous-valued responses (e.g., MOS or MUSHRA) is so that we can compare many different systems at once, without having to do a quadratic number of pairwise tests.
I’ll add more detail in the next lecture about how the linguistic specification from the front end is converted to a vector of numerical (mostly binary) features, ready for input to the DNN.
Good – you have correctly understood that the DNN is not replacing the HMM, but rather it is replacing the regression tree that is used to cluster (tie) the HMM parameters. In current DNN systems, we still need a separate model of the sequence (and in particular, something that divides phones into sub-phonetic units and models the durations of those sub-phone units): that is why there is still an HMM involved, and the HMM state is the sub-phone unit.
In the next lecture I will spell out the relationship between the two approaches in more detail.
All good questions – I’ll cover these points in the next lecture.
You need to report more than just the result of the listening test, of course. Explain what you were testing (i.e., what was your hypothesis), and how that led to your chosen design of listening test.
You should report enough detail about your listening test to tell the reader what you did and how you did it. A screenshot is one possible way to explain what the interface looked like, and exactly what question you presented to the listeners.
Not quite right, no.
Let’s separate out the three stages
1. preparing the data
deltas are computed from the so-called ‘static’ parameters, as explained above (e.g., simple difference between consecutive frames) – this is a simple deterministic process
2. training the model
the ‘static’ and delta parameters are now components of the same observation vector of the HMMs, which is modelled with multivariate Gaussians; the fact that one part of the observation vector contains the deltas of another other part is not taken into consideration (*)
3. generation
MLPG finds the most likely trajectory, given the statics and deltas – think of the deltas as constraints on how fast the trajectory moves from the static of one state to the static of the next state
(*) there are more advanced training algorithms that respect the relationship between statics and deltas – we don’t really need to know about that here
Deltas (of any parameter, including F0) are always computed using more than one frame. There is no way to compute them from a single frame, because there is only a single value (of F0, say) to work from.
Minimally, we need the current frame and one adjacent frame (previous or next) to compute the delta – in this case, it would simply be the difference between the two frames (the value in one frame minus the value in the other frame). It is actually more common to compute the deltas across several frames, centred on the current frame.
Adding deltas is a way to compensate for the frame-wise independence assumption that is made by the HMM. However, in synthesis, we also need them as a constraint on trajectory generation at synthesis time.
Just open the label file in Aquamacs and remove all lines that have an sp label of zero duration (i.e., whose end time is the same as the end time of the preceding label). You only need to do this for the few files under investigation, not the whole database.
For example – before:
2.046 26 i ; score -199.268387 ; 2.192 26 ng ; score -4145.471191 ; 2.192 26 sp ; score -0.229073 ; 2.232 26 dh ; score -1160.089966 ; 2.304 26 @ ; score -2054.851562 ;
after
2.046 26 i ; score -199.268387 ; 2.192 26 ng ; score -4145.471191 ; 2.232 26 dh ; score -1160.089966 ; 2.304 26 @ ; score -2054.851562 ;
You need to
bash$ source setup.sh
first, so that the shell variable $MBDIR is defined.
To load the pitchmarks, first convert them to label files, then load one of those labels files into a transcription pane in Wavesurfer (after loading the corresponding waveform).
Yes, that’s the correct idea.
In fact we don’t actually average the model parameters, but instead we pool all the training data associated with those models and use that to train a new, single model. Averaging the model parameters would be incorrect because it wouldn’t account for the fact that each was trained on a different amount of data.
Correct – the value is a beam width (wider = less pruning) and takes values between 0 and 1. The default values are set in the file …/festival/lib/multisyn/multisyn.scm
I’ve added more information on pruning to the exercise.
I’ve realised there is indeed a run-time interface to all of the various join and target cost weights and beam widths, etc. I had originally thought that these were deprecated and values were compiled in to the code, but I was wrong.
See the full list of functions – look for those that start “du_” (which means “diphone unit”)
This should be simpler than what you’re doing above.
We’ll look at this in the lecture.
Yes, there is some pruning of the candidates before search commences, then more pruning during the Viterbi search.
Some of the relevant functions within Festival are as follows:
festival> (du_voice.set_tc_rescoring_beam currentMultiSynVoice 0.5) festival> (du_voice.set_tc_rescoring_weight currentMultiSynVoice 3.0) festival> (du_voice.set_ob_pruning_beam currentMultiSynVoice 0.3) festival> (du_voice.set_pruning_beam currentMultiSynVoice 0.3)
which you execute after loading a multisyn voice. Note that you use them literally as above, with the “currentMultiSynVoice” argument exactly as written (i.e., don’t replace that with the name of your voice).
See the full list of functions – look for those that start “du_” (which means “diphone unit”)
As you make the beam sizes smaller, the speech will gradually get worse. For very small numbers in some cases, you may prevent any sequence being found, and get the error message “No best candidate sequence found”.
-
AuthorPosts