Page 60

Forum Replies Created

Viewing 15 posts - 886 through 900 (of 1,084 total)

← 1 2 3 … 59 60 61 … 71 72 73 →

Author

Posts
March 12, 2016 at 22:07 in reply to: Timing a process #2774
Simon
Professor
You can time Festival, just like any other program. But, to get meaningful results you may not want to run it in interactive mode. Create a script for Festival to execute (load your voice, synthesise a test set, exit) and time the execution of that:
```
$ time festival myscript.scm
```
where myscript.scm might contain something like
```
(voice_localdir_multisyn-rpx)
(SayText "My first sentence.")
(SayText "My second sentence.")
(SayText "...and so on...")
(quit)
```
Make the test set large enough so that the time spent loading the voice is not a significant portion of the execution time.

You should probably disable audio output too, which can be done by changing the playback method to use a command, and setting that command to be empty – add these lines to the top of the script
```
(Parameter.set 'Audio_Method 'Audio_Command)
(Parameter.set 'Audio_Command "")
```
March 12, 2016 at 13:14 in reply to: Test Design #2728
Simon
Professor
When you only have two systems to compare, then a pairwise test is a good choice.

Offering a “neither” option is a design choice on your part and I don’t think there is a right or wrong answer.

Simple forced-choice (2-way response):
- Pros: the simplest possible task for listeners (easy to explain to them, hard for them to do it incorrectly); no danger that they will choose “neither” all the time
- Cons: you will be forcing listeners to guess, or provide an arbitrary response, for pairs where they really don’t have a preference
Including a third “neither” option (3-way response):
- Pros: might be a more natural task for listeners; the number of “neither” responses is itself informative
- Cons: listeners might choose “neither” too much of the time, and not bother listening for small differences (might make the test lest sensitive)
Remember that, during research and development, we often need to measure quite small improvements, so need to use sensitive tests. In other words, we are trying to measure an improvement that, by itself, may not significantly improve real-world performance in a final application. This is because research typically proceeds in small steps (with only infrequent “breakthroughs”), but over time those small incremental improvements do accumulate and do indeed lead to measurable real-world improvements.

What do we do with the “neither” responses? Well we can report them as in Table 1 of this paper in which only selected pairs from a set of systems were compared, or graphically as in Figure 1 of this paper.

Without the “neither” option, we would report the results as in Figure 7 of this paper.

Pairwise comparisons scale very badly with the number of systems we are comparing. The reason for using multi-level or continuous-valued responses (e.g., MOS or MUSHRA) is so that we can compare many different systems at once, without having to do a quadratic number of pairwise tests.
March 12, 2016 at 12:51 in reply to: Input features of DNNs #2727
Simon
Professor
I’ll add more detail in the next lecture about how the linguistic specification from the front end is converted to a vector of numerical (mostly binary) features, ready for input to the DNN.
March 12, 2016 at 12:50 in reply to: HMM-based DNN? #2726
Simon
Professor
Good – you have correctly understood that the DNN is not replacing the HMM, but rather it is replacing the regression tree that is used to cluster (tie) the HMM parameters. In current DNN systems, we still need a separate model of the sequence (and in particular, something that divides phones into sub-phonetic units and models the durations of those sub-phone units): that is why there is still an HMM involved, and the HMM state is the sub-phone unit.

In the next lecture I will spell out the relationship between the two approaches in more detail.
March 12, 2016 at 12:48 in reply to: DNN Basics #2725
Simon
Professor
All good questions – I’ll cover these points in the next lecture.
March 12, 2016 at 12:46 in reply to: How to write up the listening test #2724
Simon
Professor
You need to report more than just the result of the listening test, of course. Explain what you were testing (i.e., what was your hypothesis), and how that led to your chosen design of listening test.

You should report enough detail about your listening test to tell the reader what you did and how you did it. A screenshot is one possible way to explain what the interface looked like, and exactly what question you presented to the listeners.
March 9, 2016 at 12:45 in reply to: Deltas #2717
Simon
Professor
Not quite right, no.

Let’s separate out the three stages

1. preparing the data

deltas are computed from the so-called ‘static’ parameters, as explained above (e.g., simple difference between consecutive frames) – this is a simple deterministic process

2. training the model

the ‘static’ and delta parameters are now components of the same observation vector of the HMMs, which is modelled with multivariate Gaussians; the fact that one part of the observation vector contains the deltas of another other part is not taken into consideration (*)

3. generation

MLPG finds the most likely trajectory, given the statics and deltas – think of the deltas as constraints on how fast the trajectory moves from the static of one state to the static of the next state

(*) there are more advanced training algorithms that respect the relationship between statics and deltas – we don’t really need to know about that here
March 8, 2016 at 21:00 in reply to: Deltas #2715
Simon
Professor
Deltas (of any parameter, including F0) are always computed using more than one frame. There is no way to compute them from a single frame, because there is only a single value (of F0, say) to work from.

Minimally, we need the current frame and one adjacent frame (previous or next) to compute the delta – in this case, it would simply be the difference between the two frames (the value in one frame minus the value in the other frame). It is actually more common to compute the deltas across several frames, centred on the current frame.

Adding deltas is a way to compensate for the frame-wise independence assumption that is made by the HMM. However, in synthesis, we also need them as a constraint on trajectory generation at synthesis time.
March 8, 2016 at 18:15 in reply to: How to correct the error? #2713
Simon
Professor
Just open the label file in Aquamacs and remove all lines that have an sp label of zero duration (i.e., whose end time is the same as the end time of the preceding label). You only need to do this for the few files under investigation, not the whole database.

For example – before:
```
	2.046	26	i ; score -199.268387 ;
	2.192	26	ng ; score -4145.471191 ;
	2.192	26	sp ; score -0.229073 ;
	2.232	26	dh ; score -1160.089966 ;
	2.304	26	@ ; score -2054.851562 ;
```
after
```
	2.046	26	i ; score -199.268387 ;
	2.192	26	ng ; score -4145.471191 ;
	2.232	26	dh ; score -1160.089966 ;
	2.304	26	@ ; score -2054.851562 ;
```
March 6, 2016 at 15:33 in reply to: Adjust the pitch marker settings #2702
Simon
Professor
You need to
```
bash$ source setup.sh
```
first, so that the shell variable $MBDIR is defined.

To load the pitchmarks, first convert them to label files, then load one of those labels files into a transcription pane in Wavesurfer (after loading the corresponding waveform).
March 4, 2016 at 10:53 in reply to: Phonetic Decision Trees #2691
Simon
Professor
Yes, that’s the correct idea.

In fact we don’t actually average the model parameters, but instead we pool all the training data associated with those models and use that to train a new, single model. Averaging the model parameters would be incorrect because it wouldn’t account for the fact that each was trained on a different amount of data.
March 2, 2016 at 17:38 in reply to: Pruning #2686
Simon
Professor
Correct – the value is a beam width (wider = less pruning) and takes values between 0 and 1. The default values are set in the file …/festival/lib/multisyn/multisyn.scm

I’ve added more information on pruning to the exercise.
March 1, 2016 at 11:49 in reply to: Join Costs #2666
Simon
Professor
I’ve realised there is indeed a run-time interface to all of the various join and target cost weights and beam widths, etc. I had originally thought that these were deprecated and values were compiled in to the code, but I was wrong.

See the full list of functions – look for those that start “du_” (which means “diphone unit”)

This should be simpler than what you’re doing above.
February 29, 2016 at 16:15 in reply to: Linear prediction #2660
Simon
Professor
We’ll look at this in the lecture.
February 26, 2016 at 08:19 in reply to: Pruning #2653
Simon
Professor
Yes, there is some pruning of the candidates before search commences, then more pruning during the Viterbi search.

Some of the relevant functions within Festival are as follows:
```
festival> (du_voice.set_tc_rescoring_beam   currentMultiSynVoice 0.5)
festival> (du_voice.set_tc_rescoring_weight currentMultiSynVoice 3.0)
festival> (du_voice.set_ob_pruning_beam currentMultiSynVoice 0.3)
festival> (du_voice.set_pruning_beam currentMultiSynVoice 0.3)
```
which you execute after loading a multisyn voice. Note that you use them literally as above, with the “currentMultiSynVoice” argument exactly as written (i.e., don’t replace that with the name of your voice).

See the full list of functions – look for those that start “du_” (which means “diphone unit”)

As you make the beam sizes smaller, the speech will gradually get worse. For very small numbers in some cases, you may prevent any sequence being found, and get the error message “No best candidate sequence found”.
Author

Posts

Viewing 15 posts - 886 through 900 (of 1,084 total)

← 1 2 3 … 59 60 61 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis