Forum Replies Created
-
AuthorPosts
-
The run time might be dominated by the time taken to load the voice. There are several ways to control for that:
- Use a machine with no other users logged in
- Put a copy of the voice on the local disk of the machine you are using (e.g., copy your
ss
folder to/tmp
and change to that folder before starting Festival) so that the loading time is fast and consistent - Likewise, if saving the waveform output, make sure to write to local disk, or better to the system “black hole” file
/dev/null
, or even better not to write any output at all - Synthesise a large enough set of sentences to give a run time of a minute or more, thus making the few seconds of voice loading time irrelevant
Remember also to report the “user” time (which is the time used by the process) and not “real” (which is wall-clock time).
There are interactions between “Observation pruning” and “Beam pruning” and you will want to disable one when experimenting with the other. With small databases where there is only one candidate for some diphones, pruning will have no effect in those parts of the search: that one candidate will always have to be used, no matter what. So, do these experiments with the largest possible database (e.g., ARCTIC slt).
You need to deduce which subsequent steps depend on pitch marking – try drawing a flowchart of the steps in the voice building process showing the flow of information between them.
For example, pitch marks determine potential join locations, and so the computation of join cost coefficients will be affected. Therefore any steps related to join cost will need to be re-run.
(If in doubt, run all subsequent steps anyway, to be on the safe side.)
The target cost function collapses all levels of stress (1,2,3) into a single level (1 = “stressed”).
Please could you post details of the exact problem and the solution, for future reference.
You can easily check that endpointing has worked by inspecting the endpointed wav files – there should be a small (but non-zero) amount of silence at the start and end of every file. I’m not sure that’s the cause of your error though, but it’s something you should check anyway.
Modifying
train.scp
reduces the amount of training data for the models, but alignment will still be performed on all the data. You only want to be doing this for an experiment to measure the effect of less well-trained models (and the resulting accuracy of alignment) independently of the amount of data in the unit selection database.Does the removal of any utterance lead to the error, or only specific ones? If the latter, could it be an utterance containing the only remaining example of a particular phoneme within the utterances listed in
train.scp
? That would lead to an untrained model for that phoneme, and this model will cause problems during alignment.In general, you need at least one training example per phoneme, and ideally three. Check for warnings from
HERest
.To close this topic: there is no exam this year. But you should still think about how to integrate the content of modules 6-9 and the state-of-the-art in your coursework report.
Yes, that would be fine. For a higher mark, you could complement that with other forms of evaluation for other hypotheses.
The intention was to respond to this type of question in lab sessions, for each individual student. Since that’s not possible now, I’ll provide a generic answer here.
First, remember that a formal listening test is not the only option for every experiment. There are at least two other options for testing a hypothesis: expert listening by the author, or an objective measure.
Second, remember that not every hypothesis is worth testing formally. For example, if you – the expert listener – cannot discern any difference between two conditions, then there is little point asking whether other listeners can hear one.
Once you have decided that a formal listening test is what you need, then – as you correctly point out – you will have to be selective about which hypotheses are worth testing in this relatively expensive way.
I suggest testing a handful of hypotheses in total, of which maybe just a couple would have a formal listening test.
The target and join cost values reported by Festival have already been multiplied by their respective weights.
A low target cost weight will bias the search towards finding good joins (those with lower cost), at the expense of selecting candidates with a poorer match to their target, i.e., candidates with high target cost, noting that the reported target cost has been multiplied by a low target cost weight.
The consequence is that it is only valid to compare absolute values of join and target costs for a fixed setting of the target cost weight (e.g., comparing across different input sentences, or a fixed sentence synthesised with different unit databases). Changing the weight changes the absolute values.
An added complication in inspecting the total join cost across an utterance, as you vary the target cost weight, is that the proportion of zero-cost joins will vary – so you will get sudden ‘jumps’ in the values.
In summary – you are doing the right thing in inspecting values closely for individual sentences, but the absolute values of the costs are not very helpful. Try inspecting the ratio between them instead. If you’re looking for something objective to measure, then the number of zero-cost joins is a good option.
Yes, using a “between subjects” design for naturalness would be fine – it’s what the Blizzard Challenge does. It is not essential though, and a “within subjects” design is acceptable.
The “festival_mac” in the PATH is the clue. It’s a curious bug. See this topic.
Yes, that diagram has many steps! p287 says
“After we generate the Mandarin training sentences for the monolingual English speaker, his HMM based TTS in Mandarin can be trained via the standard HMM training procedure.”
so what they are doing is using trajectory tiling (with the waveform being created using concatenation) to construct a training set in the target language, for a speaker who doesn’t speak that language.
That data is then used to train a conventional HMM-based system that drives a vocoder.
All the synthesisers compared in Fig 12 are conventional HMM-plus-vocoder systems. Trajectory tiling is used to create the training data for TSMT.
April 2, 2020 at 09:59 in reply to: Talkin: A robust algorithm for pitch tracking – autocorrelation equation #11086j and m are both indexing the samples in the entire waveform under analysis. Remember that we are doing short-term analysis which involved analysing short frames taken from that waveform
m is the first sample in the current analysis frame (the i’th frame)
j is counting through the samples in the current analysis frame
So j=m is the lower limit of the summation (the first sample in the current frame) but j then increments up to j=m+n-k-1 (the last sample in the current frame)
Qualtrics might convert them to mp3 silently (certainly some platforms do) – check in a browser by taking your completed test as a subject.
The main problems with using wav files on the web are
- They are larger than mp3 – not a problem here: we care about quality not size
- Some browsers, notably Safari, will not play wav files
-
AuthorPosts