Forum Replies Created
-
AuthorPosts
-
The target cost function collapses all levels of stress (1,2,3) into a single level (1 = “stressed”).
Please could you post details of the exact problem and the solution, for future reference.
You can easily check that endpointing has worked by inspecting the endpointed wav files – there should be a small (but non-zero) amount of silence at the start and end of every file. I’m not sure that’s the cause of your error though, but it’s something you should check anyway.
Modifying
train.scp
reduces the amount of training data for the models, but alignment will still be performed on all the data. You only want to be doing this for an experiment to measure the effect of less well-trained models (and the resulting accuracy of alignment) independently of the amount of data in the unit selection database.Does the removal of any utterance lead to the error, or only specific ones? If the latter, could it be an utterance containing the only remaining example of a particular phoneme within the utterances listed in
train.scp
? That would lead to an untrained model for that phoneme, and this model will cause problems during alignment.In general, you need at least one training example per phoneme, and ideally three. Check for warnings from
HERest
.To close this topic: there is no exam this year. But you should still think about how to integrate the content of modules 6-9 and the state-of-the-art in your coursework report.
Yes, that would be fine. For a higher mark, you could complement that with other forms of evaluation for other hypotheses.
The intention was to respond to this type of question in lab sessions, for each individual student. Since that’s not possible now, I’ll provide a generic answer here.
First, remember that a formal listening test is not the only option for every experiment. There are at least two other options for testing a hypothesis: expert listening by the author, or an objective measure.
Second, remember that not every hypothesis is worth testing formally. For example, if you – the expert listener – cannot discern any difference between two conditions, then there is little point asking whether other listeners can hear one.
Once you have decided that a formal listening test is what you need, then – as you correctly point out – you will have to be selective about which hypotheses are worth testing in this relatively expensive way.
I suggest testing a handful of hypotheses in total, of which maybe just a couple would have a formal listening test.
The target and join cost values reported by Festival have already been multiplied by their respective weights.
A low target cost weight will bias the search towards finding good joins (those with lower cost), at the expense of selecting candidates with a poorer match to their target, i.e., candidates with high target cost, noting that the reported target cost has been multiplied by a low target cost weight.
The consequence is that it is only valid to compare absolute values of join and target costs for a fixed setting of the target cost weight (e.g., comparing across different input sentences, or a fixed sentence synthesised with different unit databases). Changing the weight changes the absolute values.
An added complication in inspecting the total join cost across an utterance, as you vary the target cost weight, is that the proportion of zero-cost joins will vary – so you will get sudden ‘jumps’ in the values.
In summary – you are doing the right thing in inspecting values closely for individual sentences, but the absolute values of the costs are not very helpful. Try inspecting the ratio between them instead. If you’re looking for something objective to measure, then the number of zero-cost joins is a good option.
Yes, using a “between subjects” design for naturalness would be fine – it’s what the Blizzard Challenge does. It is not essential though, and a “within subjects” design is acceptable.
The “festival_mac” in the PATH is the clue. It’s a curious bug. See this topic.
Yes, that diagram has many steps! p287 says
“After we generate the Mandarin training sentences for the monolingual English speaker, his HMM based TTS in Mandarin can be trained via the standard HMM training procedure.”
so what they are doing is using trajectory tiling (with the waveform being created using concatenation) to construct a training set in the target language, for a speaker who doesn’t speak that language.
That data is then used to train a conventional HMM-based system that drives a vocoder.
All the synthesisers compared in Fig 12 are conventional HMM-plus-vocoder systems. Trajectory tiling is used to create the training data for TSMT.
April 2, 2020 at 09:59 in reply to: Talkin: A robust algorithm for pitch tracking – autocorrelation equation #11086j and m are both indexing the samples in the entire waveform under analysis. Remember that we are doing short-term analysis which involved analysing short frames taken from that waveform
m is the first sample in the current analysis frame (the i’th frame)
j is counting through the samples in the current analysis frame
So j=m is the lower limit of the summation (the first sample in the current frame) but j then increments up to j=m+n-k-1 (the last sample in the current frame)
Qualtrics might convert them to mp3 silently (certainly some platforms do) – check in a browser by taking your completed test as a subject.
The main problems with using wav files on the web are
- They are larger than mp3 – not a problem here: we care about quality not size
- Some browsers, notably Safari, will not play wav files
The listener is potentially a significant factor that could affect the results. Language background is one important aspect of this, and for that reason, the work reported in many scientific papers only uses native listeners.
Within speech synthesis evaluation, there is not a lot of work exploring this. One paper worth skimming is
Wester, Valentini-Botinhao and Henter “Are we using enough listeners? no! — an empirically-supported critique of interspeech 2014 TTS evaluations“, In INTERSPEECH-2015, 3476-3480.
although this is focussed on number of listeners more than their properties.
From the Blizzard Challenge we know that non-natives have systematically higher average (and higher standard deviation) WER than native speakers when transcribing speech. We sometimes also find that people with high exposure to TTS (“Speech Experts”) do not rank systems in exactly the same order as the general population of listeners.
If we think that listener properties could affect results, then there are several possible approaches, including:
1. use a listener pool that is as homogenous as possible, typically “normal-hearing native speakers” – this is what we do in Edinburgh most of the time
2. use a large listener pool and collect information about, for example, language background or previous exposure to TTS, so that the results can be analysed – this is what Blizzard does
Neither is perfect. 1 – limits the available pool of subjects. 2 – results in unbalanced sub-groups which complicates statistical analysis.
For the assignment, I do not recommend attempting to restrict your listeners to only native speakers of English, or to native speakers of your own first language (where that’s not English) – just get as many listeners as possible. So, take approach 2.
Approach 2 involves collecting information about each individual listener. Be very careful to collect only what is essential for testing your hypothesis (e.g., language background) and not to ask for intrusive personal information that you don’t need (e.g., gender, age, ethnicity).
But, for the assignment, investigation of listener factors is optional and it would be fine to omit this and to analyse system properties instead. You choose!
Festival is showing its age: it doesn’t support UTF-8. It only supports ASCII.
Terminology in this reply:
- sentence = the text being synthesised
- utterance = the synthetic speech for a sentence
You’re right to worry about listener boredom or fatigue.
But you also need to think about the effect of the sentence, which can be large (or at least unknown): some sentences are just harder to synthesise than others. In general, we therefore prioritise using the same sentences across all systems we are comparing.
If you place utterances side-by-side for direct comparison (e.g., ranking or MUSHRA) then you would always use the same sentence. Listeners would indeed have to listen to the same sentence uttered multiple times.
If you present utterances one at a time (e.g., MOS) then you can (pseudo)randomise the order so that listeners do not get the same sentence several times in a row, although they will still hear multiple utterances saying the same sentence across the test (or that section) as a whole.
-
AuthorPosts