Forum Replies Created
-
AuthorPosts
-
We can compute the target cost of all candidates before the search commences. As you say, this would be necessary if we want to do some pruning based only on the target cost.
We could also pre-compute all the join costs too, after the candidate lists are pruned, but before search commences. That might be wasteful, because if we use pruning during the search, then some joins may never be considered. So, computing join costs during the search would be more sensible.
My names for the processes that happen before search commences:
pre-selection: the process for retrieving an initial list of candidates (per target position) from the inventory; in Festival, this means retrieving all units that match in diphone type
pruning: reducing the number of candidates (per target position), perhaps on the basis of their target costs
It’s possible that training on only a quarter of the data gives good enough models to align all the data just as well as training the models on the whole data, but you should do some careful checking to make sure you have things set up correctly.
In
do_alignment
, you do indeed only want to change the list of MFCC files used for training the models (HCompV and HERest) to the smaller data set, but run the alignment (HVite) on the full dataset.The error you report needs to be fixed – it indicates that your utts.mlf file does not contain labels for that file, yet it is listed in your list of training MFCC files. You need to rebuild utts.mlf using
make_initial_phone_labs
.Yes, reporting the CI computed in the way specified in the book, plus a narrative description of any non-homogeneities, sounds fine.
In many fields, it is common to report an overall statistic but to fail to report any interesting patterns, or distributions. For example, in ASR, people generally just report the overall WER, without breaking it down per speaker.
In industry, I know that the WER distribution across speakers is more important than the overall average WER: they want to bring down the WER for the “hardest” speakers since they currently have the worst user experience.
We are starting to see reports of the WER distribution across speakers, or subsets of the test data, in some research papers now – for example Figure 10 in http://arxiv.org/abs/1601.02828
So, I think this might be the best source I can find for explaining how to actually compute the Confidence Interval for an A/B test.
Quantifying the User Experience, edited by Jeff Sauro and James R. Lewis, Morgan Kaufmann, Boston, 2012 ISBN 9780123849687 DOI:10.1016/B978-0-12-384968-7.00003-5
Chapter 3 is what you need, and I think the little worked example on page 21 (“7 out of 10 users”) could be mapped directly on to your situation (“675 out of 900 responses”)
It’s also worth pointing out that you need to know how to interpret error bars or Confidence Intervals, and what can (and cannot) be concluded from them.
If you plot the Confidence Interval, and it does not overlap the 50% mark (in an A/B test) then you can be confident that the mean preference (75% for system A in your example above) is indeed greater than 50% and you can then state that your listeners significantly preferred system A.
Your calculation of the percentages is correct – it’s simply the percent of responses pooled over all presentations of stimuli to listeners.
Confidence Intervals and error bars are not necessarily the same thing. Both are illustrating the variability in the measure, but in different units.
It is probably most common for error bars to show +/- one standard deviation ( [latex]\sigma[/latex]) around the mean. You can easily compute that quantity from your data (I hope!). If we assume your data have a Normal distribution, then the range of +/- one standard deviation about the mean would include approximately 68% of the data points.
(Aside: the more you read on this topic, the more confused you may become. You will come across a quantity called the Standard Error of the Mean (SEM), which is not the same as the standard deviation. If you want to illustrate variability in your data, then plotting the standard deviation is the right thing to do and you don’t need to worry about the SEM.)
Confidence Intervals plot the same type of information, but the units are now the percentage of data points (again, assuming a Normal distribution) that would lie within the Confidence Interval (centred on the mean). The most popular Confidence Interval is 95%, which is approximately +/- two standard deviations.
I’ve attached some examples here – unit selection (look for usel in the filename), HMM (hts) and DNN (dnn).
Attachments:
You must be logged in to view attached files.Yes, that’s correct.
After manually correcting a small number (possibly just one) of misaligned labels, you need to create a new voice which has a very small database, containing only the utterances needed to synthesise the one test sentence you are trying to improve.
After fixing the labels, you need to rebuild the utterance structures to incorporate the new, corrected label timestamps – run both the
build_utts
and theadd_duration_info_utts
steps.You also need to recalculate the join cost coefficients, since the potential join locations have changed (it’s just the
strip_join_cost_coefs
part that needs to be re-run).Some sources of SUS material:
1. Follow the templates in the paper
Christian Benoît, Martine Grice, Valérie Hazan, The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences, Speech Communication, Volume 18, Issue 4, June 1996, Pages 381-392, ISSN 0167-6393, DOI: 10.1016/0167-6393(96)00026-X
and populate them with words of your own choice (for this assignment, it is OK to use your own judgement regarding words of appropriate frequency)
2. Download a Blizzard Challenge set of test materials (just use the text, obviously, not the wav files)
You need to decide what type of sentences are most appropriate for testing naturalness (SUS are probably not very appropriate!). The Blizzard Challenge is a reasonable reference point, where newspaper sentences are commonly used, as are sentences taken from novels.
BeaqleJS looks like a general framework that is worth investigating – no idea how easy it is to use. It can do preference tests and MUSHRA.
That is indeed the function that is doing the vowel reduction at synthesis time. You could try modifying (redefining) it, to prevent any vowel reduction at all, but you would then almost certainly make a lot of other test sentences worse.
The problem you have found is a mismatch between the vowel reduction rules used at synthesis time, and the method for identifying reduced vowels in the database. It is impossible(*) for these to perfectly match, because the speaker may not reduce exactly the vowels that the front-end predicts should be reduced.
Editing the labels on the database sounds like the solution in this case, assuming that the corrected label is a closer match to the speech. Of course, whilst you might improve this particular test sentence, you may make others worse.
(*) unless you can get the speaker to read out phonetic transcriptions of the sentences?
It’s actually quite hard to specify how many listeners you need for a reliable result (e.g., to find statistically significant differences, or to get stable results that would not change if you added more listeners) because it depends to some extent on the magnitude of the difference between the systems you are comparing and the number of systems. This paper gives some guidelines:
Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations. In Proc. Interspeech, pages 3476-3480, Dresden, September 2015.
and a critique of published papers that include listening test results. The paper suggests that 30 listeners are needed for a typical MOS “naturalness” test (although 20 listeners gets you quite close: see Figure 1).
However, for this assignment, 30 listeners may be more than you can easily recruit, so it’s OK to use fewer.
One thing you must not do is to analyse your results, then add a few more listeners, and keep going until you reach statistical significance. That would be a fishing expedition.
You actually asked how many listeners per stimulus, and in the paper you then need to look to Figure 4, where a datapoint corresponds to one listener giving a response for a single stimulus.
The analysis in this paper is for one year of the Blizzard Challenge (where 11 systems are being compared), so the numbers of listeners and datapoints might not transfer directly to your listening test. But, the checklist in Section 2 is definitely applicable.
Are you sure you are increasing the beam width for HVite, and not HERest. The beam for HERest is specified as three numbers (as per your other post in this topic) but for HVite it is just a single number.
Other possible causes:
- the speech is not a perfect match to the transcription (e.g., speaker errors) in too many places
- there just isn’t enough data
If you can’t get this working, then move on to another variation instead (e.g., train on all the data, and just exclude data the simple way at synthesis time).
It looks like you have moved all your ss directories into a directory called voices. That’s fine, but now the paths to the MFCC files (which are stored as absolute paths in
train.scp
) is wrong. Deletetrain.scp
and re-run themake_mfcc_list
script in this step to rebuild thetrain.scp
file. -
AuthorPosts