Forum Replies Created
-
AuthorPosts
-
What do you get when you do the following:
$ festival $MBDIR/scm/build_unitsel.scm
festival> (setup_phoneset_and_lexicon 'unilex-rpx) #
festival> (lex.lookup 'Cabernet) ("Cabernet" nil (((k a b) 0) ((@ n) 0) ((e t) 0))) and are you sure the problem is with the word “Cabernet” and not “Shiraz” ?
festival> (lex.lookup 'Shiraz) ("Shiraz" nil (((sh ? r a z) 0)))
where “?” is a glottal stop. That symbol is mapped to “Q” for the forced alignment step, because HTK cannot handle a phoneme called “?” (it conflicts with the regular-expression-like patterns it uses).
So, it looks like “Q” and “Q_cl” are simply missing from $MBDIR/resources/phone_list.unilex-rpx but that this error only surfaces on rare occasions, because the LTS decision tree only predicts a glottal stop in very infrequent contexts.
You correctly note that “systems may perform very differently on some sentences but similarly on others”. The simple way to mitigate this is to use a larger number of sentences – there is no hard-and-fast rule, but I would suggest a minimum of 10 different sentences, and more than that (e.g., 20 or 30) if you can fit that into your listening test.
Although, in principle, a the likelihoods of the segments could be used as a simple form of confidence measure, this is not actually done in Multisyn.
Errors about unknown phones are almost always due to mixing up more than one dictionary. Start back from the step where you choose the dictionary and after that re-do anything that involves the dictionary in subsequent steps (e.g., creating the initial labels).
In theory, the method you suggest would work: the lowest harmonic is of course at a value of F0. But, in practice this doesn’t work as well as autocorrelation-based approaches because the FFT spectrum has a resolution determined by the analysis window length: for example, if there were 1024 FFT bins spanning 8000 Hz, that would mean that each bin was about 8Hz wide – this limits the resolution of the F0 estimate. To get good resolution, a rather long analysis window is needed (try it yourself in Wavesurfer)
The resolution of the autocorrelation function depends only on the sample rate of the waveform being analysed.
Attachments:
You must be logged in to view attached files.Yes, we can say that one step in the synthesis process is to compute the target cost for every candidate at every target position.
After that, Festival performs “observation pruning”, which is pruning of the candidate lists based on their target costs.
It is desirable to make the candidate lists shorter before the search commences, because this dramatically reduces the number of join costs that need to be computed (which is proportional to the average number of candidates per target position squared). Halving the average number of candidates thus cuts the number of join costs to be computed by 75%.
[Aside: the reason that the term ‘observation’ is used is as follows: if we conceive of the search as equivalent to an HMM, then the target cost takes the place of the observation probability, and the join cost takes the place of the transition probability.]
We can compute the target cost of all candidates before the search commences. As you say, this would be necessary if we want to do some pruning based only on the target cost.
We could also pre-compute all the join costs too, after the candidate lists are pruned, but before search commences. That might be wasteful, because if we use pruning during the search, then some joins may never be considered. So, computing join costs during the search would be more sensible.
My names for the processes that happen before search commences:
pre-selection: the process for retrieving an initial list of candidates (per target position) from the inventory; in Festival, this means retrieving all units that match in diphone type
pruning: reducing the number of candidates (per target position), perhaps on the basis of their target costs
It’s possible that training on only a quarter of the data gives good enough models to align all the data just as well as training the models on the whole data, but you should do some careful checking to make sure you have things set up correctly.
In
do_alignment
, you do indeed only want to change the list of MFCC files used for training the models (HCompV and HERest) to the smaller data set, but run the alignment (HVite) on the full dataset.The error you report needs to be fixed – it indicates that your utts.mlf file does not contain labels for that file, yet it is listed in your list of training MFCC files. You need to rebuild utts.mlf using
make_initial_phone_labs
.Yes, reporting the CI computed in the way specified in the book, plus a narrative description of any non-homogeneities, sounds fine.
In many fields, it is common to report an overall statistic but to fail to report any interesting patterns, or distributions. For example, in ASR, people generally just report the overall WER, without breaking it down per speaker.
In industry, I know that the WER distribution across speakers is more important than the overall average WER: they want to bring down the WER for the “hardest” speakers since they currently have the worst user experience.
We are starting to see reports of the WER distribution across speakers, or subsets of the test data, in some research papers now – for example Figure 10 in http://arxiv.org/abs/1601.02828
So, I think this might be the best source I can find for explaining how to actually compute the Confidence Interval for an A/B test.
Quantifying the User Experience, edited by Jeff Sauro and James R. Lewis, Morgan Kaufmann, Boston, 2012 ISBN 9780123849687 DOI:10.1016/B978-0-12-384968-7.00003-5
Chapter 3 is what you need, and I think the little worked example on page 21 (“7 out of 10 users”) could be mapped directly on to your situation (“675 out of 900 responses”)
It’s also worth pointing out that you need to know how to interpret error bars or Confidence Intervals, and what can (and cannot) be concluded from them.
If you plot the Confidence Interval, and it does not overlap the 50% mark (in an A/B test) then you can be confident that the mean preference (75% for system A in your example above) is indeed greater than 50% and you can then state that your listeners significantly preferred system A.
Your calculation of the percentages is correct – it’s simply the percent of responses pooled over all presentations of stimuli to listeners.
Confidence Intervals and error bars are not necessarily the same thing. Both are illustrating the variability in the measure, but in different units.
It is probably most common for error bars to show +/- one standard deviation ( [latex]\sigma[/latex]) around the mean. You can easily compute that quantity from your data (I hope!). If we assume your data have a Normal distribution, then the range of +/- one standard deviation about the mean would include approximately 68% of the data points.
(Aside: the more you read on this topic, the more confused you may become. You will come across a quantity called the Standard Error of the Mean (SEM), which is not the same as the standard deviation. If you want to illustrate variability in your data, then plotting the standard deviation is the right thing to do and you don’t need to worry about the SEM.)
Confidence Intervals plot the same type of information, but the units are now the percentage of data points (again, assuming a Normal distribution) that would lie within the Confidence Interval (centred on the mean). The most popular Confidence Interval is 95%, which is approximately +/- two standard deviations.
I’ve attached some examples here – unit selection (look for usel in the filename), HMM (hts) and DNN (dnn).
Attachments:
You must be logged in to view attached files.Yes, that’s correct.
After manually correcting a small number (possibly just one) of misaligned labels, you need to create a new voice which has a very small database, containing only the utterances needed to synthesise the one test sentence you are trying to improve.
After fixing the labels, you need to rebuild the utterance structures to incorporate the new, corrected label timestamps – run both the
build_utts
and theadd_duration_info_utts
steps.You also need to recalculate the join cost coefficients, since the potential join locations have changed (it’s just the
strip_join_cost_coefs
part that needs to be re-run).Some sources of SUS material:
1. Follow the templates in the paper
Christian Benoît, Martine Grice, Valérie Hazan, The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences, Speech Communication, Volume 18, Issue 4, June 1996, Pages 381-392, ISSN 0167-6393, DOI: 10.1016/0167-6393(96)00026-X
and populate them with words of your own choice (for this assignment, it is OK to use your own judgement regarding words of appropriate frequency)
2. Download a Blizzard Challenge set of test materials (just use the text, obviously, not the wav files)
You need to decide what type of sentences are most appropriate for testing naturalness (SUS are probably not very appropriate!). The Blizzard Challenge is a reasonable reference point, where newspaper sentences are commonly used, as are sentences taken from novels.
-
AuthorPosts