Page 58

Forum Replies Created

Viewing 15 posts - 856 through 870 (of 1,084 total)

← 1 2 3 … 57 58 59 … 71 72 73 →

Author

Posts
March 28, 2016 at 15:36 in reply to: F0 estimation and Fourier analysis #2883
Simon
Professor
In theory, the method you suggest would work: the lowest harmonic is of course at a value of F0. But, in practice this doesn’t work as well as autocorrelation-based approaches because the FFT spectrum has a resolution determined by the analysis window length: for example, if there were 1024 FFT bins spanning 8000 Hz, that would mean that each bin was about 8Hz wide – this limits the resolution of the F0 estimate. To get good resolution, a rather long analysis window is needed (try it yourself in Wavesurfer)

The resolution of the autocorrelation function depends only on the sample rate of the waveform being analysed.

Attachments:
You must be logged in to view attached files.
March 27, 2016 at 11:58 in reply to: The cost functions: when in the pipeline? #2877
Simon
Professor
Yes, we can say that one step in the synthesis process is to compute the target cost for every candidate at every target position.

After that, Festival performs “observation pruning”, which is pruning of the candidate lists based on their target costs.

It is desirable to make the candidate lists shorter before the search commences, because this dramatically reduces the number of join costs that need to be computed (which is proportional to the average number of candidates per target position squared). Halving the average number of candidates thus cuts the number of join costs to be computed by 75%.

[Aside: the reason that the term ‘observation’ is used is as follows: if we conceive of the search as equivalent to an HMM, then the target cost takes the place of the observation probability, and the join cost takes the place of the transition probability.]
March 26, 2016 at 19:22 in reply to: The cost functions: when in the pipeline? #2875
Simon
Professor
We can compute the target cost of all candidates before the search commences. As you say, this would be necessary if we want to do some pruning based only on the target cost.

We could also pre-compute all the join costs too, after the candidate lists are pruned, but before search commences. That might be wasteful, because if we use pruning during the search, then some joins may never be considered. So, computing join costs during the search would be more sensible.

My names for the processes that happen before search commences:

pre-selection: the process for retrieving an initial list of candidates (per target position) from the inventory; in Festival, this means retrieving all units that match in diphone type

pruning: reducing the number of candidates (per target position), perhaps on the basis of their target costs
March 26, 2016 at 12:34 in reply to: do_alignment script #2873
Simon
Professor
It’s possible that training on only a quarter of the data gives good enough models to align all the data just as well as training the models on the whole data, but you should do some careful checking to make sure you have things set up correctly.

In do_alignment, you do indeed only want to change the list of MFCC files used for training the models (HCompV and HERest) to the smaller data set, but run the alignment (HVite) on the full dataset.

The error you report needs to be fixed – it indicates that your utts.mlf file does not contain labels for that file, yet it is listed in your list of training MFCC files. You need to rebuild utts.mlf using make_initial_phone_labs.
March 25, 2016 at 16:46 in reply to: Charting, reporting an A/B test #2871
Simon
Professor
Yes, reporting the CI computed in the way specified in the book, plus a narrative description of any non-homogeneities, sounds fine.

In many fields, it is common to report an overall statistic but to fail to report any interesting patterns, or distributions. For example, in ASR, people generally just report the overall WER, without breaking it down per speaker.

In industry, I know that the WER distribution across speakers is more important than the overall average WER: they want to bring down the WER for the “hardest” speakers since they currently have the worst user experience.

We are starting to see reports of the WER distribution across speakers, or subsets of the test data, in some research papers now – for example Figure 10 in http://arxiv.org/abs/1601.02828
March 25, 2016 at 15:21 in reply to: Charting, reporting an A/B test #2866
Simon
Professor
So, I think this might be the best source I can find for explaining how to actually compute the Confidence Interval for an A/B test.

Quantifying the User Experience, edited by Jeff Sauro and James R. Lewis, Morgan Kaufmann, Boston, 2012 ISBN 9780123849687 DOI:10.1016/B978-0-12-384968-7.00003-5

Chapter 3 is what you need, and I think the little worked example on page 21 (“7 out of 10 users”) could be mapped directly on to your situation (“675 out of 900 responses”)
March 25, 2016 at 09:55 in reply to: Charting, reporting an A/B test #2864
Simon
Professor
It’s also worth pointing out that you need to know how to interpret error bars or Confidence Intervals, and what can (and cannot) be concluded from them.

If you plot the Confidence Interval, and it does not overlap the 50% mark (in an A/B test) then you can be confident that the mean preference (75% for system A in your example above) is indeed greater than 50% and you can then state that your listeners significantly preferred system A.
March 25, 2016 at 09:44 in reply to: Charting, reporting an A/B test #2863
Simon
Professor
Your calculation of the percentages is correct – it’s simply the percent of responses pooled over all presentations of stimuli to listeners.

Confidence Intervals and error bars are not necessarily the same thing. Both are illustrating the variability in the measure, but in different units.

It is probably most common for error bars to show +/- one standard deviation ( [latex]\sigma[/latex]) around the mean. You can easily compute that quantity from your data (I hope!). If we assume your data have a Normal distribution, then the range of +/- one standard deviation about the mean would include approximately 68% of the data points.

(Aside: the more you read on this topic, the more confused you may become. You will come across a quantity called the Standard Error of the Mean (SEM), which is not the same as the standard deviation. If you want to illustrate variability in your data, then plotting the standard deviation is the right thing to do and you don’t need to worry about the SEM.)

Confidence Intervals plot the same type of information, but the units are now the percentage of data points (again, assuming a Normal distribution) that would lie within the Confidence Interval (centred on the mean). The most popular Confidence Interval is 95%, which is approximately +/- two standard deviations.
March 24, 2016 at 17:30 in reply to: Waveform of statistical parametric utterance #2855
Simon
Professor
I’ve attached some examples here – unit selection (look for usel in the filename), HMM (hts) and DNN (dnn).

Attachments:
You must be logged in to view attached files.
March 24, 2016 at 09:25 in reply to: Evaluating the effect of fixing the error #2852
Simon
Professor
Yes, that’s correct.

After manually correcting a small number (possibly just one) of misaligned labels, you need to create a new voice which has a very small database, containing only the utterances needed to synthesise the one test sentence you are trying to improve.

After fixing the labels, you need to rebuild the utterance structures to incorporate the new, corrected label timestamps – run both the build_utts and the add_duration_info_utts steps.

You also need to recalculate the join cost coefficients, since the potential join locations have changed (it’s just the strip_join_cost_coefs part that needs to be re-run).
March 24, 2016 at 09:15 in reply to: Choosing sentences for listening test #2850
Simon
Professor
Some sources of SUS material:

1. Follow the templates in the paper

Christian Benoît, Martine Grice, Valérie Hazan, The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences, Speech Communication, Volume 18, Issue 4, June 1996, Pages 381-392, ISSN 0167-6393, DOI: 10.1016/0167-6393(96)00026-X

and populate them with words of your own choice (for this assignment, it is OK to use your own judgement regarding words of appropriate frequency)

2. Download a Blizzard Challenge set of test materials (just use the text, obviously, not the wav files)

You need to decide what type of sentences are most appropriate for testing naturalness (SUS are probably not very appropriate!). The Blizzard Challenge is a reasonable reference point, where newspaper sentences are commonly used, as are sentences taken from novels.
March 22, 2016 at 12:17 in reply to: Tools for running a web-based listening test #2842
Simon
Professor
BeaqleJS looks like a general framework that is worth investigating – no idea how easy it is to use. It can do preference tests and MUSHRA.
March 21, 2016 at 15:20 in reply to: vowel reduction at runtime #2839
Simon
Professor
That is indeed the function that is doing the vowel reduction at synthesis time. You could try modifying (redefining) it, to prevent any vowel reduction at all, but you would then almost certainly make a lot of other test sentences worse.
March 21, 2016 at 15:19 in reply to: vowel reduction at runtime #2838
Simon
Professor
The problem you have found is a mismatch between the vowel reduction rules used at synthesis time, and the method for identifying reduced vowels in the database. It is impossible(*) for these to perfectly match, because the speaker may not reduce exactly the vowels that the front-end predicts should be reduced.

Editing the labels on the database sounds like the solution in this case, assuming that the corrected label is a closer match to the speech. Of course, whilst you might improve this particular test sentence, you may make others worse.

(*) unless you can get the speaker to read out phonetic transcriptions of the sentences?
March 21, 2016 at 15:10 in reply to: Minimum number of listeners #2837
Simon
Professor
It’s actually quite hard to specify how many listeners you need for a reliable result (e.g., to find statistically significant differences, or to get stable results that would not change if you added more listeners) because it depends to some extent on the magnitude of the difference between the systems you are comparing and the number of systems. This paper gives some guidelines:

Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations. In Proc. Interspeech, pages 3476-3480, Dresden, September 2015.

and a critique of published papers that include listening test results. The paper suggests that 30 listeners are needed for a typical MOS “naturalness” test (although 20 listeners gets you quite close: see Figure 1).

However, for this assignment, 30 listeners may be more than you can easily recruit, so it’s OK to use fewer.

One thing you must not do is to analyse your results, then add a few more listeners, and keep going until you reach statistical significance. That would be a fishing expedition.

You actually asked how many listeners per stimulus, and in the paper you then need to look to Figure 4, where a datapoint corresponds to one listener giving a response for a single stimulus.

The analysis in this paper is for one year of the Blizzard Challenge (where 11 systems are being compared), so the numbers of listeners and datapoints might not transfer directly to your listening test. But, the checklist in Section 2 is definitely applicable.
Author

Posts

Viewing 15 posts - 856 through 870 (of 1,084 total)

← 1 2 3 … 57 58 59 … 71 72 73 →

Simon

Forum Replies Created

Attachments:

Attachments:

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis