Page 57

Forum Replies Created

Viewing 15 posts - 841 through 855 (of 1,084 total)

← 1 2 3 … 56 57 58 … 71 72 73 →

Author

Posts
April 3, 2016 at 09:21 in reply to: Scoring SUS Test for Intelligibility #2928
Simon
Professor
Yes, you need to allow for homophones, and I also recommend allowing for spelling mistakes as well: you are testing your systems, not testing the listeners.
April 2, 2016 at 21:39 in reply to: Scoring SUS Test for Intelligibility #2924
Simon
Professor
In the Blizzard Challenge, and almost everywhere else, Word Error Rate (WER) is used. I have rarely, if ever, seen anyone following the recommendation from the original paper to score entire sentences.

With WER, there is no need to normalise for sentence length. Just use the same formula that is used in automatic speech recognition.
April 2, 2016 at 13:42 in reply to: Required files at run time #2922
Simon
Professor
The “bad f0” penalty is part of the target cost. It obtains F0 at the concatenation points of the unit from the “stripped” join cost features (coef2). In fact, only the voicing status is used, and not the actual value of F0.
March 31, 2016 at 21:01 in reply to: Stimuli Across Voices #2920
Simon
Professor
Normally, we do not calculate a score for an individual listener at all, we just pool all responses. But, you are right that the Latin Square design would prevent us from computing per-listener scores.

You seem to be suggesting that scores should be somehow rescaled, or made relative within each listener. That seems reasonable for an intelligibility test, but in fact is not something that is normally done.

But, this might not be the perfect arrangement, so feel free to try something different.

Your suggestion to include natural speech is good, but don’t expect listeners to transcribe that perfectly: they will still make errors (see Blizzard Challenge summary papers for typical WERs on naturally-spoken SUS – they are not 0%) and so you can’t necessarily use this as a means to exclude listeners who didn’t follow the instructions carefully.
March 31, 2016 at 20:55 in reply to: Bad Pitch Marking #2919
Simon
Professor
I’m not sure that’s a convincing explanation. You are arguing that units with “bad pitchmarks” won’t get used. Fine, but then an increase in the proportion of units with bad pitchmarks effectively reduces the inventory size, which should lead to worse quality.

Does using the ‘male’ setting actually lead to more ‘bad pitchmarking’ warnings”? These warnings relate to units without any pitchmarks at all, and this then results in a penalty.

If the ‘male’ setting just results in fewer pitchmarks overall (say, about half as many), that’s a different situation – think about what the consequences of that would be.
March 31, 2016 at 17:57 in reply to: Domain-specific intelligibility #2916
Simon
Professor
For the domain-specific voice, obviously you will want to try domain-specific test sentences (that’s the whole point). You are right to consider how to control the difficulty of those sentences, when measuring intelligibility. Here are some options:

1. Try ‘normal’ domain-specific sentences and see whether you do indeed have a ceiling effect on intelligibility (just make that judgement informally, although there are formal statistical tests for this kind of thing).

2. If ‘normal’domain-specific sentences are too easy, then you may decide to try one of your two suggestions: domain-specific SUS, or adding noise

I’m not entirely sure about domain-specific SUS: although the words will be well-covered by the database, the sequences of words will not, and so there will still be lots of joins. But try it and see!

Adding noise is a good option, but you will need to choose an appropriate Signal-to-Noise Ratio (SNR) carefully. See this reply, and the paper I mention there, for some clues. Don’t aim to replicate that paper (!) but you might do something similar on an informal basis.

3. Or, simply don’t measure intelligibility, and only measure naturalness, in this experiment.
March 31, 2016 at 17:49 in reply to: Psychometric function to calibrate material #2915
Simon
Professor
The x-axis is the experimental variable that you control (for you: the difficulty of the text) and the y-axis is the response (for you: Word Error Rate (WER), or Accuracy).

For this assignment there is no need to actually plot out a full psychometric curve: that would consume a lot of listening test time. Instead, you can informally try a few different types of text, and choose a type that gives you a WER in about the right range (let’s say 30-50%).

An example curve is Figure 1 in this paper (there, the x-axis is Signal-to-Noise Ratio, but you could imagine that it is “text difficulty” ranging from “very hard” on the left to “really easy” on the right).
March 31, 2016 at 16:39 in reply to: Pruning #2912
Simon
Professor
A value of exactly 0 for the beam widths is actual a special value that turns pruning off altogether. If you want to try a lot of pruning, use a small value, but something greater than 0.
March 31, 2016 at 12:44 in reply to: Pronunciation Issues #2908
Simon
Professor
For your specific example, don’t try to fix it, but you could report it as part of your qualitative analysis. Try to suggest how it could be fixed though.
March 31, 2016 at 11:12 in reply to: Stimuli Across Voices #2906
Simon
Professor
For testing intelligibility

Now things get more complicated. You correctly state two things:

1. all systems must synthesise exactly the same set of sentences, in case some are easier or more difficult than others.

2. listeners cannot be presented with the same sentence more than once, because they will probably remember it and so make fewer errors on the second presentation.

So, we need an experimental design that satisfies the above points. The standard way to do that is with a Latin Square. Let’s do a really simple 2×2 design, for comparing two systems (call them A and B) using, say, 10 sentences (1 to 10)

The trick is to split the listeners into groups:

Listener group 1 hears:
System A synthesising sentences 1-5
System B synthesising sentences 6-10

Listener group 2 hears:
System B synthesising sentences 1-5
System A synthesising sentences 6-10

and we pool all responses. What is happening is that a pair of listeners (one from each group) form a “virtual listener” who hears all combinations.

The number of listeners needs to be a multiple of the number of systems (2 in this example) so that each listener group is the same size. Imagine we recruit 20 listeners – we put 10 of them into each group.
March 31, 2016 at 11:03 in reply to: Stimuli Across Voices #2905
Simon
Professor
For testing naturalness

In a standard MOS test, listeners rate a single stimulus at a time (one particular system saying one sentence). Across the whole test, it is important that all systems synthesise the same set of sentences. You will probably either pseudo-randomise the order of presentation, or use a fancier such as a Latin Square.

So, listeners will not be aware of how many different systems there are. They will notice the same sentence occurring more than once but the pseudo-randomisation (or other method) can be made to ensure that they never hear the same sentence two times in a row.
March 30, 2016 at 11:39 in reply to: Building a voice with OOV words #2902
Simon
Professor
What do you get when you do the following:
```
$ festival $MBDIR/scm/build_unitsel.scm
```
```
festival> (setup_phoneset_and_lexicon 'unilex-rpx)
#
festival> (lex.lookup 'Cabernet)                  
("Cabernet" nil (((k a b) 0) ((@ n) 0) ((e t) 0)))
```
and are you sure the problem is with the word “Cabernet” and not “Shiraz” ?
```
festival> (lex.lookup 'Shiraz)
("Shiraz" nil (((sh ? r a z) 0)))
```
where “?” is a glottal stop. That symbol is mapped to “Q” for the forced alignment step, because HTK cannot handle a phoneme called “?” (it conflicts with the regular-expression-like patterns it uses).

So, it looks like “Q” and “Q_cl” are simply missing from $MBDIR/resources/phone_list.unilex-rpx but that this error only surfaces on rare occasions, because the LTS decision tree only predicts a glottal stop in very infrequent contexts.
March 29, 2016 at 22:03 in reply to: Number of sentences #2895
Simon
Professor
You correctly note that “systems may perform very differently on some sentences but similarly on others”. The simple way to mitigate this is to use a larger number of sentences – there is no hard-and-fast rule, but I would suggest a minimum of 10 different sentences, and more than that (e.g., 20 or 30) if you can fit that into your listening test.
March 29, 2016 at 17:48 in reply to: likelihood of the labels in the utts file?? #2890
Simon
Professor
Although, in principle, a the likelihoods of the segments could be used as a simple form of confidence measure, this is not actually done in Multisyn.
March 28, 2016 at 15:58 in reply to: Building a voice with OOV words #2886
Simon
Professor
Errors about unknown phones are almost always due to mixing up more than one dictionary. Start back from the step where you choose the dictionary and after that re-do anything that involves the dictionary in subsequent steps (e.g., creating the initial labels).
Author

Posts

Viewing 15 posts - 841 through 855 (of 1,084 total)

← 1 2 3 … 56 57 58 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis