Forum Replies Created
-
AuthorPosts
-
Yes, you need to allow for homophones, and I also recommend allowing for spelling mistakes as well: you are testing your systems, not testing the listeners.
In the Blizzard Challenge, and almost everywhere else, Word Error Rate (WER) is used. I have rarely, if ever, seen anyone following the recommendation from the original paper to score entire sentences.
With WER, there is no need to normalise for sentence length. Just use the same formula that is used in automatic speech recognition.
The “bad f0” penalty is part of the target cost. It obtains F0 at the concatenation points of the unit from the “stripped” join cost features (coef2). In fact, only the voicing status is used, and not the actual value of F0.
Normally, we do not calculate a score for an individual listener at all, we just pool all responses. But, you are right that the Latin Square design would prevent us from computing per-listener scores.
You seem to be suggesting that scores should be somehow rescaled, or made relative within each listener. That seems reasonable for an intelligibility test, but in fact is not something that is normally done.
But, this might not be the perfect arrangement, so feel free to try something different.
Your suggestion to include natural speech is good, but don’t expect listeners to transcribe that perfectly: they will still make errors (see Blizzard Challenge summary papers for typical WERs on naturally-spoken SUS – they are not 0%) and so you can’t necessarily use this as a means to exclude listeners who didn’t follow the instructions carefully.
I’m not sure that’s a convincing explanation. You are arguing that units with “bad pitchmarks” won’t get used. Fine, but then an increase in the proportion of units with bad pitchmarks effectively reduces the inventory size, which should lead to worse quality.
Does using the ‘male’ setting actually lead to more ‘bad pitchmarking’ warnings”? These warnings relate to units without any pitchmarks at all, and this then results in a penalty.
If the ‘male’ setting just results in fewer pitchmarks overall (say, about half as many), that’s a different situation – think about what the consequences of that would be.
For the domain-specific voice, obviously you will want to try domain-specific test sentences (that’s the whole point). You are right to consider how to control the difficulty of those sentences, when measuring intelligibility. Here are some options:
1. Try ‘normal’ domain-specific sentences and see whether you do indeed have a ceiling effect on intelligibility (just make that judgement informally, although there are formal statistical tests for this kind of thing).
2. If ‘normal’domain-specific sentences are too easy, then you may decide to try one of your two suggestions: domain-specific SUS, or adding noise
I’m not entirely sure about domain-specific SUS: although the words will be well-covered by the database, the sequences of words will not, and so there will still be lots of joins. But try it and see!
Adding noise is a good option, but you will need to choose an appropriate Signal-to-Noise Ratio (SNR) carefully. See this reply, and the paper I mention there, for some clues. Don’t aim to replicate that paper (!) but you might do something similar on an informal basis.
3. Or, simply don’t measure intelligibility, and only measure naturalness, in this experiment.
The x-axis is the experimental variable that you control (for you: the difficulty of the text) and the y-axis is the response (for you: Word Error Rate (WER), or Accuracy).
For this assignment there is no need to actually plot out a full psychometric curve: that would consume a lot of listening test time. Instead, you can informally try a few different types of text, and choose a type that gives you a WER in about the right range (let’s say 30-50%).
An example curve is Figure 1 in this paper (there, the x-axis is Signal-to-Noise Ratio, but you could imagine that it is “text difficulty” ranging from “very hard” on the left to “really easy” on the right).
A value of exactly 0 for the beam widths is actual a special value that turns pruning off altogether. If you want to try a lot of pruning, use a small value, but something greater than 0.
For your specific example, don’t try to fix it, but you could report it as part of your qualitative analysis. Try to suggest how it could be fixed though.
For testing intelligibility
Now things get more complicated. You correctly state two things:
1. all systems must synthesise exactly the same set of sentences, in case some are easier or more difficult than others.
2. listeners cannot be presented with the same sentence more than once, because they will probably remember it and so make fewer errors on the second presentation.
So, we need an experimental design that satisfies the above points. The standard way to do that is with a Latin Square. Let’s do a really simple 2×2 design, for comparing two systems (call them A and B) using, say, 10 sentences (1 to 10)
The trick is to split the listeners into groups:
Listener group 1 hears:
System A synthesising sentences 1-5
System B synthesising sentences 6-10Listener group 2 hears:
System B synthesising sentences 1-5
System A synthesising sentences 6-10and we pool all responses. What is happening is that a pair of listeners (one from each group) form a “virtual listener” who hears all combinations.
The number of listeners needs to be a multiple of the number of systems (2 in this example) so that each listener group is the same size. Imagine we recruit 20 listeners – we put 10 of them into each group.
For testing naturalness
In a standard MOS test, listeners rate a single stimulus at a time (one particular system saying one sentence). Across the whole test, it is important that all systems synthesise the same set of sentences. You will probably either pseudo-randomise the order of presentation, or use a fancier such as a Latin Square.
So, listeners will not be aware of how many different systems there are. They will notice the same sentence occurring more than once but the pseudo-randomisation (or other method) can be made to ensure that they never hear the same sentence two times in a row.
What do you get when you do the following:
$ festival $MBDIR/scm/build_unitsel.scm
festival> (setup_phoneset_and_lexicon 'unilex-rpx) #
festival> (lex.lookup 'Cabernet) ("Cabernet" nil (((k a b) 0) ((@ n) 0) ((e t) 0))) and are you sure the problem is with the word “Cabernet” and not “Shiraz” ?
festival> (lex.lookup 'Shiraz) ("Shiraz" nil (((sh ? r a z) 0)))
where “?” is a glottal stop. That symbol is mapped to “Q” for the forced alignment step, because HTK cannot handle a phoneme called “?” (it conflicts with the regular-expression-like patterns it uses).
So, it looks like “Q” and “Q_cl” are simply missing from $MBDIR/resources/phone_list.unilex-rpx but that this error only surfaces on rare occasions, because the LTS decision tree only predicts a glottal stop in very infrequent contexts.
You correctly note that “systems may perform very differently on some sentences but similarly on others”. The simple way to mitigate this is to use a larger number of sentences – there is no hard-and-fast rule, but I would suggest a minimum of 10 different sentences, and more than that (e.g., 20 or 30) if you can fit that into your listening test.
Although, in principle, a the likelihoods of the segments could be used as a simple form of confidence measure, this is not actually done in Multisyn.
Errors about unknown phones are almost always due to mixing up more than one dictionary. Start back from the step where you choose the dictionary and after that re-do anything that involves the dictionary in subsequent steps (e.g., creating the initial labels).
-
AuthorPosts