Forum Replies Created
-
AuthorPosts
-
In-class “dissection” of selected essential readings
This is very popular (much more than I expected) and so we will definitely do more of this.
In-class group /pair activities
The majority of you really like these and find them useful. But, a few people do not find them useful. The ratio of like vs dislike is about 2:1.
So, we’ll keep doing this, but not to excess.
Website is much improved
Thanks – it does work much better now. I will restructure Speech Processing for future years, along the same lines as Speech Synthesis.
Yes, that is correct.The particular time alignment (within each pitch period) between the detected epochs and the speech waveform will vary between algorithms. In the video, my very simple algorithm does not include any special steps to get a “good” alignment.
Both your hypotheses are reasonable.
Hypothesis 1 simply states that an ASF target cost function is superior to an IFF one. That will be true if our predictions of speech parameters are sufficiently accurate. The reason that measuring the target-to-candidate-unit distance in acoustic space is better than in linguistic feature space is sparsity. See Figure 16.6 in Taylor’s book, or the video on ASF target cost functions.
Hypothesis 2 is currently true much of the time, although improvements are being made steadily. It is just now becoming possible to construct commercial-quality TTS systems that use a vocoder, rather than waveform concatenation.
It’s worth reminding ourselves that an ASF target cost function does not need to use vocoder parameters as such, because we do not need to be able to reconstruct the waveform from them. We could choose to use a simpler parameterisation of speech (e.g., ASR-style MFCCs derived using a filterbank) if we wished.
February 25, 2017 at 16:51 in reply to: Vocoder: how we can evaluate the performance of a vocoder #6788If you think that WORLD gives perfect quality when vocoding (which is often called “copy synthesis”) then you either need to listen more closely, or use a better pair of headphones! You should be able to discern some artefacts of vocoding.
To evaluate copy synthesis, a MUSHRA listening test would be most appropriate. The hidden reference would be the original waveform and the lower anchor could be low-pass filtered speech.
Yes, that’s basically right. The filter is ‘swept’ across a range of frequencies – i.e., hypothesised values for F0. The notches are at multiples of the hypothesised value of F0, and so when the filter is set for the correct value of F0, they will align with the harmonics and remove the maximum amount of energy.
This is an old-fashioned method, and was included in the video to illustrate that there are often several possible solutions to any problem in signal processing.
“Why not use an ASR system as an objective evaluation method? “
This idea is often proposed, and occasionally tried. In essence, it uses ASR as a model of a listener, and assumes that its accuracy correlates with listener performance.
However, an ASR system is unlikely to be a very good model of human perception and language processing, and so you would need to find out whether this is true before proceeding.
“I think we really should focus on the effectiveness of communication itself.”
Yes, that would a good idea. The challenge in doing that would be creating an evaluation that truly measured communication success. It’s possible, but likely to be more complicated than a simple listening + transcribing task.
Thinking more generally, you are touching on the idea of “ecological validity” in evaluation, which is an evaluation that accurately measures “real world” performance. It is not easy to make an evaluation ecologically valid, and is very likely to be much more complex to set up than a simple listening test. There is also a significant risk of introducing confounding or uncontrolled external factors into the experimental situation. These would make the results harder to interpret.
The reason that researchers very rarely worry about ecological validity in synthetic speech evaluation is a mixture of laziness, cost, and the problem of external confounding factors.
“if we speak in a very un-natural way, this is can add more difficulty of comprehending meaning.At least, our brain will spend more time to undertand it correctly”
That seems a sensible hypothesis, and I believe it is true.
What we do know (from the Blizzard Challenge) is that the most intelligible system is definitely not always the most natural. A very basic statistical parametric system can be highly intelligible without sounding very natural.
What we don’t know as much about is whether “our brain will spend more time” processing such speech. We might formally call that “cognitive load”. We have a hypothesis that some or all forms of synthetic speech impose a higher cognitive load on the listener than natural speech does. Testing this hypothesis is ongoing research.
Hunt & Black use a combination of IFF and ASF target costs – see Section 2.2 of the paper.
All units are stored in the same way: they are part of whole utterances, as recorded by the speaker, in the database. Candidate units are extracted from the database utterances on-the-fly during synthesis.
You are correct that we can “extract” sub-parts of multi-phone units, yes. This is in fact no different to extracting any other units — whether single diphones or sequences of diphones — from the database utterances.
Usually, as you suggest, there will be a relative weighting between the two costs. You can experiment with that for yourself in Festival.
The join cost and target cost might be on quite different scales, since they are measuring different things entirely. Internally, the costs may perform some normalisation to partially address this. The cost functions themselves generally also involve summing up sub-costs, and again there will be weights involved in that sum: you can also experiment with this for yourself in Festival’s join cost function.
It’s written in Javascript, from scratch. It’s much simpler than you might imagine…
The terminology is potentially confusing.
A triphone is a model of one whole phone. It’s not a model of a fraction of a phone, and it’s not a model of a sequence of three phones.
It’s called triphone because it takes a context of three phones into account (previous, current and next). So, it’s a context-dependent model of a phone. Depending on the amount of context we take into account, we use different terms to describe a model of a phone:
- monophone – context-independent
- biphone -depends on either left- or right-context
- triphone – depends on left and right context
- quinphone – depends on 2 phones to the left and 2 to the right
(Note that a biphone model is not a model of a diphone!)
Jurafsky and Martin’s explanations may not be the clearest, especially when they start talking about “sub-phones”. By that, they simply mean the states within an HMM.
-
AuthorPosts