Forum Replies Created
-
AuthorPosts
-
Both your hypotheses are reasonable.
Hypothesis 1 simply states that an ASF target cost function is superior to an IFF one. That will be true if our predictions of speech parameters are sufficiently accurate. The reason that measuring the target-to-candidate-unit distance in acoustic space is better than in linguistic feature space is sparsity. See Figure 16.6 in Taylor’s book, or the video on ASF target cost functions.
Hypothesis 2 is currently true much of the time, although improvements are being made steadily. It is just now becoming possible to construct commercial-quality TTS systems that use a vocoder, rather than waveform concatenation.
It’s worth reminding ourselves that an ASF target cost function does not need to use vocoder parameters as such, because we do not need to be able to reconstruct the waveform from them. We could choose to use a simpler parameterisation of speech (e.g., ASR-style MFCCs derived using a filterbank) if we wished.
February 25, 2017 at 16:51 in reply to: Vocoder: how we can evaluate the performance of a vocoder #6788If you think that WORLD gives perfect quality when vocoding (which is often called “copy synthesis”) then you either need to listen more closely, or use a better pair of headphones! You should be able to discern some artefacts of vocoding.
To evaluate copy synthesis, a MUSHRA listening test would be most appropriate. The hidden reference would be the original waveform and the lower anchor could be low-pass filtered speech.
Yes, that’s basically right. The filter is ‘swept’ across a range of frequencies – i.e., hypothesised values for F0. The notches are at multiples of the hypothesised value of F0, and so when the filter is set for the correct value of F0, they will align with the harmonics and remove the maximum amount of energy.
This is an old-fashioned method, and was included in the video to illustrate that there are often several possible solutions to any problem in signal processing.
“Why not use an ASR system as an objective evaluation method? “
This idea is often proposed, and occasionally tried. In essence, it uses ASR as a model of a listener, and assumes that its accuracy correlates with listener performance.
However, an ASR system is unlikely to be a very good model of human perception and language processing, and so you would need to find out whether this is true before proceeding.
“I think we really should focus on the effectiveness of communication itself.”
Yes, that would a good idea. The challenge in doing that would be creating an evaluation that truly measured communication success. It’s possible, but likely to be more complicated than a simple listening + transcribing task.
Thinking more generally, you are touching on the idea of “ecological validity” in evaluation, which is an evaluation that accurately measures “real world” performance. It is not easy to make an evaluation ecologically valid, and is very likely to be much more complex to set up than a simple listening test. There is also a significant risk of introducing confounding or uncontrolled external factors into the experimental situation. These would make the results harder to interpret.
The reason that researchers very rarely worry about ecological validity in synthetic speech evaluation is a mixture of laziness, cost, and the problem of external confounding factors.
“if we speak in a very un-natural way, this is can add more difficulty of comprehending meaning.At least, our brain will spend more time to undertand it correctly”
That seems a sensible hypothesis, and I believe it is true.
What we do know (from the Blizzard Challenge) is that the most intelligible system is definitely not always the most natural. A very basic statistical parametric system can be highly intelligible without sounding very natural.
What we don’t know as much about is whether “our brain will spend more time” processing such speech. We might formally call that “cognitive load”. We have a hypothesis that some or all forms of synthetic speech impose a higher cognitive load on the listener than natural speech does. Testing this hypothesis is ongoing research.
Hunt & Black use a combination of IFF and ASF target costs – see Section 2.2 of the paper.
All units are stored in the same way: they are part of whole utterances, as recorded by the speaker, in the database. Candidate units are extracted from the database utterances on-the-fly during synthesis.
You are correct that we can “extract” sub-parts of multi-phone units, yes. This is in fact no different to extracting any other units — whether single diphones or sequences of diphones — from the database utterances.
Usually, as you suggest, there will be a relative weighting between the two costs. You can experiment with that for yourself in Festival.
The join cost and target cost might be on quite different scales, since they are measuring different things entirely. Internally, the costs may perform some normalisation to partially address this. The cost functions themselves generally also involve summing up sub-costs, and again there will be weights involved in that sum: you can also experiment with this for yourself in Festival’s join cost function.
It’s written in Javascript, from scratch. It’s much simpler than you might imagine…
The terminology is potentially confusing.
A triphone is a model of one whole phone. It’s not a model of a fraction of a phone, and it’s not a model of a sequence of three phones.
It’s called triphone because it takes a context of three phones into account (previous, current and next). So, it’s a context-dependent model of a phone. Depending on the amount of context we take into account, we use different terms to describe a model of a phone:
- monophone – context-independent
- biphone -depends on either left- or right-context
- triphone – depends on left and right context
- quinphone – depends on 2 phones to the left and 2 to the right
(Note that a biphone model is not a model of a diphone!)
Jurafsky and Martin’s explanations may not be the clearest, especially when they start talking about “sub-phones”. By that, they simply mean the states within an HMM.
Seems that grad student has left and their home page has gone. I can’t find a good replacement – let me know if come across anything.
Harmonic spacing: the interval (“distance, in frequency”) between two adjacent harmonics, in voiced speech. This interval will be equal to F0 since there is a harmonic at every integer multiple of F0.
Effective filter bandwidth: the cochlear can be thought of as a filterbank – a set of bandpass filters. The centre frequencies of the filters are not evenly spaced on a linear frequency scale. They get more widely spaced at higher frequencies. They also have a larger bandwidth (“width”) with higher frequency.
For the filters at higher frequencies, this bandwidth is greater than the harmonic spacing. Therefore, the cochlea cannot resolve the harmonics at higher frequencies. Rather, the cochlea only captures the overall spectral envelope.
These facts about human hearing are the inspiration for the Mel-scale triangular filterbank commonly used as part of the sequence of processes for extracting MFCCs from speech.
The cepstrum is invertible because each operation is invertible (e.g., the Fourier transform). However, MFCCs are not exactly the same as the cepstrum:
Normally, we discard phase and only retain the magnitude spectrum, from which we compute the (magnitude) cepstrum.
If we use a filterbank to warp the spectrum to the Mel scale (which is the normal method in ASR), this is not invertible: information is lost when we sum the energy within each filter.
For ASR, we truncate the cepstrum, retaining only the first (say) 12 coefficients. This is a loss of information.
Questions to think about:
- why do we discard phase?
- why do we use a filterbank?
- why do we truncate the cesptrum?
In this one, very special case, where P(W) is a constant (i.e., it does not depend on W) we could indeed omit it
but in the more general case, where P(W) varies for different W, then we must of course compute it
Simon
-
AuthorPosts