- This topic has 1 reply, 2 voices, and was last updated 7 years, 5 months ago by .
Viewing 1 reply thread
Viewing 1 reply thread
- You must be logged in to reply to this topic.
› Forums › Speech Synthesis › DNN synthesis › Hybrid model: where does the improvement come from?
In the hybrid approach to TTS, we use speech parameters (vocoder parameters) to select the candidate units. But I do not understand where the improvement comes from.
I have two hypothesises:
1. Speech parameters are better than linguistic features, leading to a better target cost function
2. Vocoders are still not good enough to reconstruct very natural speech.
I tend towards the first view. I tried the WORLD vocoder, and it reconstructed my voice perfectly (I cannot hear any differences between the original waveform and the reconstructed one).
Both your hypotheses are reasonable.
Hypothesis 1 simply states that an ASF target cost function is superior to an IFF one. That will be true if our predictions of speech parameters are sufficiently accurate. The reason that measuring the target-to-candidate-unit distance in acoustic space is better than in linguistic feature space is sparsity. See Figure 16.6 in Taylor’s book, or the video on ASF target cost functions.
Hypothesis 2 is currently true much of the time, although improvements are being made steadily. It is just now becoming possible to construct commercial-quality TTS systems that use a vocoder, rather than waveform concatenation.
It’s worth reminding ourselves that an ASF target cost function does not need to use vocoder parameters as such, because we do not need to be able to reconstruct the waveform from them. We could choose to use a simpler parameterisation of speech (e.g., ASR-style MFCCs derived using a filterbank) if we wished.
Some forums are only available if you are logged in. Searching will only return results from those forums if you log in.
Copyright © 2024 · Balance Child Theme on Genesis Framework · WordPress · Log in