› Forums › Speech Synthesis › Unit selection › linear prediction
- This topic has 1 reply, 2 voices, and was last updated 7 years, 9 months ago by Simon.
-
AuthorPosts
-
-
October 24, 2016 at 21:03 #5586
In the process of pitch synchronous synthesis, at the estimation stage, we could figure out one coefficient for each pitch period. Another way is to update the coefficients at a fixed frame rate, so that artefact could hide behind F0. How does this process implement?
What is the excitation signal?
As for inverse filtering part, residual is just a signal that goes with the speech signal, that is, the residual and filter go together. So we should not manipulate the filter, because it might deviate from the perfect match. BUT we have to continuously change the filter coefficients so as to get the residuals? Are they contradictory to each other?
-
October 25, 2016 at 13:09 #5588
Let’s clarify your understanding – this most important thing to say first is
we must separate the description of the analysis and synthesis parts
You said that we calculate “one coefficient for each pitch period” – no, we calculate the complete set of filter coefficients (there might be 16 of them, say) for each analysis frame.
We have choices about the analysis frame: it might be a pitch period, but that would require pitch marking the signal, which is error-prone. So, let’s just consider the simple case where the analysis frame is of fixed duration (25ms, say) and the frames are spaced at fixed times (every 10ms, say).
After calculating the filter coefficients, we inverse filter the frame of speech signal. This gives us the residual signal for the current analysis frame. The residual is a waveform. If we use this residual signal to excite the filter (i.e., as the excitation signal), we will get near-perfect reconstruction of the frame of speech being analysed.
We store the filter coefficients and the residual waveform together. They are “matched”: only the combination of residual and filter from the same analysis frame will give near-perfect speech output. If we “mix and match” across different analysis frames, we will not get such good reconstruction.
You have correctly understood that we “should not manipulate the filter, because it might deviate from the perfect match“. That is true. So, we will only manipulate the filter by small amounts (for join smoothing), to avoid too much mismatch. We may also manipulate the residual using overlap-and-add (to modify F0) – this will also create some amount of mismatch. So, again, we will limit the amount of manipulation to limit the severity of the mismatch.
Now on to the synthesis stage, which happens every time we use the TTS system to say a sentence…
Here, we have choices about the resynthesis frame. It could be as simple as the fixed analysis frame from above. This will work, but because the filter coefficients are updated at a fixed rate (every 10ms, which is 100 times per second) we may hear an artefact: a constant 100Hz “buzz”.
We can’t avoid updating the filter, but we can be clever about the rate at which we do it. If we update not every 10ms but every pitch period, then we will create an artefact not at 100Hz but at F0. Since the listener will perceive F0 anyway (in voiced speech), then we can “hide” the artefact “behind” the natural F0.
-
-
AuthorPosts
- You must be logged in to reply to this topic.