› Forums › Readings › Taylor – Text-to-speech synthesis › Taylor – Chapter 12
- This topic has 4 replies, 3 voices, and was last updated 4 years, 9 months ago by Simon.
-
AuthorPosts
-
-
October 30, 2016 at 09:41 #5693
Analysis of speech signals
-
November 7, 2018 at 12:09 #9567
I did a Jurafsky and Martin reading (chapter 9.3) before this, and it used a diagram from this chapter (page 334, figure 9.14).
However, in this reading the magnitude spectrum and log magnitude spectrums are reversed (page 354, Figure 12.11, compare 12.11a and 12.11b to 9.14a and 9.14b from previous readings).
So which one is correct and which one is wrong? Which one is the normal spectrum and which one is its logged version?
Thank you!
Attachments:
You must be logged in to view attached files. -
November 7, 2018 at 18:33 #9572
The diagram in Taylor is correct.
You can work this out yourself from first principles: taking the log will compress the vertical range of the spectrum, bringing the very low amplitude components up so we can see them, and bringing the high amplitudes (the harmonics, in this case) down.
J&M messed up when they quoted it – a lesson in not quoting something unless you really understand it, perhaps!? Or maybe a printer’s error.
-
November 9, 2019 at 14:47 #10175
On p.355 Taylor mentions r[n] referring to radiation. In section 11, he describes this as “radiation impedance from the lips”. Why is this separated out, rather than considered part of the vocal-tract filter?
-
November 9, 2019 at 17:29 #10184
Taylor is separating it out here because he is trying to show how the equations align with the physics of sound propagation in the vocal tract.
Lip radiation can be assumed a constant effect: effectively, a filter that boosts high frequencies. This filtering effect is independent of the configuration of the articulators (Taylor, 2009, equation 11.29).
Furthermore, the constant high-pass filtering effect of lip radiation is more than cancelled out by another constant effect of low-pass filtering at the sound source:
It is this, combined with the radiation effect, that gives all speech spectra their characteristic spectral slope.
(Taylor, 2009, page 332)
So, we don’t need any learnable model parameters for these effects. We can account for them either by absorbing this constant effect into the vocal tract filter (which might be modelled using linear prediction) or by pre-emphasising the signal in the time domain (Taylor, 2009, page 375) to make its spectrum flatter, before any subsequent modelling, processing or feature extraction.
Pre-emphasis is standard practice in most speech processing – can you find where this is done in the digit recogniser ?
-
-
AuthorPosts
- You must be logged in to reply to this topic.