Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing.

in William J. Hardcastle, John Laver, Fiona E. Gibbon (editors) “The handbook of phonetic sciences” in the series Blackwell handbooks in linguistics, 2010, Wiley-Blackwell, Chichester (England), Second edition,, ISBN 1405145900, 9781405145909

The key sections to read are ‘Resonance’ and ‘Sinusoids’.

The author has made a preprint available online although the section numbering differs. If you post questions on the forum about this reading, please refer to the section numbering from the published book version. The Handbook of Phonetic Sciences is available as an ebook from the University of Edinburgh library.

Forum for discussing this reading

Viewing 8 reply threads

Author

Posts
- October 5, 2016 at 15:34 #5184
  Simon King
  Professor
  Discuss it here
- September 27, 2018 at 17:47 #9381
  Riqiang
  Student
  Hello everyone! Probably we don’t need to know about the details of Fourier transform, but I’m still curious how the scale constant can be calculated by “taking the inner product between the waveform and the harmonic”. How is it done? It seems that the inner product of two waveforms change according to their phase, so how is the scale constant calculated before the phase?
- October 4, 2018 at 12:05 #9397
  Simon King
  Professor
  The inner product between two signals is calculated by multiplying the corresponding samples (one from each signal) and summing up those values.
  
  Intuitively, think of this as a measure of how similar the two signals are. If they are similar, then the inner product will have a high value. If they are very different, it will have a low value. So, we can understand the Fourier transform in this intuitive way:
  
  Take the signal we want to analyse.
  
  Create a sine wave of a particular frequency, and take the inner product between this and our signal. The resulting value is “how much of that frequency is present in our signal”. Plot that result, as a dot on a chart with frequency along the horizontal axis, and “how much” (i.e., magnitude) on the vertical axis.
  
  Repeat for a range of frequencies. Join the dots. The final plot is the spectrum.
  
  Fourier theory will tell us exactly what frequencies of sine waves we need to use, in order to perfectly characterise the signal (i.e., for the spectrum and the signal to contain exactly the same information, and thus to be able to make one from the other, in either direction).
  
  Now on to phase: this is the relative offset (i.e., shift in time) between the sine wave and the signal, before we take the inner product. You correctly state that phase is important. Luckily, Fourier analysis not only computes the magnitudes of frequency components present in our signal, it also computes the phase that each sine wave needs to be at, so that when we sum those sine waves together we reconstruct our signal exactly.
- October 4, 2018 at 12:07 #9398
  Riqiang
  Student
  When talking about the limits of Linear Prediction, Ellis talked about poles and zeros. What do they mean in relation to speech?
- October 4, 2018 at 12:19 #9400
  Simon King
  Professor
  Poles and zeros are properties of a filter. They correspond to the physical properties of resonance and anti-resonance.
  
  It is common to model the vocal tract as an all-pole filter: something with only resonances. The most common all-pole filter used is Linear Prediction.
  
  The relationship between our model (i.e., filter) parameters and the vocal tract shape is not trivial, because our model is such a simplistic approximation of the true vocal tract. So, for example, we wouldn’t normally use pole frequencies as features for Automatic Speech Recognition (although in the early days, features like that were widely used).
  
  But, for conceptual understanding, we can say that the poles of a Linear Prediction filter correspond to resonant frequencies of the vocal tract, which we call formants. (Poles occur in pairs, and there will be two poles per formant).
  
  To do formant tracking, we could fit an all-pole filter to a speech signal, and use the poles to identify the formants.
  
  [This level of detail is beyond the scope of Speech Processing. These concepts are still important, and will become more relevant in Speech Synthesis.]
- October 12, 2018 at 19:43 #9414
  Alexandra T
  Student
  In revising this reading, I’m struggling to comprehend Linear Prediction and how it works. As I understand it, LP modeling allows us to generate artificial copies of spectra. Is this correct? Has anyone come across any alternate resources and/or visualizations that describe the process?
- October 13, 2018 at 18:41 #9419
  Simon King
  Professor
  Linear Prediction refers to a specific form of filter being used in a source-filter model. A linear predictive filter is very simple: it predicts each speech sample (the filter’s output) as a weighted sum of the previous few speech (i.e., filter output) samples. The weights are called the filter co-efficients.
  
  Such a filter has only resonances (technically called “poles”), and no anti-resonances (“zeros”). It can be used as a simple model of the vocal tract. We need to excite the filter with an input signal, such as a pulse train. The output generated will be a synthetic speech waveform.
  
  The frequency response of the filter corresponds to the spectral envelope of the generated speech.
  
  It’s important to realise that, when we model speech with a simple source-filter model, such as linear prediction, we are only really modelling properties of the signal. We are not directly modelling the vocal tract in any realistic sense.
- October 1, 2020 at 10:54 #12158
  Nicole M
  Student
  Hi,
  
  in the Section on Sinusoids on page 761, it is explained that “resonant behavior always involves such an exchange between two energy forms”.
  So there are two major aspects of resonant behavior – periodic transfer of energy between two forms & exponential decay. In the domain of speech, exponential decay makes sense to me because it is also illustrated in the book in Fig. 20.4.
  
  However, I don’t understand which two energy forms are exchanged in the vocal tract as a resonance system?
  
  I just want to better understand the analogy between “swinging on a swing and the corresponding exchange between kinetic and potential energy” and producing speech.
- October 2, 2020 at 12:34 #12169
  Simon King
  Professor
  The two forms of energy exchanged are the same as in the swing.
  
  A swing has maximum potential energy at its highest point, when it is momentarily stationary (= no kinetic energy). It has maximum kinetic energy at its lowest point, when it is moving fastest.
  
  In air resonating in a tube, the same two forms of energy are exchanged. Potential energy is air under increased pressure, and kinetic energy is air moving at maximum velocity.
  
  Look at these air molecules – they alternate between being “all bunched together and stationary” (maximum potential energy) and being “evenly spaced and moving quickly” (maximum kinetic energy).
  
  To understand why pressure is potential energy, imagine a cylinder storing gas: it contains high pressure gas. Open the valve and this is converted into kinetic energy as the gas rushes out at high speed. Potential energy has been converted to (= exchanged for) kinetic energy. The total amount of energy is conserved.
Author

Posts

Viewing 8 reply threads

You must be logged in to reply to this topic.

This reading is
Very useful		4
Somewhat useful		11
Confusing		2

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech

Forum for discussing this reading

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis