Jurafsky & Martin – Section 9.4 – Acoustic Likelihood Computation

To perform speech recognition with HMMs involves calculating the likelihood that each model emitted the observed speech. You can skip 9.4.1 Vector Quantization.

in Dan Jurafsky and James H. Martin “Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition”, 2009, Pearson Prentice Hall, Upper Saddle River, N.J., Second edition, ISBN 0135041961

Scanned copy of chapter 9 (University of Edinburgh only)

Ignore the material in 9.4.1 on vector quantisation – it’s rarely used these days.

Forum for discussing this reading

Viewing 15 reply threads

Author

Posts
- October 25, 2016 at 16:08 #5600
  Simon
  Professor
  Speech recognition
- November 12, 2017 at 14:18 #8343
  Shijie Yao
  Student
  Regarding the 13th feature in MFCC, JM2 9.3.6 says that “the energy in a frame is the sum over time of the power of the samples in the frame”.
  My understanding of this is that, in the time-domain waveform, we have energy as amplitude, which could have positive and negative values, so to avoid them from cancelling each out, we take the power.
  Is this on the right track? Is this the most common way of representing sound energy?
  
  Also, I have seen some different versions of calculating the energy in question on the Internet. One of them is to take the log of energy and add it as the 13th feature of MFCC(link). Does it matter which way we choose to represent the energy?
- November 14, 2017 at 22:58 #8346
  Simon
  Professor
  You’re on the right lines. We couldn’t just average the amplitudes of the speech samples in a frame – as you say, this would come out to a value of about zero. We need to make them all positive first, so we square them. Then we average them (sum and divide by the number of samples). To get back to the original units we then take the square root.
  
  This procedure is so common that it gets a special name: RMS, or Root Mean Square. We’ll then often take the log, to compress the dynamic range.
  
  The variants you are coming across might differ in whether they take the square root or not. That might seem like a major difference, but it’s not. If we’re going to take the log, then taking the square root first doesn’t do anything useful: it will just become a constant multiplier of 0.5.
- November 8, 2018 at 12:06 #9573
  Riqiang
  Student
  The figure 9.14 in J&M shows a classical cepstrum. Why are we not using this classical cepstrum but the Mel frequency one? We want to isolate the source signal, and it seems to me that the classical cepstrum shown in 9.14 can do just that (harmonics become a single spike corresponding to the fundamental frequency). Although Mel-filterbank also smoothes out the harmonics, it also makes the relationship between harmonics less straightforward (at equal intervals on the Hertz scale)?
- November 8, 2018 at 16:46 #9575
  Simon
  Professor
  Let’s separate out several different processes:
  
  The Mel scale is applied to the spectrum (by warping the x axis), before taking the cepstrum. This is effectively just a resampling of the spectrum, providing higher resolution (more parameters) in the lower frequencies and lower resolution at higher frequencies.
  
  Taking the cepstrum of this warped spectrum doesn’t change the cepstral transform’s abilities to separate source and filter. But it does result in using more parameters to specify the shape of the lower, more important, frequency range.
  
  The filterbank does several things all at once, which can be confusing. It is a way of smoothing out the harmonics, leaving only the filter information. By spacing the filter centre frequencies along a Mel scale we can also use it to warp the frequency axis. Finally, it also reduces the dimensionality from the raw FFT (which has 100s or 1000s of dimensions) to maybe 20-30 filters in the filter bank.
  
  Note: J&M’s figure 9.14 is taken from Taylor, and they made a mistake in the caption. See this topic for the correction.
- November 11, 2018 at 00:58 #9580
  YA
  Student
  J&M 2 Edition, in 9.3.3, page 299 mentions the fast Fourier transform or FFT is very efficient but only works for values of N that are powers of 2.
  
  Could you please explain why it only works for values of N that are powers of 2?
  - November 11, 2018 at 19:33 #9586
    Simon
    Professor
    The Fast Fourier Transform (FFT) is an algorithm that efficiently implements the Discrete Fourier Transform (DFT). Because it is so efficient, we pretty much only ever use the FFT and that’s why you hear me say “FFT” in class when I could use the more general term “DFT”.
    
    The FFT is a divide-and-conquer algorithm. It divides the signal into two parts, solves for the FFT of each part, then joins the solutions together. This is recursive, so each part is then itself divided into two, and so on. Therefore, the length of the signal being analysed has to be a power of 2, to make it evenly divide into two parts recursively.
    
    The details of this algorithm are beyond the scope of the course, but it is a beautiful example of how an elegant algorithm can make computation very fast.
    
    If you really want to learn this algorithm, then refer to the classic textbook:
    
    Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck, Discrete-Time Signal Processing, 2nd edition (Upper Saddle River, NJ: Prentice Hall, 1989)
    
    We would rarely implement an FFT because there are excellent implementations available in standard libraries for all major programming languages. But, you can’t call yourself a proper speech processing engineer until you have implemented the FFT, so add it to your bucket list (after Dynamic Programming and before Expectation-Maximisation)!
    
    You can see my not-fast-enough FFT implementation here followed by a much faster implementation from someone else (which is less easy to understand).
- November 11, 2018 at 01:05 #9581
  YA
  Student
  J&M 2Edition, 9.3.4 (9.14) mentions the mel frequency m can be computed from the raw acoustic frequency as follows:
  mel(f) = 11271n(1+ f/700)
  
  Could you please explain what does the f and n stand for respectively?
  Also, how did we derive 11271 and 700 from? Thanks!
  - November 11, 2018 at 19:36 #9587
    Alexandra T
    Student
    I’m not sure about the 1127 or 700 specifically, but I think the equation is actually the following:
    
    mel(f) = 1127 ln(1 + f/700)
    
    Where ln is the natural logarithm (log base e), and f is the frequency in Hertz. So it shows us how to convert from “raw frequencies” (frequencies in Hertz which are of units 1/s or s^-1) to mel frequencies.
    
    You’re right that the font of this textbook make the letter “l” and the digit “1” look exactly the same though!
    - November 11, 2018 at 19:57 #9592
      Simon
      Professor
      
      The 1127 and 700 values are determined empirically, to match human hearing.
- November 11, 2018 at 01:13 #9582
  YA
  Student
  J&M 2Edition, 9.3.4 (9.11)
  Could you please elaborate a bit more on the Hamming window? How did we derive 0.54 and 0.46 from? Are they fixed numbers? or could vary depending on design/preference? Thanks!
  - November 11, 2018 at 19:55 #9591
    Simon
    Professor
    There are many possible window shapes available, of which the Hamming window is perhaps the most popular in digital audio signal processing. To understand why there are so many, we need to understand why there is no such thing as the perfect window: they all involve a compromise.
    
    Let’s start with the most simple window: rectangular. This does not taper at the edges, so the signal will have discontinuities which will lead to artefacts – for example, after taking the FFT. On the plus side, the signal inside the window is exactly equal to the original signal from which it was extracted.
    
    A better option is a tapered window that eliminates the discontinuity problem of the rectangular window, by effectively fading the signal in, and then out again. The problem is that this fading (i.e., changing the amplitude) also changes the frequency content of the signal subtly. To see that, consider fading a sine wave in and out. The result is not a sine wave anymore (e.g., J&M Figure 9.11). Therefore, a windowed sine wave is no longer a pure tone: it must have some other frequencies that were introduced by the fading in and out operation.
    
    So, tapered windows introduce additional frequencies into the signal. Exactly what is introduced will depend on the shape of the window, and hence different people prefer different windows for different applications. But, we are not going to get hung up on the details – it doesn’t make much difference for our applications.
    
    For the spectrum of a voiced speech sound, the main artefact of a tapered window is that the harmonics are not perfect vertical lines, but rather peaks with some width. The diagrams on the Wikipedia page for Window Function may help you understand this. That page also correctly explains where the 0.54 value comes from and why it’s not exactly 0.5 (which would be a simple raised cosine, called the Hann window). Again, these details really don’t matter much for our purposes and are well beyond what is examinable.
- November 17, 2018 at 10:21 #9598
  Alexandra T
  Student
  I’m confused on what is meant by “penalty” in J&M 9.6 (page 315). Does it refer to the value of the LMSF, the value of P(W), or some other concept that is not mathematically represented in equation 9.48?
  
  Furthermore, how can both an LM probability decrease and an LM probability increase cause a larger penalty, as quoted below?:
  
  “…if … the language model probability decreases (causing a larger penalty) … if the language model probability increases (larger penalty)…” (315)
  
  And finally, J&M seem to say that both LM probability increases and decreases result in a preference for shorter words:
  
  “…if … the language model probability decreases … the decoder will prefer fewer longer words. If the language model probability increases … the decoder will prefer more shorter words” (315)
  
  Am I misinterpreting this? In other words, are they saying that there is a difference between “fewer longer words” and “more shorter words” that I’m not getting? Or is their point indeed that both increases and decreases in LM probability result in the same thing?
- November 17, 2018 at 19:51 #9601
  Simon
  Professor
  J&M don’t do a great job of explaining either the language model scaling factor (LMSF) or the word insertion penalty (WIP), so I’ll explain both.
  
  Let’s start with the LMSF. The real reason that we need to scale the language model probability before combining it with the acoustic model likelihood is much simpler than J&M’s explanation:
  - the language model probability really is a probability
  - the acoustic model likelihood is not a probability because it’s computed by probability density functions
  Remember that a Guassian probability density function can not assign a probability to an observation, but only a probability density. If we insisted on getting a true probability, this would have to be for an interval of observation values (J&M figure 9.18). We might describe density as being “proportional” to probability – i.e., a scaled probability.
  
  So, the language model probability and the acoustic model likelihood are on different scales. Simply multiplying them (in practice, adding them in the log domain) assumes they are on the same scale. So, doing some rescaling is perfectly reasonable, and the convention is to multiply the language model log probability by a constant: the LMSF. This value is chosen empirically – for example, to minimise WER on a development set.
  
  Right, on to the Word Insertion Penalty (WIP). J&M attempt a theoretical justification of this, which relies on their explanation of why the LMSF is needed. I’ll go instead for a pragmatic justification:
  
  An automatic speech recognition system makes three types of errors: substitutions, insertions and deletions. All of them affect the WER. We try to minimise substitution errors by training the best acoustic and language models possible. But there is no direct control via either of those models over insertions and deletions. We might find that our system makes a lot of insertion errors, and that will increase WER (potentially above 100%!).
  
  So, we would like to have a control over the insertions and deletions. I’ll explain this control in the Token Passing framework. We subtract a constant amount from a token’s log probability every time it leaves a word end. This amount is the WIP (J&M equation 9.49, except they don’t describe it in the log prob domain). Varying the WIP will trade off between insertions and deletions. You will need to adjust this penalty if you attempt the connected digits part of the digit recogniser exercise because you may find that, without it, your system makes so many insertion errors that WER is indeed greater than 100%.
  
  Finally, to actually answer your question, I think there is a typo in “if the language model probability increases (larger penalty)” where surely they meant “(smaller penalty)”. But to be honest, I find their way of explaining this quite confusing, and it’s not really how ASR system builders think about LMSF or WIP. Rather, these are just a couple of really useful additional system tuning parameters to be found empirically, on development data.
- November 13, 2019 at 09:45 #10212
  Yichao L
  Student
  Hi,
  
  I got a question regarding how the first 12 MFCC coefficients are extracted from any cepstrum. J&M says in Ch.9.3.5 that “we generally just take the first 12 cepstral values”. Does that mean we are taking the y-axis value for the first 12 values on the x-axis, or we are somehow extracting some features from the cepstrum?
  
  Many thanks,
  
  Yichao
- November 13, 2019 at 10:04 #10213
  Simon
  Professor
  Yes, it’s as simple as taking the first 12 values starting from the left on the horizontal axis. We call this operation “truncation”.
  
  They are ordered (the lower ones capture broad spectral shape, higher ones capture finer and finer details), so this is theoretically well-motivated.
- December 5, 2023 at 21:32 #17325
  Louise
  Student
  Hi,
  I have 2 questions regarding the language model scaling factor(LMSF) and the word insertion penalty(WIP).
  
  1
  On J&M 9.6, page 315, the authors mentioned:
  
  “This factor(LMSF) is an exponent on the language model probability P(W).
  Because P(W) is less than one and the LMSF is greater than one (between 5 and 15, in many systems), this has the effect of decreasing the value of the LM probability”
  
  So what I understand is that: common values for LMSF(5-15) can decrease LM probability, however, on the same page, the authors mentioned:
  
  “Thus if (on average) the language model probability decreases (causing a larger penalty), the decoder will prefer fewer, longer words…Thus our use of a LMSF to
  balance the acoustic model has the side-effect of decreasing the word insertion penalty.”
  
  I’m confused by these 2 references, because in the first reference, LMSF is used to decrease LM probability. But in the second reference, it first said that the decreased LM probability can cause a larger penalty, however, LFSM has the side-effect of decreasing penalty, isn’t this a contradiction?
  
  Still, for the second reference, I understand why it prefers fewer words, but I can’t see why it prefers longer words.
  
  2
  In the HTK manual(version 3.4), page 43, it mentioned that for HVite, “The options -p and -s set the word insertion penalty and the grammar scale factor, respectively.”
  
  Does -s (grammar scale factor) refer to LMSF in J&M 9.6?
  Does -p exactly refer to “WIP” in J&M 9.6, page 316, formula (9.50), or does it actually mean “N×logWIP”?
  
  Thank you for your help.
- December 6, 2023 at 09:59 #17327
  Simon
  Professor
  When J&M say
  
  Thus, if (on average) the language model probability decreases…
  
  they are talking about the probability decreasing as the sentence length increases, since more and more word probabilities will be multiplied together.
  
  Their explanation of the LMSF is rather long-winded. There is a much simpler and better explanation for why we need to scale the language model probability when combining it with the acoustic model likelihood. In equation 9.48, P(O|W) implies that the acoustic model calculates a probability mass. It generally does not!
  
  If the acoustic model uses Gaussian probability density functions, it cannot compute probability mass. It can only compute a probability density. Density is proportional to the probability mass in a small region around the observation O. The constant of proportionality is unknown.
  
  Since we always work in the log probability domain, equation 9.48 involves a sum of two log probabilities.
  
  The acoustic model will compute quantities on a different scale to the language model. We need to account for the unknown constant of proportionality by scaling one or other of them in this sum. The convention is to scale the language model log probability, hence the LMSF. We typically find a good value for the LMSF empirically (e.g., by minimising the Word Error Rate on some held-out data).
  - December 6, 2023 at 11:43 #17330
    Louise
    Student
    Thank you, but I still can’t see the contradiction in the 2 references:
    in the 1st reference: the LMSF can lower LM probability(which means a larger penalty)
    in the 2nd reference: but LMSF has the side-effect of decreasing penalty.
    
    So how can LMSF both increase and decrease the penalty?
    
    Additionally, I understand why the decoder prefers fewer words when LM probability is low, but I can’t see why it prefers longer words.
    - December 6, 2023 at 12:27 #17335
      Simon
      Professor
      
      The number of observations in the observation sequence is fixed, and they all have to be generated by the model (i.e., the compiled-together language model and acoustic model).
      
      There are many possible paths through the model that could generate this observation sequence. Some paths will pass through mostly short words, each of which generates a short sequence of observations (because short words tend to have short durations when spoken). Other paths pass through long words, each of which will typically generate a longer sequence of observations.
      
      So, to generate the fixed-length observation sequence, the model might take a path through many short words, or through a few long words, or something in-between.
      
      Paths through many short words are likely to contain insertion errors. Paths through a few long words are likely to contain deletion errors. The path with the lowest WER is likely to be a compromise between the two: we need some way to control that, which is what the WIP provides.
      
      Again, J&M’s explanation of the LMSF is not the best, so don’t get lost in their explanations of the interaction between LMSF and WIP.
      
      In summary:
      
      The LMSF is required because the language model computes probability mass, whilst the acoustic model computes probability density.
      
      The WIP enables us to trade off insertion errors against deletion errors.
- December 6, 2023 at 10:08 #17328
  Simon
  Professor
  The HTK manual says
  
  The grammar scale factor is the amount by which the language model probability is scaled before being added to each token as it transits from the end of one word to the start of the next
  
  but of course, they mean “language model log probability”, and when they say
  
  The word insertion penalty is a fixed value added to each token
  
  they mean “added to the log probability of each token” (the same applies to the previous point too).
  
  The HTK term “penalty” is potentially misleading, since in their implementation the value is added not subtracted. Conceptually there is no difference and it doesn’t really matter: we can just experiment with positive and negative values to find a value that minimises the WER on some held-out data.
  
  The implementation in HTK is consistent with J&M equation 9.50.
  - December 6, 2023 at 11:41 #17329
    Louise
    Student
    The word insertion penalty is a fixed value added to each token
    
    For “fixed”, does mean that “-p” is equal to “logWIP” in formula 9.50 ?
    
    For “each token”, is the number of tokens equal to N (the N in formula 9.50, which means the number of words in the sentence), and that’s why formula 9.50 uses N×logWIP, because the penalty for each token(word) will be added together at last for a word sequence.
- December 6, 2023 at 12:13 #17332
  Simon
  Professor
  “fixed” just means that it is a constant value
  
  The word insertion penalty, which is a log probability, is “logWIP” in J&M equation 9.50. It is summed to the partial path log probability (e.g., the token log probability in a token passing implementation) once for each word in that partial path, which is why it is multiplied by N in the equation.
Author

Posts