Forum Replies Created
-
AuthorPosts
-
The language model is not quite the same as “all the HMMs connected together”.
The language model, on its own, is a generative model that generates (or if you prefer emits) words.
The language model and the acoustic models (the HMMs of words) are combined – we usually say compiled – into a single network. Some arcs in this recognition network come from the language model, others come from the acoustic models.
We can only compile the language model and acoustic models if they are finite state. Any N-gram language model can be written as a finite state network. That’s the main reason that we use N-grams (rather than some other sort of language model).
The missing component in your explanation is the language model. This is what connects the individual word models into a single network (like the one in the “Token Passing game” we played in class).
The language model and all the individual HMMs of words are compiled together into a single network. This recognition network is also an HMM, just with a more complicated topology than the individual word HMMs.
Because the recognition network is just an HMM, we can perform Token Passing to find the most likely path through it that generated the given observation sequence.
The tokens will each keep a record of which word models they pass through. Then, we can read this record from the winning token to find the most likely word sequence.
Put the above code in a file called
child_script
and try calling it from another script, to demonstrate thatchild_script
correctly returns an exit status#!/usr/local/bin/bash echo "This is the parent script" # let's call the child script ./child_script # at this point, the special shell variable $? contains the exit status of child_script # for readability, let's put that status into a variable with a nicer name STATUS=$? if [ ${STATUS} = 0 ] then echo "The child script ran OK" else echo "In parent script: child script returned error code "${STATUS} # exiting with any non-zero value indicates an error exit 1 fi
#!/usr/local/bin/bash echo "This is the child script" # let's run an HTK command, but with a deliberate mistake to cause an error HInit -T 1 this_file_does_not_exist # at this point, the special shell variable $? contains the exit status of HInit # for readability, let's put that status into a variable with a nicer name STATUS=$? # test whether the return status is 0 (indicates success) if [ ${STATUS} = 0 ] then echo "HInit ran without error" else echo "Something went wrong in HInit, which exited with code "${STATUS} # a good idea is to exit this script with the same error code # (or some other non-zero value, if you prefer) # so that anything calling this script can also detect the error exit ${STATUS} fi
Check your understanding with some quizzes:
Don’t think in terms of decimal places, but in terms of significant figures.
1.3968 written to 3 significant figures would be 1.40
You’re on the right lines. We couldn’t just average the amplitudes of the speech samples in a frame – as you say, this would come out to a value of about zero. We need to make them all positive first, so we square them. Then we average them (sum and divide by the number of samples). To get back to the original units we then take the square root.
This procedure is so common that it gets a special name: RMS, or Root Mean Square. We’ll then often take the log, to compress the dynamic range.
The variants you are coming across might differ in whether they take the square root or not. That might seem like a major difference, but it’s not. If we’re going to take the log, then taking the square root first doesn’t do anything useful: it will just become a constant multiplier of 0.5.
Your summary of how the cepstrum separates source and filter is good.
Omitting the phase of the speech signal is only a small part of the story – this happens right in the first step after windowing, when we retain only the magnitude spectrum.
The key ideas to understand are:
The magnitude spectrum of speech is equal to the product of the magnitude spectrum of the source and the magnitude spectrum of the filter.
The log magnitude spectrum of speech is equal to the sum of the log magnitude spectrum of the source and the log magnitude spectrum of the filter.
We perform a series expansion of the log magnitude spectrum of the speech. Whether we use the DFT, inverse DFT or DCT (Discrete Cosine Transform) isn’t important conceptually.
This series expansion expresses the log magnitude spectrum of speech as sum of simple components (e.g., cosines). Some of those simple components are representing the filter (the low order ones) and one or two of the higher order components represent the source. They are additive in the log spectral domain.
I don’t find Holmes & Holmes’ argument about transmission channels very convincing either.
Their point is that machines should not be able to “hear” something that humans cannot, and that might turn out to be a good idea when it comes to privacy and security of voice-enabled devices. Here’s one reason:
and here’s another more extreme form of attack on ASR systems.
An excellent question. Yes, there are many ways to represent and parameterise the vocal tract frequency response, or more generally the spectral envelope.
Let’s break the answer down into two parts
1) comparing MFCCs with vocal tract filter coefficients
There are many choices of vocal tract filter. The most common is a linear predictive filter. We could use the coefficients of such as filter as features, and in older papers (e.g., where DTW was the method for pattern matching) we will find that this was relatively common. A linear predictive filter is “all pole” – that means it can only model resonances. That’s a limitation. When we fit the filter to a real speech signal, it will will give an accurate representation of the formant peaks, but be less accurate at representing (for example) nasal zeros. In contrast, the cepstrum places equal importance on the entire spectral envelope, not just the peaks.
2) comparing MFCCs with filterbank outputs
It is true that MFCCs cannot contain any more information than filterbank outputs, given that they are derived from them.
There must be another reason for preferring MFCCs in certain situations. The reason is that there is less covariance (i.e., correlation) between MFCC coefficients than between filterbank outputs. That’s important when we want to fit a Gaussian probability density function to our data, without needing a full covariance matrix.
You also make a good point that we can seek inspiration from either speech production or speech perception. In fact, we could use ideas from both in a single feature set – a example of that would be Perceptual Linear Prediction (PLP) coefficients. This is beyond the scope of Speech Processing, where we’ll limit ourselves to filterbank outputs and MFCCs.
The best route to understanding this is first to understand Bayes’ rule.
If W is a word sequence and O is the observed speech signal:
The language model represents our prior beliefs about what sequences of words are more or less likely. We say “prior” because this is knowledge that we have before we even hear (or “observe”) any speech signal. The language model computes P(W). Notice that O is not involved.
When using a generative model, such as an HMM, as the acoustic model, it computes the likelihood of the observed speech signal, given a possible word sequence – this is called the likelihood and is written P(O|W).
Neither of those quantities are what we actually need, if we are trying to decide what was said. We actually want to calculate the probability of every possible word sequence (so we can choose the most probable one), given the speech signal. This quantity is called the posterior, because we can only know its value after observing the speech, and is written P(W|O).
Bayes’ rule tells us how we can combine the prior and the likelihood to calculate the posterior – or at least something proportional to it, which is good enough for our purposes of choosing the value of W that maximises P(W|O).
You might think this is rather abstract and conceptually hard. You’d be right. Developing both an intuitive and formal understanding of probabilistic modelling takes some time.
In a filterbank, there are a set of bandpass filters (perhaps 20 to 30 of them). Each one selects a range (or a “band”) of frequencies from the signal.
The filters in a filterbank are fixed and do not vary. We, as the system designer, choose the frequency bands – for example, we might space them evenly on a Mel scale, taking inspiration from human hearing.
The feature vector produced by the filterbank is a vector containing, in each element, the amount of energy captured in each frequency band.
The filter in a source-filter model is a more complex filter than the ones in a filterbank, in two ways:
- it’s not just a simple bandpass filter, but has a more complex frequency response, in order to model the vocal tract transfer function
- it varies over time (it can be fitted to an individual frame of speech waveform)
This filter is inspired not by human hearing, but by speech production.
The simplest type of feature vector derived from the filter in a source-filter model would be a vector containing, in each element, one of the filter’s coefficients. Together, the set of filter coefficients captures the vocal tract transfer function (or, more abstractly, the spectral envelope of the speech signal).
I’m aware of the third edition currently under construction. As with previous editions, Jurafsky & Martin make this freely available until it goes off to the publishers (at which point they will presumably withdraw the draft version). As you say, the speech material is not yet updated, so we are staying with the second edition for now.
In these slides, I am temporarily imagining that the Fourier co-efficients (i.e., the magnitude spectrum) would be a good representation for Automatic Speech Recognition. Whilst we could use them, we can do better by performing some feature engineering – this is covered a little later on.
Slide 10: these co-efficients are the amount of energy at that frequency – these are the Fourier co-efficients (think of them as weights that multiply each sine wave). If we plot them, then we get the spectrum of the signal.
Slide 11: the number of Fourier co-efficients depends on the duration of the signal being analysed. But remember that we don’t analyse the whole signal at once: we divide it into short frames and perform the analysis on each frame in turn. The frames all have the same, fixed duration (e.g., 25ms).
The number of frames is equal to the total duration of the speech signal (e.g., an utterance) divided by the frame shift (e.g., 10ms).
-
AuthorPosts