Page 39

Forum Replies Created

Viewing 15 posts - 571 through 585 (of 1,104 total)

← 1 2 3 … 38 39 40 … 72 73 74 →

Author

Posts
November 29, 2017 at 08:46 in reply to: Source in the cepstrum – eliminated after Mel? #8601
Simon King
Professor
Jurafsky & Martin (J&M) include a figure showing the “classical” cepstrum, and this is what is confusing you. As you say, they fail to make a clear connection between this and MFCCs.

To clear this up, we need to distinguish between the classical cepstrum, and what actually happens in creating MFCCs.

Let’s start with the classical cepstrum, as in J&M’s Figure 9.14 (borrowed from Taylor, who gives a better explanation – read that if you can).

WARNING: in J&M’s Figure 9.14, the plots for (a) and (b) need to be swapped in order for the caption to be correct! My explanation below assumes you’ve corrected this figure.

The three subfigures illustrate the key stages of

(a) obtaining the spectrum from the waveform, using an FFT – in this domain, the source and filter are multiplied together

(b) taking the log of the spectrum, which makes the source and filter additive

(c) performing a series expansion (e.g., DCT) which “lays out” the different components of the log spectrum along an axis, such that the source components and filter components are in different places along that axis and can easily be separated. In J&M’s Figure 9.14(c) we can see the fundamental period as a small peak around the middle of the cepstrum.

There is no filterbank in the classical cepstrum, and no Mel-scaling of the frequency axis.

Mel Frequency Cepstral Co-efficients are inspired by the classical cepstrum and use the same key processing steps. In addition, MFCC extraction involves some additional processing: a Mel-scaled filterbank. This happens after 9.14(a) and so 9.14(b) becomes a smooth spectral envelope (no harmonics) on a Mel scale, and 9.14(c) would no longer have the small peak corresponding to the fundamental period.

The filterbank serves two purposes. First, it’s an easy way to warp the frequency scale from linear (Hertz) to a Mel scale, simply by placing the filter’s centre frequencies evenly apart on a Mel scale. Second, it’s an opportunity to smooth the spectrum and reduce the prominence of the harmonics – in other words, to produce a spectrum that contains less information about the source.

To summarise: J&M’s Figure 9.14(c) is the classical cepstrum and is not one of the stages on the way to MFCCs.
November 27, 2017 at 20:54 in reply to: Acoustic Model and Language Model #8589
Simon King
Professor
A n-gram language model would be learned from a large text corpus. The simplest method is just to count how often each word follows other words, and then normalise to probabilities.

In general, we don’t train the language model only on the transcripts of the speech we are using to train the HMMs. We usually need a lot more data than this, and so train on text-only data. This is beyond the scope of the Speech Processing course, where we don’t actually cover the training of n-gram language models

We just need to now how to write them in a finite state form, and then use them for recognition.

In the digit recogniser assignment, the language model is even simpler than an n-gram and so we write it by hand.
November 27, 2017 at 09:47 in reply to: Acoustic Model and Language Model #8565
Simon King
Professor
The language model computes the prior, P(W). If you like, we might say that the language model is the prior. It’s called the prior because we can calculate it before observing O.

In the isolated digit recogniser, P(W) is never actually made explicit, because it’s a uniform distribution. But you can think of having P(W=w) = 0.1 for all words w.

The acoustic model computes the likelihood, P(O|W).

We combine them, using Bayes’ rule, to obtain the posterior P(W|O); we ignore the constant scaling factor of P(O).

Now, to incorporate alternative pronunciation probabilities, we’d need to introduce a new random variable to our equations, and decide how to compute it. Try for yourself…
November 26, 2017 at 16:34 in reply to: Transition Matrix in training a model #8560
Simon King
Professor
Yes, the transition matrix is also updated – you can verify this for yourself by inspecting it in the prototype models, the intermediate models after HInit and the final models after HRest.

Conceptually, training the transition probabilities is straightforward: we just count how often each transition is used, and then normalise the counts to probabilities (to make the probabilities across all transitions leaving any given state sum to 1). This counting is very easy for the Viterbi training – we literally count how often each transition was used in the single best alignment for each training example, and sum across all training examples. For Baum-Welch it’s conceptually the same, but we use “soft counting” when summing across all alignments and all training examples.

To see for yourself how much contribution the transition matrix makes to the model, you could even do an experiment (optional!), such as manually editing the transition matrices of the final models to reset them to the values from the prototype model, but leaving the Gaussians untouched.
November 23, 2017 at 07:48 in reply to: recognising words #8530
Simon King
Professor
But we don’t know which word our sequence of feature vectors corresponds to. This is what we are trying to find out.

So, we can only try generating it with every possible model (or every possible sequence of models, in the case of connected speech), and search for the one that generates it with the highest probability.

Because we are using Gaussian pdfs, any model can generate any observation sequence. The model of “cat” can generate an observation sequence that corresponds to “car”. But, if we have trained our models correctly, then it will do so with a lower probability than the model of “car”.

A Gaussian pdf assigns a non-zero probability to any possible observation – the long “tails” of the distribution never quite go down to zero. The probability of observations far away from the mean becomes very small, but never zero.
November 22, 2017 at 19:08 in reply to: recognising words #8525
Simon King
Professor
The language model is not quite the same as “all the HMMs connected together”.

The language model, on its own, is a generative model that generates (or if you prefer emits) words.

The language model and the acoustic models (the HMMs of words) are combined – we usually say compiled – into a single network. Some arcs in this recognition network come from the language model, others come from the acoustic models.

We can only compile the language model and acoustic models if they are finite state. Any N-gram language model can be written as a finite state network. That’s the main reason that we use N-grams (rather than some other sort of language model).
November 22, 2017 at 18:30 in reply to: recognising words #8523
Simon King
Professor
The missing component in your explanation is the language model. This is what connects the individual word models into a single network (like the one in the “Token Passing game” we played in class).

The language model and all the individual HMMs of words are compiled together into a single network. This recognition network is also an HMM, just with a more complicated topology than the individual word HMMs.

Because the recognition network is just an HMM, we can perform Token Passing to find the most likely path through it that generated the given observation sequence.

The tokens will each keep a record of which word models they pass through. Then, we can read this record from the winning token to find the most likely word sequence.
November 21, 2017 at 20:58 in reply to: Detecting whether a process completed successfully #8508
Simon King
Professor
Put the above code in a file called child_script and try calling it from another script, to demonstrate that child_script correctly returns an exit status
```
#!/usr/local/bin/bash
echo "This is the parent script"

# let's call the child script
./child_script

# at this point, the special shell variable $? contains the exit status of child_script
# for readability, let's put that status into a variable with a nicer name
STATUS=$?

if [ ${STATUS} = 0 ]
then
	echo "The child script ran OK"
else
	echo "In parent script: child script returned error code "${STATUS}

	# exiting with any non-zero value indicates an error
	exit 1
fi
```
November 21, 2017 at 20:55 in reply to: Detecting whether a process completed successfully #8507
Simon King
Professor
```
#!/usr/local/bin/bash

echo "This is the child script"

# let's run an HTK command, but with a deliberate mistake to cause an error
HInit -T 1 this_file_does_not_exist

# at this point, the special shell variable $? contains the exit status of HInit
# for readability, let's put that status into a variable with a nicer name
STATUS=$?

# test whether the return status is 0 (indicates success)
if [ ${STATUS} = 0 ]
then
	echo "HInit ran without error"
else
	echo "Something went wrong in HInit, which exited with code "${STATUS}
	# a good idea is to exit this script with the same error code
	# (or some other non-zero value, if you prefer)
	# so that anything calling this script can also detect the error
	exit ${STATUS}
fi
```
November 19, 2017 at 12:30 in reply to: Significant figures #8493
Simon King
Professor
Check your understanding with some quizzes:

How many significant figures do these numbers have?

Rounding to the specified number of significant figures
November 18, 2017 at 20:03 in reply to: Significant figures #8474
Simon King
Professor
Don’t think in terms of decimal places, but in terms of significant figures.

1.3968 written to 3 significant figures would be 1.40
November 14, 2017 at 22:58 in reply to: Jurafsky & Martin – Chapter 9 #8346
Simon King
Professor
You’re on the right lines. We couldn’t just average the amplitudes of the speech samples in a frame – as you say, this would come out to a value of about zero. We need to make them all positive first, so we square them. Then we average them (sum and divide by the number of samples). To get back to the original units we then take the square root.

This procedure is so common that it gets a special name: RMS, or Root Mean Square. We’ll then often take the log, to compress the dynamic range.

The variants you are coming across might differ in whether they take the square root or not. That might seem like a major difference, but it’s not. If we’re going to take the log, then taking the square root first doesn’t do anything useful: it will just become a constant multiplier of 0.5.
November 9, 2017 at 21:38 in reply to: Cepstral Sorcery #8342
Simon King
Professor
Your summary of how the cepstrum separates source and filter is good.

Omitting the phase of the speech signal is only a small part of the story – this happens right in the first step after windowing, when we retain only the magnitude spectrum.

The key ideas to understand are:

The magnitude spectrum of speech is equal to the product of the magnitude spectrum of the source and the magnitude spectrum of the filter.

The log magnitude spectrum of speech is equal to the sum of the log magnitude spectrum of the source and the log magnitude spectrum of the filter.

We perform a series expansion of the log magnitude spectrum of the speech. Whether we use the DFT, inverse DFT or DCT (Discrete Cosine Transform) isn’t important conceptually.

This series expansion expresses the log magnitude spectrum of speech as sum of simple components (e.g., cosines). Some of those simple components are representing the filter (the low order ones) and one or two of the higher order components represent the source. They are additive in the log spectral domain.
November 8, 2017 at 15:48 in reply to: Filter bank vs. filter coefficients #8289
Simon King
Professor
I don’t find Holmes & Holmes’ argument about transmission channels very convincing either.

Their point is that machines should not be able to “hear” something that humans cannot, and that might turn out to be a good idea when it comes to privacy and security of voice-enabled devices. Here’s one reason:

and here’s another more extreme form of attack on ASR systems.
November 8, 2017 at 15:02 in reply to: Filter bank vs. filter coefficients #8287
Simon King
Professor
An excellent question. Yes, there are many ways to represent and parameterise the vocal tract frequency response, or more generally the spectral envelope.

Let’s break the answer down into two parts

1) comparing MFCCs with vocal tract filter coefficients

There are many choices of vocal tract filter. The most common is a linear predictive filter. We could use the coefficients of such as filter as features, and in older papers (e.g., where DTW was the method for pattern matching) we will find that this was relatively common. A linear predictive filter is “all pole” – that means it can only model resonances. That’s a limitation. When we fit the filter to a real speech signal, it will will give an accurate representation of the formant peaks, but be less accurate at representing (for example) nasal zeros. In contrast, the cepstrum places equal importance on the entire spectral envelope, not just the peaks.

2) comparing MFCCs with filterbank outputs

It is true that MFCCs cannot contain any more information than filterbank outputs, given that they are derived from them.

There must be another reason for preferring MFCCs in certain situations. The reason is that there is less covariance (i.e., correlation) between MFCC coefficients than between filterbank outputs. That’s important when we want to fit a Gaussian probability density function to our data, without needing a full covariance matrix.

You also make a good point that we can seek inspiration from either speech production or speech perception. In fact, we could use ideas from both in a single feature set – a example of that would be Perceptual Linear Prediction (PLP) coefficients. This is beyond the scope of Speech Processing, where we’ll limit ourselves to filterbank outputs and MFCCs.
Author

Posts

Viewing 15 posts - 571 through 585 (of 1,104 total)

← 1 2 3 … 38 39 40 … 72 73 74 →

Simon King

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis