Forum Replies Created
-
AuthorPosts
-
This slide is talking about the most basic form waveform concatenation synthesis in which we store only one example of each unit type:
inventory = the set of stored waveform units
- using phonemes as the type would require only around 45 stored waveform units
- diphones would require ~2000 stored waveform units
- and so on …. larger unit sizes generally have more types
But similar arguments apply to unit selection:
inventory = the set of unique unit types
database = the stored waveforms (usually complete natural utterances) from which units are selected; multiple instances of each unit type are available
Unit selection involves search all possible candidate unit sequences to find the best-sounding sequence. Even using dynamic programming, this will involve significant computation.
As the database of speech to draw candidates from increases in size, the number of available candidates increases in proportion, but the number of possible sequences increases exponentially.
Longer tubes have lower resonant frequencies than shorter tubes.
It’s vocal tract, not “vocal track”.
Vocal tract length is determined mainly by anatomy; speakers can only vary it a little (e.g., by protruding the lips)
Likewise, a speaker’s F0 range is determined by anatomy and can only be varied within that range.
F0 and formants are independent things, so “adjust their F0 so that the speech they produce has the same formants” is incorrect.
This was a bug – thanks for spotting it. Now fixed.
J&M 8.3 was only missing from the reading list displayed within the module 4 page.
You are right that this is a typo: “decreases” should read “increases”.
Does this change your answer?
You are right that using a filterbank with suitably wide filters should eliminate all evidence of F0 in the resulting set of filter outputs. That’s the theory…
…however, even with wide filter bandwidths, there will still be some correlation between a filter’s output and F0: the amount of energy in a filter’s output will vary a little up/down as more/fewer harmonics happen to fall within its pass band.
In this case, the motivation for using the cepstrum remains obtaining decorrelated representation, and we should still truncate the cepstrum to discard the higher cepstral coefficients which will be those that correlate the most with F0 (and to reduce dimensionality).
This topic about using a remote machine is for MSc dissertation projects. For Speech Processing, we do not have a remote machine that you can log in to.
That’s a great question. If we read outdated books like Holmes & Holmes we will find discussion of LPC features for ASR, and other features derived from a source-filter analysis such as PLP (Perceptual Linear Prediction). PLP was popular for quite a while.
LPC analysis makes a strong assumption about the shape of the spectral envelope: that it can be modelled as an all-pole filter. MFCCs use a more general-purpose approach of series expansion that does not make this assumption.
LPC analysis requires solving for the filter co-efficients, and there are multiple ways to do that. They all have limitations, and the process can be error prone, especially when the speech is not clean.
The labels are for different things in training and testing.
For the training data, we need to know the start and end time of each training example because
HRest
requires this: it trains one model at a time.For the test data, we simply need the correct label for each test
.mfcc
file, to use as a reference when computing the WER.You should reference the manual for whatever version you are using (you can find that by running any HTK program with just the
-V
flag).Both algorithms find alignments between states and observations.
Both algorithms express this alignment as the probability of an observation aligning with a state, which is the same thing as the state sequence. In Viterbi, we can think of those probabilities being “hard” or 1s and 0s because there just one state sequence is considered. In Baum-Welch, they will be “soft” because all state sequences are considered.
These probabilities are then used as the weights in a weighted sum of observations to re-estimate the means of the Gaussians. The weights will not sum to one, so this weighted sum must be normalised by dividing by the sum of the weights.
Remember that in this course, we’re not deriving the equations to express the above. So bear in mind that whilst the concept of “hard” and “soft” alignment is perfectly correct, the exact computations of the weights might be slightly more complex.
In training, W is constant for each training sample, so no language model is needed.
The objective function of both algorithms when used for training is to maximise the probability of the observations given the model: P(O|W). This is called “maximum likelihood training” and is the simplest and most obvious thing to do.
(However, this simple objective function does not directly relate to the final task, which is one of classification. So, in advanced courses on ASR, other objective functions would be developed which more directly relate to minimising the recognition error rate. Those are much more complex.)
The states in a whole word model do not correspond to phonemes. Figure 9.4 in Jurafsky & Martin (2nd edition) implies this is the case but what they are doing is constructing a word model from sub-word (phoneme) models, and their phoneme models have a single state (which is not common – normally we use 3 states). The figure is misleading.
The number of emitting states in a model is a design choice we need to make. As you correctly say, more states means we will need more training data, because the model will have more parameters.
In the digit recogniser assignment, there are a variety of “prototype” models that have varying numbers of states, for you to experiment with. It’s certainly worth doing an experiment to investigate this; make sure it’s one using large training and test sets, not just a single speaker.
You could try using a different number of states for each digit in the vocabulary, but that’s probably not the most fruitful line of experiments.
First, you can and should only run
make_mfccs
for your own data. (The only exception would be an advanced experiment varying the parameterisation, and you should talk that over with me in a lab session before attempting it).There are ongoing permissions problems on the server, and so I’ve reset them again. Please try again and report back.
Each model is of one digit. There are always 10 models (look in the models directory to see them). This is the same, regardless of what data you train the models on.
You need to run
HInit
(and thenHRest
) once for each model you want to train. The basic scripts do this already for you, using afor
loop, and you’ll keep that structure for speaker-independent experiments too. -
AuthorPosts