Forum Replies Created
-
AuthorPosts
-
The terms “degrading” is somewhat informal, but what I mean is that the low-level signal quality is made worse.
This can be contrasted to other types of degradations – for example, that we might get from unit selection: perceptible joins, incorrect co-articulation or bad prosody.
The features (used by the target cost function) of candidates do indeed have to be independent of the features of other candidates. If there was a dependency, this would violate the conditional independence assumption that the search makes (recall that it is equivalent to a Markov model – it is memoryless).
Now, the features of target units will of course depend on the features of other target units. That’s not a problem – we are not searching over different sequences of target units (they are fixed, and depend only on the input text).
Also, the features of candidate units do depend on their neighbours within the sentence that they were extracted from. Again, that is a constant and not something that we are not searching over.
Adding deltas effectively brings in information from neighbouring frames. You are right that this will still be over a fairly small region of the signal.
The join cost has to be local to the two candidate units being considered for a join. If it depended on the properties of other candidate units, then this would dramatically increase the complexity of the search problem.
In section 16.3.4, Taylor is talking about the specific problem of how to set the target cost weights. If unit selection was a generative model (e.g., an HMM), we could use the usual objective of maximising likelihood.
The problem is that there is no explicit “model” as such – the unit selection system actually contains the training data, rather than abstracting away (i.e., generalising) from it by fitting a model.
Because it has “memorised” the training data exactly, it is perfectly fitted to the training data (we would say “over-fitted” if it was a generative model). This means that changing the target cost weights has absolutely no effect (*) on the output when we generate sentences from the training data.
(*) a weighted sum of zero terms is always zero, regardless of the weights
It’s actually the same calendar as Speech Processing, but I’ve now added a subscription link for it on this page
J&M don’t do a great job of explaining either the language model scaling factor (LMSF) or the word insertion penalty (WIP), so I’ll explain both.
Let’s start with the LMSF. The real reason that we need to scale the language model probability before combining it with the acoustic model likelihood is much simpler than J&M’s explanation:
- the language model probability really is a probability
- the acoustic model likelihood is not a probability because it’s computed by probability density functions
Remember that a Guassian probability density function can not assign a probability to an observation, but only a probability density. If we insisted on getting a true probability, this would have to be for an interval of observation values (J&M figure 9.18). We might describe density as being “proportional” to probability – i.e., a scaled probability.
So, the language model probability and the acoustic model likelihood are on different scales. Simply multiplying them (in practice, adding them in the log domain) assumes they are on the same scale. So, doing some rescaling is perfectly reasonable, and the convention is to multiply the language model log probability by a constant: the LMSF. This value is chosen empirically – for example, to minimise WER on a development set.
Right, on to the Word Insertion Penalty (WIP). J&M attempt a theoretical justification of this, which relies on their explanation of why the LMSF is needed. I’ll go instead for a pragmatic justification:
An automatic speech recognition system makes three types of errors: substitutions, insertions and deletions. All of them affect the WER. We try to minimise substitution errors by training the best acoustic and language models possible. But there is no direct control via either of those models over insertions and deletions. We might find that our system makes a lot of insertion errors, and that will increase WER (potentially above 100%!).
So, we would like to have a control over the insertions and deletions. I’ll explain this control in the Token Passing framework. We subtract a constant amount from a token’s log probability every time it leaves a word end. This amount is the WIP (J&M equation 9.49, except they don’t describe it in the log prob domain). Varying the WIP will trade off between insertions and deletions. You will need to adjust this penalty if you attempt the connected digits part of the digit recogniser exercise because you may find that, without it, your system makes so many insertion errors that WER is indeed greater than 100%.
Finally, to actually answer your question, I think there is a typo in “if the language model probability increases (larger penalty)” where surely they meant “(smaller penalty)”. But to be honest, I find their way of explaining this quite confusing, and it’s not really how ASR system builders think about LMSF or WIP. Rather, these are just a couple of really useful additional system tuning parameters to be found empirically, on development data.
The 1127 and 700 values are determined empirically, to match human hearing.
There are many possible window shapes available, of which the Hamming window is perhaps the most popular in digital audio signal processing. To understand why there are so many, we need to understand why there is no such thing as the perfect window: they all involve a compromise.
Let’s start with the most simple window: rectangular. This does not taper at the edges, so the signal will have discontinuities which will lead to artefacts – for example, after taking the FFT. On the plus side, the signal inside the window is exactly equal to the original signal from which it was extracted.
A better option is a tapered window that eliminates the discontinuity problem of the rectangular window, by effectively fading the signal in, and then out again. The problem is that this fading (i.e., changing the amplitude) also changes the frequency content of the signal subtly. To see that, consider fading a sine wave in and out. The result is not a sine wave anymore (e.g., J&M Figure 9.11). Therefore, a windowed sine wave is no longer a pure tone: it must have some other frequencies that were introduced by the fading in and out operation.
So, tapered windows introduce additional frequencies into the signal. Exactly what is introduced will depend on the shape of the window, and hence different people prefer different windows for different applications. But, we are not going to get hung up on the details – it doesn’t make much difference for our applications.
For the spectrum of a voiced speech sound, the main artefact of a tapered window is that the harmonics are not perfect vertical lines, but rather peaks with some width. The diagrams on the Wikipedia page for Window Function may help you understand this. That page also correctly explains where the 0.54 value comes from and why it’s not exactly 0.5 (which would be a simple raised cosine, called the Hann window). Again, these details really don’t matter much for our purposes and are well beyond what is examinable.
The Fast Fourier Transform (FFT) is an algorithm that efficiently implements the Discrete Fourier Transform (DFT). Because it is so efficient, we pretty much only ever use the FFT and that’s why you hear me say “FFT” in class when I could use the more general term “DFT”.
The FFT is a divide-and-conquer algorithm. It divides the signal into two parts, solves for the FFT of each part, then joins the solutions together. This is recursive, so each part is then itself divided into two, and so on. Therefore, the length of the signal being analysed has to be a power of 2, to make it evenly divide into two parts recursively.
The details of this algorithm are beyond the scope of the course, but it is a beautiful example of how an elegant algorithm can make computation very fast.
If you really want to learn this algorithm, then refer to the classic textbook:
Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck, Discrete-Time Signal Processing, 2nd edition (Upper Saddle River, NJ: Prentice Hall, 1989)
We would rarely implement an FFT because there are excellent implementations available in standard libraries for all major programming languages. But, you can’t call yourself a proper speech processing engineer until you have implemented the FFT, so add it to your bucket list (after Dynamic Programming and before Expectation-Maximisation)!
You can see my not-fast-enough FFT implementation here followed by a much faster implementation from someone else (which is less easy to understand).
Let’s separate out several different processes:
The Mel scale is applied to the spectrum (by warping the x axis), before taking the cepstrum. This is effectively just a resampling of the spectrum, providing higher resolution (more parameters) in the lower frequencies and lower resolution at higher frequencies.
Taking the cepstrum of this warped spectrum doesn’t change the cepstral transform’s abilities to separate source and filter. But it does result in using more parameters to specify the shape of the lower, more important, frequency range.
The filterbank does several things all at once, which can be confusing. It is a way of smoothing out the harmonics, leaving only the filter information. By spacing the filter centre frequencies along a Mel scale we can also use it to warp the frequency axis. Finally, it also reduces the dimensionality from the raw FFT (which has 100s or 1000s of dimensions) to maybe 20-30 filters in the filter bank.
Note: J&M’s figure 9.14 is taken from Taylor, and they made a mistake in the caption. See this topic for the correction.
The diagram in Taylor is correct.
You can work this out yourself from first principles: taking the log will compress the vertical range of the spectrum, bringing the very low amplitude components up so we can see them, and bringing the high amplitudes (the harmonics, in this case) down.
J&M messed up when they quoted it – a lesson in not quoting something unless you really understand it, perhaps!? Or maybe a printer’s error.
There is an error in the caption of your version (is this the ebook?). The caption in my hardcopy correctly lists the paths which are
T1 – T3 – T3 – T3
T1 – T3 – T3 – T2
T1 – T3 – T1 – T2
The take home message from this reading is that working with these data structures is not necessarily the easiest way to understand (or even to practically implement) dynamic programming. We will shortly see a much more elegant approach called Token Passing.
Again, including an example from Allen et al without explaining the notation is not helpful of J&M. I also would not know how to read that rule, without reading the Allen paper in full. I think J&M are just making the point that stress assignment is complex, and showing us an esoteric rule as evidence of this.
J&M shouldn’t really have included examples from UNISYN without explaining the notation, which is much more sophisticated than other dictionaries. You don’t really need to know this level of detail, but if you are interested then the notation is explained in the UNISYN manual:
Curly brackets {} surround free morphemes, left angle brackets << are used to enclose prefixes, right angle brackets >> are used for suffixes, and equals signs == join bound morphemes or relatively unproductive affixes, for example ‘ade’ in ‘blockade’
Attachments:
You must be logged in to view attached files.git might not be installed (it comes with Xcode, which is not installed by default).
-
AuthorPosts