Forum Replies Created
-
AuthorPosts
-
Lab tasks for each week could be clearer / labs could be more structured
We will provide more class-wide instructions during the remaining lab sessions, whilst still leaving plenty of time for individual help.
Positive comments
Number of people mentioning each point is given in parentheses.
Group work / interactive classes (13)
The videos (8) and subtitles / transcripts (3)
Flipped classroom format (7)
Labs (6) and specifically the tutor (2)
Milestones for the assignment (4)
speech.zone in general, including content, navigation (4)
February 8, 2019 at 08:49 in reply to: ASF – translating linguistic features to acoustic representation #9686Predicting acoustic features from linguistic features is a regression problem. We already have the necessary labelled training data: the speech database that will be used for unit selection.
One way to do the regression would be to train a regression tree (a CART). This is the method used in so-called “HMM-based speech synthesis” that we will cover in the second half of the course. But in HMM synthesis, the predicted acoustic features are used as input to a vocoder to create a waveform, rather than in an ASF target cost function.
We might then replace the tree with a better regression model: a neural network. We’ll cover this method after HMM synthesis.
Once we know about HMM and neural network speech synthesis (both using vocoders rather than unit selection + waveform concatenation), we can then come back to the ASF formulation of unit selection. We will find that this is usually called “hybrid speech synthesis” and is covered towards the end of the course.
Your analogy with programming languages is along the right lines. In this context:
“high level” means “further away from the waveform”, “more abstract” and “changing at a slower rate”
“low level” means “closer to the waveform”, “more concrete (e.g., specified more precisely using more parameters)” and “changing more rapidly”
The terms “degrading” is somewhat informal, but what I mean is that the low-level signal quality is made worse.
This can be contrasted to other types of degradations – for example, that we might get from unit selection: perceptible joins, incorrect co-articulation or bad prosody.
The features (used by the target cost function) of candidates do indeed have to be independent of the features of other candidates. If there was a dependency, this would violate the conditional independence assumption that the search makes (recall that it is equivalent to a Markov model – it is memoryless).
Now, the features of target units will of course depend on the features of other target units. That’s not a problem – we are not searching over different sequences of target units (they are fixed, and depend only on the input text).
Also, the features of candidate units do depend on their neighbours within the sentence that they were extracted from. Again, that is a constant and not something that we are not searching over.
Adding deltas effectively brings in information from neighbouring frames. You are right that this will still be over a fairly small region of the signal.
The join cost has to be local to the two candidate units being considered for a join. If it depended on the properties of other candidate units, then this would dramatically increase the complexity of the search problem.
In section 16.3.4, Taylor is talking about the specific problem of how to set the target cost weights. If unit selection was a generative model (e.g., an HMM), we could use the usual objective of maximising likelihood.
The problem is that there is no explicit “model” as such – the unit selection system actually contains the training data, rather than abstracting away (i.e., generalising) from it by fitting a model.
Because it has “memorised” the training data exactly, it is perfectly fitted to the training data (we would say “over-fitted” if it was a generative model). This means that changing the target cost weights has absolutely no effect (*) on the output when we generate sentences from the training data.
(*) a weighted sum of zero terms is always zero, regardless of the weights
It’s actually the same calendar as Speech Processing, but I’ve now added a subscription link for it on this page
J&M don’t do a great job of explaining either the language model scaling factor (LMSF) or the word insertion penalty (WIP), so I’ll explain both.
Let’s start with the LMSF. The real reason that we need to scale the language model probability before combining it with the acoustic model likelihood is much simpler than J&M’s explanation:
- the language model probability really is a probability
- the acoustic model likelihood is not a probability because it’s computed by probability density functions
Remember that a Guassian probability density function can not assign a probability to an observation, but only a probability density. If we insisted on getting a true probability, this would have to be for an interval of observation values (J&M figure 9.18). We might describe density as being “proportional” to probability – i.e., a scaled probability.
So, the language model probability and the acoustic model likelihood are on different scales. Simply multiplying them (in practice, adding them in the log domain) assumes they are on the same scale. So, doing some rescaling is perfectly reasonable, and the convention is to multiply the language model log probability by a constant: the LMSF. This value is chosen empirically – for example, to minimise WER on a development set.
Right, on to the Word Insertion Penalty (WIP). J&M attempt a theoretical justification of this, which relies on their explanation of why the LMSF is needed. I’ll go instead for a pragmatic justification:
An automatic speech recognition system makes three types of errors: substitutions, insertions and deletions. All of them affect the WER. We try to minimise substitution errors by training the best acoustic and language models possible. But there is no direct control via either of those models over insertions and deletions. We might find that our system makes a lot of insertion errors, and that will increase WER (potentially above 100%!).
So, we would like to have a control over the insertions and deletions. I’ll explain this control in the Token Passing framework. We subtract a constant amount from a token’s log probability every time it leaves a word end. This amount is the WIP (J&M equation 9.49, except they don’t describe it in the log prob domain). Varying the WIP will trade off between insertions and deletions. You will need to adjust this penalty if you attempt the connected digits part of the digit recogniser exercise because you may find that, without it, your system makes so many insertion errors that WER is indeed greater than 100%.
Finally, to actually answer your question, I think there is a typo in “if the language model probability increases (larger penalty)” where surely they meant “(smaller penalty)”. But to be honest, I find their way of explaining this quite confusing, and it’s not really how ASR system builders think about LMSF or WIP. Rather, these are just a couple of really useful additional system tuning parameters to be found empirically, on development data.
The 1127 and 700 values are determined empirically, to match human hearing.
There are many possible window shapes available, of which the Hamming window is perhaps the most popular in digital audio signal processing. To understand why there are so many, we need to understand why there is no such thing as the perfect window: they all involve a compromise.
Let’s start with the most simple window: rectangular. This does not taper at the edges, so the signal will have discontinuities which will lead to artefacts – for example, after taking the FFT. On the plus side, the signal inside the window is exactly equal to the original signal from which it was extracted.
A better option is a tapered window that eliminates the discontinuity problem of the rectangular window, by effectively fading the signal in, and then out again. The problem is that this fading (i.e., changing the amplitude) also changes the frequency content of the signal subtly. To see that, consider fading a sine wave in and out. The result is not a sine wave anymore (e.g., J&M Figure 9.11). Therefore, a windowed sine wave is no longer a pure tone: it must have some other frequencies that were introduced by the fading in and out operation.
So, tapered windows introduce additional frequencies into the signal. Exactly what is introduced will depend on the shape of the window, and hence different people prefer different windows for different applications. But, we are not going to get hung up on the details – it doesn’t make much difference for our applications.
For the spectrum of a voiced speech sound, the main artefact of a tapered window is that the harmonics are not perfect vertical lines, but rather peaks with some width. The diagrams on the Wikipedia page for Window Function may help you understand this. That page also correctly explains where the 0.54 value comes from and why it’s not exactly 0.5 (which would be a simple raised cosine, called the Hann window). Again, these details really don’t matter much for our purposes and are well beyond what is examinable.
The Fast Fourier Transform (FFT) is an algorithm that efficiently implements the Discrete Fourier Transform (DFT). Because it is so efficient, we pretty much only ever use the FFT and that’s why you hear me say “FFT” in class when I could use the more general term “DFT”.
The FFT is a divide-and-conquer algorithm. It divides the signal into two parts, solves for the FFT of each part, then joins the solutions together. This is recursive, so each part is then itself divided into two, and so on. Therefore, the length of the signal being analysed has to be a power of 2, to make it evenly divide into two parts recursively.
The details of this algorithm are beyond the scope of the course, but it is a beautiful example of how an elegant algorithm can make computation very fast.
If you really want to learn this algorithm, then refer to the classic textbook:
Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck, Discrete-Time Signal Processing, 2nd edition (Upper Saddle River, NJ: Prentice Hall, 1989)
We would rarely implement an FFT because there are excellent implementations available in standard libraries for all major programming languages. But, you can’t call yourself a proper speech processing engineer until you have implemented the FFT, so add it to your bucket list (after Dynamic Programming and before Expectation-Maximisation)!
You can see my not-fast-enough FFT implementation here followed by a much faster implementation from someone else (which is less easy to understand).
Let’s separate out several different processes:
The Mel scale is applied to the spectrum (by warping the x axis), before taking the cepstrum. This is effectively just a resampling of the spectrum, providing higher resolution (more parameters) in the lower frequencies and lower resolution at higher frequencies.
Taking the cepstrum of this warped spectrum doesn’t change the cepstral transform’s abilities to separate source and filter. But it does result in using more parameters to specify the shape of the lower, more important, frequency range.
The filterbank does several things all at once, which can be confusing. It is a way of smoothing out the harmonics, leaving only the filter information. By spacing the filter centre frequencies along a Mel scale we can also use it to warp the frequency axis. Finally, it also reduces the dimensionality from the raw FFT (which has 100s or 1000s of dimensions) to maybe 20-30 filters in the filter bank.
Note: J&M’s figure 9.14 is taken from Taylor, and they made a mistake in the caption. See this topic for the correction.
The diagram in Taylor is correct.
You can work this out yourself from first principles: taking the log will compress the vertical range of the spectrum, bringing the very low amplitude components up so we can see them, and bringing the high amplitudes (the harmonics, in this case) down.
J&M messed up when they quoted it – a lesson in not quoting something unless you really understand it, perhaps!? Or maybe a printer’s error.
-
AuthorPosts