Covers filterbank, MFCC features. The material on linear prediction is out of scope.
Sharon Goldwater: Basic probability theory
An essential primer on this topic. You should consider this reading ESSENTIAL if you haven’t studied probability before or it’s been a while. We’re adding this the readings in Module 7 to give you some time to look at it before we really need it in Module 9 – mostly we need the concepts of conditional probability and conditional independence.
Sharon Goldwater: Vectors and their uses
A nice, self-contained introduction to vectors and why they are a useful mathematical concept. You should consider this reading ESSENTIAL if you haven’t studied vectors before (or it’s been a while).
Jurafsky & Martin – Section 9.4 – Acoustic Likelihood Computation
To perform speech recognition with HMMs involves calculating the likelihood that each model emitted the observed speech. You can skip 9.4.1 Vector Quantization.
Jurafsky & Martin – Section 9.3 – Feature Extraction: MFCCs
Mel-frequency Cepstral Co-efficients are a widely-used feature with HMM acoustic models. They are a classic example of feature engineering: manipulating the extracted features to suit the properties and limitations of the statistical model.
Jurafsky & Martin – Section 9.2 – The HMM Applied to Speech
Introduces some notation and the basic concepts of HMMs.
Jurafsky & Martin – Section 9.1 – Speech Recognition Architecture
Most modern methods of ASR can be described as a combination of two models: the acoustic model, and the language model. They are combined simply by multiplying probabilities.
Jurafsky & Martin – Chapter 9 introduction
The difficulty of ASR depends on factors including vocabulary size, within- and across-speaker variability (including speaking style), and channel and environmental noise.
Holmes & Holmes – Chapter 8 – Template matching and dynamic time warping
Read up to the end of 8.5 carefully. Try to read 8.6 as part of Module 7, but rest assured we will go over the concept of dynamic programming again in Module 9. We recommend you should skim 8.7 and 8.8 because the same general concepts carry forward into Hidden Markov Models (again, we’ll come back to this in Module 9). You don’t need to read 8.9 onwards. Methods like DTW are rarely used now in state of the art systems, but are a good way to start understanding some core ideas.
Taylor – Chapter 5 – Text decoding
Complementary to Jurafsky & Martin, Section 8.1.
Taylor – Chapter 6 – Prosody prediction from text
Predicting phrasing, prominence, intonation and tune, from text input.
Jurafsky & Martin – Section 8.4 – Diphone Waveform Synthesis
A simple way to generate a waveform is by concatenating speech units from a pre-recorded database. The database contains one recording of each required speech unit.