In connected speech, three types of error are possible: substitutions, insertions, or deletions of words. It is usual to combine them into a single measure: Word Error Rate.
Jurafsky & Martin – Section 9.7 – Embedded training
Embedded training means that the data are transcribed, but that we don’t know the time alignment at the model or state levels.
Young et al: Token Passing
My favourite way of understanding how the Viterbi algorithm is applied to HMMs. Can also be helpful in understanding search for unit selection speech synthesis.
Jurafsky & Martin – Section 9.5 – The lexicon and language model
Simply mentions the lexicon and language model and refers the reader to other chapters.
Taylor – Section 12.3 – The cepstrum
By using the logarithm to convert a multiplication into a sum, the cepstrum separates the source and filter components of speech.
Holmes & Holmes – Chapter 10 – Front-end analysis for ASR
Covers filterbank, MFCC features. The material on linear prediction is out of scope.
Sharon Goldwater: Basic probability theory
An essential primer on this topic. You should consider this reading ESSENTIAL if you haven’t studied probability before or it’s been a while. We’re adding this the readings in Module 7 to give you some time to look at it before we really need it in Module 9 – mostly we need the concepts of conditional probability and conditional independence.
Sharon Goldwater: Vectors and their uses
A nice, self-contained introduction to vectors and why they are a useful mathematical concept. You should consider this reading ESSENTIAL if you haven’t studied vectors before (or it’s been a while).
Jurafsky & Martin – Section 9.4 – Acoustic Likelihood Computation
To perform speech recognition with HMMs involves calculating the likelihood that each model emitted the observed speech. You can skip 9.4.1 Vector Quantization.
Jurafsky & Martin – Section 9.3 – Feature Extraction: MFCCs
Mel-frequency Cepstral Co-efficients are a widely-used feature with HMM acoustic models. They are a classic example of feature engineering: manipulating the extracted features to suit the properties and limitations of the statistical model.
Jurafsky & Martin – Section 9.2 – The HMM Applied to Speech
Introduces some notation and the basic concepts of HMMs.
Jurafsky & Martin – Section 9.1 – Speech Recognition Architecture
Most modern methods of ASR can be described as a combination of two models: the acoustic model, and the language model. They are combined simply by multiplying probabilities.