reading – Page 4

Ladefoged (Elements) – Chapter 6 – Hearing

Some understanding of human hearing will be helpful for engineering suitable features to extract from the waveform for automatic speech recognition.

Ladefoged (Elements) – Chapter 5 – Resonance

Now we turn to speech production, and resonance is how the vocal tract shapes the quality of sound, to produce different vowels, for example.

Ladefoged (Elements) – Chapter 4 – Wave analysis

The waveform is rarely the best representation for analysis of sound, and this is an intuitive introduction to why.

Ladefoged (Elements) – Chapter 3 – Quality

A general term for aspects of sound other than loudness and pitch. Different vowels, for example, have a different quality of sound.

Ladefoged (Elements) – Chapter 2 – Loudness and pitch

These are perceptual phenomena that relate to physical properties of sound.

Ladefoged (Elements) – Chapter 1 – Sound waves

Very brief, but a reasonable place to start if you have no idea what sound is.

Jurafsky & Martin (3rd Ed) – Hidden Markov models

An overview of Hidden Markov Models, the Viterbi algorithm, and the Baum-Welch algorithm

Jurafsky & Martin (2nd ed) – Section 8.3 – Prosodic Analysis

Beyond getting the phones right, we also need to consider other aspects of speech such as intonation and pausing.

Jurafsky & Martin (2nd ed) – Section 8.2 – Phonetic Analysis

Each word in the normalised text needs a pronunciation. Most words will be found in the dictionary, but for the remainder we must predict pronunciation from spelling.

Jurafsky & Martin (2nd ed) – Section 8.1 – Text Normalisation

We need to normalise the input text so that it contains a sequence of pronounceable words.

Jurafsky & Martin – Section 9.8 – Evaluation

In connected speech, three types of error are possible: substitutions, insertions, or deletions of words. It is usual to combine them into a single measure: Word Error Rate.

Jurafsky & Martin – Section 9.7 – Embedded training

Embedded training means that the data are transcribed, but that we don’t know the time alignment at the model or state levels.