Finish

This module introduced a conceptually hard algorithm: dynamic time warping. Sometimes, the only way to truly understand an algorithm is to implement it. The following exercise involves programming in Python, and so is beyond the scope of this course. But, if you can program, then you may find this quite helpful in developing your understanding of dynamic programming:

The material in this module on feature extraction was left incomplete. That’s because we don’t yet fully understand what properties we would like our feature vector to have. That will depend on what we are going to do with those feature vectors. The approach of measuring distance to exemplars in feature space is too simple because it fails to account for variability in at least two ways. First, all dimensions of the feature vector are treated as equally important, even though they almost certainly are not. Second, and more importantly, the single stored exemplar fails to capture the natural variation found in speech.

We will solve this problem by replacing the distance measure with a statistical model: the Guassian probability density function. This will make us rethink the feature vector and do some feature engineering to give it the right properties.

What you should know

  • What is the basic problem we are trying to solve in Automatic Speech Recognition
    • What is the ASR objective? i.e. get the most probable transcript given a specific audio recording
    • How is this framed mathematically?
    • How does this relate to Bayes’ rule?
    • What’s does the the acoustic model P (O|W ) represent?
    • What does the Language model P (W) represent?
  • Cochlea, Mel-scale, Filterbank:
    • Non-linear human perception of sounds: mel-scale, semitones
    • What are filterbanks (more in module 8)
  • Feature vectors, sequences, sequences of feature vectors
    • How do we represent audio as a sequence of feature vectors?
  • Exemplars, distances
    • How do we measure the (Euclidean) distances between a pair of vectors
    • local versus global distance between sequences
    • What is a frame (i.e. vector) level alignment?
  • Dynamic Time warping:
    • How can you calculate the distance between two sequences of feature vectors of different lengths?
    • What is the goal of Dynamic Time Warping? What are its inputs and outputs?
    • Why is DTW a dynamic programming algorithm?
    • Why is the alignment produced by DTW generally considered better than a linear alignment for measuring sequence similarity?

Key Terms

  • transcription
  • objective
  • Bayes’ rule
  • acoustic model
  • language model
  • log scale
  • non-linear
  • mel scale
  • semitones
  • cochlea
  • filterbank
  • feature vector
  • sequence
  • distance
  • frame-level alignment
  • dynamic time warping
  • dynamic programming
  • cost function