Finish

This module focused on putting together the remaining puzzle pieces for building an automatic speech recognition system with Hidden Markov Models (HMMs).  From this you should have an understanding of how we combine information from acoustic models and a language model to estimate the likelihood of  seeing a specific word (or word sequence) given and sequence of acoustic observations.

We also looked at the Baum-Welch algorithm for training HMMs (in this case, for acoustic modelling).  This looks a lot like Viterbi training, but allows us to use the fact that we can only “guess” which HMMs states emitted specific observed acoustic feature vectors (the “hidden” part of HMMs).  We can use state occupancy probabilities (i.e., a soft alignment) to weight the contributions of specific feature vectors when updating state emission and transition parameters. This allows us to do even better training after using Viterbi training for initialisation.

Hopefully, now you can see the how we might extend the simple whole word models we’ve been using for the ASR assignment to model (context-dependent) phones instead, and how this allows us to build a speech recogniser for larger vocabulary.

We went through these concepts quite quickly and at a relatively high level.  The important thing for this course is to get a conceptual understanding.  But for understand what is actually happening, you’ll need to look more at the maths and how the various ASR components are put together for large vocabulary speech recognition.  There’s certainly a lot more to ASR than we were able to cover in this course.  If you want to go further, we recommend you take the Automatic Speech Recognition course in semester 2!

What you should know

  • Hierarchy of models:
    • How can we put different word (and/or subword) HMMs (e.g. phone HMMs) together to produce an HMM representing multiword utterance
    • Where do then language model probabilities appear in this?
    • Why do ASR systems often do pruning/beam search?
  • Conditional Independence and the forward algorithm
    • What does the forward algorithm calculate?
    • How does this differ from the Viterbi algorithm? (i.e. max versus sum)
  • HMM training
    • What happens during Viterbi Training?
      • What is the goal of Viterbi Training? (i.e. HMM parameter updating so to maximize the
        probability of the training data given the HMM)
      • What are the 2 main steps that are repeated on each iteration of Viterbi training? (i.e.
        alignment, parameter estimation)
    • In general terms, what is the main difference between Viterbi training (i.e. HInit in HTK) and
      Baum-Welch training (i.e. HRest in HTK)?

      • What’s a “hard alignment” between states and observations
      • What’s a “soft alignment” between states and observations

Key Terms

  • Uniform Segmentation
  • Viterbi Training
  • Baum-Welch algorithm
  • Forward probability
  • The Forward algorithm
  • The Backward algorithm
  • Soft alignment
  • Hard alignment
  • State occupancy probability
  • beam search
  • pruning
  • Composition/compilation of finite state models
  • Language model
  • Acoustic model
  • Hierarchical composition
  • Embedded training
  • Subword unit

That’s the end of the course!