› Forums › Automatic speech recognition › Hidden Markov Models (HMMs) › E-step in EM algorithm
- This topic has 1 reply, 2 voices, and was last updated 3 years, 8 months ago by Simon.
-
AuthorPosts
-
-
November 28, 2020 at 12:17 #13282
The definition of E-step in the slides is a little confusing for me. Each iteration in EM is trying to update the mean and variances. Does the E-step correspond to
1) the computation of occupancy probability only, i.e. the prob. of being in state s at frame i (so that the M-step corresponds to the computation of the mean and variance AND the assignment), OR
2) the computation of the mean and variance, i.e. the sum of observations, weighted by the occupancy probability. (So that the M-step corresponds to the assignment only)
It looks like in Jurafsky’s book, the E-step corresponds to 1).
Thank you.
-
November 29, 2020 at 11:27 #13284
The Expectation (E) step computes all the statistics necessary, such as state occupancy probabilities. A simple way to think of this is averaging. This is where all the computation happens because we are considering all possible state sequences.
The Maximisation (M) step updates the model parameters, using those statistics, to maximise the probability of the training data. This is a very simple equation requiring very little computation.
In theoretical explanations of Expectation-Maximisation, which for HMMs is called the Baum-Welch algorithm, the E step is typically defined as computing the statistics. For example, we compute the state occupancy probabilities so that the M step can use them as the weights in a weighted sum of the observations (i.e,, the training data).
In a practical implementation of the E step, we actually compute the weighted sum as we go. We create a simple data structure called an accumulator which has two parts: one to sum up the numerator and the other to sum up the denominator, of each M step equation (e.g., for the mean of a particular Gaussian, Jurafsky and Martin 2nd edition equation 9.35 in Section 9.4.2). There will be one accumulator for each model parameter. The M step is then simply to divide numerator by denominator and update the model parameter (and to do that for each model parameter).
For Speech Processing, aim for a conceptual understanding of Expectation-Maximisation. You would not need to reproduce these equations under exam conditions.
Now that I’ve mentioned accumulators, here is something right at the edge of the course coverage:
For large-vocabulary connected speech recognition, we model phonemes. If there is enough data, we can make these models context-dependent and thus get a lower WER. The context is typically the left and right phoneme, and such models are called triphones. We need to share parameters between clusters of model parameters because there won’t be enough (or sometimes any) training examples for certain triphones. This is called “tying”. It turns out to be very easy to implement training for tied triphones: we just make a single accumulator for each cluster of shared model parameters.
-
-
AuthorPosts
- You must be logged in to reply to this topic.