› Forums › Automatic speech recognition › Hidden Markov Models (HMMs) › HMM training
- This topic has 5 replies, 3 voices, and was last updated 3 years, 8 months ago by Simon.
-
AuthorPosts
-
-
November 19, 2015 at 14:46 #685
Hi,
I don’t quite understand why the state sequence is hidden in the training of HMM. First of all, we would know which model generates each observation in the training set. Can we not simply hand-label each observation in the training data with its correct state sequence according to changes in feature vectors?
For example, in the training data of the second assignment, we already know that which digit model generates each observation(digit). We could then divide each observation into several states according to changes in feature vectors? For example, we could divide an observation “two” into two states “[t]” and “[u:]”?
Thanks!
-
November 19, 2015 at 15:08 #691
We could certainly consider hand-labelling the data with the correct state sequence. In other words, for every frame in the training data, we would annotate it with the state of the model it aligns with.
But, that would be very hard, for two reasons:
- How do we know what the correct state sequence is anyway?
- There are 100 frames per second, and we might have hours of data. It might take rather a long time to do this hand-labelling
Your suggestion to divide the word models into sub-word units (in fact, phonemes) is a good idea, but we still wouldn’t want to hand-label the phones in a large speech corpus (phonetic transcription and alignment takes about 100 times real time: 100 hours of work, per hour of speech data).
But what if we wanted to have more than one state per phoneme, which we normally would do (3 emitting states per phoneme model is the usual arrangement in most systems)? How would we then hand-align the observations with the three sub-phonetic states?
We will see that the model can itself find alignments with the training data, and labelling at the phoneme level is not needed. In fact, we don’t even need to align the word boundaries either, we just need to know the model sequence for each training utterance. The Baum-Welch algorithm takes care of everything else.
-
November 19, 2015 at 22:47 #694
Thanks! It is all clear to me now. Just a follow-up question:
Can the model itself find the most appropriate number of states? Or is it by convention predetermined to be three states for each phoneme?
Intuitively, I guess there is a limit number of states in each model ie. not larger than the number of frames in the observation sequence generated by the model. Is it correct?
Thanks!
-
November 20, 2015 at 08:47 #695
Let’s be clear about terminology: the model itself can “do” nothing more than randomly generate observation sequences. If we want to do something else, then we need an algorithm to do that.
For example, the Baum-Welch algorithm finds the most likely values of the Gaussian parameters and transition probabilities, given some training data.
Typically, the number of states is set by the system builder (e.g., you!). Three emitting states per phoneme is the most common arrangement when using phoneme models. For whole-word models, the right number of states is less obvious.
You make a very good point about the number of states in the model: if the model has a simple left-to-right topology (determined by the transitions), then the minimum length of observation sequence that it can emit is equal to the number of states. If the number of states is large, this will be a problem: the model will assign zero probability (either in training or testing) to any sequence shorter than this.
-
November 30, 2020 at 10:37 #13303
I don’t understand why we need to keep iteratively doing dynamic realignment with the Viterbi algorithm after we have updated the model parameters.
-
November 30, 2020 at 10:43 #13304
Before commencing Viterbi training, the model must have some parameters. These could come from uniform segmentation, for example. These parameters will not be optimal: they will not be the parameters that maximise the probability of the training data.
For that initial model, we use the Viterbi algorithm to find an alignment between model and training data. This alignment is the best possible one, given the current model’s parameters (which, remember, are not optimal at this stage). Because the model is not yet optimal, this alignment will not necessarily be the best either.
This alignment is used to update the model parameters. The model is now better: it will generate the training data with a higher probability than with the previous model parameters.
Because the model is better, it will now be able to find a more probable alignment with the training data than the previous model. So we re-align the data, then use this new alignment to update the model parameters.
This can be repeated (iterated) a number of times. We stop when the model no longer improves, as measured by the probability of it generating the training data.
-
-
AuthorPosts
- You must be logged in to reply to this topic.