- This topic has 3 replies, 2 voices, and was last updated 4 years, 1 month ago by .
Viewing 3 reply threads
Viewing 3 reply threads
- You must be logged in to reply to this topic.
› Forums › Automatic speech recognition › Hidden Markov Models (HMMs) › monophone model
I’m not sure if I understand it correctly from HTKbook that when creating a monophone model, a phone-level transcription is needed. If it is, is it a must? Why cannot we uniformly segment a word’s vectors by (number of phones * number of HMM states for each phone)? Just like what we did on word-level model.
For each training utterance, there will be a word-level transcription. A phone-level transcription is needed in order to determine the phone models to join together to make an utterance model. The phone transcription might be created simply by looking each word up in the dictionary and replacing it with its phone sequence.
These transcriptions do not need to have any timing information though – they are just sequences of words or phones.
I guess I misunderstood transcription. So for monophone model, we only need word-level labels, rather than phone-level labels?
In general, we only need transcriptions without time alignments to train HMMs, including for monophone models. The method for training models in such a situation is known as “embedded training” but this is slightly beyond the scope of the course.
But in the Digit Recogniser assignment – and in the theory part of the Speech Processing course – we are using a simpler method for training HMMs which does require the data to be labelled with the start and end times of each model (which are of words, in the Digit Recogniser assignment).
Some forums are only available if you are logged in. Searching will only return results from those forums if you log in.
Copyright © 2025 · Balance Child Theme on Genesis Framework · WordPress · Log in