monophone model

This topic has 3 replies, 2 voices, and was last updated 4 years, 5 months ago by Simon.

Viewing 3 reply threads

Author

Posts
- December 7, 2020 at 00:10 #13474
  Alice W
  Student
  I’m not sure if I understand it correctly from HTKbook that when creating a monophone model, a phone-level transcription is needed. If it is, is it a must? Why cannot we uniformly segment a word’s vectors by (number of phones * number of HMM states for each phone)? Just like what we did on word-level model.
- December 7, 2020 at 12:49 #13484
  Simon
  Professor
  For each training utterance, there will be a word-level transcription. A phone-level transcription is needed in order to determine the phone models to join together to make an utterance model. The phone transcription might be created simply by looking each word up in the dictionary and replacing it with its phone sequence.
  
  These transcriptions do not need to have any timing information though – they are just sequences of words or phones.
- December 7, 2020 at 13:36 #13485
  Alice W
  Student
  I guess I misunderstood transcription. So for monophone model, we only need word-level labels, rather than phone-level labels?
- December 7, 2020 at 18:09 #13494
  Simon
  Professor
  In general, we only need transcriptions without time alignments to train HMMs, including for monophone models. The method for training models in such a situation is known as “embedded training” but this is slightly beyond the scope of the course.
  
  But in the Digit Recogniser assignment – and in the theory part of the Speech Processing course – we are using a simpler method for training HMMs which does require the data to be labelled with the start and end times of each model (which are of words, in the Digit Recogniser assignment).
Author

Posts

Viewing 3 reply threads

You must be logged in to reply to this topic.

monophone model

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis