This is a bit harder than isolated digits, but not much. The key is to realise that there may be silence between the words, so you will need an acoustic model (i.e., an HMM) for that. Hint: you labelled the silence earlier as junk
.
You’ll also need to write a new language model in the form of an HTK grammar (hint: see the HTK manual) and then convert it to the finite-state format that HVite
needs, like this:
$ HParse resources/digit_sequence_grammar resources/digit_sequence_grammar_as_network
Evaluating the output of the recogniser is no longer so easy – it might insert or delete digits, as well as substitute incorrect digits. You can use your existing training data, but you’ll need to make a different test set, containing digit sequences, and a corresponding label file.
How to do this is described in the HTK manual (3.4.1), Chapter 12: Networks, Dictionaries and Language Models. Specifically, have a look at section 12.3.
You can also look at the HTK manual for HResults
– there are some useful options for showing more detail of the scoring procedure, such as the flag -t
Some helpful forum posts:
- https://speech.zone/forums/topic/finding-users-with-digit-sequences
- https://speech.zone/forums/topic/how-do-i-get-the-results-for-multiple-speakers-at-once/
- https://speech.zone/forums/topic/recognising-digit-sequences/
Remember you don’t want to count junk
labels when scoring! (See last forum post above)