Forum Replies Created
-
AuthorPosts
-
Put this line
set -x
somewhere near the start of your script. It will cause the complete HVite command line to be printed just before it is executed. That will help you see what is wrong with the arguments you are passing to HVite.
The shell is trying to execute “resources/word_list_seq” which obviously should never happen – you might have a space after one of the “\”, or blank lines or comments in the middle of the HVite command, perhaps.
Next time I have a wet Sunday afternoon with nothing better to do, I may indeed do a longer animation for connected word token passing. Don’t hold your breath though.
Yes – you seem to have got it: tokens that have “looped back” from the final state of one word model to the initial state of another (or indeed the same) word model will simply compete with whatever tokens is encounters there. The mechanism is exactly the same as tokens meeting within words.
Tokens will need to remember which words they have been through, so that we can read the most likely word sequence off the winning token. Appending a word label to a token’s internal word link record every time it leaves a word model is the way to do that, yes.
Don’t think of lots of HMMs running in parallel. Think of the compiled network as one big HMM. That’s because it really is just one big HMM.
It just happens to have a more interesting topology than the individual left-to-right word models, but that makes no difference to token passing. There’s still a start state, some emitting states, some arcs, and an end state. Turn the handle on token passing, and it works exactly as it would for an isolated left-to-right model.
No, there are not 1100 HMMs – the topology of the compiled network is fixed and isn’t changed just because tokens are flowing around it.
Watch the token passing animation – the network there has been compiled from a simple language model which generates sentences from the set “Yes” or “No”, and we have a 2-state whole-word-model HMM for each of those words. As the tokens flow around it, the network itself (i.e., the “one big HMM”) never changes.
1. Yes, that is all correct.
2. Yes, also all correct.
It’s easiest to think about the language model and HMMs having been “compiled together” into a network, and performing token passing around that. That’s exactly how HTK’s HVite is implemented.
In this compiled network, there are emitting states from the HMMs, and arcs (i.e., transitions). Some of the arcs are from the HMMs, and others are from the language model. But after compilation, that doesn’t really matter and tokens can move along any transition to reach another emitting state.
The latter: simply the y-values of the first 12 points on the quefrency plot.
It might help if you forget for a moment about plotting the cepstrum (which is a slightly odd thing anyway) and just think about the coefficients as a series expansion of the spectral envelope (from a filterbank). They are “shape coefficients”.
In the specific case of a left-to-right HMM topology, you are right that the first and last observations in any observation sequence must indeed be emitted from the first and last states, respectively.
But, we still only know about some part of the state sequence, and the complete state sequence remains unknown to us: it is still a hidden random variable. It’s just that the distribution of this hidden random variable is ever so slightly restricted.
In the general case of an arbitrary model topology and an observation sequence that is longer than the shortest path through that model, this is not the case. But, even in this general case, we still know something about possible values of the hidden state sequence. Any state sequences that are shorter or longer than the observation sequence have zero probability, and non-zero-probability values of the state sequence are restricted to those of exactly the right length.
Let’s separate out a few different aspects of this.
The storage space required for a covariance matrix is [latex]O(D^2)[/latex], where D is the dimensionality of the observation vector.
The computational cost can worked out by looking at the vector-matrix-vector multiplication in the formula for a multivariate Gaussian – can you work that out?
But the real issue is the large number of model parameters — which is also [latex]O(D^2)[/latex] — that need to be estimated from data.
No, in general Automatic Speech Recognition (ASR) systems have a fixed vocabulary and therefore, as you correctly state, any out-of-vocabulary (OOV) words would be recognised as similar-sounding in-vocabulary words (or, more likely, as a sequence of short words).
It is possible to build an open-vocabulary system, but this is somewhat unusual. The vast majority of ASR systems that you will read about in the literature have a fixed vocabulary.
The non-emitting (also known as ‘dummy’) start and end states are there to avoid having to separately specify two specific parameters: the probability of starting in each state, and the probability of ending in each state. Using the non-emitting states allows us to write those parameters on transitions out of the starting non-emitting state, and into the ending non-emitting state. In some text books, they do not use the non-emitting states, and therefore the parameters of the model have to also include these starting and ending state distributions. That’s messy, and easily avoided by doing things ‘the HTK way’.
In a left-to-right model, there will typically be only one transition from the starting non-emitting state: it goes to the first emitting state, with probability 1. But we could have additional transitions if we wished: this would allow the model to emit the first observation from one of the other states.
All those dummy states are still there in the compiled model (language model + acoustic models) – they are joined up with arcs, and on those arcs are the language model probabilities.
The language model on its own is not an HMM – it’s just a finite state machine.
Your question is mainly about what the word models look like. Yes, we could indeed use a single emitting state per word. Then, we would be modelling just the average spectral envelope across the entire word duration. That may well be enough to distinguish words in a very small vocabulary system (e.g., isolated digits), but is a rather naive model.
Using more states per word gives us a finer temporal granularity. For example, 3 emitting states per word allows the modelling of (roughly speaking) the beginning, middle and end sounds of that word. Such a model is probably a better generative model of that word, and conversely a worse generative model of other words, so should lead to more accurate recognition.
In larger systems, we use one model per sub-word unit (e.g., phone-sized units) and then of course we will have multiple states per word.
Try it for yourself by experiment – it’s easy to vary the number of emitting states in a whole-word system. You’ll probably want to do such an experiment on a reasonably large multi-speaker dataset in order to get reliable results.
That versus which
Examples:
- The bicycle that I saw yesterday was red.
- The bicycle, which I saw yesterday, was red.
This one is simple. If the part starting with ‘that‘ or ‘which‘ can be deleted and you still have a sentence that means the same thing, then it’s optional and you should use ‘which‘. If you can’t delete it, then it’s obligatory and you should use ‘that‘.
In Example 1, I am distinguishing between the bicycle that I saw yesterday and some other possible bicycles (perhaps one that I saw today). In Example 2, there is only one possible bicycle that I could be talking about. I could have not told you about seeing it yesterday and would still have communicated the same meaning: that it is red.
Another way to make the distinction is that clauses with ‘that’ are restrictive: they narrow down the scope of what you are talking about. Clauses with ‘which’ just add optional extra information without doing that: they are ‘nonrestrictive’.
Because the ‘which’ version is optional information, you will usually want to put some commas around it, as in the second example above.
As usual, Grammar Girl explains it well.
Less versus fewer
The traditional answer is that fewer is for countable things (sheep, people, days, apples,…) and less is for things you can’t count (water, excitement, pain, work, …). That is still my default answer and is the safe choice.
Grammar Girl is usually a good source for this sort of writing information, and there you’ll find some useful exceptions to the ‘is it countable?’ rule.
In reality, the distinction between less and fewer is not quite so clear and you can argue either way.
A general rule
When writing scientifically, the last thing you want is your reader fixating on your English usage instead of the actual content that you are trying to communicate. So, play it safe and follow conventions. If in doubt, find another construction that avoids tricky word choices that you might get wrong.
What is a stochastic model?
The term stochastic is the opposite of deterministic. We could also use words like ‘random’ or ‘probabilistic’ instead of ‘stochastic’.
An HMM is stochastic because the state sequence is a random variable: we simply don’t know what value it took, when the model generated a particular observation sequence.
Summing over all values of Q versus taking only the single most likely value
What we are summing up are the individual probabilities of the observation sequence and one particular state sequence (Q takes on one of its many possible values), given the model W, which is p(O,Q|W). That’s not the value we need. We want p(O|W). We don’t care about Q – it’s a hidden random variable.
The quantity that we are computing by summing over (i.e., integrating away) Q is the total conditional probability of the observation sequence O given the model W, that is p(O|W). This is also known as the likelihood. Here’s a mathematical statement of integrating away Q that shows why we need to sum over all possible state sequences to get the value we really want:
[latex]
p(O|W) = \sum_Q p(O,Q|W)
[/latex]and note that I’ve been using small ‘p’ everywhere because we are using Gaussians and so things are not actually probabilities, but probability densities.
-
AuthorPosts