Page 68

Forum Replies Created

Viewing 15 posts - 1,006 through 1,020 (of 1,087 total)

← 1 2 3 … 67 68 69 … 71 72 73 →

Author

Posts
December 1, 2015 at 21:29 in reply to: Tokens/Language Models #1014
Simon
Professor
Next time I have a wet Sunday afternoon with nothing better to do, I may indeed do a longer animation for connected word token passing. Don’t hold your breath though.

Yes – you seem to have got it: tokens that have “looped back” from the final state of one word model to the initial state of another (or indeed the same) word model will simply compete with whatever tokens is encounters there. The mechanism is exactly the same as tokens meeting within words.

Tokens will need to remember which words they have been through, so that we can read the most likely word sequence off the winning token. Appending a word label to a token’s internal word link record every time it leaves a word model is the way to do that, yes.
December 1, 2015 at 18:05 in reply to: Tokens/Language Models #1003
Simon
Professor
Pruning is a separate issue.
December 1, 2015 at 18:04 in reply to: Tokens/Language Models #1002
Simon
Professor
Don’t think of lots of HMMs running in parallel. Think of the compiled network as one big HMM. That’s because it really is just one big HMM.

It just happens to have a more interesting topology than the individual left-to-right word models, but that makes no difference to token passing. There’s still a start state, some emitting states, some arcs, and an end state. Turn the handle on token passing, and it works exactly as it would for an isolated left-to-right model.

No, there are not 1100 HMMs – the topology of the compiled network is fixed and isn’t changed just because tokens are flowing around it.

Watch the token passing animation – the network there has been compiled from a simple language model which generates sentences from the set “Yes” or “No”, and we have a 2-state whole-word-model HMM for each of those words. As the tokens flow around it, the network itself (i.e., the “one big HMM”) never changes.
December 1, 2015 at 17:25 in reply to: Tokens/Language Models #999
Simon
Professor
1. Yes, that is all correct.

2. Yes, also all correct.

It’s easiest to think about the language model and HMMs having been “compiled together” into a network, and performing token passing around that. That’s exactly how HTK’s HVite is implemented.

In this compiled network, there are emitting states from the HMMs, and arcs (i.e., transitions). Some of the arcs are from the HMMs, and others are from the language model. But after compilation, that doesn’t really matter and tokens can move along any transition to reach another emitting state.
December 1, 2015 at 17:21 in reply to: Cepstral Coefficients #997
Simon
Professor
The latter: simply the y-values of the first 12 points on the quefrency plot.

It might help if you forget for a moment about plotting the cepstrum (which is a slightly odd thing anyway) and just think about the coefficients as a series expansion of the spectral envelope (from a filterbank). They are “shape coefficients”.
December 1, 2015 at 17:03 in reply to: First and last state/observation – not truly 'hidden'? #995
Simon
Professor
In the specific case of a left-to-right HMM topology, you are right that the first and last observations in any observation sequence must indeed be emitted from the first and last states, respectively.

But, we still only know about some part of the state sequence, and the complete state sequence remains unknown to us: it is still a hidden random variable. It’s just that the distribution of this hidden random variable is ever so slightly restricted.

In the general case of an arbitrary model topology and an observation sequence that is longer than the shortest path through that model, this is not the case. But, even in this general case, we still know something about possible values of the hidden state sequence. Any state sequences that are shorter or longer than the observation sequence have zero probability, and non-zero-probability values of the state sequence are restricted to those of exactly the right length.
December 1, 2015 at 15:03 in reply to: covariance parameters #990
Simon
Professor
Let’s separate out a few different aspects of this.

The storage space required for a covariance matrix is [latex]O(D^2)[/latex], where D is the dimensionality of the observation vector.

The computational cost can worked out by looking at the vector-matrix-vector multiplication in the formula for a multivariate Gaussian – can you work that out?

But the real issue is the large number of model parameters — which is also [latex]O(D^2)[/latex] — that need to be estimated from data.
November 26, 2015 at 22:35 in reply to: Recognizing unseen words #878
Simon
Professor
No, in general Automatic Speech Recognition (ASR) systems have a fixed vocabulary and therefore, as you correctly state, any out-of-vocabulary (OOV) words would be recognised as similar-sounding in-vocabulary words (or, more likely, as a sequence of short words).

It is possible to build an open-vocabulary system, but this is somewhat unusual. The vast majority of ASR systems that you will read about in the literature have a fixed vocabulary.
November 26, 2015 at 12:59 in reply to: Joining HMMs together #873
Simon
Professor
The non-emitting (also known as ‘dummy’) start and end states are there to avoid having to separately specify two specific parameters: the probability of starting in each state, and the probability of ending in each state. Using the non-emitting states allows us to write those parameters on transitions out of the starting non-emitting state, and into the ending non-emitting state. In some text books, they do not use the non-emitting states, and therefore the parameters of the model have to also include these starting and ending state distributions. That’s messy, and easily avoided by doing things ‘the HTK way’.

In a left-to-right model, there will typically be only one transition from the starting non-emitting state: it goes to the first emitting state, with probability 1. But we could have additional transitions if we wished: this would allow the model to emit the first observation from one of the other states.

All those dummy states are still there in the compiled model (language model + acoustic models) – they are joined up with arcs, and on those arcs are the language model probabilities.
November 26, 2015 at 12:00 in reply to: Gaussians and HMM states #871
Simon
Professor
The language model on its own is not an HMM – it’s just a finite state machine.

Your question is mainly about what the word models look like. Yes, we could indeed use a single emitting state per word. Then, we would be modelling just the average spectral envelope across the entire word duration. That may well be enough to distinguish words in a very small vocabulary system (e.g., isolated digits), but is a rather naive model.

Using more states per word gives us a finer temporal granularity. For example, 3 emitting states per word allows the modelling of (roughly speaking) the beginning, middle and end sounds of that word. Such a model is probably a better generative model of that word, and conversely a worse generative model of other words, so should lead to more accurate recognition.

In larger systems, we use one model per sub-word unit (e.g., phone-sized units) and then of course we will have multiple states per word.

Try it for yourself by experiment – it’s easy to vary the number of emitting states in a whole-word system. You’ll probably want to do such an experiment on a reasonably large multi-speaker dataset in order to get reliable results.
November 25, 2015 at 13:32 in reply to: Grammar #863
Simon
Professor
That versus which

Examples:
1. The bicycle that I saw yesterday was red.
2. The bicycle, which I saw yesterday, was red.
This one is simple. If the part starting with ‘that‘ or ‘which‘ can be deleted and you still have a sentence that means the same thing, then it’s optional and you should use ‘which‘. If you can’t delete it, then it’s obligatory and you should use ‘that‘.

In Example 1, I am distinguishing between the bicycle that I saw yesterday and some other possible bicycles (perhaps one that I saw today). In Example 2, there is only one possible bicycle that I could be talking about. I could have not told you about seeing it yesterday and would still have communicated the same meaning: that it is red.

Another way to make the distinction is that clauses with ‘that’ are restrictive: they narrow down the scope of what you are talking about. Clauses with ‘which’ just add optional extra information without doing that: they are ‘nonrestrictive’.

Because the ‘which’ version is optional information, you will usually want to put some commas around it, as in the second example above.

As usual, Grammar Girl explains it well.
November 25, 2015 at 13:20 in reply to: Grammar #862
Simon
Professor
Less versus fewer

The traditional answer is that fewer is for countable things (sheep, people, days, apples,…) and less is for things you can’t count (water, excitement, pain, work, …). That is still my default answer and is the safe choice.

Grammar Girl is usually a good source for this sort of writing information, and there you’ll find some useful exceptions to the ‘is it countable?’ rule.

In reality, the distinction between less and fewer is not quite so clear and you can argue either way.

A general rule

When writing scientifically, the last thing you want is your reader fixating on your English usage instead of the actual content that you are trying to communicate. So, play it safe and follow conventions. If in doubt, find another construction that avoids tricky word choices that you might get wrong.
November 25, 2015 at 09:05 in reply to: Viterbi vs Forward-Backward (Baum-Welch) #856
Simon
Professor
What is a stochastic model?

The term stochastic is the opposite of deterministic. We could also use words like ‘random’ or ‘probabilistic’ instead of ‘stochastic’.

An HMM is stochastic because the state sequence is a random variable: we simply don’t know what value it took, when the model generated a particular observation sequence.
November 25, 2015 at 09:02 in reply to: Viterbi vs Forward-Backward (Baum-Welch) #855
Simon
Professor
Summing over all values of Q versus taking only the single most likely value

What we are summing up are the individual probabilities of the observation sequence and one particular state sequence (Q takes on one of its many possible values), given the model W, which is p(O,Q|W). That’s not the value we need. We want p(O|W). We don’t care about Q – it’s a hidden random variable.

The quantity that we are computing by summing over (i.e., integrating away) Q is the total conditional probability of the observation sequence O given the model W, that is p(O|W). This is also known as the likelihood. Here’s a mathematical statement of integrating away Q that shows why we need to sum over all possible state sequences to get the value we really want:

[latex]
p(O|W) = \sum_Q p(O,Q|W)
[/latex]

and note that I’ve been using small ‘p’ everywhere because we are using Gaussians and so things are not actually probabilities, but probability densities.
November 24, 2015 at 11:10 in reply to: Language model scaling factor #844
Simon
Professor
HMMs actually compute the likelihood p(O|W) where O is the observation sequence and W is the HMM (or a sequence of HMMs). Note the small “p” – it’s a probability density, because we are using Gaussians, and not a true probability, although it is in some loose sense “proportional” to probability.

So, the likelihood computed by the HMM, and the probability from the language model, P(W), are on different scales. They have a quite different range of values. You can see that for yourself in the output from HTK when it prints acoustic log likelihoods, which tend to be large negative numbers like -1352.3.

Therefore, some scaling of these two values is necessary before combining them. We do that scaling in the log domain: multiply the language model probability by a constant value before adding it to the acoustic model log likelihood. If we didn’t do this, the acoustic model likelihood would dominate and the language model would have little influence.

We call the language model scaling factor (sometimes called the language model weight) a hyperparameter because it is something separate from the actual HMM and language model parameters.

Separately from this, there is another hyperparameter, called the word insertion penalty. This allows us to trade off insertion errors versus deletion errors in order to minimise Word Error Rate.

Note that hyperparameters must never be set using the test data. We should reserve some of the training data for this (and not use that part for training the HMMs or language model). This set is typically called a development set.

Generally, both the language model scaling factor and the word insertion penalty are tuned empirically (by trial and error) on a development set.
Author

Posts

Viewing 15 posts - 1,006 through 1,020 (of 1,087 total)

← 1 2 3 … 67 68 69 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis