HMM speech synthesis, viewed as regression

Continuing our view of the task as one of regression, we see how that can be solved by combining HMMs (for sequence modelling) with regression trees (to provide the parameters of the HMMs).

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
we've got then a description ofthe Texas speech as a sequence to sequence regression task that was a rather abstract description, deliberately because there are many possible solutions to this problem.
What we going to do now, then, is look at the first solution on this solution is the one that's been dominant in statistical Parametric speech synthesis until quite recently on.
That's the so called Hidden Markov model approach, or hmm, based speech synthesis, although that's the standard name, So we're going to use it.
There's a little bit misleading because the work isn't really being done by hidden Markov model.
We'll see in a moment.
It's mostly going to be done by a regression tree in order to understand why people call it hidden Markov model based speech synthesis.
But why we here are going to prefer to call it Regression Tree plus hidden Markov model speech synthesis.
I'm going to give you two complimentary explanations.
These two explanations will be completely consistent with each other.
They're both true, and they're both perfectly good ways to understand it, So if one works better for you than the other, go with that one.
But it's important that you see two points of view two different ways of thinking about the problem.
What I'm going to do is I'm going to defer the conventional explanation off context dependent hidden Markov models until second.
I'm going to first give the slightly more abstract but more general way of explaining this in terms of regression.
When those two explanations are completed, we'll need to mop up a few details on DH.
The practical implementation of context dependent models will immediately ask.
How do we do duration modelling? So we just touch on that very briefly on.
Once we've got our model, we'll need to generate speak from it, and I'm going to deal with that extremely briefly.
And so this is another good point, which to remind you that these videos really are the bare bones.
They're just the skeleton that you need to start understanding, and you need to flush that out with readings.
That will be particularly true for generation.
I am not going to go into great detail about the generation algorithm on DH.
One reason for doing that is that this generation algorithm, as it stands, is somewhat specific to the hidden Markov model approach to speaks emphasis which is rapidly being superseded by your network approaches, so I don't want to dwell on it too long.
We need to understand what the problem is that there is a solution, but we're not going to look in detail about the solution.
Just describe what it does is really fairly simple anyway.
So what are these two complimentary explanations off hidden Markov model based speech synthesis? The first is to stay with our abstract idea ofthe regression, a sequence to sequence regression problem.
And so, in the coming slides, I'm going to tag that for you in this colour ongoing right regression on those slides just so you can see that that's that part of the explanation.
The other complementary view is to think very practically about how you actually build a system to do this.
How would you build a regression tree, plus, hmm system to do Sequels to sequence modelling on DH? That view actually starts from hidden Markov models, it says.
We'd like a hidden Markov model of every linguistic unit type.
Let's say the phone in every possible linguistic context.
Let's say Quinn, phone plus Prasad in context.
And then we immediate realise we're in trouble because that's a very large set of models on for almost all of them.
There's no training data, even in a big data set on, we have to fix that problem of not being able to train many of our models because they're unseen in training on DH.
That's by sharing parameters amongst models on.
What we're going to see is that this regression on the sharing are the same thing.
We'll start then with the regression view we have.
As I said, two tasks to accomplish.
We have to do the sequencing problem.
We have to take a little walk through the sequence off phones.
Each one will have its context attached to it in this sort of flattened structure, so it will look more like this on DH.
That walk is at a linguistic timescale.
Phone to phone to phone for each of those phones we have toe.
If you like, expand it out into a duration of physical duration.
Each phone will have a different duration to the others who need some sort of model of duration to expand that out on for each phone.
Then to generate a sequence of speech parameters to describe the sound of that phone.
In that context, we need a sequencing component to our solution.
But sequencing isn't too hard just counting from left to right on deciding how long to spend in each phone.
Perhaps more difficult is the problem then off when we know which phone you're in and how far through it we are, we need to predict the speech parameters.
What's the sound? And that's the second part of the problem.
Throw out this module.
We're not going to go all the way to speech way form.
We'll just do that at them with a vocoder.
We're going to predict a sequence off speech parameters, and they're putting these vectors, and we've seen what those like those are the output feature vectors.
So we need to choose a model of sequences.
I don't need to choose a model for aggression on.
Let's just choose the models that we already know about for sequencing.
The most obvious choice is the hidden Markov model.
Why? Well, it's the simplest model that we know off that can generate sequences that's choose that on for regression, where there are lots and lots of different ways of doing regression.
But we can certainly say this is a hard regression problem because the mapping from this linguistic specification to speech parameters say the spectrum is for sure.
Nonlinear might even be non continuous.
So it is a really tough regression problem to a better pick.
A really general purpose regression model, something that can learn arbitrary functions, arbitrary mapping tze from inputs to outputs from predictors to predict e.
On.
The one model that we're pretty comfortable with because we've used it many times is a regression tree.
So let's pick that with those chosen models.
They might not be the best, but we know about them and we know how to use them.
And that's just a CZ important as them being good models in formation at the end.
Why, that is, it's because we know how to, for example, train them from data.
Here's a hidden Markov model hit a Markov.
Models are generative models, although their main use in our field is for automatic speech recognition, which is a classification problem.
We can create classifies by having multiple generative models and having them compete to generate the observed data on whichever can compete it with the highest probability we assign that class to the data here.
We're just going to generate from them.
So here's a hidden Markov model on dis generative so it can generate a sequence off observations.
These are the speak parameters, the vocoder parameters.
Now my abstract picture of a hidden Markov model of drawn little gal scenes in the States.
Of course, the dimensionality of that galaxy in is going to be the same as the dimensionality off the thing we're trying to generate.
So they're multi variant Gossens.
I've just drawn single ones here to tell you that their garrisons in those states.
So how much work is this hidden Markov model doing? Well? Not a lot, really.
It's saying, First, do this and you can do it for a while.
You choose and then do this and so on.
So it's really just saying that things happen in this particular order that's appropriate for speech, because speech sounds happen in a particular order, whether that's within a phone or within the sentence, we trying to say we don't want to have things reordered.
The model of John here is Hidden Markov model.
It has these self transitions, and that's the model of duration.
At the moment, we'll revisit that a bit later because that's a really rubbish model of duration.
Actually, it's not going to be good enough for synthesis, but it's good enough to start our understanding, so we'll leave it as it is for now.
That's a generative model, and it's a probabilistic generative model, so we could take a random walk through the model.
We could toss biassed planes to choose whether to take self transitions or to go to the next state on DH from each galaxy in, we could somehow generate on observation factor at each time stamp those we could take what would amount to a random walk through this model controlled by the models parameters on generate somehow put.
We'll come back at the end, toothy little bit more about how we do that generation.
But before we can get to generating for model, of course, we'll have to training on some data.
Now, in this very naive and simplified picture, I've drawn a Gaussian in each state, and this is a model say off a particular sound.
In particular, linguistic context was a model of a very specific sound in a very specific context, on DH.
It has five emitting states that's normal in speech synthesis to get a bit more temporal resolution than we would need in speech recognition.
So I have many, many, many models, and each of the models has five states, and I've said each state in East have its multi, very Gaussian so it can do this generation thing.
Just do the multiplication in your head.
How many phones are there? How many Quinn phone context can they be in on each for each of those? How many Prasad it context come to be in and then multiply by five and you'll see that's a very, very large number of gas.
Ian's so large, there's no chance we could ever train them all on any finite data set in this naive set up.
So we need to provide the parameters of the model in some other way.
The models can't have their own parameters.
They're going to have to be given parameters.
So what's the model doing? What is this generation? Well, that's the regression step.
It says you're in the second emitting state of a five state model for a particular speech sound on, given that in other words that is your sort of predictors.
Please predict or regress to an acoustic feature vector.
Another was the promise of the galaxy in our the product off the regression part off the problem.
If you remember this, which I hope you do, this is a classification and regression tree, remember, is really a classification or regression tree because it's operating in one mode or the other at any one time.
We spent a lot more time talking about classifications on regression, but the ideas are exactly the same.
We're going to use this machinery to provide the parameters.
So I had a Markov model because we know lots of things.
We know the phone on its context.
In other words, we know the name of the model.
We can ask questions about the name of the model.
Yes, no questions on given sequence of questions and the answers.
We can descend this tree on arrival, the leaf in which we find the value ofthe the parameters of that state, and there was the mean and variants of the Gaussian.
What it amounts to, then is eating the hidden Markov model simply as a model of a sequence it says do things in this order and spend about this long doing each off them Onda Regression Tree to provide the parameters of the states.
You know, there was the means of various the Gas Ian's, which is the regression onto acoustic properties onto speech parameters.
But spell it out in a little more detail because that's kind of complex and potentially confusing idea.
Here's a linguistic specifications, Andi.
For every sound in every context, we have a hidden Markov model.
Let's stick with the Schwab in the word the in this vanity context on this prasad in context.
In our huge set of hidden Markov models, we have a model especially for that sound in that context, and there it is.
It's got five emitting states.
But for all the reasons we just explained, that model does not own its own parameters.
Perhaps we never saw this sound in this context in the training data, so we couldn't train a model just for that sound.
In that context, this model's not simply not trainable from the training data, we're going to provide it with its parameters by regression.
In other words, there will be some regression tree.
I'll have a root.
It will be binary brandishing and that each node will ask questions on the leaves will arrive a promise of the model.
Let's imagine that the first question in the tree is Is the centre phone a vowel? Yes, it is.
Maybe this is the s brunch.
And then we might ask another question here, as is the phone to the right.
The stop.
Yes, it is on DH.
Maybe there's a very small tree.
We've got a leaf that the leaf we've got, I mean the variance of Gaussian.
And that's the prediction off the parameters off one state of this model.
So in to go into the state now the tree is going to be a lot bigger than that.
It's going to have to ask a lot more than just the centre phone on the right phone.
On the very least, it would have to also ask about state position.
But that's fine.
We know that that you could be just another predictor is the state position 1234 or five and it should probably ask about various other features.
Now that tree is going to be learned from data in the usual way.
We're not going to draw that by hand.