Doing Text-to-Speech

To make things more concrete, we examine the step-by-step process for performing Text-to-Speech.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
Now we have a basic understanding of what a neural network is.
It's non linear regression.
Perhaps we could think of it as a little bit of a black box.
For now, let's think about how to do text to speech with it.
And that's just going to be a matter of taking the linguistic specification for the front end.
I'm doing various reformer things and adding a little bit more information, specifically duration YL information to get it into a format that, suitable for input on your network on that form is going to be a sequence off vectors.
The sequence is going to be a time sequence because speech synthesis is a sequence to sequence regression problem.
So we'll start off very much like we did in hmm Bay speech emphasis by flattening the rich linguistic structures because we don't know what to do with them into a sequence.
A linear sequence on that is a sequence off context dependent phones, as in the H Mints and this module.
I'm going to use this H D K style notation, in fact, is from the HTS tool kit, which is commonly used for each member synthesis on DH just to remind you that's the phone on DH, These punctuation symbols separated from its context.
On Dhere, we can see we've got two phones with right to phone to the left and some machine friendly but human unfriendly encoding off super segmental context.
Such a syllable, structure or position in phrase.
So that's what I like to call flattening the linguistic structure.
But this structure here, although it has time as this access time, is measured in linguistic units on in order to turn that into speech, we need to know the duration of each of those linguistic units.
So we're going to expand this description out into a sequence or vectors, which is at a fixed time scale.
In other words, it's going to be the frame rate, maybe every five milliseconds.
You know, there was a 200 hertz frame rate as opposed to what we see on the left, which is what a frenetic rate when you construct somehow some sequence off vectors on.
There's going to be 200 per second on DH.
Obviously, to do that, we need to add duration information, so we're going to have to have a duration model that helps us get from the linguistic timescale out to the fixed frame rate.
I was going to first show you the kind of complete picture off doing speech synthesis, and then we'll backtrack and see how this input is actually prepared.
Let's just take that as read that that is some representation ofthe DYS linguistic information here, enhanced with duration all information so expanded out to a frame rate, and it's got some sort of duration features at it.
Maybe that's what these numbers here mean.
I will say that fixed frame rate representation on will synthesise with it.
So before we know exactly what's in it and before we know how to train them, your network, I think it's helpful just to see the process of synthesis.
So we'll take that and for each frame, in turn, will push it through the neural network and generate a little bit of speech.
So I'll take the first frame and we'll push it through the network, make what's called a forward pass.
Let's just be clear how simple that really is.
That's a question off.
Writing the vector onto the input layer has its activations doing this matrix multiplication at each of the inputs to this hidden layer.
Well, some all of the activations that arrived down those weighted connexions apply the nonlinear function on DH produce an output will do another matrix multiplication.
Get the inputs to the next layer, apply the nonlinear function and get the activations or outputs play another matrix.
Multiplication of the output layer will receive all of those activations down another set of weights and produce some output on the output Laugh.
They're just the sequence off matrix multiplies on non linear operations inside the hidden layers that will give us our set of speech parameters.
And again, just to be clear, I'm gonna networks only got two outputs.
Of course, we'd need a lot more than that for a full vocoder specifications.
And that's the speech parameters is then the input to Arvo Koda.
That's this thing here on that will produce for us a little bit of speech corresponding to this frame of import.
We just push those features through the yokota, which produces away form on.
We'll save that for later on DH by putting subsequent frames through this one, then this one and this one, which is a whole sequence of fragments of way forms, which just joined together and play back as the speech signal.
We might do some overlapping out, for example, to join them together.
That's the end to end process on.
What we need to explain now is how to make this stuff here.
So let's go step by step through the process we're going to follow to construct this representation in front of us.
This is a sequence of frames, so that's time Michigan frames.
So this is frame one.
This is frame, too, and so on for a sentence that we're trying to say.
And so the duration of speech that we will get out is equal to the number of frames times the time interval between frames times five milliseconds.
Just by glancing across these features, we can see that there's some part of it that's binary.
It's some zeros with a few one sprinkled in under some part that's has continuously valued American features.
So we need to explain what these are and where they come from.
So the complete process for creating this input, the neural network, is to take our linguistic specifications from our from damned in the usual way.
Flatten that to get our sequence of context dependent phones this thing here that we could then go off and do hmm synthesis with.
But when you're in its synthesis, we need to do a little bit more specifically, we have to expand it out to a time sequence of the fixed frame rate.
So we go from the linguistic timescale off the linguistic specification on the context dependent phones to a fixed frame rate on the way we do.
That is by adding duration information that would have to be predicted.
So we need a duration model.
After applying the duration model, time is now measured a fixed frame rate.
We can then expand out thes context dependent phone names into a sequence of binary feature vectors.
And that's the frame sequence ready for input on your network.
To improve performance, we might add further fine grain positional information, so the duration information would expand out each context dependent phone into some sequence of frames.
But at that point, every individual frame would have the exact same specifications.
It would seem useful to know how far through an individual phone we are while synthesising it so we can add some fine grain position information that's below the phone level, some sub genetic information.
So we'll have that to the frame sequence and then input that, too.
Then your network.
Let's take all that of its slower to see the individual steps we shouldn't need to describe how the front end works.
We've seen that picture 100 times, will then rewrite that.
And this is just a question of changing format as a sequence off context dependent phones.
So to get from that linguistic structure, the thing that in festival is an utterance structure to this representation is simply a reformatting.
We just take a walk down the genetic sequence and attached the relevant context to the name off the phone to arrive at something that looked a bit like this.
So there's several parts to this.
There's this Quinn phone, the centre phone plus or minus two, and then the some super segmental stuff to do with where we are in syllables, part of speech, all sorts of things.
So these are just examples of things we could use, and you could add anything else you think might be useful.
Just encode it in the name of the model.
These are all linguistic things.
So the positional features are the positions of linguistic objects in linguistic structures, position of phoney unsellable position of sellable, inward position of word and phrase, those sorts of things.
What we going to do now is step by step.
Expand that into a secret of frames at a fixed frame rate.
The first thing that's usually done is to expand each phone into a set of sub phones.
The reason for this is really a carry over from hmm synthesis.
So we're going to instead of work in terms of phones, we're going to divide them into sub phones here have divided this one into five, so this is still time.
It's still measured in linguistic units.
It's that these are phones on.
This is a sequence of sub phones off this particular one have expanded the phone out into five sub phones.
So there's the beginning is the end on There's some stuff in the middle, and it would seem reasonable that knowing where we are in that sequence will be helpful for determining the final sound off this speech unit on.
The reason for using five is just copied over from hmm synthesis.
So many of the first papers of the new wave of new on that base speech synthesis.
Ah, lot of machinery was carried over from hmm synthesis.
So these systems were built on top of hmm.
System is basically so we'll see that now in the way that we divide a phone into sub phones on the sub phone is simply just a nature member state.
So this is an hmm of one phone in context.
That's its name.
So as a model of an L in this context, and it's got five states, and so we just break the phone in tow, five sub phone units and then we're going to predict the duration at the sub genetic level.
Another was going to duration per state, and that prediction is going to be in units of frames.
So what is that duration model? Well, in really early work, the duration model was just the regression tree from an hmm system, so just borrowed from the system.
We already know that regression trees are not the smartest model for this problem, and so we could easily replace that.
For example, we could have a separate neural network whose job it was to predict your rations or we could borrowed durations from natural speech if we wanted to cheat or we could get it from any other model.
We like some external duration model.
So the duration model is providing the duration of each state in frames.
We know the name of the model.
We know the index of the state that will help was follow the tree down to a leaf.
The prediction of that leaf here would be two frames on wood, right, that there were just do that for every state in our sequence.
So this is the bridge between the linguistic timescale here.
Sub phones on the fixed frame rate here.
So let's add those durations to our picture.
That's the sequence of phones.
I broke it down into sub phones, and I'm now going to write durations on them those of ST Indices in h k style.
So they start from two on DH.
We can now put your rations on and a good way to rip situation will be start time and end time within the whole sentence.
We now have what looks very much like unaligned label file, which exactly what it is.
The next step is to rewrite this as binary vectors, and then we finally going to expand that out in two frames.
I'm going to describe this in a way that's rather specific to a particular set of tools, in fact, to the HTS toolkit that's a widely used tool.
And so it's reasonable to use its formats, which we just borrowed from H.
D.
K.
Of course, of course, if you're not using that tool, if you have written your own tool, you might do this in a rather different way.
However conceptually, we're always going to need to do something equivalent to the following.
We're going to turn each of these context dependent model names.
There's one of them into a feature vector.
He's going to be emotionally zeros with a few ones.
In other words, for each of these categorical features and all of the rest, we're going to convert it into a one hot coding.
And in fact, we're just going to the processing for the entire model name all at once.
We was going to turn the whole thing into a some hot coding so effective this almost all zeros, but with a few ones in on those ones are capturing the information in this model name.
Here's how this particular tool does it.
We write down a set of patterns which either much or don't match the model name matching will result in a feature value of one not matching will result in a feature value of zero.
So these patterns are just, in fact, the same thing that we would use as possible questions.
We're building a regression tree in hmm system.
We're going to have a very, very long list of possible questions because we're going to write down everything we could possibly ask about this model name, and most of the questions will not match will result in the value of zero, but a few will match.
Let's do that.
Here's the top of my list of questions, and I'm just going to scroll through it.
And for each of them, I'm going to ask the question about the model name.
They're a little bit like regular expressions.
I'm going to see if the model name contains this little bit of pattern.
This pattern is just saying, Is the centre phone equal to that value? Most the time.
The answer will be no on.
All right, a zero into my feature vector.
Whenever a question matches those when I find the answer is yes, then I'll write a one in my feature vector.
So let's do that.
We start scrolling through the list of questions saying 00000 and will eventually come to something that does match.
This question here is saying, Does the sender phone equal? L And here? Of course it does.
And so I write one into my binary feature vector, and I keep on strolling through the questions again.
Most them don't match and I write a zero, and eventually I'll come to something else that matches.
There is the question that says, Do you have a P to the left? My question matches this P that gives me another yes on the writing of the one to my feature vector so we can see in this feature vector.
Here it's mostly zeros with a few ones.
Those ones are the one hot coatings.
This first part is the one hot coating of the centre phone and then a one hot coding of the left phone.
And then we'll carry on will do the right phone the left left phone, the right right phone and all of the other stuff.
It's one of the very long list of questions to do that.
We're still operating after linguistic timescale.
What we have is this binary vector is very sparse.
Mostly, you know, just a few ones we need just now.
Write it out at the fixed frame rate, and that's just a simple matter of duplicating it the right number of times for each hmm estate.
So the first step, then, is to obtain this for each hmm state.
Now we can add some simple position information immediately, and that is which state we were in this sequence here Is that the linguistic clock rate counting states 23456 And we now know how long we would like to spend in each state from moderation model.
So just duplicate the lines that number of times so this one will just be duplicated twice.
So we write them out like that on that white space doesn't really mean anything, so we'll just rewrite it like that.
That's now a sequence off feature vectors at a fixed frame rate.
This is time in frames.
One line per frame within a single Hmm estate.
Most of the vector is constant.
And then we've got some position information here that we might add here have added a very simple feature, which is position within ST it says.
First we halfway through and then we're all the way through.
Must look at another state, this last state here.
We need to be in for three time frames.
That's though three frames here.
So where a third of the way through two thirds of the way through all the way through very simple positional feature is a little counter that counts our way through the state.
What this represents, then this part here, the sequence of frames.
It says This isn't L.
That's encoded somehow in one of these one hot things in the context, so and so that's encouraging all the other one hot stuff.
Exaggeration of 10 frames, you can count the line.
There are 10 of them on DH.
We've got little counters, little clocks ticking away at different rates.
This one's going around quite rapidly.
This one's counting through states on, we could add any other counters we like.
We'll see in the real example that we might have other counters that count backwards.
For example,