Training a Neural Network

Just a very informal look at how this can be done, to give you a starting point before reading about this in more detail.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
So I just finished with a rather informal description of how to train in your network just to show that it's fairly straightforward.
It's a very simple algorithm now, like all simple algorithms, weaken.
Dress it up in all sorts of other enhancements on you'll find when you start reading the literature that that's what happens.
But in principle is pretty simple.
So we're going to give an impressionistic hand waving our non mathematical introduction to how you train a neural network.
We're going to use supervised machine learning.
In other words, we know the inputs on the outputs on the imports now, but they're going to be aligned.
So we're going to have to sequence is a sequence of inputs on the sequence of outputs.
They're both at the same frame rate, and there's a correspondence between one frame of input on one frame of output on DH.
Because my in your network in my pictures is rather small and simple.
My imports have got three dimensions.
My outpost got two dimensions, of course, in real speaks and assess.
This is to do with vocoder parameters, and this is to do with linguistic features we saw in the previous section how to prepare the input to how to get our sequence off vectors, which emotionally binary ones and zeroes with a few numerical positional features attached to them.
Little counters account through states or count through phones we could are counting through any other unit we liked.
Syllables, words, phrases on the outpost Adjust the vocoder parameters.
Just speech features.
Now the Sioux sequences have to be aligned.
How'd you get that alignment? Well, you actually know the answer to that.
We could do that with forced alignment in precisely the same way that you have prepared data for a unit selection.
Synthesiser.
In other words, we prepare aligned label files with duration information attached to them.
Where that duration information didn't come from the model that predicts duration.
It came from natural speech from the training data through forced alignment so we can get aligned.
Pairs off input and output vectors is an example of what the data might look like to train in your network.
We've got some input, not some output.
This is time in frames on.
These things are aligned.
So we know that when we put this into the network, we would like to get that out of the network.
So that's the job of the network is to learn that regression.
These features are all from recorded natural speech.
So these are the speech features extracted from natural speech, and these are the linguistic features for the transcription of that speech.
Have bean time aligned through forced alignment.
So we've prepared our data on now will train our network, and I'll use my really simple small network for the example.
So our imports are going to be of dimensionality three and the outputs or dimensionality too.
So I wonder this is going to be rather impressionistic.
We're not going to even show a single equation.
The job of the network is to learn when I show it this import, it should give me this output.
So the target output is an example from natural speech.
So this isn't aligned.
Pair of input and output Feature vectors typically will initialise in your network by just letting all the ways to random values.
When we input this feature, it goes through this weight matrix which just some random linear projection.
To start with the hidden layer, apply some nonlinear function to it, which was a sigmoid on DH.
We get another projection, another protection when we get some output, which is just the projection off the import through those layers.
So let's see where maybe get the following value and here we might get now.
We wouldn't expect the network to get the right answer because it's waits around them on the learning algorithm, then is just going to gradually adjust the weights to make the output as close as possible to the target.
That learning algorithm is called back propagation on DH Impressionistic Lee.
It does the following We looked at the output we would like to get.
We looked at the output that we got when we compute some error between those two.
So here the output was higher than it should have bean.
So the Somme era here the air is about 0.3.
So we need to adjust the weights in a way that makes this out put a little bit smaller.
So we take the era and we send it back through the network and the central messages.
Colonel, the weights and these messages here are going to say you gave me now, but it was too big.
I want to not put little bit smaller from that, So could you just scale down your weight a little bit? All those ways will get a little bit reduced to be an error.
Function also propagated back through this unit through these ways, and there's the little messages saying how they should be adjusted as well.
On the back propagation algorithm can then take these areas that all arrive the outputs of thes units.
So we're going to some of the errors that arrive here.
We're going to send it backwards through the activation function on DH Central messages down all of these weights as well, saying whether they should be increased a little bit or decreased a little bit.
We'll have computed how much the weight you need to be changed, and then we'll update the weights and we'll get a network which next time should give us something closer to the right output.
So maybe we update our weights and put the vector back to the network, and now maybe we'll get the following outputs so a little bit closer to the correct values we'll just do the same again will compute the errors between what the network gave us and what we wanted.
And we'll use that era as a signal to send back through the network to back propagate, and it will tell the weights where they need to increase or decrease.
Now, of course, we won't just do that for a single import on a single target will do it for the entire training centre so we'll have thousands or millions of pairs of inputs and outputs for each input will put it through the network.
Compute network output on the era.
We'll do that across the entire training set to find the average era, and then we'll back propagate through the weights to do an update.
And they will iterated that will make many, many passes through the data.
Make many, many updates of the weights on will gradually converge on an ideal set of weights that gives us out.
But that was close.
Is there going to be to the target now that training algorithm, as I've described it, is extremely naive.
It suggests that we need to make a pass through the entire training data before we can even update the weights.
Once in practise, we don't do that.
We'll do it in what it called mini Batch is so will make weight updates a lot more frequently, so we'll be able to get the network trained a lot quicker on there's lots and lots of other tricks to get this network to converge on the best possible set of weights.
And I'm deliberately not going to cover all of those here because that's an ever changing field on.
You need to read the literature to find out what people are doing today when their training your networks for speech synthesis.
Likewise, this very simple feed forward architecture is about the simplest neural network we could draw on DH people using far more sophisticated architectures.
They're expressing something that they believe about the problem just to give you one example of the sorts of things you could do to make a more complicated architecture.
You could choose not to fully connect two layers but have a very specific pattern of connexions that expresses something you think you know about the problem.
That's all I would like to say about training your networks because the rest is much better done from readings.
Let's finish off by seeing where we are in the big picture on what's coming up.
So I've only told you about the very simply sort of.
Neural networks feed forward architectures on DH The way I've described that is just a straight swap for the regression tree on.
We saw that basically a hidden Markov model is still involved.
For example, is there for force alignment of the training data on DH? When we do synthesis, it's there as a sort of sub genetic clock, it says.
Phones are too coarse in granularity.
We need to divide them into five little pieces.
And for each of those five pieces, we need to attach duration information on some positional features, which is how far through the phone you are in terms of states.
And then use that to construct the input of the neural network in early work on the network's piece emphasis in this rediscovery of new or nets for speech synthesis.
A lot of things were borrowed from hmm systems, including the duration model, even though it is a rather poor regression tree based model.
Later work tends to use a much better duration model.
I wonder what that might be.
Well, it's just another neural network, so a typical New York speak.
Synthesise will have one year on that prediction.
Duration, and then we'll use those durations and feed them into a second year old network, which does the regression onto speech parameters.
Well, what's coming next in neural networks would speak.
Synthesis depends entirely on when you're watching this video, because the literature is changing very quickly on this topic.
And so from this point onwards, I'm not even going to attempt to give you videos about the state of the art in your network's piece synthesis.
You're going to go and read the current literature to find that out.
We can make use of our neural network speech synthesiser in a unit selection system, and so will come full circle on DH.
Finish off in the next module, looking at hybrid synthesis.
And I'll describe that hours unit selection using an acoustic space formulation target cost function.
And it's worth pointing out that the better the regression model we get from your network speech emphasis, the better.
We expect hybrid synthesis toe work.
So any developments that we learn about from the literature on neural network based speech synthesis in a statistical parametric framework, in the words driving a vocoder.
We could take those developments and put them into a hybrid system and use it to drive unit selection system.
But we expect those improvements to carry through to better synthesis in the hybrid framework.