What is a Neural Network?

We said earlier that the Neural Network is a replacement for the regression tree, so here we describe it in those terms.

slownormalfast

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
this modules and introduction to how we can do speech synthesis using neural networks, and I'll stress that word introduction.
This is a fast moving field, and so I'm not even going to attempt to describe the state of the art.
What I'm going to do is to give you an idea of what a neural network is and how it works, and then going to look in a little bit more detail about how we actually do text to speech with a neural network.
That's really a matter of getting the inputs and the output in the right form so that we can use the neural network to perform regression from linguistic features to speak parameters.
We'll need to think about things like duration.
In order to do that, I'll finish with a very impressionistic and hand wavy description of how we might train in your network using an algorithm called back Propagation.
But I won't be using any mathematical details are believing that two readings? Let's do the usual thing off, checking what we need to know before we can proceed.
You obviously need to know quite a bit about speak synthesis before you get to this point and in particular, you need to know how Tex Processing works in the front end on what available linguistic features there are.
You'll need to have completed the module on H bomb synthesis, where we talked about flattening this rich linguistic specification into just a fanatic sequence where each of those symbols in that sequence is a context dependent phone on all the context that we might need needs to be attached to the phone that includes left and right frenetic context on super segmental things such as property as well as basic positional features.
Maybe position in phrase implicit in the hmm method.
But becoming explicit in the neural net method is that these features could be further processed and treated as binary is either true or false.
That will become clear even more so as we work through the example of how to prepare the inputs to on your network because the imports will have to be numerical in H.
Members speak synthesis.
The questions in our regression tree queried the linguistic features on DH.
Although it was done implicitly, that's really treating those features as binary there, either true or false on.
Of course, we need to know something about the typical speed parameters used by Ivo Koda, because for our neural networks in this module that will be the output we will still be driving a vocoder.
This idea of representing Ah categorical linguistic feature as a binary vector is very important.
On the way to do that is something called a one hot coding that we've already mentioned on will become clear again later in this module that has various names that people use.
Sometimes people say one of K or one of n or one of some other letter.
I quite like the one heart phrase because it tells me that we've got a binary vector, which is entirely full of zeros except for a single position where there's a one which is equal to on or true, more hot.
And that's telling me which category out of a set of possible categories the current sound belongs to.
For example, it could represent the current phone was a one out of 45 encoding say so what is the neural network to make the Connexion back to hidden Markov model based speech synthesis? Let's say that a neural network is a regression model like a regression tree.
It's very general purpose.
We can fit almost any problem into this framework.
It's just a matter of getting the input and output into the correct format.
In the case of Regression Tree, the way that we need to represent the import is there something that we can query with? Yes, no questions.
And that's exactly like saying we need to turn the input into a set of binary features that either true or false more No.
Zero.
A neural network is very similar.
We need to get the input in the right form.
It's going to have to be numerical, a vector of numbers.
They don't have to be binary.
Just a vector of numbers on the output has to be another vector of numbers throughout this module.
I'm going to take a very simple idea of neural networks.
It's this feed forward architecture.
Once you start reading literature, you'll find that there are many, many other possible architectures, and that's a place where you can put some knowledge off the problem into the model by playing with that architecture to reflect, for example, your intuitions about how the outputs depend on the inputs.
However, here Let's just use the simple feed forward in your network.
Let's define some terms we're going to need to talk about when we talk about neural networks.
The building block of any in your network is thie unit.
It's sometimes called the Neuron on DH that contains something called an activation function.
The activation function relates the output to the input.
So some function says that the output equals some function of the import.
The activation off a unit is passed down.
Some connexions on goes into the input ofthe subsequent units, so we need to look at these connexions.
There's one of thumb their directed, so the information flows along.
The direction of the arrow and connexions have a parameter is a weight on the way just multiplies the activation off.
The unit on DH feeds it to the next unit, so those weights are the parameters of the model.
My simple feed forward network is what's called fully connected.
Every unit in one layer is connected to all the units in the subsequent layer, so you can see that those weights are going to fit into some simple data structure.
In fact, it's just a matrix, so the set of weights that connects one layer to the next layer.
The factor matrix on the way that the activations of one layer a fed to the imports of the next layer is just a simple matrix multiplication.
We can see that the units are arranged on this diagram in a particular way, and they're arranged into what we call layers some of the layers inside the model and they're called hidden layers on other layers take the imports in the outputs, so information flows through the network in a particular way, and it flows from the input layer through the hidden layers to the output layer.
There's the summarise.
We represent the input as a numerical vector on DH that's placed on the input layer that's them, propagated through the network by a sequence of matrix multiplication sze on activation functions and eventually arrives at the output layer.
On the activations of the output layer are the output of the network.
I said that these units, sometimes called neurons, have an activation function in them.
So what is a unit on? What does it do? A key idea in neural networks is that we can build up very complex models that might have very large numbers off parameters, say large numbers of weights from very, very simple building blocks.
Very simple little operations on the unit is that building block.
I'm just going to tell you about a very simple sort of unit.
There are more complex forms of unit, and you need to read the literature to find out what they are.
He's a very simple unit, and it just does the following.
It receives inputs from the preceding layer.
Those inputs are simply the activations of units in the previous layer multiplied by the weights on these connexions that they've come down and they all arrive together at the input to this unit, and they simply get summed up so there's a sum so that input is awaited.
Some of the activations of the previous layer away to some is just a linear operation, and it could be computed by the matrix multiplication that I talked about before.
Importantly, inside each unit is an activation function, and that function must be non linear.
If the function was linear, then that would would simply be a sequence of linear operations, and that's just a product of linear operations which itself is just another linear operation.
So the network will be doing nothing more than essentially a big matrix multiply.
Just be a very simple, linear regression model, not very powerful.
We want a non linear regression model, so I need to put a non linearity inside.
The unit's on again.
There's many, many possible choices of nonlinear function.
You need to do the readings to discover what sort of nonlinear activation functions we might use in these units is very often some sort of squashing function.
Some sort of s shaped curved something perhaps like a sigmoid or a Tansi.
But there are many other possibilities.
That's another place where you need to make some design decisions when you're building on your network.
What activation functions and most appropriate for the problem that you're trying to solve on the output simply goes out off the unit so the output comes out of the unit.
Quite often, that output is called the activation, so to summarise this part, there are many, many choices off activation function, but they need to be non linear.
Otherwise, the entire network just reduces to a big linear operation.
There's no point having all of those layers.
So what are all those layers about? Why would you have multiple layers? Well, there's lots of ways to describe what in your network is doing, but here we're talking about regression regression from some input representation.
Linguistic features expressed is a vector of numbers to some output representation.
Just a speech parameters for Arvo Coda on DH.
We can think of the network as doing this regression in a sequence off simpler regressions.
So each layer of weights plus units applying non in operations is a little regression model on.
We're doing the very complicated regression from inputs to outputs in a number of intermediate steps.
Now, if you read the fundamental literature on this subject, you'll find that there's a proof that single hidden layer is enough and that a single hidden layer on your network an approximate any function.
While that's true, in theory, there's always differences between theory and practise on what works well, empirically is not always the same thing that the theory tells us.
What we find empirically, you know there was by experimentation is that having multiple hidden layers is good.
It works better, so on your network is a non linear regression model.
We put some representation of the input into the network.
We get some representation of the output on the output layer.
The activations of the output layer on what's happening in the hidden layers is some sort of learned intermediate representations.
So we're trying to bridge this big gap between inputs and outputs by bridging lots of smaller gaps on these intermediate representations.
I learned as part of training the model, we do not need to decide what they are or what they mean the hidden.
In other words, the model's not only performing regression, it's learning how to break a complicated regression problem down into a sequence of rather simpler regressions that, when stacked together, performed the end to end regression problem.
So one way to describe what this network is doing is a sequence off non linear projections, or regressions from one space to another space tow another space and eventually gets us from this linguistic space to this acoustic space.
On this, in between things while some other spaces some intermediate representations on, we don't know what they are on, the network is going to learn how best to represent the problem internally.
as part of training.
You might compare that to a pipeline architecture that we've seen in our Texas beach front end, where we break down the very complicated problem off moving from text to linguistic features into a sequence off processes, which is normalisation or part of speech, tagging or looking things up in a dictionary.
And there's lots of intermediate representations in that pipeline, which is the fanatic string or syllables or symbolic property.
But those representations are handcrafted.
We've had to decide what they are throughout expert knowledge.
And then we've built separate models to jump from one representation to the next the neural networks a little bit like that in a very general way that it were breaking a complex problem down into a sequence of simpler problems.
But here we do not need to decide what the simple problems are.
We just choose how many steps there are, so there are a bunch of design parameters are on your network that we need to choose on their things, like the number of hidden layers but number of units in each hidden layer, which could vary from layer to layer on DH, the activation function that's hidden layers with the inputs and outputs being decided by, in our case, how many linguistic features we can extract from the text on DH.
What promises are vocoder needs.
So that's in your net.
In very general terms.
Now, in the next part, we're going to go and see how we use that to do texture speech.
That's just going to be a matter of getting the input representation right.