Module 8 – Deep Neural Networks

The use of neural networks is motivated by replacing the regression trees, which were used in the HMM approach, with a more powerful regression model.
Log in

Module status: ready (including the updated slides after the class)

We will now use a Neural Network to replace the regression tree in HMM synthesis, and will keep an HMM-like mechanism to take care of the sequencing part of the problem.

Download the slides for the module 8 videos

Total video to watch in this section: 40 minutes

We said earlier that the Neural Network is a replacement for the regression tree, so here we describe it in those terms.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
this modules and introduction to how we can do speech synthesis using neural networks, and I'll stress that word introduction.
This is a fast moving field, and so I'm not even going to attempt to describe the state of the art.
What I'm going to do is to give you an idea of what a neural network is and how it works, and then going to look in a little bit more detail about how we actually do text to speech with a neural network.
That's really a matter of getting the inputs and the output in the right form so that we can use the neural network to perform regression from linguistic features to speak parameters.
We'll need to think about things like duration.
In order to do that, I'll finish with a very impressionistic and hand wavy description of how we might train in your network using an algorithm called back Propagation.
But I won't be using any mathematical details are believing that two readings? Let's do the usual thing off, checking what we need to know before we can proceed.
You obviously need to know quite a bit about speak synthesis before you get to this point and in particular, you need to know how Tex Processing works in the front end on what available linguistic features there are.
You'll need to have completed the module on H bomb synthesis, where we talked about flattening this rich linguistic specification into just a fanatic sequence where each of those symbols in that sequence is a context dependent phone on all the context that we might need needs to be attached to the phone that includes left and right frenetic context on super segmental things such as property as well as basic positional features.
Maybe position in phrase implicit in the hmm method.
But becoming explicit in the neural net method is that these features could be further processed and treated as binary is either true or false.
That will become clear even more so as we work through the example of how to prepare the inputs to on your network because the imports will have to be numerical in H.
Members speak synthesis.
The questions in our regression tree queried the linguistic features on DH.
Although it was done implicitly, that's really treating those features as binary there, either true or false on.
Of course, we need to know something about the typical speed parameters used by Ivo Koda, because for our neural networks in this module that will be the output we will still be driving a vocoder.
This idea of representing Ah categorical linguistic feature as a binary vector is very important.
On the way to do that is something called a one hot coding that we've already mentioned on will become clear again later in this module that has various names that people use.
Sometimes people say one of K or one of n or one of some other letter.
I quite like the one heart phrase because it tells me that we've got a binary vector, which is entirely full of zeros except for a single position where there's a one which is equal to on or true, more hot.
And that's telling me which category out of a set of possible categories the current sound belongs to.
For example, it could represent the current phone was a one out of 45 encoding say so what is the neural network to make the Connexion back to hidden Markov model based speech synthesis? Let's say that a neural network is a regression model like a regression tree.
It's very general purpose.
We can fit almost any problem into this framework.
It's just a matter of getting the input and output into the correct format.
In the case of Regression Tree, the way that we need to represent the import is there something that we can query with? Yes, no questions.
And that's exactly like saying we need to turn the input into a set of binary features that either true or false more No.
Zero.
A neural network is very similar.
We need to get the input in the right form.
It's going to have to be numerical, a vector of numbers.
They don't have to be binary.
Just a vector of numbers on the output has to be another vector of numbers throughout this module.
I'm going to take a very simple idea of neural networks.
It's this feed forward architecture.
Once you start reading literature, you'll find that there are many, many other possible architectures, and that's a place where you can put some knowledge off the problem into the model by playing with that architecture to reflect, for example, your intuitions about how the outputs depend on the inputs.
However, here Let's just use the simple feed forward in your network.
Let's define some terms we're going to need to talk about when we talk about neural networks.
The building block of any in your network is thie unit.
It's sometimes called the Neuron on DH that contains something called an activation function.
The activation function relates the output to the input.
So some function says that the output equals some function of the import.
The activation off a unit is passed down.
Some connexions on goes into the input ofthe subsequent units, so we need to look at these connexions.
There's one of thumb their directed, so the information flows along.
The direction of the arrow and connexions have a parameter is a weight on the way just multiplies the activation off.
The unit on DH feeds it to the next unit, so those weights are the parameters of the model.
My simple feed forward network is what's called fully connected.
Every unit in one layer is connected to all the units in the subsequent layer, so you can see that those weights are going to fit into some simple data structure.
In fact, it's just a matrix, so the set of weights that connects one layer to the next layer.
The factor matrix on the way that the activations of one layer a fed to the imports of the next layer is just a simple matrix multiplication.
We can see that the units are arranged on this diagram in a particular way, and they're arranged into what we call layers some of the layers inside the model and they're called hidden layers on other layers take the imports in the outputs, so information flows through the network in a particular way, and it flows from the input layer through the hidden layers to the output layer.
There's the summarise.
We represent the input as a numerical vector on DH that's placed on the input layer that's them, propagated through the network by a sequence of matrix multiplication sze on activation functions and eventually arrives at the output layer.
On the activations of the output layer are the output of the network.
I said that these units, sometimes called neurons, have an activation function in them.
So what is a unit on? What does it do? A key idea in neural networks is that we can build up very complex models that might have very large numbers off parameters, say large numbers of weights from very, very simple building blocks.
Very simple little operations on the unit is that building block.
I'm just going to tell you about a very simple sort of unit.
There are more complex forms of unit, and you need to read the literature to find out what they are.
He's a very simple unit, and it just does the following.
It receives inputs from the preceding layer.
Those inputs are simply the activations of units in the previous layer multiplied by the weights on these connexions that they've come down and they all arrive together at the input to this unit, and they simply get summed up so there's a sum so that input is awaited.
Some of the activations of the previous layer away to some is just a linear operation, and it could be computed by the matrix multiplication that I talked about before.
Importantly, inside each unit is an activation function, and that function must be non linear.
If the function was linear, then that would would simply be a sequence of linear operations, and that's just a product of linear operations which itself is just another linear operation.
So the network will be doing nothing more than essentially a big matrix multiply.
Just be a very simple, linear regression model, not very powerful.
We want a non linear regression model, so I need to put a non linearity inside.
The unit's on again.
There's many, many possible choices of nonlinear function.
You need to do the readings to discover what sort of nonlinear activation functions we might use in these units is very often some sort of squashing function.
Some sort of s shaped curved something perhaps like a sigmoid or a Tansi.
But there are many other possibilities.
That's another place where you need to make some design decisions when you're building on your network.
What activation functions and most appropriate for the problem that you're trying to solve on the output simply goes out off the unit so the output comes out of the unit.
Quite often, that output is called the activation, so to summarise this part, there are many, many choices off activation function, but they need to be non linear.
Otherwise, the entire network just reduces to a big linear operation.
There's no point having all of those layers.
So what are all those layers about? Why would you have multiple layers? Well, there's lots of ways to describe what in your network is doing, but here we're talking about regression regression from some input representation.
Linguistic features expressed is a vector of numbers to some output representation.
Just a speech parameters for Arvo Coda on DH.
We can think of the network as doing this regression in a sequence off simpler regressions.
So each layer of weights plus units applying non in operations is a little regression model on.
We're doing the very complicated regression from inputs to outputs in a number of intermediate steps.
Now, if you read the fundamental literature on this subject, you'll find that there's a proof that single hidden layer is enough and that a single hidden layer on your network an approximate any function.
While that's true, in theory, there's always differences between theory and practise on what works well, empirically is not always the same thing that the theory tells us.
What we find empirically, you know there was by experimentation is that having multiple hidden layers is good.
It works better, so on your network is a non linear regression model.
We put some representation of the input into the network.
We get some representation of the output on the output layer.
The activations of the output layer on what's happening in the hidden layers is some sort of learned intermediate representations.
So we're trying to bridge this big gap between inputs and outputs by bridging lots of smaller gaps on these intermediate representations.
I learned as part of training the model, we do not need to decide what they are or what they mean the hidden.
In other words, the model's not only performing regression, it's learning how to break a complicated regression problem down into a sequence of rather simpler regressions that, when stacked together, performed the end to end regression problem.
So one way to describe what this network is doing is a sequence off non linear projections, or regressions from one space to another space tow another space and eventually gets us from this linguistic space to this acoustic space.
On this, in between things while some other spaces some intermediate representations on, we don't know what they are on, the network is going to learn how best to represent the problem internally.
as part of training.
You might compare that to a pipeline architecture that we've seen in our Texas beach front end, where we break down the very complicated problem off moving from text to linguistic features into a sequence off processes, which is normalisation or part of speech, tagging or looking things up in a dictionary.
And there's lots of intermediate representations in that pipeline, which is the fanatic string or syllables or symbolic property.
But those representations are handcrafted.
We've had to decide what they are throughout expert knowledge.
And then we've built separate models to jump from one representation to the next the neural networks a little bit like that in a very general way that it were breaking a complex problem down into a sequence of simpler problems.
But here we do not need to decide what the simple problems are.
We just choose how many steps there are, so there are a bunch of design parameters are on your network that we need to choose on their things, like the number of hidden layers but number of units in each hidden layer, which could vary from layer to layer on DH, the activation function that's hidden layers with the inputs and outputs being decided by, in our case, how many linguistic features we can extract from the text on DH.
What promises are vocoder needs.
So that's in your net.
In very general terms.
Now, in the next part, we're going to go and see how we use that to do texture speech.
That's just going to be a matter of getting the input representation right.

Log in if you want to mark this as completed
To make things more concrete, we examine the step-by-step process for performing Text-to-Speech.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
Now we have a basic understanding of what a neural network is.
It's non linear regression.
Perhaps we could think of it as a little bit of a black box.
For now, let's think about how to do text to speech with it.
And that's just going to be a matter of taking the linguistic specification for the front end.
I'm doing various reformer things and adding a little bit more information, specifically duration YL information to get it into a format that, suitable for input on your network on that form is going to be a sequence off vectors.
The sequence is going to be a time sequence because speech synthesis is a sequence to sequence regression problem.
So we'll start off very much like we did in hmm Bay speech emphasis by flattening the rich linguistic structures because we don't know what to do with them into a sequence.
A linear sequence on that is a sequence off context dependent phones, as in the H Mints and this module.
I'm going to use this H D K style notation, in fact, is from the HTS tool kit, which is commonly used for each member synthesis on DH just to remind you that's the phone on DH, These punctuation symbols separated from its context.
On Dhere, we can see we've got two phones with right to phone to the left and some machine friendly but human unfriendly encoding off super segmental context.
Such a syllable, structure or position in phrase.
So that's what I like to call flattening the linguistic structure.
But this structure here, although it has time as this access time, is measured in linguistic units on in order to turn that into speech, we need to know the duration of each of those linguistic units.
So we're going to expand this description out into a sequence or vectors, which is at a fixed time scale.
In other words, it's going to be the frame rate, maybe every five milliseconds.
You know, there was a 200 hertz frame rate as opposed to what we see on the left, which is what a frenetic rate when you construct somehow some sequence off vectors on.
There's going to be 200 per second on DH.
Obviously, to do that, we need to add duration information, so we're going to have to have a duration model that helps us get from the linguistic timescale out to the fixed frame rate.
I was going to first show you the kind of complete picture off doing speech synthesis, and then we'll backtrack and see how this input is actually prepared.
Let's just take that as read that that is some representation ofthe DYS linguistic information here, enhanced with duration all information so expanded out to a frame rate, and it's got some sort of duration features at it.
Maybe that's what these numbers here mean.
I will say that fixed frame rate representation on will synthesise with it.
So before we know exactly what's in it and before we know how to train them, your network, I think it's helpful just to see the process of synthesis.
So we'll take that and for each frame, in turn, will push it through the neural network and generate a little bit of speech.
So I'll take the first frame and we'll push it through the network, make what's called a forward pass.
Let's just be clear how simple that really is.
That's a question off.
Writing the vector onto the input layer has its activations doing this matrix multiplication at each of the inputs to this hidden layer.
Well, some all of the activations that arrived down those weighted connexions apply the nonlinear function on DH produce an output will do another matrix multiplication.
Get the inputs to the next layer, apply the nonlinear function and get the activations or outputs play another matrix.
Multiplication of the output layer will receive all of those activations down another set of weights and produce some output on the output Laugh.
They're just the sequence off matrix multiplies on non linear operations inside the hidden layers that will give us our set of speech parameters.
And again, just to be clear, I'm gonna networks only got two outputs.
Of course, we'd need a lot more than that for a full vocoder specifications.
And that's the speech parameters is then the input to Arvo Koda.
That's this thing here on that will produce for us a little bit of speech corresponding to this frame of import.
We just push those features through the yokota, which produces away form on.
We'll save that for later on DH by putting subsequent frames through this one, then this one and this one, which is a whole sequence of fragments of way forms, which just joined together and play back as the speech signal.
We might do some overlapping out, for example, to join them together.
That's the end to end process on.
What we need to explain now is how to make this stuff here.
So let's go step by step through the process we're going to follow to construct this representation in front of us.
This is a sequence of frames, so that's time Michigan frames.
So this is frame one.
This is frame, too, and so on for a sentence that we're trying to say.
And so the duration of speech that we will get out is equal to the number of frames times the time interval between frames times five milliseconds.
Just by glancing across these features, we can see that there's some part of it that's binary.
It's some zeros with a few one sprinkled in under some part that's has continuously valued American features.
So we need to explain what these are and where they come from.
So the complete process for creating this input, the neural network, is to take our linguistic specifications from our from damned in the usual way.
Flatten that to get our sequence of context dependent phones this thing here that we could then go off and do hmm synthesis with.
But when you're in its synthesis, we need to do a little bit more specifically, we have to expand it out to a time sequence of the fixed frame rate.
So we go from the linguistic timescale off the linguistic specification on the context dependent phones to a fixed frame rate on the way we do.
That is by adding duration information that would have to be predicted.
So we need a duration model.
After applying the duration model, time is now measured a fixed frame rate.
We can then expand out thes context dependent phone names into a sequence of binary feature vectors.
And that's the frame sequence ready for input on your network.
To improve performance, we might add further fine grain positional information, so the duration information would expand out each context dependent phone into some sequence of frames.
But at that point, every individual frame would have the exact same specifications.
It would seem useful to know how far through an individual phone we are while synthesising it so we can add some fine grain position information that's below the phone level, some sub genetic information.
So we'll have that to the frame sequence and then input that, too.
Then your network.
Let's take all that of its slower to see the individual steps we shouldn't need to describe how the front end works.
We've seen that picture 100 times, will then rewrite that.
And this is just a question of changing format as a sequence off context dependent phones.
So to get from that linguistic structure, the thing that in festival is an utterance structure to this representation is simply a reformatting.
We just take a walk down the genetic sequence and attached the relevant context to the name off the phone to arrive at something that looked a bit like this.
So there's several parts to this.
There's this Quinn phone, the centre phone plus or minus two, and then the some super segmental stuff to do with where we are in syllables, part of speech, all sorts of things.
So these are just examples of things we could use, and you could add anything else you think might be useful.
Just encode it in the name of the model.
These are all linguistic things.
So the positional features are the positions of linguistic objects in linguistic structures, position of phoney unsellable position of sellable, inward position of word and phrase, those sorts of things.
What we going to do now is step by step.
Expand that into a secret of frames at a fixed frame rate.
The first thing that's usually done is to expand each phone into a set of sub phones.
The reason for this is really a carry over from hmm synthesis.
So we're going to instead of work in terms of phones, we're going to divide them into sub phones here have divided this one into five, so this is still time.
It's still measured in linguistic units.
It's that these are phones on.
This is a sequence of sub phones off this particular one have expanded the phone out into five sub phones.
So there's the beginning is the end on There's some stuff in the middle, and it would seem reasonable that knowing where we are in that sequence will be helpful for determining the final sound off this speech unit on.
The reason for using five is just copied over from hmm synthesis.
So many of the first papers of the new wave of new on that base speech synthesis.
Ah, lot of machinery was carried over from hmm synthesis.
So these systems were built on top of hmm.
System is basically so we'll see that now in the way that we divide a phone into sub phones on the sub phone is simply just a nature member state.
So this is an hmm of one phone in context.
That's its name.
So as a model of an L in this context, and it's got five states, and so we just break the phone in tow, five sub phone units and then we're going to predict the duration at the sub genetic level.
Another was going to duration per state, and that prediction is going to be in units of frames.
So what is that duration model? Well, in really early work, the duration model was just the regression tree from an hmm system, so just borrowed from the system.
We already know that regression trees are not the smartest model for this problem, and so we could easily replace that.
For example, we could have a separate neural network whose job it was to predict your rations or we could borrowed durations from natural speech if we wanted to cheat or we could get it from any other model.
We like some external duration model.
So the duration model is providing the duration of each state in frames.
We know the name of the model.
We know the index of the state that will help was follow the tree down to a leaf.
The prediction of that leaf here would be two frames on wood, right, that there were just do that for every state in our sequence.
So this is the bridge between the linguistic timescale here.
Sub phones on the fixed frame rate here.
So let's add those durations to our picture.
That's the sequence of phones.
I broke it down into sub phones, and I'm now going to write durations on them those of ST Indices in h k style.
So they start from two on DH.
We can now put your rations on and a good way to rip situation will be start time and end time within the whole sentence.
We now have what looks very much like unaligned label file, which exactly what it is.
The next step is to rewrite this as binary vectors, and then we finally going to expand that out in two frames.
I'm going to describe this in a way that's rather specific to a particular set of tools, in fact, to the HTS toolkit that's a widely used tool.
And so it's reasonable to use its formats, which we just borrowed from H.
D.
K.
Of course, of course, if you're not using that tool, if you have written your own tool, you might do this in a rather different way.
However conceptually, we're always going to need to do something equivalent to the following.
We're going to turn each of these context dependent model names.
There's one of them into a feature vector.
He's going to be emotionally zeros with a few ones.
In other words, for each of these categorical features and all of the rest, we're going to convert it into a one hot coding.
And in fact, we're just going to the processing for the entire model name all at once.
We was going to turn the whole thing into a some hot coding so effective this almost all zeros, but with a few ones in on those ones are capturing the information in this model name.
Here's how this particular tool does it.
We write down a set of patterns which either much or don't match the model name matching will result in a feature value of one not matching will result in a feature value of zero.
So these patterns are just, in fact, the same thing that we would use as possible questions.
We're building a regression tree in hmm system.
We're going to have a very, very long list of possible questions because we're going to write down everything we could possibly ask about this model name, and most of the questions will not match will result in the value of zero, but a few will match.
Let's do that.
Here's the top of my list of questions, and I'm just going to scroll through it.
And for each of them, I'm going to ask the question about the model name.
They're a little bit like regular expressions.
I'm going to see if the model name contains this little bit of pattern.
This pattern is just saying, Is the centre phone equal to that value? Most the time.
The answer will be no on.
All right, a zero into my feature vector.
Whenever a question matches those when I find the answer is yes, then I'll write a one in my feature vector.
So let's do that.
We start scrolling through the list of questions saying 00000 and will eventually come to something that does match.
This question here is saying, Does the sender phone equal? L And here? Of course it does.
And so I write one into my binary feature vector, and I keep on strolling through the questions again.
Most them don't match and I write a zero, and eventually I'll come to something else that matches.
There is the question that says, Do you have a P to the left? My question matches this P that gives me another yes on the writing of the one to my feature vector so we can see in this feature vector.
Here it's mostly zeros with a few ones.
Those ones are the one hot coatings.
This first part is the one hot coating of the centre phone and then a one hot coding of the left phone.
And then we'll carry on will do the right phone the left left phone, the right right phone and all of the other stuff.
It's one of the very long list of questions to do that.
We're still operating after linguistic timescale.
What we have is this binary vector is very sparse.
Mostly, you know, just a few ones we need just now.
Write it out at the fixed frame rate, and that's just a simple matter of duplicating it the right number of times for each hmm estate.
So the first step, then, is to obtain this for each hmm state.
Now we can add some simple position information immediately, and that is which state we were in this sequence here Is that the linguistic clock rate counting states 23456 And we now know how long we would like to spend in each state from moderation model.
So just duplicate the lines that number of times so this one will just be duplicated twice.
So we write them out like that on that white space doesn't really mean anything, so we'll just rewrite it like that.
That's now a sequence off feature vectors at a fixed frame rate.
This is time in frames.
One line per frame within a single Hmm estate.
Most of the vector is constant.
And then we've got some position information here that we might add here have added a very simple feature, which is position within ST it says.
First we halfway through and then we're all the way through.
Must look at another state, this last state here.
We need to be in for three time frames.
That's though three frames here.
So where a third of the way through two thirds of the way through all the way through very simple positional feature is a little counter that counts our way through the state.
What this represents, then this part here, the sequence of frames.
It says This isn't L.
That's encoded somehow in one of these one hot things in the context, so and so that's encouraging all the other one hot stuff.
Exaggeration of 10 frames, you can count the line.
There are 10 of them on DH.
We've got little counters, little clocks ticking away at different rates.
This one's going around quite rapidly.
This one's counting through states on, we could add any other counters we like.
We'll see in the real example that we might have other counters that count backwards.
For example,

Log in if you want to mark this as completed
Let's examine an actual input file for Neural Network speech synthesis.

Here are the input features for one sentence for a frame-by-frame model (in spreadsheet format for convenience).

Just a very informal look at how this can be done, to give you a starting point before reading about this in more detail.

This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.
So I just finished with a rather informal description of how to train in your network just to show that it's fairly straightforward.
It's a very simple algorithm now, like all simple algorithms, weaken.
Dress it up in all sorts of other enhancements on you'll find when you start reading the literature that that's what happens.
But in principle is pretty simple.
So we're going to give an impressionistic hand waving our non mathematical introduction to how you train a neural network.
We're going to use supervised machine learning.
In other words, we know the inputs on the outputs on the imports now, but they're going to be aligned.
So we're going to have to sequence is a sequence of inputs on the sequence of outputs.
They're both at the same frame rate, and there's a correspondence between one frame of input on one frame of output on DH.
Because my in your network in my pictures is rather small and simple.
My imports have got three dimensions.
My outpost got two dimensions, of course, in real speaks and assess.
This is to do with vocoder parameters, and this is to do with linguistic features we saw in the previous section how to prepare the input to how to get our sequence off vectors, which emotionally binary ones and zeroes with a few numerical positional features attached to them.
Little counters account through states or count through phones we could are counting through any other unit we liked.
Syllables, words, phrases on the outpost Adjust the vocoder parameters.
Just speech features.
Now the Sioux sequences have to be aligned.
How'd you get that alignment? Well, you actually know the answer to that.
We could do that with forced alignment in precisely the same way that you have prepared data for a unit selection.
Synthesiser.
In other words, we prepare aligned label files with duration information attached to them.
Where that duration information didn't come from the model that predicts duration.
It came from natural speech from the training data through forced alignment so we can get aligned.
Pairs off input and output vectors is an example of what the data might look like to train in your network.
We've got some input, not some output.
This is time in frames on.
These things are aligned.
So we know that when we put this into the network, we would like to get that out of the network.
So that's the job of the network is to learn that regression.
These features are all from recorded natural speech.
So these are the speech features extracted from natural speech, and these are the linguistic features for the transcription of that speech.
Have bean time aligned through forced alignment.
So we've prepared our data on now will train our network, and I'll use my really simple small network for the example.
So our imports are going to be of dimensionality three and the outputs or dimensionality too.
So I wonder this is going to be rather impressionistic.
We're not going to even show a single equation.
The job of the network is to learn when I show it this import, it should give me this output.
So the target output is an example from natural speech.
So this isn't aligned.
Pair of input and output Feature vectors typically will initialise in your network by just letting all the ways to random values.
When we input this feature, it goes through this weight matrix which just some random linear projection.
To start with the hidden layer, apply some nonlinear function to it, which was a sigmoid on DH.
We get another projection, another protection when we get some output, which is just the projection off the import through those layers.
So let's see where maybe get the following value and here we might get now.
We wouldn't expect the network to get the right answer because it's waits around them on the learning algorithm, then is just going to gradually adjust the weights to make the output as close as possible to the target.
That learning algorithm is called back propagation on DH Impressionistic Lee.
It does the following We looked at the output we would like to get.
We looked at the output that we got when we compute some error between those two.
So here the output was higher than it should have bean.
So the Somme era here the air is about 0.3.
So we need to adjust the weights in a way that makes this out put a little bit smaller.
So we take the era and we send it back through the network and the central messages.
Colonel, the weights and these messages here are going to say you gave me now, but it was too big.
I want to not put little bit smaller from that, So could you just scale down your weight a little bit? All those ways will get a little bit reduced to be an error.
Function also propagated back through this unit through these ways, and there's the little messages saying how they should be adjusted as well.
On the back propagation algorithm can then take these areas that all arrive the outputs of thes units.
So we're going to some of the errors that arrive here.
We're going to send it backwards through the activation function on DH Central messages down all of these weights as well, saying whether they should be increased a little bit or decreased a little bit.
We'll have computed how much the weight you need to be changed, and then we'll update the weights and we'll get a network which next time should give us something closer to the right output.
So maybe we update our weights and put the vector back to the network, and now maybe we'll get the following outputs so a little bit closer to the correct values we'll just do the same again will compute the errors between what the network gave us and what we wanted.
And we'll use that era as a signal to send back through the network to back propagate, and it will tell the weights where they need to increase or decrease.
Now, of course, we won't just do that for a single import on a single target will do it for the entire training centre so we'll have thousands or millions of pairs of inputs and outputs for each input will put it through the network.
Compute network output on the era.
We'll do that across the entire training set to find the average era, and then we'll back propagate through the weights to do an update.
And they will iterated that will make many, many passes through the data.
Make many, many updates of the weights on will gradually converge on an ideal set of weights that gives us out.
But that was close.
Is there going to be to the target now that training algorithm, as I've described it, is extremely naive.
It suggests that we need to make a pass through the entire training data before we can even update the weights.
Once in practise, we don't do that.
We'll do it in what it called mini Batch is so will make weight updates a lot more frequently, so we'll be able to get the network trained a lot quicker on there's lots and lots of other tricks to get this network to converge on the best possible set of weights.
And I'm deliberately not going to cover all of those here because that's an ever changing field on.
You need to read the literature to find out what people are doing today when their training your networks for speech synthesis.
Likewise, this very simple feed forward architecture is about the simplest neural network we could draw on DH people using far more sophisticated architectures.
They're expressing something that they believe about the problem just to give you one example of the sorts of things you could do to make a more complicated architecture.
You could choose not to fully connect two layers but have a very specific pattern of connexions that expresses something you think you know about the problem.
That's all I would like to say about training your networks because the rest is much better done from readings.
Let's finish off by seeing where we are in the big picture on what's coming up.
So I've only told you about the very simply sort of.
Neural networks feed forward architectures on DH The way I've described that is just a straight swap for the regression tree on.
We saw that basically a hidden Markov model is still involved.
For example, is there for force alignment of the training data on DH? When we do synthesis, it's there as a sort of sub genetic clock, it says.
Phones are too coarse in granularity.
We need to divide them into five little pieces.
And for each of those five pieces, we need to attach duration information on some positional features, which is how far through the phone you are in terms of states.
And then use that to construct the input of the neural network in early work on the network's piece emphasis in this rediscovery of new or nets for speech synthesis.
A lot of things were borrowed from hmm systems, including the duration model, even though it is a rather poor regression tree based model.
Later work tends to use a much better duration model.
I wonder what that might be.
Well, it's just another neural network, so a typical New York speak.
Synthesise will have one year on that prediction.
Duration, and then we'll use those durations and feed them into a second year old network, which does the regression onto speech parameters.
Well, what's coming next in neural networks would speak.
Synthesis depends entirely on when you're watching this video, because the literature is changing very quickly on this topic.
And so from this point onwards, I'm not even going to attempt to give you videos about the state of the art in your network's piece synthesis.
You're going to go and read the current literature to find that out.
We can make use of our neural network speech synthesiser in a unit selection system, and so will come full circle on DH.
Finish off in the next module, looking at hybrid synthesis.
And I'll describe that hours unit selection using an acoustic space formulation target cost function.
And it's worth pointing out that the better the regression model we get from your network speech emphasis, the better.
We expect hybrid synthesis toe work.
So any developments that we learn about from the literature on neural network based speech synthesis in a statistical parametric framework, in the words driving a vocoder.
We could take those developments and put them into a hybrid system and use it to drive unit selection system.
But we expect those improvements to carry through to better synthesis in the hybrid framework.

Log in if you want to mark this as completed

The essential readings are concerned with speech synthesis. If you first need some help understanding the basic ideas of Neural Networks, try one or other of the recommended readings. Both of those are complete, but short, books. Use your skim-reading skills to locate the most important parts.

Reading

Zen et al: Statistical parametric speech synthesis using deep neural networks

The first paper that re-introduced the use of (Deep) Neural Networks in speech synthesis.

Wu et al: Deep neural networks employing Multi-Task Learning…

Some straightforward, but effective techniques to improve the performance of speech synthesis using simple feedforward networks.

Watts et al: From HMMs to DNNs: where do the improvements come from?

Measures the relative contributions of the key differences in the regression model, state vs. frame predictions, and separate vs. combined stream predictions.

Download the slides for the class on 2025-03-04

These slides are now the updated version, after class. There is some homework for you in them!

To reveal an answer, click a flashcard.

As a warm-up for the next module, here is an optional additional video to watch.

Download the slides for this video

Total video to watch in this section: 64 minutes

A talk given to the University of Lancaster's student societies in Linguistics and Computer Science in December 2019. You can fast-forward over some of the background content in the middle part of the talk, if you wish.

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0002:59 [Automatic subtitles] There's been a massive paradigm shift in the way that speech synthesis works and we'll define what speech synthesis is in a minute. So, what I'm going to try and convey to you is some understanding of why that happened and what that makes possible and why end to end is in scare quotes in this title and what people really mean by end to end. So if you're tempted to find out what was happening in speech synthesis, with a bit of searching the literature, you'd very quickly come across a huge volume of papers in the last few years making claims that they were going truly end to end. This one says end to end. They mean going from text to speech, from raw text like the text on that slide, to a waveform that could come out of your loudspeaker with a single piece of technology that is learned from pairs of text and audio, with anything inside that being fully learned. No supervision of the internal representations, completely end to end. This is one of the first papers that tried to do that. They actually did it by gluing lots of things together. It doesn't really end to end at all. And then there's a whole sequence of papers and you'll see these papers are actually coming more from industry than academia and one reason for that is that these models are computationally super expensive to work with and they're very data hungry and that is pricing some people out of the game at the moment. We can talk a bit more about that later, why that's a problem. This model from Google called, all the models have ridiculous names by the way, this is called Tacotron. You can ask Google why by typing into Google. And on and on these papers go, all sorts of things and you could try and read them and eventually you'd find that they try and explain how their systems work by drawing pictures like this, which right at this moment in this talk will be an impenetrable diagram of coloured blocks joined together with some lines that you won't understand. You might have a clue what a spectrogram is hopefully, but other than that this will be a bit mysterious. We're going to come back to this at the end of the talk and you will understand this picture by the end of the talk and understand that it's actually doing something that's not that far from some traditional methods in text to speech. So our goal is to try and understand that picture, which is a very important seminal paper from Google on the second version of this Tacotron system that can sound extremely good, even though it's computationally and data wise a little bit hungry. So what I thought I would do is to help you understand that picture, we'll do the following. We'll have a little bit of a tutorial for the first part of the talk on how text to speech is done today, working up to the state of the art as it's used commercially. So all that synthesis that you're hearing, say coming out of Amazon's Alexa, how is that currently made today and how did we get to that point? So you'll need to understand that. And we'll see that universally all deployed systems actually don't operate end to end. They fall into three very traditional blocks that we'll understand. We do a bunch of very tricky, messy linguistic stuff with text.
02:5903:56 That's called the front end. We then do some straightforward kind of machine learning stuff in the middle to get from text-like things to speech-like things. And then we make a final leap from the speech-like thing like a spectrogram to a waveform. So we generate waveforms. And there's lots of different options in these different blocks. And we'll talk about what the state of the art is, what sounds best at the moment. And that will lead us then to current research, which is moving all the time, of which Tachytron 2 is one example. And then we'll go backwards through those blocks again and see what people are doing to generate waveforms, what people are doing to bridge from written form to spoken form, and if there's any new work in text processing, which is often the forgotten part, the most important part sometimes of speech synthesis. And that will give us a clue about the things we can now do with these models that just weren't possible with some of the historical techniques that we'll see on in the tutorial. And then at the end I will foolishly suggest what might happen next. And we'll keep that really brief because I will be wrong.
03:5604:21 So a tutorial then. So we need to talk about the paradigm. What is text-to-speech? It's from arbitrary text. And that text could be containing words, like this little fragment of a sentence here, but also things that are not words. So we call those non-standard words in the field. Currency amounts, dates, times, punctuation. Things that are not going to be literally read out loud need some processing. So we need some normalisation of the text.
04:2105:18 We want to get from that text to something we can listen to, and that's the only thing there is a waveform. So a waveform we can play out of the speakers. And that problem is called text-to-speech. That's what we're getting from text to speech. And the pipeline doesn't actually look like one big box. It typically looks like several boxes. We need to do quite a lot of things to text to extract something that we're going to call a linguistic specification. And we could think of that as instructions on how to say this thing out loud. So it's going to have obvious things in, like, phonemes, or maybe syllable structure, maybe prosody, stuff like that. So instructions on how to say it, which then some little bit of machine learning in the middle can do regression on. It can use as input and produce as out put something that's acoustic-like. And we tend not to go all the way to the waveform in modern techniques. We go to something that's this sequence of vectors picture I've drawn here, very abstract. If you have no idea what that is, just look at it and see a spectrogram.
05:1805:30 So a spectrogram is just a sequence of vectors. So we predict something like a spectrogram, and then it's not too hard, but still non-trivial to get from a spectrogram back to a waveform.
05:3006:00 And then we generate a waveform that we can listen to. So we can talk about those three boxes, what happens in those three boxes. So we'll start with the messy one, and that's the text processing box. And that module is called the front end, because the front of the system is the thing that receives the raw input. And its output needs to look something like this, very pretty, rich, linguistically meaningful, structured information. It might be this, syllable-structured phonemes, whose parents are words that have part of speech tags, and any number of things that you think might be useful as the how-to-say-it instructions.
06:0006:23 We can put anything we want in there, there's choices, design choices when we build the system. And we need to build some machinery that can take the text and produce that thing from the text. And it's pretty clear that involves bringing more information to the table than just there in the naked raw text. We need to bring some external sources of knowledge. For example, how do you predict the pronunciation of something from its spelling?
06:2307:55 You need some external knowledge to do that. So that thing, in terms of machine learning, which is the paradigm everything's happening in these days, we can think of as extracting features from the text that are going to be useful as input to this regression problem, this prediction of acoustics from linguistics. And you could extract all sorts of features, they might be useful, they might be less useful, and the next step of machine learning will decide whether to use them or not. So we could call that feature extraction, but traditionally it happens in a box that we call the front end. And the front end's a horrible, messy thing that, in big, mature systems, involves lots of people maintaining it and trying not to break it. So it rarely gets radically improved, it just gets tweaked because it's containing lots and lots of individual steps. So this messy, messy box called the front end has got to do things such as break the input text into tokens, which might be words, might not be words. And when they're not words, we have to normalise them. For all words, we can find their part of speech tag. For example, function word, content word is going to be extremely useful for predicting prosody, perhaps. For words that aren't in our dictionary, we're going to have to predict their pronunciation, that's called letter to sound. And we might, if we're adventurous, actually predict some sort of prosody. We might predict where to put phrase breaks, which is more than just where the punctuation is. There's a whole sequence of boxes in there, so have a little look inside those in just a shallow way, because to do all of that would be a very long course. Let's just get an idea of what's easy and what's hard in this pipeline.
07:5508:08 So tokenising is pretty straightforward for English, because English is a nice language in that it uses whitespace and punctuation. Not all languages are so well behaved. So there's one thing, and probably only one thing about English that's easy, and that's tokenisation.
08:0808:31 You can just do that with rules on punctuation and whitespace. You don't need to be particularly clever. But languages that don't use whitespace might need some serious engineering or knowledge sources to tokenise into word-like units. And for some languages you can debate what the words are, even. So that's straightforward. We need to then normalise those, so in these sentences that we really have to deal with, they're full of things that aren't words.
08:3108:49 And when I say aren't words, I mean they're things that even the Oxford English Dictionary in its massive 26-volume edition would never ever contain. No one would ever write pound sign 100 in a dictionary, and then pound sign 101, 102. We just wouldn't enumerate those in a dictionary. That would be stupid. So we can't look that up in a dictionary ever.
08:4910:01 We need to turn that into some words that we could look up in a dictionary. So we need to detect things non-standard, and that could be done with rules, rules looking at character classes like currency classes. Or it could be done with machine learning, by annotating lots of data with things that are and are not words, and training some piece of machinery to learn that. We then need to decide what kind of a not a word is it, and there's a set of standard categories, like it's a number that's a year, it's a money amount, it's a thing that you should say as a word, like IKEA. Just pretend it's a word and pronounce it as if it's a spelling of a real word. Plain numbers, letter sequences that you read as DVD, and so on and so forth. Once you've done the hard part, the expansion's pretty straightforward, but it involves human knowledge. It involves humans taking the knowledge of how you pronounce those things and writing rules using that knowledge. And that expression of your knowledge as rules seems a very simple and trivial thing to do, and we'll see that that might be a really hard thing to learn from data, because you need to see an awful lot of examples of DVD and someone saying it out loud to learn that it was a letter sequence. So these things are still done in very old-fashioned traditional ways in all the synthesis that you're hearing.
10:0111:19 All this normalisation, when you hear a mistake, it's because somebody's rules were not comprehensive enough to include that case. But that's old technology, this thing has been around for 20 years and hardly changed, because it basically is a solved problem. We also might want to annotate some richer bits of linguistic information on them, starting really quite shallow, something like a part of speech, whether things are nouns and verbs, we might use rather fine-grained categories coming from natural language processing. And to do that, we'd have to get some big corpus of text, we'd have to pay some poorly paid annotators to annotate these millions of words of text, and from that we could learn a model to tag new text we've never seen before with its parts of speech. Many words are unambiguous, one spelling only has one part of speech, but many are ambiguous, and it's their part of speech that will tell us, for example, which pronunciation to choose in the dictionary, or where to break the phrase breaks and so on. So we have this part of speech tagging. Part of speech tagging is a solved problem in NLP, if we have data. So for English, don't do a PhD on part of speech tagging, there's no wins left, it's been done. Given a big enough data set, you can part of speech text with extremely high accuracy. The problem is to do that for languages where you don't have that data, and that's unsolved. You then want to look up pronunciations of things.
11:1911:44 English is badly behaved in its spelling, its spelling is messy because it's coming from one and a half different language families, with loads of other borrowings, and of course in many languages spellings are archaic, so spellings stay fixed, or pronunciation varies away from them, or vice versa, and so we need knowledge, and for English the answer to that is to look it up in a big look-up table called a dictionary.
11:4412:29 So pronunciation, we're going to come back to pronunciation a bit later on, and see that learning pronunciations just from spoken examples might be significantly harder than writing a dictionary, because we might not see the diversity of words in a speech corpus that we would see in a dictionary, because a human dictionary expert, a lexicographer, would by design get very large numbers of word types. So writing dictionaries though is super expensive, and so the people that thought, let's go end to end, thought, we don't like dictionaries, those are expert things, we need to pay skilled people to make them, and so let's try not to do that, let's try not to get people to write these long, long, long lists of words and their pronunciations, because that's really painful.
12:2912:52 Nevertheless, all commercial systems that you ever hear deployed, whether it's Alexa, or Google Home, or Siri, have an enormous dictionary in them, and somewhere in the company there is a team of people who maintain the dictionary for each of the languages by adding words to it, so that, for example, pop singers' names are said correctly, because they won't be in the dictionary, because they're changing all the time.
12:5213:32 So at the end of all of that horrible, messy stuff that we just took a whistle-stop tour of, we just have this linguistic structure, this specification, from which we're going to now go and do the things to get from this specification of how to say it, to the acoustic description of what it sounds like, and that's where we need to do something that's called regression. So you might have come across this word regression before, if you've taken a statistics course, you might have looked at regression, models that try and fit functions to data. It's just a very generic term for predicting something that's got a continuous value from something that's an input that could be discrete or continuous, and that's just a generic problem called regression.
13:3214:41 So back to that end-to-end problem, we're trying to do this, until very, very recently nobody thought that that was a sensible problem to even try and solve. Everybody retreated a little bit from that problem, at both ends, they shrank the problem down to something that comes out of a front end, to something that's not quite a waveform, but from which we can get straightforwardly to a waveform, and this is a problem that we really think we can solve with machine learning, and even the end-to-end systems are going to do something a bit like this. So this problem is regression, because the output are these continuous values, these scalars, for example, the values of a spectrogram or the time frequency values in a spectrogram. The input is this rich linguistic thing, but we're going to have to do something to that, to make it available as the input to our chosen regression model, whatever that might be. And so we bolt onto that the front end that we just made, this thing here, that does all that messy, nasty stuff, but we contain it in a box, make it look neat, call it the front end, and we're going to need some other thing that we haven't got yet, called a waveform generator that will take our spectrogram or acoustic features and give us some sound that we can listen to.
14:4115:19 And the right way to do regression is not to try and handcraft some rules that says if this phoneme is this, then the formant value is equal to that, that's 1960s technology, that's very, very hard to generalise, for example, to make a new voice is extremely expensive, very hard to do that, and it's very hard to learn that from data, that's the wrong answer. The right answer is to use statistical modelling, or to use the fancy modern term, machine learning. So we're going to learn this model from pairs of linguistic features and acoustic features, which have in turn been extracted from a corpus of text and audio. So we're going to learn this model from a big corpus of transcribed speech.
15:1915:53 And we can think of this thing here as a feature extraction, because raw text is too horrible to deal with, it's too hard for our regression model, so we're going to make the problem easier by getting something a little bit closer to acoustics, so phonemes are closer than letters, so getting a bit closer to acoustics with some feature extraction. Waveforms are really horrible things to try and predict, we'll see later that the end-to-end systems attempted at first to go all the way to the speech waveform, to predict one-by-one the samples in a waveform, given the letters of the input. That's a horribly, horribly difficult problem, and we'll see why waveforms are such a nasty thing to try and predict directly.
15:5316:00 So we back off away from those, and we back off to something a bit like a spectrogram.
16:0018:13 So how do we do this thing here, this thing in the middle? What does this statistical model look like to do this task which I'm going to call regression? And it sits in between the two things, the front end that we've done already, and the waveform generator that is still to come. There's lots and lots of regression models out there. If you wanted your model to be interpretable, for example if you were fitting a model to some linguistic data or some psychological data, and you wanted to point at parts of the model and say how much they explain the data, you would have to use something very explicit, some modern fashionable things like mixed effects models or something like that. But we don't care about explainability directly here, we just care about performance. We want a model that fits the data as well as possible. That is that it predicts the acoustics with the least amount of error given the linguistic input across our whole corpus, and then generalises to linguistic inputs that we never saw before. So there's generalisation, so we can say new things. So we want the very, very best regression model we can. So what we want is the most general purpose, generic, fits all sizes regression model out there. And there is such a thing, there's a very, very general purpose machine that does regression, and that's called a neural network. So we're not going to do a course in neural networks, but we can understand that neural networks, like this very trivial baby network here that's tiny and won't do anything very useful, are general purpose machines that can be trained to do all sorts of tasks. And our task here is regression, because the output's going to be some values of spectrogram bins. And we can train these models from data, given pairs of inputs and outputs. So let's see if we can just understand, in very broad terms, just to get some intuition of why this is of a generic regression model. What's it made of, this funny picture of circles and lines? So each of these circles is called a unit or a neuron, and the people who invented these things thought they were modelling the brain, so they called them neurons. These are not models of the brain, these are just general pieces of machine learning. In no sense does your brain look like that. It's got a lot more neurons, for one thing, and a lot more connections. But this is neurons that are somehow representations of information.
18:1419:08 And there are connections between these neurons, they've got little arrows on, and they're called connections, and each have weights on them. And the weights are just numbers, and they're the learnable part of the model, they're the coefficients of the model. And these weights are arranged into blocks that link one layer to another, and you can see that's a 3 by 4 matrix, that's called a weight matrix. And as you can see, even this very small network has got quite a large number of weights. It's got 12 plus 16 plus 8 weights in this tiny model, so there's a large number of weights. And in a real neural network, there might be a million weights or 10 million weights or more, because we're going to make much bigger ones. These are the parameters of the model, and this is why it's the most general purpose regression model out there, because you have a very large number of parameters, and we have very straightforward machine learning algorithms that, given pairs of inputs and outputs, will find the best values for these weights. It will train these models on data.
19:1322:56 Inside the model, there are these layers of weights, and we often have many layers. And when we have many, many hidden layers, the model gets to be called deep. It's not clear when it becomes deep, whether it's two layers or three layers or four layers, but we'll see later on, and we already hinted at it in these pictures of tachytron, that modern neural networks are extremely deep, have tens or hundreds of layers. So that's why the field is often now called deep learning, because these neural networks are deep, they have many layers, many more than this one, and each layer is much bigger. So you can think of it as a model that takes inputs and produces outputs, it predicts the output, and it could be trained from labelled inputs and outputs, so it's supervised, we need that data. And we can think of it as flowing information from inputs to outputs, transforming the representation on the input, which is going to be that linguistic thing, slowly through some intermediate representations that we don't understand, because the model learns them, but they are slowly stepping towards the representation on the output. By making the model deep enough, we can go quite a long distance from input to output. We can go from something like linguistic symbols to something like a spectrogram, which are quite far apart. By stacking enough of these layers up, and with enough layers we'll have many weights, and therefore need a lot of data, but given the data we can learn this general purpose regression. So we put some sort of input on the input, and we push that through this model, and it gives us a prediction on the output. And we train that on some labelled data, and then for a new input with an unknown output, it will tell us the output, so it will do linguistic specification to spectrogram. So you've got to put on the input of this general purpose regression model, this thing that we already made. And it doesn't seem very obvious how you would take this beautiful tree of syllables and phonemes, all linguistically rich and meaningful, and squish it into the input layer of this machine here. And you can't. So these models don't accept structured inputs. They accept numbers. Flat arrays of numbers. So we have to come up with some way of squashing that thing on the left, and putting it into this input layer here. And this is where there's a big limitation in current models, is that even if we're able to predict linguistically rich things, not just syllable and word structure, but phrase structure, prosodic elements, and we can explain how all these things belong together in structured relationships, that's squashed and almost lost when we put it into these regression models. And the way that we put them in is by simply querying the structure with a great big long list of questions that reduce the answers yes and no, so probe the structure with lots of questions, and then encode the answers to those questions as ones and zeros on the inputs. So we query some bits of the input, put it in, make a prediction, and get the acoustic feature on the output. So for example, we might ask the question, was that third phoneme voiced? No, it's a zero. And we have hundreds and thousands of such questions, and we query some part of the linguistic structure, put those inputs through, and make some prediction of some slice of spectrogram on the output, and then we move forward a bit in time and make another prediction of the next slice and so on. So we slide through the linguistic structure from left to right, and we print out a spectrogram from left to right. And the network won't be a little thing like that, it will have thousands of inputs and thousands of outputs, and many millions of weights. But that's learnable given pairs of inputs and outputs, so that's the problem of regression solved, and the game that people are playing now in end-to-end is to play with the shape and size of this network in very complicated ways to try and get really good regression performance from these inputs to these outputs, and we'll come back to that in the current research section.
22:5623:52 So if we can print out a spectrogram from a linguistic structure, shouldn't it be pretty easy then just to turn that back into audio and listen to it? To understand that we've got to this point, we need to do a tiny bit of history, we need to look at how waveforms have been generated over the ages, and we won't go back too far in history, we'll just go back to the start of modern speech synthesis, modern data-driven speech synthesis in around 1990, and we'll see that the first attempt at doing speech synthesis from data was actually to concatenate bits of recordings, and we'll see how that fits into this paradigm of regression and waveform generation, and then there were various evolutions of that which will bring us eventually to the state-of-the-art of neural speech synthesis. But it's worth understanding how we got there, and that we've actually made some steps forward and some steps backwards along the way, and we still haven't quite got back everything that we had in 1990.
23:5224:38 So back in the 90s, and in some commercial products until recently, so until about a year ago, everything that you heard coming out of Alexa was concatenations of recorded waveforms, that's changed, but it was, it wasn't first generation unit selection, it was what we're going to do in a minute, second generation, but until recently what you heard was re-played audio recordings. And that works by having a big database of recorded sentences, and for everything we want to say, carefully and cleverly choosing fragments from that, sequencing them back together again, doing a little bit of signal processing to try and hide the fact that we've pasted audio together so the listeners don't notice, and then play that back. If you do that well, that can work pretty well actually.
24:3825:22 Because I've told you it's made from things stitched together, you might be able to sense there are some glitches in there, they're not too obvious, but there's some wobbliness in the audio, it's not perfect. But what is clear, it sounds like real human beings, real individuals because it is, it's just their speech played back. So it's worth understanding how that works, and to see how that might connect to the current state of the art. So let's actually do some speech synthesis, let's try say my name from a database of recorded speech in which this word does not exist, but the parts of it do, the fragments do.
25:2225:58 So we'll go to the database and find all the fragments of audio that we would need to sequence together to say my name. These fragments are of the same size as phones, but they're called diphones, they're the second half of a phone, the first half of the next one, so they're units of co-articulation. Because co-articulation is hard, it's something we're not very good at modelling, and as phoneticians we know that co-articulation is a tricky thing, so we actually record the units of co-articulation and stitch those together to avoid having to model it. And the name of the game is to pick one thing from each column, and play them back, and try all of the permutations until you find the one that sounds the best.
25:5826:58 So you could do that like this, pick one, and not get very good results, and you could keep going, and there's many, many permutations, and this is a tiny, tiny database, real databases are much bigger. At some point there's one that's going to be plausible. And if we find the units that are appropriate for the context in which we're using them, and that join smoothly to each other, we can get away with this and convince people this sounds like recorded audio of a whole word, when it was made from little fragments, joined together. So we'll draw a more general picture of doing that, so I'll do it by drawing pictures of phonemes because it's easier than these diphones. These things in blue is what we'd like to say, and these is a machine-readable version of the phonetic alphabet, so that says the cat sat, and the red things are candidates in the database. So blue things are just predictions, that's what we'd like to say, and that is something that's only specified linguistically. We only know its phonemes, we don't know what it sounds like, we don't have any audio of it.
26:5827:13 The red things are the other, they are actual recorded fragments from a database, but we also know the linguistic specification, and the game is to pick one thing from each of the column of red things that will sound the best, and will say the target, the blue thing.
27:1327:44 So in first generation unit selection, it uses the same pipeline, it has a front-end that produces this very rich linguistic specification, but it actually combines the regression and waveform generation steps into a single step. So we never explicitly write out acoustic features, and the reason was in the 1990s, we weren't very good at neural networks, our neural networks were kind of small, our databases were quite small, and we couldn't do regression very well. So we didn't attempt to do it explicitly, we did it implicitly by choosing fragments.
27:4428:14 So the regression actually happens as part of waveform generation, and so we have linguistic features on the thing we want to say, such as, it's this phoneme in a stressed syllable near the end of a question, and we have the same information for everything in the database that's audio, and we just match up and try and find the closest match. We'll never find an exact match in the general case, we try and find the closest match. So we make comparisons between what we want to say, and the available candidate units, in just linguistic space.
28:1429:25 How different was the left phonetic context, and how much does that matter? How much should we penalise that candidate for having a voiced thing on the left when we really want it to appear in a situation where it's an unvoiced thing for the left? We put costs on those, and we add up the costs, and we try and minimise this mismatch, as well as making things join smoothly. And that minimising mismatching linguistic space is implicitly predicting what the blue thing sounds like, by saying, well it sounds like the best selected red thing. So there's implicit regression as part of waveform generation. That was fine, that was the state of the art until about ten years ago, but in parallel to that, in the background, was something that never became commercial state of the art, because it never sounded good enough, and you might think people would have given up on that. They would have stuck with the thing that worked commercially, and not bothered with this thing. But people did, and so while in industry people were doing this first generation unit selection, and fine-tuning it, with lots and lots of engineering, in the background, mostly in academia, people were looking back at doing things explicitly, and doing explicit prediction of the acoustics, and that's a technique that was known at the time as statistical parametric speech synthesis, a bit of a mouthful.
29:2531:05 And this worked, not by selecting recorded waveforms, but by actually making predictions of the acoustics, not with a neural network, because at the time we weren't very good at it, with much simpler models, and then taking those specifications of the acoustics, which are things like spectrograms, and trying to make speech signals from them, with signal processing, with very traditional signal processing. And they used things called vocoders, which you may come across if you ever did a phonetics project, and you wanted to manipulate speech in some way, for example to change its pitch, or to extend its duration, or even modify its formants, you might use a vocoder to do that manipulation. And a traditional vocoder is a very heavy piece of engineering, and only a few people are really good enough to make in detail, and they decomposed speech signals into things like the spectral envelope, the formants, the pitch, the fundamental frequency, and this non-periodic energy, the noise part, and write out explicit representations of that, which we could predict with our acoustic model, with our regression, and then the really hard part is to take those representations, and from them make a really convincing speech signal that sounds as good as the original natural speech. And that's really, really hard, and no one ever could do that quite perfectly with traditional signal processing, which is why these models were rarely, if ever, deployed commercially. Because to take a spectrogram, or some spectral envelope information, some pitch information, and try and make speech, you always got a lot of artefacts, it always sounded quite artificial. But people persevered, because they believed that this eventually would be the right paradigm, and they will turn out to be true. They will turn out to be right, these people, because they are the people that led to the current paradigm.
31:0532:43 And so this thing in the middle are those features that our regression is going to produce on the output, these vectors of acoustic features. But the path wasn't quite smooth. People had got this first generation unilateral expression, it was okay, but it was impossible to make it any better. It didn't matter how much you tuned your weights and your functions, it just wouldn't get any better. So this is the barometric speech synthesis, got better and better, but never quite sounded natural because of the vocoder. So people thought, well, what's the obvious thing to do? Let's try and combine these two paradigms. Let's try and predict spectrograms with this parametric method, but not turn those into speech, but use them to choose waveform fragments. And that's a method called, I'm going to call second generation, or it's called hybrid. And from ten years ago until one year ago, that's what you were hearing from all the state-of-the-art stuff. Everything on phones, everything on these smart speakers and personal assistants, whether it's Siri or Alexa, whoever else you're listening to was doing something like this. They had the traditional front end, a big, horrible, messy thing that had been around for 20 years. They had some regression, which was by now a small neural network to predict some sort of acoustics from which we didn't know how to go to a waveform with machine learning, but we did know how to go to a database and find bits of waveform that sounded like that. And if you like, wallpaper over the spectrogram with real speech and play that back. And so we explicitly predict acoustics now. And so now, instead of comparing the linguistic specifications of these things, we take what we'd like to say and we predict what it should sound like in, say, the spectrogram domain.
32:4333:40 So we do some regression. Our predictions won't be perfect. And even if they were, our signal processing will ruin it if we try to use a vocoder. Our predictions will be good enough that we can then compare the acoustic features we just predicted from the actual acoustics on the things in the database, of which we know because it's speech, and go and find the same sounding things. And the nicest paper of all on this, and I'll put these slides online if you really want to follow up on these papers, is this paper here, which has got the best title because it says everything. Imagine that you've predicted a spectrogram that's a bit fuzzy and not very good, and if you turned it into speech it would sound a bit rubbish. But you can go find in your database bits of speech that sound a lot like that spectrogram and paper over the cracks with little tiles. They call it tiling. I think that's wallpapering. They paper over this nasty spectrogram with real pristine sharp sounding real audio and play that back instead of the original spectrogram.
33:4035:28 And that really, really works. So this is one important paper on that where they predict not actually a spectrogram, they predict something a bit more like formants, called line spectral pairs, so some sort of frequency domain representation of the formants, and then they take little tiles of audio and paper over it so you don't actually ever see that. And that's a really nice paper. And that was deployed for a very long time and can sound really good, but it suffers from the same limitations as first generation unit selection is that you're stuck with the database you recorded. You can't, for example, make new voices easily, it's very expensive. So it turned out those people that spent the best part of two decades pushing the statistical parametric paradigm were right. It's just that when they were doing it, they didn't have very good regression models, they didn't have deep neural networks, they had rubbish old models called regression trees, which are not as powerful, and they didn't have a good way of generating a waveform from their output. They had vocoders, which ruined everything. So everything was great and then everything sounded vocoded. But by replacing both of those things with neural networks, everything sounds fantastic. So it's just a question of waiting to be better at machine learning for more powerful regression models and dropping them into the same paradigm. And that's the latest thing. So that really brings us on to what's now current research. We can go through now and understand, finally, that method that we saw at the beginning. These people that are pretending to go end to end and we're going to discover they're not. They're going to do it in three blocks and we're going to do the same three blocks. But since we just talked about waveform, we'll start there and work our way backwards. So how might you generate a waveform in a way that's better than a vocoder? And why would that even be hard? So if you've done any phonetics course, hopefully you've seen something like that. That's a piece of software called PRART.
35:2838:57 There are plenty of other ones out there. The thing on the top is the waveform, it's just a sound pressure wave that this microphone here is recording. And the thing on the bottom is a spectrogram, it's just a time frequency map of the content of that signal. And hopefully you understand that getting from the top to the bottom is easy-peasy. That's well-defined, that's the thing called the Fourier transform, it's fast, it's deterministic, it always gives you the same answer, and it draws this picture. What you might not know is that picture on the bottom is not the whole story, it's only half of the information in the waveform. It's the amount of energy at all the different frequencies. But it's not how those sine waves that are all those energies actually line up in time. So we're dropping half of the information that's less meaningful, because we don't know what to make of it as humans, and it's called the phase. So to get back the way, you need to invent this thing called phase. You've got the magnitude, that's the amount of energy you need to mix together all the different frequencies to make that speech. But to make all the waveforms all line up correctly, for example to make those stop bursts be nice and sharp, we need to get the right phase. And that turns out to be not that easy. So to get you to understand why that's not that easy, let's try going from the audio to the spectrogram, and then back again, and get the phase wrong. So phase is something that just doesn't come up in phonetics courses. It might come up in a course on hearing, where at some point someone would say, phase is not important because we can't hear it. That's true that we can't hear it in natural speech because it's correct. We can hear it when it's wrong. So we'll play some original audio. That's a nice, reasonable recording. If we go to the spectrogram and back again and mess up phase, do a bad job of guessing what the phase should be, it's got this horrible phase-y artefact. It sounds like some sort of effects pedal has been applied to it. So that's the hard problem that we need to solve, and that these deep learning people have found very good solutions to. There's quite a few papers on this. This is another thing that I would say is basically nearly a solved problem. This would be a very bad choice for a PhD topic, because you will not beat these guys. For example, these guys that we work with in Amazon. They would like to build a machine that, given any spectrogram of speech, they're only interested in speech, of a single speaker, you could produce a really high-quality waveform from anyone without having to have seen that person during building the model. So arbitrary for new speakers. So a universal model. And they put the word towards in the title. The pre-review paper didn't have the word towards in the title, but they haven't quite got there yet, so they had to put that word back in. And they're going to do that with a neural network. It's going to be a bit more complicated than my one here, but the idea is going to be the same. Instead of going from linguistic features to acoustic features like a spectrogram, they're going to put the spectrogram on the input, and they're going to query values in the spectrogram. Is there any energy there, or is there energy there? Put that into their neural network, and the neural network is going to print out the samples of the waveform. So just regression again. So it's not necessarily any harder than any other regression problem, but because it's got a very good specification on the input for the magnitude, all it's got to do is come up with a reasonable phase that sounds okay. So actually quite a well-defined problem. So it's going to print out a waveform sample by sample. So that input's actually just a spectrogram. Now their model doesn't look like that because that's just a toy neural network with a very small number of units.
38:5739:34 Of course, their paper's got the kind of crazy flow diagram in that waves its hands and says this is what our neural network looks like. But if I tell you that each of those orange blobs there is just some neural architecture with layers and weights connected in some particular way, that's all they are, then we understand this. It's just a more complicated version of my neural network. It's not doing anything particularly different. It's just doing regression. So play two audio samples now. We'd have some original audio, then audio that's been converted to the spectrogram and had the phase thrown away, and we're not allowed to see that anymore. And then this model will go back to the original audio. It will guess the phase, and it'll do a much better job than the previous one.
39:3439:39 No wonder she searches out some wild desert to find a peaceful home.
39:3939:44 No wonder she searches out some wild desert to find a peaceful home.
39:4439:51 Anyone hear any difference between those two? On this kind of speaker system, you're not going to hear any of the artefacts. You might on good headphones if you listen carefully.
39:5140:17 So this is a really well-solved problem, really. And all people are trying to do now is to do this really fast, because these models are still too slow. So these are now just being deployed commercially, but very few people have got them actually running on your device, because it would just drain your battery every time they spoke. They're running on big servers in the cloud. So this is now what you're hearing when you listen to Google's synthesis. You're listening to this sort of architecture.
40:2240:54 So what I'm playing you, the question is why we need to do this. We don't have the original audio, so as you're going to see in these end-to-end systems, we train the system on pairs of text and audio, but we would like to say arbitrary new things where we only have text. I'm playing you the original audio just to show you there's very little degradation for this round trip. So this isn't speech synthesis, this is just a spectrogram I'm back, to show that's essentially solved. That bit's solved. The front-end text processing bit, we've got some good solutions, and all the action's going to be in the middle, as we'll see in a minute.
40:5440:58 So let's say we can now make waveforms much better than we could with those old vocoders.
40:5841:06 We can throw vocoders away and just use these things if we've got the compute power. So now let's look back into the middle. So we've got this regression problem in the middle.
41:0641:37 That was our flowchart before, and I'm going to draw you, in uncannily closely-matching colours, Google's tachytron model. And we're going to see that this tachytron model has got the exact same architecture of all the traditional synthesizers anyway, even though it was saying it's going a bit end-to-end. It's got something that's doing the job of the front-end. In the Dream version of this model, it takes textual input. But in the commercially-deployed version, that makes too many mistakes. So it takes phonetic input.
41:3741:56 So it's got a traditional front-end before it. It's got something that's doing something like a front-end that's extracting interesting features from the input, whether it's graphemes or phonemes. Either's possible. And that's this blue box. It's got a thing in the middle that takes whatever that thing has been extracted, and regresses it up to a spectrogram.
41:5642:04 And then it's got a thing that's very much like Amazon's thing, that takes a spectrogram and makes a waveform from it. A waveform generator.
42:0442:22 We renamed these boxes because people are changing the names of them. The front-end is not a front-end anymore, because its output is no longer interpretable. It doesn't mean anything to us humans, it's internal to the model. And so it's encoding the input into some hidden, abstract, embedded representation inside the model.
42:2242:26 And that's good and bad. It's good because it can be optimal for the task. It can be learned.
42:2642:31 It's bad because we have no idea what it is. We can't do anything useful with it.
42:3142:39 We then decode that. In other words, we regress from that mysterious internal representation to a spectrogram, and then do the obvious thing of vocoding it.
42:3942:51 And all of those other papers that we flashed at the beginning, they've got equally complicated-looking flow diagrams, but we don't really need to understand them because we can just draw colour boxes around them and see that they've all got the same architecture.
42:5143:10 So all of the state-of-the-art systems I showed you in all those previous papers, this is just one of them, do something that's a neural network that encodes input into something internal, something that takes this internal thing and decodes it into an audio representation, a time-frequency plane, so a spectrogram, and then a little vocoder that makes a waveform.
43:1143:35 So we better have a little bit of an understanding of this encoder-decoder architecture, because it is the paradigm shift. So the thing that really made things work, to go from the statistical parametric speed synthesis to the fully neural one, where we're going close to end-to-end, I've put characters on the input here, but it would work better with phonemes, is this idea of encoding sequences of inputs into something, and then decoding them out into a spectrogram.
43:3543:39 And so we're now regressing from sequences to sequences, and that's where things got exciting.
43:3943:42 That's what actually made everything work.
43:4243:50 But this internal thing is entirely mysterious, and I can't draw you, well, I could draw you pictures of it, but they wouldn't mean anything. They'd just be numbers in matrices.
43:5043:53 They'd be utterly uninterpretable, and therefore no point visualising.
43:5344:07 We don't really know what they are. They're learned by the model, because the model is trained simply by seeing pairs of input and output, and it learns what to represent internally to do the best possible job of regression with the least amount of error.
44:0744:16 But these models need to do something that the previous generation of models didn't do, and that's because they're going from a sequence of inputs to a sequence of outputs.
44:1644:20 They need to map between two sequences, and that's not trivial.
44:2044:23 And it's not trivial because one of the sequences is a linguistic thing.
44:2344:27 It's on linguistic time. The clock that ticks through it is a clock that's in phones.
44:2744:30 It ticks through pronunciations.
44:3044:34 But the thing on the output, the horizontal axis on the spectrogram, is a thing in time.
44:3444:37 And so we have to get between two sequences that are of different lengths.
44:3744:40 Typically the acoustic one is going to be longer than the linguistic one.
44:4044:44 And we don't know exactly how they align, because we don't know that information.
44:4444:47 We don't supervise the model with the durations of the phones.
44:4744:51 It learns that. So it needs to do the sequence-to-sequence model.
44:5144:57 And so part of the model is actually doing that alignment between linguistic things and acoustic things.
44:5745:02 It learns which linguistic inputs to look at when it's trying to predict certain acoustic outputs.
45:0345:07 And that's why this mechanism is often called attention, attending to or looking to.
45:0745:11 And it scans across the input and writes out the output.
45:1145:14 So when we're doing synthesis, that's the duration model.
45:1445:20 That's the thing that says how long each linguistic input should last in the output spectrogram.
45:2045:23 So these models have got built-in models of duration.
45:2345:28 So they are complete in that sense.
45:2845:32 So these sequence-to-sequence models of regression, we've only got a very high level understanding, don't worry.
45:3245:36 We don't need to get into the nitty-gritty of all the different architectures, because this changes every week.
45:3645:39 This concept seems to be very well established. This just works.
45:3945:50 And if we can now encode either text or phonemes into something and then decode it, we can start doing some kind of exciting things with these models.
45:5045:59 And the most exciting thing is to accept that text-to-speech is actually a very ill-posed problem, because text does not tell you how to say something.
45:5946:02 There are many, many different ways of saying any given text.
46:0246:04 They're all valid.
46:0446:12 And in the data that you learn the system from, you just get to observe one of them, one possible way of saying a sentence, but there are many, many other ways of doing it.
46:1246:20 So imagine that you had a database where someone had read out some books, and they had changed their speaking style as they went through this database.
46:2046:23 Maybe the character speech was in the voice of a character.
46:2346:25 Maybe there was happy speech or sad speech.
46:2546:28 The style is varying as we go through the database.
46:2846:33 But the text, the bare text, doesn't quite explain that variation in style.
46:3346:35 It's not fully specified.
46:3546:38 We have to bring some more information than just the text.
46:3846:40 This is one of many papers that's doing that.
46:4046:42 It's called Style Tokens.
46:4246:48 It's got another horrible diagram, which I'm about to simplify, because who knows what it's doing? They don't really know.
46:4846:51 It's got an encoder-decoder here that we vaguely understand.
46:5146:54 And that stuff at the top, I'll just simplify that.
46:5446:59 And we'll say it's a text-to-speech model that adds more information than just the text.
46:5947:01 It adds some new information.
47:0147:03 And that's when things get exciting.
47:0347:05 They're calling it a style embedding.
47:0547:07 And this model's doing something a bit peculiar.
47:0747:19 It takes as input text and some reference audio, which is speech, of some other sentence, not the sentence you're trying to say, but in the style in which you'd like to say it.
47:1947:26 So if you'd like this sentence to come out sad, you put in text, and you just give some speech in a sad style.
47:2647:33 And this model learns to embed this sad speech into some representation internal to the model called this style embedding.
47:3347:38 And it uses that to influence the regression in the encoder-decoder model.
47:3847:41 So an embedding is just some internal representation.
47:4147:48 We've already seen one, the mysterious thing the model learns to bridge from linguistic space to acoustic space.
47:4847:52 But we can add others if we have more information than just the text.
47:5247:53 And maybe we do.
47:5347:57 Maybe we've got a corpus in which we've labelled every sentence with a label.
47:5747:59 Happy, sad, whatever.
47:5948:03 Or that we've learned those labels in some way.
48:0348:05 This one learns the labels.
48:0548:07 It doesn't require you to label the corpus.
48:0748:11 It just requires you to have these reference audios.
48:1148:16 In other words, speech in the style that you would like it to come out in.
48:1648:19 That model on the bottom, that's just our text-to-speech model.
48:1948:24 And that thing on the top, that's just new information that wasn't explicitly in the text.
48:2448:27 So this is a more interesting model than text-to-speech.
48:2748:31 It's finally text-to-speech people admitting that we can't really do text-to-speech.
48:3148:34 It doesn't mean anything to say, do text-to-speech.
48:3448:41 You've got to decide who's going to say it, in what style are they going to say it, what accent, and so on and so forth.
48:4148:46 And in the past, that was done by changing the data on which you built the system.
48:4648:56 If you wanted your first or second generation unit selection system to sound sad, it was back to the recording studio and say, could you please sound sad for the next ten hours?
48:5648:58 And let's just record ten hours of sad speech.
48:5849:04 And then if we want to speak in a sad style, we chose our waveform fragments from the sad bit of the database.
49:0449:10 That worked, but it's very, very expensive, and it doesn't really scale to continuously varying speaking styles.
49:1049:13 So imagine we have this additional information, and now it can be anything you want.
49:1349:22 It could be something you can label on the data, or something that you discover varying in the data that is not explained by the text.
49:2249:31 So it could be a speaking style label, it could be a voice quality label, horse voice, whatever, modal voice.
49:3149:33 It could be anything, anything that you want.
49:3349:38 Whatever the text is missing, the leftover stuff.
49:3849:45 One of the interesting things people have been doing is to derive that from another audio sample in the required speaking style.
49:4549:47 But not the text you're starting to say, because you wouldn't have that.
49:4749:50 It just has to be some reference audio.
49:5049:53 So the Google paper does that with what they're calling a prosody embedding.
49:5349:55 It's not really prosody.
49:5549:57 It's really just the leftovers.
49:5749:59 So this model is learned in the same way as the text-to-speech model.
49:5950:04 You present it with pairs of text and speech, and it learns to do the regression.
50:0450:10 And in doing that, you can also measure what would be missing in the text to perfectly regress to the speech.
50:1050:15 And that leftover, that missing bit of information, is what they're calling prosody.
50:1550:17 But it could be many, many things, as we'll see.
50:1750:19 It's a lot more than just prosody.
50:1950:21 So we can now play interesting games.
50:2150:23 We can change that prosody embedding.
50:2350:25 We put a random value there and see what happens.
50:2550:31 Or we could put an embedding that's derived from different speech styles.
50:3150:37 So we can roll the dice and change the prosody embedding, and then change the speech.
50:3750:38 So...
50:3850:43 United Airlines 563 from Los Angeles to New Orleans has landed.
50:4350:44 Ridiculous sentence.
50:4450:46 Inappropriate for this voice.
50:4650:51 But by changing the embedding, we can change the style.
50:5151:01 United Airlines 563 from Los Angeles to New Orleans has landed.
51:0151:07 United Airlines 563 from Los Angeles to New Orleans has landed.
51:0751:08 And so on.
51:0851:15 United Airplanes 563 from Los Angeles to New Orleans has landed.
51:1551:23 It's not prosody, it's everything that's not in the speech.
51:2351:26 It's speaking style, emotion, prosody, whatever you want to call it.
51:2651:35 And that's a huge gap in the terminology of the field, is what to call the stuff that is under-specified in the text, and how to actually properly factor that out.
51:3551:45 So one bit really is the prosody, and one bit really is the speaker identity, and one bit really is the speaking style, and one bit really is the voice quality, or whatever you think those things might be.
51:4551:52 And that's incredibly hard, and nobody has a solution to factoring those things out and giving separate control, because they're very much tangled up together.
51:5251:59 The speaker's identity and their speaking style are not independent things, even for the most talented voice actor.
51:5952:05 So disentangling these representations, giving independent control over different things, nobody has that yet.
52:0552:07 This model doesn't have it.
52:0752:12 It's not prosody embedding, it's just mimicking some audio, some reference audio that you give to the system.
52:1252:19 So whatever you give it, it will mimic that speaking style, but say the text, an arbitrary new text.
52:1952:23 So we'll finish off by getting all the way back to the beginning of text processing.
52:2352:33 We'll remind ourselves about the messy, traditional way of doing things that works, but it's really expensive to maintain, especially the dictionary, that's very hard to maintain.
52:3352:43 And if you wanted to start building systems across different accents of a language, you might well have to make very substantial changes to your entire dictionary.
52:4352:45 That's going to be also extremely expensive.
52:4552:51 And then moving to a new language is going to take you a year of a skilled lexicographer to even write your first past dictionary.
52:5152:54 So lots of reasons to want to not do the traditional thing.
52:5452:56 So what happens if you don't?
52:5652:58 What happens if you try and throw these things away?
52:5853:14 So the traditional way of doing it is to write a great big pronunciation dictionary, like this, and then still find that whenever you synthesise, many sentences you want to say have at least one word in that was not in your dictionary, because language is like that, it's productive.
53:1453:21 And so we need to extrapolate from the dictionary with this thing called a letter-to-sound model, which will make mistakes.
53:2153:25 It will get it right about 70% of the time, if we're lucky, maybe 80 or 90% for state-of-the-art.
53:2553:29 So that's the best we could do in the traditional mode of doing things.
53:2953:40 So the motivation for the people who really want to go end-to-end, by actually taking text input, not phonemic input, was to learn that dictionary as part of this encoder.
53:4053:49 So we'd have to still normalise, because none of these models can really handle currency symbols and things, they take normalised text and they embed it, so they're doing something like a pronunciation representation.
53:5353:57 But these models that do that make stupid mistakes.
53:5754:03 They make mistakes that are so bad you can never ever deploy one of these commercially, because your customers would laugh at it.
54:0354:05 They make embarrassing, silly mistakes.
54:0554:15 So all of these papers or blog posts, which they often are, they're often not proper papers from these guys, say, what a fantastic model, oh, but here's some hilarious outtakes that the model does.
54:1554:26 And they say in this one, this particular system, Tecleton 2, which is state-of-the-art, this is a blog post from only a year and a half ago or so, it says, it has difficulty with these complex words.
54:2654:34 Now I don't know if you think these two words, decorum and merlot, are complex, I don't think so, I had a glass of merlot the other day and I didn't find it that difficult.
54:3554:37 Oh, it's going to shut down again.
54:3854:41 There's an easy answer to how you say those things.
54:4154:43 How do you learn how to say them?
54:4354:46 Well, you ask somebody, and if you don't know, you look it in a dictionary.
54:4654:51 So there's a very easy solution to these complex words, is to look in a dictionary.
54:5154:57 And these end-to-end systems don't have dictionaries, and so when they make these mistakes, there is nowhere in the system to go and fix it.
54:5855:03 So these are systems that are not fixable, which is why they'd never be deployed.
55:0355:06 They would only be deployed with a phonetic front-end.
55:0855:17 One reason that letter-to-sound is hard, is that we need to know something, perhaps, about the way the word came about.
55:1755:18 What's the word made of?
55:1855:20 Maybe the morphology.
55:2255:32 So, we're finding now, that putting a little bit of the right amount of linguistic information back into these rather naive end-to-end models makes huge differences, really big differences.
55:3255:41 This is a student in my group that is looking at what it would take to do a better job of letter-to-sound in these end-to-end systems.
55:4555:52 And to understand how that might be possible, we could look at a graph here, which on the horizontal axis is rather small.
55:5255:54 It says hours of recorded speech, going up to 600 hours.
55:5456:01 On the vertical axis, it's got the number of word types that you get, the number of unique word types.
56:0156:09 And you can see that will keep going up and up and up, but it will take a very long time before it even gets anywhere near a dictionary, which are all those lines at the top.
56:0956:14 So you would need hundreds of thousands of hours of speech to see all the words that you would see in a typical dictionary.
56:1456:19 And we don't normally work with 600-hour databases for speech synthesis, because we don't have ones of good enough quality that are that big.
56:1956:30 We're normally working down this tiny bottom left corner here, where we have databases that are tens of hours long, and in them there are tens of thousands of word types, whereas in our dictionaries there are hundreds of thousands of word types.
56:3056:39 So what the students have done is found that by adding just the simplest amount of morphology, you can effectively reduce the number of unique types of things in your database.
56:3956:41 That's the point of morphology, right?
56:4156:50 So word forms are almost infinite in their variety, but many new words are formed productively by combining morphs that already exist in other words.
56:5056:59 So one way to make your vocabulary look a lot smaller is to think of a vocabulary of morphs, morphemes, and not of words, of surface word forms.
56:5957:01 And that gets you big wins.
57:0157:19 So one of his particular test cases are these cases where, across a morphological boundary, there is a very high frequency phoneme sequence, or single phoneme in this case, th, which is not pronounced as th, because it spans a morph boundary, and it should be two separate phonemes.
57:1957:22 These end-to-end sequences get these ones wrong.
57:2357:28 So let's see if we can find that example.
57:2857:30 So I'll just play you a couple.
57:3057:40 I'll play you one of these end-to-end models, trying to say a word where there's a high frequency sound that straddles a morpheme boundary, but the system doesn't know about morphology.
57:4057:42 So we can say this word, coat hanger.
57:4257:46 So listen carefully for the sound that corresponds to the letter th.
57:4757:49 Coathanger.
57:4957:56 It's coathanger, because th is way, way more frequent for th than this t and h separate sounds.
57:5657:59 But if you know about morphology, then the model gets it right.
57:5958:02 Coat hanger.
58:0258:16 This one here, ph, is very commonly going to be a th, but there's a morph boundary here, so it's not going to be upheld, it's going to be upheld.
58:1858:20 Right, so let's wrap up.
58:2058:24 So hopefully you've got a little bit of intuition now about what the state of the art is.
58:2458:31 Although it's these big, heavy neural networks, and we might not know really how neural networks work, it doesn't really matter, because they're just building blocks for making regression.
58:3158:36 What's interesting is what you put into them, and what you get out of them.
58:3658:44 You can play the game of messing around with their architectures, but you're never going to win that game, because you don't have the data or the compute power.
58:4458:49 Other people have got that, they're going to win that game, they can burn the electricity, and they'll come up with something that works.
58:4958:56 What's much, much more interesting is what you put in, and what you get out, especially knowing that text alone is not enough.
58:5658:58 We need more, for example.
58:5859:01 Morphology really helps.
59:0359:09 So here's a guess about what will happen next, maybe just in the very short term.
59:0959:12 Linguistics is more than phoneme sequences.
59:1259:14 We have very rich representations.
59:1459:16 Here's syllable structure.
59:1659:18 In that previous example, it was morphology.
59:1859:26 There's lots and lots of other things you could imagine putting there, so morphology, that works, and if we can infer it from text, we can do a really good job.
59:2659:29 People have tried putting syntactic structure in.
59:2959:35 Syntactic structure doesn't always directly map to how something is said, the acoustics, it depends on the syntax.
59:3559:38 It's going to wake up again in a minute.
59:3859:48 You could think of all sorts of other structured linguistic information that is helpful for predicting acoustics, that is being lost at the moment.
59:5059:52 Syntax is one of them.
59:5259:54 More obviously, perhaps, might be meaning.
59:5459:59 So none of the systems at the moment make any attempt to guess the meaning of a sentence.
59:5900:04 They just go for very shallow syntactic information, like content word, function word.
00:0400:06 They don't really get into semantics.
00:0600:17 Another work, which I couldn't fit into the talk, where labelling the data with discourse relations, so relations between spans of words, if something is an elaboration of something else, or something is a contrast with something else.
00:1700:20 That can make really good improvements to prosody.
00:2000:25 You're probably a much better linguist than I am, so you can probably come up with other things you could add here.
00:2500:30 You could put richer things in and retain the structure, and not just squash it flat.
00:3100:34 This model has lots of representations all along the way.
00:3400:45 It's got this linguistic thing on the input, so you've got some choices about what you put in there, and what you choose will make the regression easier or more difficult, and you want to make it as easy as possible, to make it as more accurate as possible.
00:4600:48 You've got a waveform on the output.
00:4801:03 We already saw that this thing, phase, is a horrible thing, and it's one reason that the end-to-end problem, going all the way to the waveform, is a bit silly, actually, and that cutting the problem in half and putting a spectrogram in the middle is way more sensible.
01:0301:04 So that's what everyone's doing.
01:0401:09 They're putting a spectrogram here, and if Amazon are right, the vocoder problem is solved.
01:0901:15 We just have a black box, vocoder, any spectrogram, you get the waveform back, so we can just tick that off as done.
01:1501:26 So that representation is a spectrogram, but it doesn't have to be a spectrogram, because when you talk to Alexa, she doesn't show you a spectrogram saying, what do you think of my spectrogram? Does it look good?
01:2601:28 Who cares? It's internal.
01:2801:32 So there's lots of ideas you could have about things that are maybe better than spectrograms.
01:3201:34 Maybe they're perceptually more relevant.
01:3401:37 The only thing we do perceptually at the moment is put a MEL scale on it.
01:3701:42 So we use a nonlinear frequency scale, but you could imagine doing an awful lot more in that representation.
01:4201:51 And then there's the entirely mysterious relationship between the thing that encodes grapheme or phoneme input into something, and then that gets decoded out into the audio.
01:5101:54 We don't know what that is, and the model's optimising it.
01:5401:57 Maybe it should stay hidden from view and just be learned.
01:5702:04 Maybe we should be intervening there and making that interpretable or controllable, so we can actually go in there and adjust things.
02:0602:16 So the old statistical parametric speech synthesis, the reason people persevered with that was that you had lots and lots of control, which you never had in unit selection.
02:1602:24 You could do all sorts of nice tricks, like changing the speaker identity from very small amounts of data.
02:2402:27 You could even repair people's speech problems.
02:2802:43 So we have a company just spinning out, commercialising that technology that could take speech from someone with motor neurone disease who's already got articulation problems and essentially repair it in a statistical parametric domain and produce speech that sounded like they used to sound when they can no longer speak at all.
02:4302:47 We could never do that with unit selection, because it just plays back their impaired speech.
02:4702:51 We could morph emotions, we could do all sorts of things in that statistical parametric domain.
02:5102:56 We had lots and lots of control, which you've kind of lost again in this end-to-end paradigm.
02:5603:01 So another thing that would happen quite soon is to put control back in.
03:0103:09 If you've got a fully specified metal spectrogram with formants and harmonics, you're just always going to get the same waveform from it, because it's pretty fully specified.
03:0903:13 You just need to guess phase, and you just need to get phase right.
03:1303:16 You can't control the phase to change the speaking style.
03:1603:19 So this representation means you've got no control.
03:1903:24 So in this model, it's actually quite hard to even do simple things.
03:2403:26 You could just increase the pitch.
03:2603:29 There's no knob for that in this model.
03:2903:33 If you just don't like the voice, you say, oh, could you just pitch it down a bit?
03:3303:36 This model doesn't actually have a knob for that.
03:3603:38 You could just say, just speak a bit slower.
03:3803:41 This model really doesn't have a knob for that either.
03:4103:44 Particular parametric synthesis had explicit knobs.
03:4403:50 It had numbers and parameters you could easily change that would do all of those things trivially, because they were there in the representation.
03:5003:52 So we've lost that.
03:5304:00 Then maybe we can control the things that we never could in any of the paradigms, things like voice quality, making things like creaky voice.
04:0004:05 No one could ever do a really good job of creaky voice with signal processing or in the neural case.
04:0504:08 So I'll leave it there, because I don't want to make too many predictions.
04:0804:11 It can only be so wrong if you make a few predictions, I think.
04:1104:16 We'll call it a day, but I will leave you with two websites if you want to find out more.
04:1604:24 One is my research group's website, and the other one is where I'll put these slides maybe over the weekend, which is my teaching website.
04:2404:27 If you really want, there are complete courses on speech synthesis and other things there as well.
04:2704:29 Thank you.

Log in if you want to mark this as completed

That was just a rather lightweight introduction. We deliberately kept things really simple and used the most basic type of neural network. The type of models covered in this module are already dated, but that’s OK! You need to understand the fundamentals before attempting to understand more advanced methods.

For a more practically-oriented description of the frame-by-frame approach, there is an Interspeech tutorial from 2017.