What is “end-to-end” speech synthesis?
Presented at Lancaster University in 2019.PDF slides
(large video file: if it doesn’t load correctly, try Chrome)
Whilst the video is playing, click on a line in the transcript to play the video from that point. [Automatic subtitles] There's been a massive paradigm shift in the way that speech synthesis works and we'll define what speech synthesis is in a minute. So, what I'm going to try and convey to you is some understanding of why that happened and what that makes possible and why end to end is in scare quotes in this title and what people really mean by end to end. So if you're tempted to find out what was happening in speech synthesis, with a bit of searching the literature, you'd very quickly come across a huge volume of papers in the last few years making claims that they were going truly end to end. This one says end to end. They mean going from text to speech, from raw text like the text on that slide, to a waveform that could come out of your loudspeaker with a single piece of technology that is learned from pairs of text and audio, with anything inside that being fully learned. No supervision of the internal representations, completely end to end. This is one of the first papers that tried to do that. They actually did it by gluing lots of things together. It doesn't really end to end at all. And then there's a whole sequence of papers and you'll see these papers are actually coming more from industry than academia and one reason for that is that these models are computationally super expensive to work with and they're very data hungry and that is pricing some people out of the game at the moment. We can talk a bit more about that later, why that's a problem. This model from Google called, all the models have ridiculous names by the way, this is called Tacotron. You can ask Google why by typing into Google. And on and on these papers go, all sorts of things and you could try and read them and eventually you'd find that they try and explain how their systems work by drawing pictures like this, which right at this moment in this talk will be an impenetrable diagram of coloured blocks joined together with some lines that you won't understand. You might have a clue what a spectrogram is hopefully, but other than that this will be a bit mysterious. We're going to come back to this at the end of the talk and you will understand this picture by the end of the talk and understand that it's actually doing something that's not that far from some traditional methods in text to speech. So our goal is to try and understand that picture, which is a very important seminal paper from Google on the second version of this Tacotron system that can sound extremely good, even though it's computationally and data wise a little bit hungry. So what I thought I would do is to help you understand that picture, we'll do the following. We'll have a little bit of a tutorial for the first part of the talk on how text to speech is done today, working up to the state of the art as it's used commercially. So all that synthesis that you're hearing, say coming out of Amazon's Alexa, how is that currently made today and how did we get to that point? So you'll need to understand that. And we'll see that universally all deployed systems actually don't operate end to end. They fall into three very traditional blocks that we'll understand. We do a bunch of very tricky, messy linguistic stuff with text. That's called the front end. We then do some straightforward kind of machine learning stuff in the middle to get from text-like things to speech-like things. And then we make a final leap from the speech-like thing like a spectrogram to a waveform. So we generate waveforms. And there's lots of different options in these different blocks. And we'll talk about what the state of the art is, what sounds best at the moment. And that will lead us then to current research, which is moving all the time, of which Tachytron 2 is one example. And then we'll go backwards through those blocks again and see what people are doing to generate waveforms, what people are doing to bridge from written form to spoken form, and if there's any new work in text processing, which is often the forgotten part, the most important part sometimes of speech synthesis. And that will give us a clue about the things we can now do with these models that just weren't possible with some of the historical techniques that we'll see on in the tutorial. And then at the end I will foolishly suggest what might happen next. And we'll keep that really brief because I will be wrong. So a tutorial then. So we need to talk about the paradigm. What is text-to-speech? It's from arbitrary text. And that text could be containing words, like this little fragment of a sentence here, but also things that are not words. So we call those non-standard words in the field. Currency amounts, dates, times, punctuation. Things that are not going to be literally read out loud need some processing. So we need some normalisation of the text. We want to get from that text to something we can listen to, and that's the only thing there is a waveform. So a waveform we can play out of the speakers. And that problem is called text-to-speech. That's what we're getting from text to speech. And the pipeline doesn't actually look like one big box. It typically looks like several boxes. We need to do quite a lot of things to text to extract something that we're going to call a linguistic specification. And we could think of that as instructions on how to say this thing out loud. So it's going to have obvious things in, like, phonemes, or maybe syllable structure, maybe prosody, stuff like that. So instructions on how to say it, which then some little bit of machine learning in the middle can do regression on. It can use as input and produce as out put something that's acoustic-like. And we tend not to go all the way to the waveform in modern techniques. We go to something that's this sequence of vectors picture I've drawn here, very abstract. If you have no idea what that is, just look at it and see a spectrogram. So a spectrogram is just a sequence of vectors. So we predict something like a spectrogram, and then it's not too hard, but still non-trivial to get from a spectrogram back to a waveform. And then we generate a waveform that we can listen to. So we can talk about those three boxes, what happens in those three boxes. So we'll start with the messy one, and that's the text processing box. And that module is called the front end, because the front of the system is the thing that receives the raw input. And its output needs to look something like this, very pretty, rich, linguistically meaningful, structured information. It might be this, syllable-structured phonemes, whose parents are words that have part of speech tags, and any number of things that you think might be useful as the how-to-say-it instructions. We can put anything we want in there, there's choices, design choices when we build the system. And we need to build some machinery that can take the text and produce that thing from the text. And it's pretty clear that involves bringing more information to the table than just there in the naked raw text. We need to bring some external sources of knowledge. For example, how do you predict the pronunciation of something from its spelling? You need some external knowledge to do that. So that thing, in terms of machine learning, which is the paradigm everything's happening in these days, we can think of as extracting features from the text that are going to be useful as input to this regression problem, this prediction of acoustics from linguistics. And you could extract all sorts of features, they might be useful, they might be less useful, and the next step of machine learning will decide whether to use them or not. So we could call that feature extraction, but traditionally it happens in a box that we call the front end. And the front end's a horrible, messy thing that, in big, mature systems, involves lots of people maintaining it and trying not to break it. So it rarely gets radically improved, it just gets tweaked because it's containing lots and lots of individual steps. So this messy, messy box called the front end has got to do things such as break the input text into tokens, which might be words, might not be words. And when they're not words, we have to normalise them. For all words, we can find their part of speech tag. For example, function word, content word is going to be extremely useful for predicting prosody, perhaps. For words that aren't in our dictionary, we're going to have to predict their pronunciation, that's called letter to sound. And we might, if we're adventurous, actually predict some sort of prosody. We might predict where to put phrase breaks, which is more than just where the punctuation is. There's a whole sequence of boxes in there, so have a little look inside those in just a shallow way, because to do all of that would be a very long course. Let's just get an idea of what's easy and what's hard in this pipeline. So tokenising is pretty straightforward for English, because English is a nice language in that it uses whitespace and punctuation. Not all languages are so well behaved. So there's one thing, and probably only one thing about English that's easy, and that's tokenisation. You can just do that with rules on punctuation and whitespace. You don't need to be particularly clever. But languages that don't use whitespace might need some serious engineering or knowledge sources to tokenise into word-like units. And for some languages you can debate what the words are, even. So that's straightforward. We need to then normalise those, so in these sentences that we really have to deal with, they're full of things that aren't words. And when I say aren't words, I mean they're things that even the Oxford English Dictionary in its massive 26-volume edition would never ever contain. No one would ever write pound sign 100 in a dictionary, and then pound sign 101, 102. We just wouldn't enumerate those in a dictionary. That would be stupid. So we can't look that up in a dictionary ever. We need to turn that into some words that we could look up in a dictionary. So we need to detect things non-standard, and that could be done with rules, rules looking at character classes like currency classes. Or it could be done with machine learning, by annotating lots of data with things that are and are not words, and training some piece of machinery to learn that. We then need to decide what kind of a not a word is it, and there's a set of standard categories, like it's a number that's a year, it's a money amount, it's a thing that you should say as a word, like IKEA. Just pretend it's a word and pronounce it as if it's a spelling of a real word. Plain numbers, letter sequences that you read as DVD, and so on and so forth. Once you've done the hard part, the expansion's pretty straightforward, but it involves human knowledge. It involves humans taking the knowledge of how you pronounce those things and writing rules using that knowledge. And that expression of your knowledge as rules seems a very simple and trivial thing to do, and we'll see that that might be a really hard thing to learn from data, because you need to see an awful lot of examples of DVD and someone saying it out loud to learn that it was a letter sequence. So these things are still done in very old-fashioned traditional ways in all the synthesis that you're hearing. All this normalisation, when you hear a mistake, it's because somebody's rules were not comprehensive enough to include that case. But that's old technology, this thing has been around for 20 years and hardly changed, because it basically is a solved problem. We also might want to annotate some richer bits of linguistic information on them, starting really quite shallow, something like a part of speech, whether things are nouns and verbs, we might use rather fine-grained categories coming from natural language processing. And to do that, we'd have to get some big corpus of text, we'd have to pay some poorly paid annotators to annotate these millions of words of text, and from that we could learn a model to tag new text we've never seen before with its parts of speech. Many words are unambiguous, one spelling only has one part of speech, but many are ambiguous, and it's their part of speech that will tell us, for example, which pronunciation to choose in the dictionary, or where to break the phrase breaks and so on. So we have this part of speech tagging. Part of speech tagging is a solved problem in NLP, if we have data. So for English, don't do a PhD on part of speech tagging, there's no wins left, it's been done. Given a big enough data set, you can part of speech text with extremely high accuracy. The problem is to do that for languages where you don't have that data, and that's unsolved. You then want to look up pronunciations of things. English is badly behaved in its spelling, its spelling is messy because it's coming from one and a half different language families, with loads of other borrowings, and of course in many languages spellings are archaic, so spellings stay fixed, or pronunciation varies away from them, or vice versa, and so we need knowledge, and for English the answer to that is to look it up in a big look-up table called a dictionary. So pronunciation, we're going to come back to pronunciation a bit later on, and see that learning pronunciations just from spoken examples might be significantly harder than writing a dictionary, because we might not see the diversity of words in a speech corpus that we would see in a dictionary, because a human dictionary expert, a lexicographer, would by design get very large numbers of word types. So writing dictionaries though is super expensive, and so the people that thought, let's go end to end, thought, we don't like dictionaries, those are expert things, we need to pay skilled people to make them, and so let's try not to do that, let's try not to get people to write these long, long, long lists of words and their pronunciations, because that's really painful. Nevertheless, all commercial systems that you ever hear deployed, whether it's Alexa, or Google Home, or Siri, have an enormous dictionary in them, and somewhere in the company there is a team of people who maintain the dictionary for each of the languages by adding words to it, so that, for example, pop singers' names are said correctly, because they won't be in the dictionary, because they're changing all the time. So at the end of all of that horrible, messy stuff that we just took a whistle-stop tour of, we just have this linguistic structure, this specification, from which we're going to now go and do the things to get from this specification of how to say it, to the acoustic description of what it sounds like, and that's where we need to do something that's called regression. So you might have come across this word regression before, if you've taken a statistics course, you might have looked at regression, models that try and fit functions to data. It's just a very generic term for predicting something that's got a continuous value from something that's an input that could be discrete or continuous, and that's just a generic problem called regression. So back to that end-to-end problem, we're trying to do this, until very, very recently nobody thought that that was a sensible problem to even try and solve. Everybody retreated a little bit from that problem, at both ends, they shrank the problem down to something that comes out of a front end, to something that's not quite a waveform, but from which we can get straightforwardly to a waveform, and this is a problem that we really think we can solve with machine learning, and even the end-to-end systems are going to do something a bit like this. So this problem is regression, because the output are these continuous values, these scalars, for example, the values of a spectrogram or the time frequency values in a spectrogram. The input is this rich linguistic thing, but we're going to have to do something to that, to make it available as the input to our chosen regression model, whatever that might be. And so we bolt onto that the front end that we just made, this thing here, that does all that messy, nasty stuff, but we contain it in a box, make it look neat, call it the front end, and we're going to need some other thing that we haven't got yet, called a waveform generator that will take our spectrogram or acoustic features and give us some sound that we can listen to. And the right way to do regression is not to try and handcraft some rules that says if this phoneme is this, then the formant value is equal to that, that's 1960s technology, that's very, very hard to generalise, for example, to make a new voice is extremely expensive, very hard to do that, and it's very hard to learn that from data, that's the wrong answer. The right answer is to use statistical modelling, or to use the fancy modern term, machine learning. So we're going to learn this model from pairs of linguistic features and acoustic features, which have in turn been extracted from a corpus of text and audio. So we're going to learn this model from a big corpus of transcribed speech. And we can think of this thing here as a feature extraction, because raw text is too horrible to deal with, it's too hard for our regression model, so we're going to make the problem easier by getting something a little bit closer to acoustics, so phonemes are closer than letters, so getting a bit closer to acoustics with some feature extraction. Waveforms are really horrible things to try and predict, we'll see later that the end-to-end systems attempted at first to go all the way to the speech waveform, to predict one-by-one the samples in a waveform, given the letters of the input. That's a horribly, horribly difficult problem, and we'll see why waveforms are such a nasty thing to try and predict directly. So we back off away from those, and we back off to something a bit like a spectrogram. So how do we do this thing here, this thing in the middle? What does this statistical model look like to do this task which I'm going to call regression? And it sits in between the two things, the front end that we've done already, and the waveform generator that is still to come. There's lots and lots of regression models out there. If you wanted your model to be interpretable, for example if you were fitting a model to some linguistic data or some psychological data, and you wanted to point at parts of the model and say how much they explain the data, you would have to use something very explicit, some modern fashionable things like mixed effects models or something like that. But we don't care about explainability directly here, we just care about performance. We want a model that fits the data as well as possible. That is that it predicts the acoustics with the least amount of error given the linguistic input across our whole corpus, and then generalises to linguistic inputs that we never saw before. So there's generalisation, so we can say new things. So we want the very, very best regression model we can. So what we want is the most general purpose, generic, fits all sizes regression model out there. And there is such a thing, there's a very, very general purpose machine that does regression, and that's called a neural network. So we're not going to do a course in neural networks, but we can understand that neural networks, like this very trivial baby network here that's tiny and won't do anything very useful, are general purpose machines that can be trained to do all sorts of tasks. And our task here is regression, because the output's going to be some values of spectrogram bins. And we can train these models from data, given pairs of inputs and outputs. So let's see if we can just understand, in very broad terms, just to get some intuition of why this is of a generic regression model. What's it made of, this funny picture of circles and lines? So each of these circles is called a unit or a neuron, and the people who invented these things thought they were modelling the brain, so they called them neurons. These are not models of the brain, these are just general pieces of machine learning. In no sense does your brain look like that. It's got a lot more neurons, for one thing, and a lot more connections. But this is neurons that are somehow representations of information. And there are connections between these neurons, they've got little arrows on, and they're called connections, and each have weights on them. And the weights are just numbers, and they're the learnable part of the model, they're the coefficients of the model. And these weights are arranged into blocks that link one layer to another, and you can see that's a 3 by 4 matrix, that's called a weight matrix. And as you can see, even this very small network has got quite a large number of weights. It's got 12 plus 16 plus 8 weights in this tiny model, so there's a large number of weights. And in a real neural network, there might be a million weights or 10 million weights or more, because we're going to make much bigger ones. These are the parameters of the model, and this is why it's the most general purpose regression model out there, because you have a very large number of parameters, and we have very straightforward machine learning algorithms that, given pairs of inputs and outputs, will find the best values for these weights. It will train these models on data. Inside the model, there are these layers of weights, and we often have many layers. And when we have many, many hidden layers, the model gets to be called deep. It's not clear when it becomes deep, whether it's two layers or three layers or four layers, but we'll see later on, and we already hinted at it in these pictures of tachytron, that modern neural networks are extremely deep, have tens or hundreds of layers. So that's why the field is often now called deep learning, because these neural networks are deep, they have many layers, many more than this one, and each layer is much bigger. So you can think of it as a model that takes inputs and produces outputs, it predicts the output, and it could be trained from labelled inputs and outputs, so it's supervised, we need that data. And we can think of it as flowing information from inputs to outputs, transforming the representation on the input, which is going to be that linguistic thing, slowly through some intermediate representations that we don't understand, because the model learns them, but they are slowly stepping towards the representation on the output. By making the model deep enough, we can go quite a long distance from input to output. We can go from something like linguistic symbols to something like a spectrogram, which are quite far apart. By stacking enough of these layers up, and with enough layers we'll have many weights, and therefore need a lot of data, but given the data we can learn this general purpose regression. So we put some sort of input on the input, and we push that through this model, and it gives us a prediction on the output. And we train that on some labelled data, and then for a new input with an unknown output, it will tell us the output, so it will do linguistic specification to spectrogram. So you've got to put on the input of this general purpose regression model, this thing that we already made. And it doesn't seem very obvious how you would take this beautiful tree of syllables and phonemes, all linguistically rich and meaningful, and squish it into the input layer of this machine here. And you can't. So these models don't accept structured inputs. They accept numbers. Flat arrays of numbers. So we have to come up with some way of squashing that thing on the left, and putting it into this input layer here. And this is where there's a big limitation in current models, is that even if we're able to predict linguistically rich things, not just syllable and word structure, but phrase structure, prosodic elements, and we can explain how all these things belong together in structured relationships, that's squashed and almost lost when we put it into these regression models. And the way that we put them in is by simply querying the structure with a great big long list of questions that reduce the answers yes and no, so probe the structure with lots of questions, and then encode the answers to those questions as ones and zeros on the inputs. So we query some bits of the input, put it in, make a prediction, and get the acoustic feature on the output. So for example, we might ask the question, was that third phoneme voiced? No, it's a zero. And we have hundreds and thousands of such questions, and we query some part of the linguistic structure, put those inputs through, and make some prediction of some slice of spectrogram on the output, and then we move forward a bit in time and make another prediction of the next slice and so on. So we slide through the linguistic structure from left to right, and we print out a spectrogram from left to right. And the network won't be a little thing like that, it will have thousands of inputs and thousands of outputs, and many millions of weights. But that's learnable given pairs of inputs and outputs, so that's the problem of regression solved, and the game that people are playing now in end-to-end is to play with the shape and size of this network in very complicated ways to try and get really good regression performance from these inputs to these outputs, and we'll come back to that in the current research section. So if we can print out a spectrogram from a linguistic structure, shouldn't it be pretty easy then just to turn that back into audio and listen to it? To understand that we've got to this point, we need to do a tiny bit of history, we need to look at how waveforms have been generated over the ages, and we won't go back too far in history, we'll just go back to the start of modern speech synthesis, modern data-driven speech synthesis in around 1990, and we'll see that the first attempt at doing speech synthesis from data was actually to concatenate bits of recordings, and we'll see how that fits into this paradigm of regression and waveform generation, and then there were various evolutions of that which will bring us eventually to the state-of-the-art of neural speech synthesis. But it's worth understanding how we got there, and that we've actually made some steps forward and some steps backwards along the way, and we still haven't quite got back everything that we had in 1990. So back in the 90s, and in some commercial products until recently, so until about a year ago, everything that you heard coming out of Alexa was concatenations of recorded waveforms, that's changed, but it was, it wasn't first generation unit selection, it was what we're going to do in a minute, second generation, but until recently what you heard was re-played audio recordings. And that works by having a big database of recorded sentences, and for everything we want to say, carefully and cleverly choosing fragments from that, sequencing them back together again, doing a little bit of signal processing to try and hide the fact that we've pasted audio together so the listeners don't notice, and then play that back. If you do that well, that can work pretty well actually. Because I've told you it's made from things stitched together, you might be able to sense there are some glitches in there, they're not too obvious, but there's some wobbliness in the audio, it's not perfect. But what is clear, it sounds like real human beings, real individuals because it is, it's just their speech played back. So it's worth understanding how that works, and to see how that might connect to the current state of the art. So let's actually do some speech synthesis, let's try say my name from a database of recorded speech in which this word does not exist, but the parts of it do, the fragments do. So we'll go to the database and find all the fragments of audio that we would need to sequence together to say my name. These fragments are of the same size as phones, but they're called diphones, they're the second half of a phone, the first half of the next one, so they're units of co-articulation. Because co-articulation is hard, it's something we're not very good at modelling, and as phoneticians we know that co-articulation is a tricky thing, so we actually record the units of co-articulation and stitch those together to avoid having to model it. And the name of the game is to pick one thing from each column, and play them back, and try all of the permutations until you find the one that sounds the best. So you could do that like this, pick one, and not get very good results, and you could keep going, and there's many, many permutations, and this is a tiny, tiny database, real databases are much bigger. At some point there's one that's going to be plausible. And if we find the units that are appropriate for the context in which we're using them, and that join smoothly to each other, we can get away with this and convince people this sounds like recorded audio of a whole word, when it was made from little fragments, joined together. So we'll draw a more general picture of doing that, so I'll do it by drawing pictures of phonemes because it's easier than these diphones. These things in blue is what we'd like to say, and these is a machine-readable version of the phonetic alphabet, so that says the cat sat, and the red things are candidates in the database. So blue things are just predictions, that's what we'd like to say, and that is something that's only specified linguistically. We only know its phonemes, we don't know what it sounds like, we don't have any audio of it. The red things are the other, they are actual recorded fragments from a database, but we also know the linguistic specification, and the game is to pick one thing from each of the column of red things that will sound the best, and will say the target, the blue thing. So in first generation unit selection, it uses the same pipeline, it has a front-end that produces this very rich linguistic specification, but it actually combines the regression and waveform generation steps into a single step. So we never explicitly write out acoustic features, and the reason was in the 1990s, we weren't very good at neural networks, our neural networks were kind of small, our databases were quite small, and we couldn't do regression very well. So we didn't attempt to do it explicitly, we did it implicitly by choosing fragments. So the regression actually happens as part of waveform generation, and so we have linguistic features on the thing we want to say, such as, it's this phoneme in a stressed syllable near the end of a question, and we have the same information for everything in the database that's audio, and we just match up and try and find the closest match. We'll never find an exact match in the general case, we try and find the closest match. So we make comparisons between what we want to say, and the available candidate units, in just linguistic space. How different was the left phonetic context, and how much does that matter? How much should we penalise that candidate for having a voiced thing on the left when we really want it to appear in a situation where it's an unvoiced thing for the left? We put costs on those, and we add up the costs, and we try and minimise this mismatch, as well as making things join smoothly. And that minimising mismatching linguistic space is implicitly predicting what the blue thing sounds like, by saying, well it sounds like the best selected red thing. So there's implicit regression as part of waveform generation. That was fine, that was the state of the art until about ten years ago, but in parallel to that, in the background, was something that never became commercial state of the art, because it never sounded good enough, and you might think people would have given up on that. They would have stuck with the thing that worked commercially, and not bothered with this thing. But people did, and so while in industry people were doing this first generation unit selection, and fine-tuning it, with lots and lots of engineering, in the background, mostly in academia, people were looking back at doing things explicitly, and doing explicit prediction of the acoustics, and that's a technique that was known at the time as statistical parametric speech synthesis, a bit of a mouthful. And this worked, not by selecting recorded waveforms, but by actually making predictions of the acoustics, not with a neural network, because at the time we weren't very good at it, with much simpler models, and then taking those specifications of the acoustics, which are things like spectrograms, and trying to make speech signals from them, with signal processing, with very traditional signal processing. And they used things called vocoders, which you may come across if you ever did a phonetics project, and you wanted to manipulate speech in some way, for example to change its pitch, or to extend its duration, or even modify its formants, you might use a vocoder to do that manipulation. And a traditional vocoder is a very heavy piece of engineering, and only a few people are really good enough to make in detail, and they decomposed speech signals into things like the spectral envelope, the formants, the pitch, the fundamental frequency, and this non-periodic energy, the noise part, and write out explicit representations of that, which we could predict with our acoustic model, with our regression, and then the really hard part is to take those representations, and from them make a really convincing speech signal that sounds as good as the original natural speech. And that's really, really hard, and no one ever could do that quite perfectly with traditional signal processing, which is why these models were rarely, if ever, deployed commercially. Because to take a spectrogram, or some spectral envelope information, some pitch information, and try and make speech, you always got a lot of artefacts, it always sounded quite artificial. But people persevered, because they believed that this eventually would be the right paradigm, and they will turn out to be true. They will turn out to be right, these people, because they are the people that led to the current paradigm. And so this thing in the middle are those features that our regression is going to produce on the output, these vectors of acoustic features. But the path wasn't quite smooth. People had got this first generation unilateral expression, it was okay, but it was impossible to make it any better. It didn't matter how much you tuned your weights and your functions, it just wouldn't get any better. So this is the barometric speech synthesis, got better and better, but never quite sounded natural because of the vocoder. So people thought, well, what's the obvious thing to do? Let's try and combine these two paradigms. Let's try and predict spectrograms with this parametric method, but not turn those into speech, but use them to choose waveform fragments. And that's a method called, I'm going to call second generation, or it's called hybrid. And from ten years ago until one year ago, that's what you were hearing from all the state-of-the-art stuff. Everything on phones, everything on these smart speakers and personal assistants, whether it's Siri or Alexa, whoever else you're listening to was doing something like this. They had the traditional front end, a big, horrible, messy thing that had been around for 20 years. They had some regression, which was by now a small neural network to predict some sort of acoustics from which we didn't know how to go to a waveform with machine learning, but we did know how to go to a database and find bits of waveform that sounded like that. And if you like, wallpaper over the spectrogram with real speech and play that back. And so we explicitly predict acoustics now. And so now, instead of comparing the linguistic specifications of these things, we take what we'd like to say and we predict what it should sound like in, say, the spectrogram domain. So we do some regression. Our predictions won't be perfect. And even if they were, our signal processing will ruin it if we try to use a vocoder. Our predictions will be good enough that we can then compare the acoustic features we just predicted from the actual acoustics on the things in the database, of which we know because it's speech, and go and find the same sounding things. And the nicest paper of all on this, and I'll put these slides online if you really want to follow up on these papers, is this paper here, which has got the best title because it says everything. Imagine that you've predicted a spectrogram that's a bit fuzzy and not very good, and if you turned it into speech it would sound a bit rubbish. But you can go find in your database bits of speech that sound a lot like that spectrogram and paper over the cracks with little tiles. They call it tiling. I think that's wallpapering. They paper over this nasty spectrogram with real pristine sharp sounding real audio and play that back instead of the original spectrogram. And that really, really works. So this is one important paper on that where they predict not actually a spectrogram, they predict something a bit more like formants, called line spectral pairs, so some sort of frequency domain representation of the formants, and then they take little tiles of audio and paper over it so you don't actually ever see that. And that's a really nice paper. And that was deployed for a very long time and can sound really good, but it suffers from the same limitations as first generation unit selection is that you're stuck with the database you recorded. You can't, for example, make new voices easily, it's very expensive. So it turned out those people that spent the best part of two decades pushing the statistical parametric paradigm were right. It's just that when they were doing it, they didn't have very good regression models, they didn't have deep neural networks, they had rubbish old models called regression trees, which are not as powerful, and they didn't have a good way of generating a waveform from their output. They had vocoders, which ruined everything. So everything was great and then everything sounded vocoded. But by replacing both of those things with neural networks, everything sounds fantastic. So it's just a question of waiting to be better at machine learning for more powerful regression models and dropping them into the same paradigm. And that's the latest thing. So that really brings us on to what's now current research. We can go through now and understand, finally, that method that we saw at the beginning. These people that are pretending to go end to end and we're going to discover they're not. They're going to do it in three blocks and we're going to do the same three blocks. But since we just talked about waveform, we'll start there and work our way backwards. So how might you generate a waveform in a way that's better than a vocoder? And why would that even be hard? So if you've done any phonetics course, hopefully you've seen something like that. That's a piece of software called PRART. There are plenty of other ones out there. The thing on the top is the waveform, it's just a sound pressure wave that this microphone here is recording. And the thing on the bottom is a spectrogram, it's just a time frequency map of the content of that signal. And hopefully you understand that getting from the top to the bottom is easy-peasy. That's well-defined, that's the thing called the Fourier transform, it's fast, it's deterministic, it always gives you the same answer, and it draws this picture. What you might not know is that picture on the bottom is not the whole story, it's only half of the information in the waveform. It's the amount of energy at all the different frequencies. But it's not how those sine waves that are all those energies actually line up in time. So we're dropping half of the information that's less meaningful, because we don't know what to make of it as humans, and it's called the phase. So to get back the way, you need to invent this thing called phase. You've got the magnitude, that's the amount of energy you need to mix together all the different frequencies to make that speech. But to make all the waveforms all line up correctly, for example to make those stop bursts be nice and sharp, we need to get the right phase. And that turns out to be not that easy. So to get you to understand why that's not that easy, let's try going from the audio to the spectrogram, and then back again, and get the phase wrong. So phase is something that just doesn't come up in phonetics courses. It might come up in a course on hearing, where at some point someone would say, phase is not important because we can't hear it. That's true that we can't hear it in natural speech because it's correct. We can hear it when it's wrong. So we'll play some original audio. That's a nice, reasonable recording. If we go to the spectrogram and back again and mess up phase, do a bad job of guessing what the phase should be, it's got this horrible phase-y artefact. It sounds like some sort of effects pedal has been applied to it. So that's the hard problem that we need to solve, and that these deep learning people have found very good solutions to. There's quite a few papers on this. This is another thing that I would say is basically nearly a solved problem. This would be a very bad choice for a PhD topic, because you will not beat these guys. For example, these guys that we work with in Amazon. They would like to build a machine that, given any spectrogram of speech, they're only interested in speech, of a single speaker, you could produce a really high-quality waveform from anyone without having to have seen that person during building the model. So arbitrary for new speakers. So a universal model. And they put the word towards in the title. The pre-review paper didn't have the word towards in the title, but they haven't quite got there yet, so they had to put that word back in. And they're going to do that with a neural network. It's going to be a bit more complicated than my one here, but the idea is going to be the same. Instead of going from linguistic features to acoustic features like a spectrogram, they're going to put the spectrogram on the input, and they're going to query values in the spectrogram. Is there any energy there, or is there energy there? Put that into their neural network, and the neural network is going to print out the samples of the waveform. So just regression again. So it's not necessarily any harder than any other regression problem, but because it's got a very good specification on the input for the magnitude, all it's got to do is come up with a reasonable phase that sounds okay. So actually quite a well-defined problem. So it's going to print out a waveform sample by sample. So that input's actually just a spectrogram. Now their model doesn't look like that because that's just a toy neural network with a very small number of units. Of course, their paper's got the kind of crazy flow diagram in that waves its hands and says this is what our neural network looks like. But if I tell you that each of those orange blobs there is just some neural architecture with layers and weights connected in some particular way, that's all they are, then we understand this. It's just a more complicated version of my neural network. It's not doing anything particularly different. It's just doing regression. So play two audio samples now. We'd have some original audio, then audio that's been converted to the spectrogram and had the phase thrown away, and we're not allowed to see that anymore. And then this model will go back to the original audio. It will guess the phase, and it'll do a much better job than the previous one. No wonder she searches out some wild desert to find a peaceful home. No wonder she searches out some wild desert to find a peaceful home. Anyone hear any difference between those two? On this kind of speaker system, you're not going to hear any of the artefacts. You might on good headphones if you listen carefully. So this is a really well-solved problem, really. And all people are trying to do now is to do this really fast, because these models are still too slow. So these are now just being deployed commercially, but very few people have got them actually running on your device, because it would just drain your battery every time they spoke. They're running on big servers in the cloud. So this is now what you're hearing when you listen to Google's synthesis. You're listening to this sort of architecture. So what I'm playing you, the question is why we need to do this. We don't have the original audio, so as you're going to see in these end-to-end systems, we train the system on pairs of text and audio, but we would like to say arbitrary new things where we only have text. I'm playing you the original audio just to show you there's very little degradation for this round trip. So this isn't speech synthesis, this is just a spectrogram I'm back, to show that's essentially solved. That bit's solved. The front-end text processing bit, we've got some good solutions, and all the action's going to be in the middle, as we'll see in a minute. So let's say we can now make waveforms much better than we could with those old vocoders. We can throw vocoders away and just use these things if we've got the compute power. So now let's look back into the middle. So we've got this regression problem in the middle. That was our flowchart before, and I'm going to draw you, in uncannily closely-matching colours, Google's tachytron model. And we're going to see that this tachytron model has got the exact same architecture of all the traditional synthesizers anyway, even though it was saying it's going a bit end-to-end. It's got something that's doing the job of the front-end. In the Dream version of this model, it takes textual input. But in the commercially-deployed version, that makes too many mistakes. So it takes phonetic input. So it's got a traditional front-end before it. It's got something that's doing something like a front-end that's extracting interesting features from the input, whether it's graphemes or phonemes. Either's possible. And that's this blue box. It's got a thing in the middle that takes whatever that thing has been extracted, and regresses it up to a spectrogram. And then it's got a thing that's very much like Amazon's thing, that takes a spectrogram and makes a waveform from it. A waveform generator. We renamed these boxes because people are changing the names of them. The front-end is not a front-end anymore, because its output is no longer interpretable. It doesn't mean anything to us humans, it's internal to the model. And so it's encoding the input into some hidden, abstract, embedded representation inside the model. And that's good and bad. It's good because it can be optimal for the task. It can be learned. It's bad because we have no idea what it is. We can't do anything useful with it. We then decode that. In other words, we regress from that mysterious internal representation to a spectrogram, and then do the obvious thing of vocoding it. And all of those other papers that we flashed at the beginning, they've got equally complicated-looking flow diagrams, but we don't really need to understand them because we can just draw colour boxes around them and see that they've all got the same architecture. So all of the state-of-the-art systems I showed you in all those previous papers, this is just one of them, do something that's a neural network that encodes input into something internal, something that takes this internal thing and decodes it into an audio representation, a time-frequency plane, so a spectrogram, and then a little vocoder that makes a waveform. So we better have a little bit of an understanding of this encoder-decoder architecture, because it is the paradigm shift. So the thing that really made things work, to go from the statistical parametric speed synthesis to the fully neural one, where we're going close to end-to-end, I've put characters on the input here, but it would work better with phonemes, is this idea of encoding sequences of inputs into something, and then decoding them out into a spectrogram. And so we're now regressing from sequences to sequences, and that's where things got exciting. That's what actually made everything work. But this internal thing is entirely mysterious, and I can't draw you, well, I could draw you pictures of it, but they wouldn't mean anything. They'd just be numbers in matrices. They'd be utterly uninterpretable, and therefore no point visualising. We don't really know what they are. They're learned by the model, because the model is trained simply by seeing pairs of input and output, and it learns what to represent internally to do the best possible job of regression with the least amount of error. But these models need to do something that the previous generation of models didn't do, and that's because they're going from a sequence of inputs to a sequence of outputs. They need to map between two sequences, and that's not trivial. And it's not trivial because one of the sequences is a linguistic thing. It's on linguistic time. The clock that ticks through it is a clock that's in phones. It ticks through pronunciations. But the thing on the output, the horizontal axis on the spectrogram, is a thing in time. And so we have to get between two sequences that are of different lengths. Typically the acoustic one is going to be longer than the linguistic one. And we don't know exactly how they align, because we don't know that information. We don't supervise the model with the durations of the phones. It learns that. So it needs to do the sequence-to-sequence model. And so part of the model is actually doing that alignment between linguistic things and acoustic things. It learns which linguistic inputs to look at when it's trying to predict certain acoustic outputs. And that's why this mechanism is often called attention, attending to or looking to. And it scans across the input and writes out the output. So when we're doing synthesis, that's the duration model. That's the thing that says how long each linguistic input should last in the output spectrogram. So these models have got built-in models of duration. So they are complete in that sense. So these sequence-to-sequence models of regression, we've only got a very high level understanding, don't worry. We don't need to get into the nitty-gritty of all the different architectures, because this changes every week. This concept seems to be very well established. This just works. And if we can now encode either text or phonemes into something and then decode it, we can start doing some kind of exciting things with these models. And the most exciting thing is to accept that text-to-speech is actually a very ill-posed problem, because text does not tell you how to say something. There are many, many different ways of saying any given text. They're all valid. And in the data that you learn the system from, you just get to observe one of them, one possible way of saying a sentence, but there are many, many other ways of doing it. So imagine that you had a database where someone had read out some books, and they had changed their speaking style as they went through this database. Maybe the character speech was in the voice of a character. Maybe there was happy speech or sad speech. The style is varying as we go through the database. But the text, the bare text, doesn't quite explain that variation in style. It's not fully specified. We have to bring some more information than just the text. This is one of many papers that's doing that. It's called Style Tokens. It's got another horrible diagram, which I'm about to simplify, because who knows what it's doing? They don't really know. It's got an encoder-decoder here that we vaguely understand. And that stuff at the top, I'll just simplify that. And we'll say it's a text-to-speech model that adds more information than just the text. It adds some new information. And that's when things get exciting. They're calling it a style embedding. And this model's doing something a bit peculiar. It takes as input text and some reference audio, which is speech, of some other sentence, not the sentence you're trying to say, but in the style in which you'd like to say it. So if you'd like this sentence to come out sad, you put in text, and you just give some speech in a sad style. And this model learns to embed this sad speech into some representation internal to the model called this style embedding. And it uses that to influence the regression in the encoder-decoder model. So an embedding is just some internal representation. We've already seen one, the mysterious thing the model learns to bridge from linguistic space to acoustic space. But we can add others if we have more information than just the text. And maybe we do. Maybe we've got a corpus in which we've labelled every sentence with a label. Happy, sad, whatever. Or that we've learned those labels in some way. This one learns the labels. It doesn't require you to label the corpus. It just requires you to have these reference audios. In other words, speech in the style that you would like it to come out in. That model on the bottom, that's just our text-to-speech model. And that thing on the top, that's just new information that wasn't explicitly in the text. So this is a more interesting model than text-to-speech. It's finally text-to-speech people admitting that we can't really do text-to-speech. It doesn't mean anything to say, do text-to-speech. You've got to decide who's going to say it, in what style are they going to say it, what accent, and so on and so forth. And in the past, that was done by changing the data on which you built the system. If you wanted your first or second generation unit selection system to sound sad, it was back to the recording studio and say, could you please sound sad for the next ten hours? And let's just record ten hours of sad speech. And then if we want to speak in a sad style, we chose our waveform fragments from the sad bit of the database. That worked, but it's very, very expensive, and it doesn't really scale to continuously varying speaking styles. So imagine we have this additional information, and now it can be anything you want. It could be something you can label on the data, or something that you discover varying in the data that is not explained by the text. So it could be a speaking style label, it could be a voice quality label, horse voice, whatever, modal voice. It could be anything, anything that you want. Whatever the text is missing, the leftover stuff. One of the interesting things people have been doing is to derive that from another audio sample in the required speaking style. But not the text you're starting to say, because you wouldn't have that. It just has to be some reference audio. So the Google paper does that with what they're calling a prosody embedding. It's not really prosody. It's really just the leftovers. So this model is learned in the same way as the text-to-speech model. You present it with pairs of text and speech, and it learns to do the regression. And in doing that, you can also measure what would be missing in the text to perfectly regress to the speech. And that leftover, that missing bit of information, is what they're calling prosody. But it could be many, many things, as we'll see. It's a lot more than just prosody. So we can now play interesting games. We can change that prosody embedding. We put a random value there and see what happens. Or we could put an embedding that's derived from different speech styles. So we can roll the dice and change the prosody embedding, and then change the speech. So... United Airlines 563 from Los Angeles to New Orleans has landed. Ridiculous sentence. Inappropriate for this voice. But by changing the embedding, we can change the style. United Airlines 563 from Los Angeles to New Orleans has landed. United Airlines 563 from Los Angeles to New Orleans has landed. And so on. United Airplanes 563 from Los Angeles to New Orleans has landed. It's not prosody, it's everything that's not in the speech. It's speaking style, emotion, prosody, whatever you want to call it. And that's a huge gap in the terminology of the field, is what to call the stuff that is under-specified in the text, and how to actually properly factor that out. So one bit really is the prosody, and one bit really is the speaker identity, and one bit really is the speaking style, and one bit really is the voice quality, or whatever you think those things might be. And that's incredibly hard, and nobody has a solution to factoring those things out and giving separate control, because they're very much tangled up together. The speaker's identity and their speaking style are not independent things, even for the most talented voice actor. So disentangling these representations, giving independent control over different things, nobody has that yet. This model doesn't have it. It's not prosody embedding, it's just mimicking some audio, some reference audio that you give to the system. So whatever you give it, it will mimic that speaking style, but say the text, an arbitrary new text. So we'll finish off by getting all the way back to the beginning of text processing. We'll remind ourselves about the messy, traditional way of doing things that works, but it's really expensive to maintain, especially the dictionary, that's very hard to maintain. And if you wanted to start building systems across different accents of a language, you might well have to make very substantial changes to your entire dictionary. That's going to be also extremely expensive. And then moving to a new language is going to take you a year of a skilled lexicographer to even write your first past dictionary. So lots of reasons to want to not do the traditional thing. So what happens if you don't? What happens if you try and throw these things away? So the traditional way of doing it is to write a great big pronunciation dictionary, like this, and then still find that whenever you synthesise, many sentences you want to say have at least one word in that was not in your dictionary, because language is like that, it's productive. And so we need to extrapolate from the dictionary with this thing called a letter-to-sound model, which will make mistakes. It will get it right about 70% of the time, if we're lucky, maybe 80 or 90% for state-of-the-art. So that's the best we could do in the traditional mode of doing things. So the motivation for the people who really want to go end-to-end, by actually taking text input, not phonemic input, was to learn that dictionary as part of this encoder. So we'd have to still normalise, because none of these models can really handle currency symbols and things, they take normalised text and they embed it, so they're doing something like a pronunciation representation. But these models that do that make stupid mistakes. They make mistakes that are so bad you can never ever deploy one of these commercially, because your customers would laugh at it. They make embarrassing, silly mistakes. So all of these papers or blog posts, which they often are, they're often not proper papers from these guys, say, what a fantastic model, oh, but here's some hilarious outtakes that the model does. And they say in this one, this particular system, Tecleton 2, which is state-of-the-art, this is a blog post from only a year and a half ago or so, it says, it has difficulty with these complex words. Now I don't know if you think these two words, decorum and merlot, are complex, I don't think so, I had a glass of merlot the other day and I didn't find it that difficult. Oh, it's going to shut down again. There's an easy answer to how you say those things. How do you learn how to say them? Well, you ask somebody, and if you don't know, you look it in a dictionary. So there's a very easy solution to these complex words, is to look in a dictionary. And these end-to-end systems don't have dictionaries, and so when they make these mistakes, there is nowhere in the system to go and fix it. So these are systems that are not fixable, which is why they'd never be deployed. They would only be deployed with a phonetic front-end. One reason that letter-to-sound is hard, is that we need to know something, perhaps, about the way the word came about. What's the word made of? Maybe the morphology. So, we're finding now, that putting a little bit of the right amount of linguistic information back into these rather naive end-to-end models makes huge differences, really big differences. This is a student in my group that is looking at what it would take to do a better job of letter-to-sound in these end-to-end systems. And to understand how that might be possible, we could look at a graph here, which on the horizontal axis is rather small. It says hours of recorded speech, going up to 600 hours. On the vertical axis, it's got the number of word types that you get, the number of unique word types. And you can see that will keep going up and up and up, but it will take a very long time before it even gets anywhere near a dictionary, which are all those lines at the top. So you would need hundreds of thousands of hours of speech to see all the words that you would see in a typical dictionary. And we don't normally work with 600-hour databases for speech synthesis, because we don't have ones of good enough quality that are that big. We're normally working down this tiny bottom left corner here, where we have databases that are tens of hours long, and in them there are tens of thousands of word types, whereas in our dictionaries there are hundreds of thousands of word types. So what the students have done is found that by adding just the simplest amount of morphology, you can effectively reduce the number of unique types of things in your database. That's the point of morphology, right? So word forms are almost infinite in their variety, but many new words are formed productively by combining morphs that already exist in other words. So one way to make your vocabulary look a lot smaller is to think of a vocabulary of morphs, morphemes, and not of words, of surface word forms. And that gets you big wins. So one of his particular test cases are these cases where, across a morphological boundary, there is a very high frequency phoneme sequence, or single phoneme in this case, th, which is not pronounced as th, because it spans a morph boundary, and it should be two separate phonemes. These end-to-end sequences get these ones wrong. So let's see if we can find that example. So I'll just play you a couple. I'll play you one of these end-to-end models, trying to say a word where there's a high frequency sound that straddles a morpheme boundary, but the system doesn't know about morphology. So we can say this word, coat hanger. So listen carefully for the sound that corresponds to the letter th. Coathanger. It's coathanger, because th is way, way more frequent for th than this t and h separate sounds. But if you know about morphology, then the model gets it right. Coat hanger. This one here, ph, is very commonly going to be a th, but there's a morph boundary here, so it's not going to be upheld, it's going to be upheld. Right, so let's wrap up. So hopefully you've got a little bit of intuition now about what the state of the art is. Although it's these big, heavy neural networks, and we might not know really how neural networks work, it doesn't really matter, because they're just building blocks for making regression. What's interesting is what you put into them, and what you get out of them. You can play the game of messing around with their architectures, but you're never going to win that game, because you don't have the data or the compute power. Other people have got that, they're going to win that game, they can burn the electricity, and they'll come up with something that works. What's much, much more interesting is what you put in, and what you get out, especially knowing that text alone is not enough. We need more, for example. Morphology really helps. So here's a guess about what will happen next, maybe just in the very short term. Linguistics is more than phoneme sequences. We have very rich representations. Here's syllable structure. In that previous example, it was morphology. There's lots and lots of other things you could imagine putting there, so morphology, that works, and if we can infer it from text, we can do a really good job. People have tried putting syntactic structure in. Syntactic structure doesn't always directly map to how something is said, the acoustics, it depends on the syntax. It's going to wake up again in a minute. You could think of all sorts of other structured linguistic information that is helpful for predicting acoustics, that is being lost at the moment. Syntax is one of them. More obviously, perhaps, might be meaning. So none of the systems at the moment make any attempt to guess the meaning of a sentence. They just go for very shallow syntactic information, like content word, function word. They don't really get into semantics. Another work, which I couldn't fit into the talk, where labelling the data with discourse relations, so relations between spans of words, if something is an elaboration of something else, or something is a contrast with something else. That can make really good improvements to prosody. You're probably a much better linguist than I am, so you can probably come up with other things you could add here. You could put richer things in and retain the structure, and not just squash it flat. This model has lots of representations all along the way. It's got this linguistic thing on the input, so you've got some choices about what you put in there, and what you choose will make the regression easier or more difficult, and you want to make it as easy as possible, to make it as more accurate as possible. You've got a waveform on the output. We already saw that this thing, phase, is a horrible thing, and it's one reason that the end-to-end problem, going all the way to the waveform, is a bit silly, actually, and that cutting the problem in half and putting a spectrogram in the middle is way more sensible. So that's what everyone's doing. They're putting a spectrogram here, and if Amazon are right, the vocoder problem is solved. We just have a black box, vocoder, any spectrogram, you get the waveform back, so we can just tick that off as done. So that representation is a spectrogram, but it doesn't have to be a spectrogram, because when you talk to Alexa, she doesn't show you a spectrogram saying, what do you think of my spectrogram? Does it look good? Who cares? It's internal. So there's lots of ideas you could have about things that are maybe better than spectrograms. Maybe they're perceptually more relevant. The only thing we do perceptually at the moment is put a MEL scale on it. So we use a nonlinear frequency scale, but you could imagine doing an awful lot more in that representation. And then there's the entirely mysterious relationship between the thing that encodes grapheme or phoneme input into something, and then that gets decoded out into the audio. We don't know what that is, and the model's optimising it. Maybe it should stay hidden from view and just be learned. Maybe we should be intervening there and making that interpretable or controllable, so we can actually go in there and adjust things. So the old statistical parametric speech synthesis, the reason people persevered with that was that you had lots and lots of control, which you never had in unit selection. You could do all sorts of nice tricks, like changing the speaker identity from very small amounts of data. You could even repair people's speech problems. So we have a company just spinning out, commercialising that technology that could take speech from someone with motor neurone disease who's already got articulation problems and essentially repair it in a statistical parametric domain and produce speech that sounded like they used to sound when they can no longer speak at all. We could never do that with unit selection, because it just plays back their impaired speech. We could morph emotions, we could do all sorts of things in that statistical parametric domain. We had lots and lots of control, which you've kind of lost again in this end-to-end paradigm. So another thing that would happen quite soon is to put control back in. If you've got a fully specified metal spectrogram with formants and harmonics, you're just always going to get the same waveform from it, because it's pretty fully specified. You just need to guess phase, and you just need to get phase right. You can't control the phase to change the speaking style. So this representation means you've got no control. So in this model, it's actually quite hard to even do simple things. You could just increase the pitch. There's no knob for that in this model. If you just don't like the voice, you say, oh, could you just pitch it down a bit? This model doesn't actually have a knob for that. You could just say, just speak a bit slower. This model really doesn't have a knob for that either. Particular parametric synthesis had explicit knobs. It had numbers and parameters you could easily change that would do all of those things trivially, because they were there in the representation. So we've lost that. Then maybe we can control the things that we never could in any of the paradigms, things like voice quality, making things like creaky voice. No one could ever do a really good job of creaky voice with signal processing or in the neural case. So I'll leave it there, because I don't want to make too many predictions. It can only be so wrong if you make a few predictions, I think. We'll call it a day, but I will leave you with two websites if you want to find out more. One is my research group's website, and the other one is where I'll put these slides maybe over the weekend, which is my teaching website. If you really want, there are complete courses on speech synthesis and other things there as well. Thank you.