What is AI speech generation currently capable of?
Presented at ACM conference on Conversational User Interfaces in 2024.PDF slides – the links to all the demo pages are clickable.
(large video file: if it doesn’t load correctly, try Chrome)
Whilst the video is playing, click on a line in the transcript to play the video from that point. [Automatic subtitles] Well, good morning. It's early in the morning, so I'm going to keep this kind of non-technical, and I thought I would give you an idea of what speech synthesis can currently do, or as we're now calling it for marketing purposes, AI speech generation. Speech synthesis has got really quite good, and I'd like to give you an idea of where that comes from, where that quality comes from, and what's possible, what's not possible, what would you need to do to make speech synthesis, I don't know, for a conversational agent. So let's just start with some teasers. These are some demos of fairly recent work, and the whole talk has got a huge recency by us. It's not a literature survey. Just to give you an idea of where we are with the quality of synthetic speech. Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do. Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it. And what is the use of a book, thought Alice, without pictures or conversations? So she was considering in her own mind, as well as she could, for the hot day made her feel very sleepy and stupid. Whether the pleasure of making a daisy chain would be worth the trouble of getting up and picking the daisies. When suddenly a white rabbit with pink eyes ran close by her. There was nothing so very remarkable in that, nor did Alice think it so very much. Okay, so that was all synthetic speech. How about another one? Did you hear about Google's paper on Soundstorm? Um, no, I must have missed it. What's it about? Well, it's a parallel decoder for efficient audio generation, so it can even be used to generate dialogues. Oh, interesting. Again, all synthetic. This one, the beginning is natural. It's a video to be dubbed. And the second one is the synthetic one. Some minutes later, Ryan Christie's penalty sparks a mob of cherished players in front of those supporters who get their reward for the long, long Tuesday night trip. And a few minutes later, Ryan Christie's penalty sparks a mob of cherished players in front of those... Okay, so it's pretty good. It's pretty good in certain domains. Maybe in potentially all domains. Let's see. So what I'm going to talk about. I'm going to talk about very, very recent methods. There's huge recency bias. I'm not going to survey the literature way back. We'll just touch on a couple of old models to remind ourselves that it sounds worse. The real theme is what's speech generation capable of. I'll define capable in a moment, really with reference to the state of the art. I'll try and finish with thinking about what the opportunities are for conversational interfaces that have speech output. Not going to talk at all about a few things. Not about history. No tutorial. I'm not going to talk about evaluation, which is huge. Evaluation in synthesis is something I've worked on an awful lot with moderate success, but there's still lots of really bad evaluations out there. I'm not going to talk at all about ethics. That'll be a whole other talk. Only to say that all of this is using data, and the data is recordings of real people, and we should be very careful about who those people are, where that data came from. Delighted to talk about that after the talk. Couldn't possibly fit it in here. And we're certainly not going to talk about accidentally sounding like famous people. Most of the demos you'll find, I've got a link at the bottom of the page, rather than a citation. That's where you can find things. I'll put the slides on my website sometime after the conference, maybe with a recording of the talk. So the samples are from those companies or academic groups who are willing to put demos online. There are two major groups that don't do that, but do great work, and they don't get any demos today, and that's Apple and Amazon. So this started with me thinking about the literature and becoming overwhelmed by reading my own literature. And I started collecting the names of models that have come out in the last year or two. And it's overwhelming. These are just the names of some text-to-speech models in the last two years. And they're only the ones that have catchy names. There's many more that don't have catchy names. We're going to run out of names pretty soon. And for someone like yourselves, perhaps thinking maybe as users of speech synthesis, you must have no idea how to make any sense of this. Going shopping for a model, how do you pick a model? Catchiest name? The best demo page? I'm going to help you think about how you might choose the model and the technology to do what you want to do. So the structure is extremely simple. Talk about three things that lead to capabilities. Data, models, and the labels on the data. Labels might seem a bit of an incidental thing, but they're going to turn out to be the most important thing. We'll then finish off by saying what the state of the art is, and so on. So data, models, and labels. And these three things each separately give us some capabilities, things we can do. What do I mean by the capabilities? So is the model able to generate speech that is what you want, with the desired properties, whatever they are? Do you know what they are? Is some user of the model, that's more likely a system designer than an end user, able to cause this model to do what they want it to do? By either manipulating the data, or swapping out the model, or putting some different labels on that data, whether that's data used for training the model, or data used whilst we're doing synthesis. So we're going to cook up these three ingredients, and we'll start by thinking about the data. So I thought it might be useful just to actually listen to the kind of data that we're using to train the state of the art models, and what kind of limitations that has. Because if it's not in this data, the model can't do it. There are no models out there that can extrapolate significantly beyond the data that they've seen in training. They can interpolate really well, but they generally can't extrapolate. So if you want the model to do something that's not in this data, good luck. So we can think about where that data comes from. The classical thing to do is to go in a studio and record. That's what we do in the company I work with, Papercut, but increasingly that's nowhere near enough data. We might elicit it in some more sophisticated way than just asking some voice talent to read stuff. We might get them to play a game, play a scenario, a dialogue for example, or we might just go find it on the internet. So let's just listen to kind of what you would get if you got your data in those three ways. A teacher would have approved. Military action is the only option we have on the table today. So speakers, disinterested, not very professional reading random sentences in a studio. That was the backbone of speech synthesis until recently. And that's why it sounds like what it sounds like. Or we could get voice actors to play a little dialogue game. Hello, I was looking for a Korean restaurant, please. Sure, where were you looking? Around East Village in New York. Yep, so there's Thursday Kitchen? Yeah, let's go for that. Or we could just go find it, I don't know, on something like YouTube. Being careful about the license and the permissions. We're going to have a lot of fun. The first night was a really big challenge for me when I came in. Did you know you were the first one? Like you were going to walk into the house. So the last one is, it's real. It's real people saying things in real ways. Obviously, it's going to be harder to work with data like that than the first two. We could also think about how we get our speakers to produce this speech. So we could just ask them to read stuff. Here's the most overused database in all of speech synthesis recently. A lady called Linda Johnson, who's recorded vast numbers of audiobooks and put them on LibriVox. Chapter seven, Lee Harvey Oswald, background and possible motives, part one. So we've just got lots and lots of hours of her reading like that. And you can imagine if you make a voice, it's going to sound exactly like that. We could get people who have been a bit more natural in the way they speak, but thought about it in advance. That should have been the title of the chapter. Honestly, I've got it wrong at certain points in my life, but I'm pretty sure he'll be 65. Much more interesting than reading text out. Or we could just, again, go find data that was just produced without any particular plan. The typical YouTube video or blog. A lot of it's been the marketing our site. Really, it's just getting people to watch it. That's been the biggest part of all for growing. There we go. I don't know why on Google Maps tonight, they have Where's Waldo. So that's kind of cool. That was surprising. Did you know anything about this? So increasingly natural, increasingly hard to work with. So think about which of these might be a good kind of basis for building a voice for an application you're interested in. Increasingly, we need very large amounts of data. We're going to see models later on that are trained on between 1,000 and 100,000 hours of speech. That kind of amount, no company has ever recorded in a studio. They may have approached 1,000 hours, but never 100,000 hours. So you just got to go find it. Then you got to worry about who's in it. So if you did it yourself in the studio, you know who the speaker is. And if you've got a professional actor, you're going to get a voice that sounds really spookily familiar. I'm too busy for romance. George Washington was the first president of the United States. The now stereotypical voice assistant voice. Those are natural. That's what she really sounds like. Or we could go find people reading audiobooks. We know who they are. They put their names on it. So we can attribute them if we wanted to. Little did I expect, however, the spectacle which awaited us when we reached the peninsula of Sneffles, where agglomerations of nature's ruins form a kind of terrible chaos. Utterly random text full of all sorts of words that may or may not be relevant to our domain. They were like many a pair of twins and seemed to have but one life divided between them. And typically unprofessional speakers. So they might be good. They might not be good. How do you know? Or again, you could just go find data and you have no idea who is in this data. It's not annotated in any way with that. I would go even a little bit farther than that. And focus didn't move. I turned the key in the lock. And you're going to get far more speaker diversity that way. You're going to find all sorts of voices, all sorts of demographics, biased only by the bias that's there on the internet rather than by, for example, who's got spare time to read audiobooks. How do you find the people in there that suit your application? But much more interestingly is what kind of style they're speaking in. And style is going to come up again and again in the talk. And I'm not going to give you any good definition of it because the field has failed to come up with one. It gets conflated with emotion, unfortunately. The typical thing to do is record neutral data. This is from the company I work with. This is very old data. We would not record such boring data anymore. After a pause, Lord Henry pulled out his watch. Very professional, very clear speech, not very expressive. See if you can name these emotions. Given the circumstances, isn't this a little unorthodox? Given the circumstances, isn't this a little unorthodox? Given the circumstances, isn't this a little unorthodox? I don't know. If I told you what they were in advance, you would have got them. But after the fact, I think not. Or in the found data, people have real speaking styles, not stereotypical portrayed emotions. But they're just speaking because they want to speak. They've got something to say. This feels really good. What can we do for you today? But when we walked around the corner and we saw kind of like the camp set up and the tent. It's very challenging to put names on those styles. It's very challenging to draw a picture of what that space of styles is. And that's a big challenge in the field at the moment. We can draw spaces of speakers. We don't really know what style is, other than something unsatisfactory like emotion. So if you just have data, any model, just straightforward, swapping in and out the data, what could you get out of your system? Well, if you just got one speaker. The boxes were impounded, opened and found to contain many of O'Connor's effects. That's fine. You get that speaker. If you've got multiple speakers, you can build a model which can switch out its voice for any of the speakers it's seen in training. The teacher would have approved. The rainbow is a division of white light into many beautiful colors. There was great support all around the road. And from now on, these samples and all the samples we hear, unless I tell you otherwise, they're all synthetic. So they're all text to speech. So they aren't really good. Over this audio system, you'd be hard pressed to tell the difference in some cases. But it just sounds like the data. Sounds like the speaker, sounds like the recordings. We can get the model to vary its speaking style. There's many, many ways of doing that, but they all rely on having data in the style that you want. So you'd be able to pull it out of the data and say, say it like this style, whether that's by putting a label on it or finding an example of it. The modern way is just to find an example of it and ask the model to mimic that. And here's how that would go. We have to reduce the number of plastic bags. We have to reduce the number of plastic bags. We have to reduce the number of plastic bags. Supposed to label these as you're going along. I'll give you the answers to these ones. We have to reduce the number of plastic bags. Apparently, these are the answers. But we can vary the style. If we can find an example that we like, we can get the model to mimic that style. What about just throwing lots of data at the model? I'm going to Istanbul for the Champions League final. That's awesome. Who are you supporting? Liverpool. I've always been a big fan. Liverpool is a great team, but I think it will be a close match. Yeah, I can't wait. You know, I'm super excited to be going there. Yeah, I can imagine. Are you coming as well? No, unfortunately I can't. So when this sample came out quite recently, this was kind of terrifyingly good. It's doing all of those things. There's even a little bit of back channels, a bit of fill pausing. There's all sorts of things going on in there. We'll listen to that one again later and see if you still like it. So with just lots of data to solve your problem. 100,000 hours, no problem. You get the engineers, you scrape it, you label it. Wittgenstein paper's doing a million hours. The most recent paper I read last week, they claim to do two orders of magnitude bigger than everyone else, which came up to 10 million hours of speech. Once you've got that kind of amount of speech, it's really impossible to know what's in it. If you've got something like 50,000 hours of speech, something like takes five years to listen to it. So you just can't listen to this data. You can't even listen to a representative sample of this data. So it's not going to be the solution. Current models need very large amounts of data. It's usually found, usually a bit of high quality data in there to make sure we get that quality back. But at the moment, we've got way too much data because it's just redundant. There's just vast amounts of stuff that's useless or is all the same. But also, it's quite biased. And in fact, it's not specific to any particular use case. It's kind of everything, general purpose. So let's listen to that dialogue again. I'm going to Istanbul for the Champions League final. That's awesome. Who are you supporting? Liverpool. I've always... Okay, so if she's a Liverpool fan, so am I. I mean, it's the least enthusiastic football fan I've ever heard. It's awesome quality. It sounds really natural, but it's just completely wrong. It's fine for whatever podcast that that data was found from that the model was trained on. It's not fine for talking about being excited to go to a football match overseas. The real problem with data is labelling what's in there. And in particular, what can you and can't you label? So you can label the text. You can label the speaker. Maybe you can label style, some other things. But there's lots of things you can't label. Random variation. Speech does randomly vary. And there's another problem with labelling data that you can't always disentangle things. So these emotion databases where actors portray emotions, if they're all portraying each emotion in their own unique idiosyncratic way, they have nothing in common. So these labels are meaningless. And at the moment, all models will sound like a speaker in their training data. The best they can do is sound like a mixture of speakers in their training data. There's a whole world of voice privacy, which just attempts to make pseudo speakers. And all they are are interpolations between real speakers in the space. And we'll hear a bit later on that it's really easy for the model to fall into one of its training speakers accidentally, even when you didn't notice initially. So that's where you get with data. You can get quite far. You can do very impressive dialogues, but only in the domain of your massive amount of dialogue data. What could you do if you're willing to get your hands a bit dirtier and start messing around with the model? So now I'm going to have to tell you how models are made and what they're made of. They're made of very simple components. Typical speech synthesis model takes inputs, either text or a pronunciation representation, phonetic representation, or a mixture. The difference doesn't matter for the purposes of this talk. And encodes that in some way, extracts features about it that specify how it's to be said. That's the thing called an encoder. That goes through something which then turns that into a specification of what it sounds like. We call that the decoder. And then we put that through something that turns that into something we can listen to, a speech waveform, and that's called a vocoder. We can supervise the training of this model in lots of different ways. For example, we can say how the text aligns with the speech. Minimally, that's utterance by utterance. We need to cut everything up into little utterances. Maximally, it might be time-aligned phonetic transcriptions. We might annotate acoustic things like the fundamental frequency, which we hear as pitch or the spectrogram. Different models do different things. We can supervise them in different ways. If we supervise them, we maybe get more control over them. But this model, given some input and given some speech target in its training data, needs to explain why that speech sounds that way from that text. And it can't, because text is underspecified. So there's not enough information in text. So you need to solve that problem, that problem that it's an ill-posed problem. Text-to-speech was just always an ill-posed problem. We're trying to do something that's essentially impossible. And there's two classical solutions to that. It's keep adding more inputs until you fully explain how the speech sounds, like what the speaker is, what the style is, whatever that long list of things is. Or have models which randomly generate from the distribution of possible ways of saying that speech. And that's called a generative model. And generative AI is all the rage. And everybody thinks it's a brand new thing. It's nothing of the sort. It's just what we've always tried to do is model probability distributions and draw samples from them that are plausible. It's just that our probability models in the past were rubbish. And the only good sample was the average. All the ones around the edge didn't sound so good. But now all the samples sound good. So generative models work because the models are bigger and more powerful and there's more data, not because they're new sorts of models. So what might additional input look like? Well, here's a really kind of weird idea that works really well. For every utterance you're training the model on, just give it some unique label. Therefore, that fully explains how to say it. And from that unique label, try to distill that down to some space, some abstract space of variation that's not the text. So text plus this abstract thing, there's probably some embedding in some vector space, explains everything. And then you can perhaps sample in that space during generation. That's an older way of doing it. A more modern way of doing it is saying, OK, a good way to explain how the speech sounds is just to show the model the speech at the input. That's obviously not going to work when you do text-to-speech. So we train a model that is allowed to see the speech. And then when we do synthesis with it, we just show it some speech and say, say it like this. Copy this speaker, copy this prosody, copy this style. And by manipulating the richness of the representation, we hope that the model copies that, but still says the text we want it to say. And this is the paradigm of the moment. Generative models don't in general need to take any of these measures. They're just going to model the full distribution. So in the training data for each text, there might be many which ways of saying it. And the model's now probabilistic. We can sample. We just toss the coin or roll the dice or however you want to imagine it and generate speech. And each time we do it, we'll get different speech. Just like if you use a generative language model like ChatGPT, each time you ask it something, you're going to get a different answer. It's going to sample from its space. OK, so it's really hard to reproduce anything with these models, but you can get natural variation from them. So if you're willing to do these games with models, play around with models, what does that get you? Let's just work through some models of increasing sophistication and power just to see that journey that we've been on just over the last few years. This is a model that's now really, really old, but it's still a benchmark. It's from Google, and it was a very important model. It's still deployed commercially. And this model, given good data, will just sound great. We heard the training data for that one earlier. This sounds exactly like that training data. So you want to deploy Google Assistant. You've got lots of recordings of that lady. You're good to go. This model will just work, and it's rock solid. Don't get any control. It'll just always sound like that. What if you want a bit of control? What if you want to get in there and, I don't know, manipulate the pitch contour or the durations? Then you need a model which knew about those things when you trained it. Here's a model called FastPitch that knows that. Here's a tiny little video from their demo showing interactive manipulation of the intonation contour. That's down on the bottom left is the F0 value per input letter. What I cannot create, I do not understand. So, for example, we'll amplify everything a bit. What I cannot create, I do not understand. That's all very nice. I'm not sure any of us wants to go in there and start turning F0 pitch values up and down per input letter. Not a very handy interface. So it's a cool trick, but it's really hard to deploy such a model for real. Here's a model which does the trick of learning what the space of variation is. So it essentially learns a label to put on every utterance. That label has no name. It's just a set of weights. But then you can go in and play with them and see what you get and try and label things after the fact. Here you go, a link for Beyondo Racing products and other related pages. Here you go, a link for Beyondo Racing products and other related pages. Here you go, a link for Beyondo Racing products and other related pages. A single model trained on a single set of data that had unknown variation in it, in its speaking style, it learned the space. And then we just randomly say, let's have style with these numbers. Let's try another style with these numbers. And that's kind of interesting. But again, these labels have no names. And it's really hard to kind of deploy such a model for real. Here's a model with a daft name. Here's a model that does the most typical thing of current models, is that if you want to get a certain output from your model, you have to just go find a sample of speech that sounds like what you want, potentially from a different person. This model claims it's only going to copy the prosody. So that's the rhythm and the intonation and not the speaker identity. But it's really hard to disentangle those two things, as we already said. So here you're going to hear pairs of natural reference. We're supposed to hear the prosody and ignore the content and then say it in a different voice with different content. Had this been common practice? That's the question. Breaking the threads gently, one by one. So the text isn't the question, but it will make it have the intonation of a question. You are not so important after all, Powama, he said. That's the reference. On the mossy trunk of the fallen tree. So we can impose a prosody of one on another, but there's clearly lots of leakage of other things going on there. The speakers leaking through, the recording conditions leading through. It's essentially really hard to disentangle those factors in the data. Here's a model that's generative. It does the tossing a coin trick. There's a little part in there. It's called flow. Don't worry what it is. It's just something that's probabilistic. It's a probabilistic distribution. And we can sample from it. And the samples are all reasonable. How much variation is there? How much variation is there? How much variation is there? How much variation is there? They're all reasonable. I don't know in what context each of them will be good. And I don't know how to choose that. And I don't know how to ask the model to do it again. But what I can do is I could generate lots and lots and lots of examples and pick the one I really like. That might work in some applications where we're pre-generating the speech, or we could automatically pick them. And all modern models are now able to do this trick of generating lots of versions. And there's a model I'm not going to talk about. It's just to give you an idea of how ridiculously complicated models got just before large language models took over. So we were trying to solve these problems of disentanglement, of control, of being generative with extremely complicated bespoke neural network architectures. These are unreproducible. They're impossible to train. There's so many settings and hyperparameters. Unless authors literally release their exact code, it's hopeless to actually try and implement a model like that. You'd be spending months and months. But it's OK, because much simpler models came out that work much better. And I'll talk a bit more about these ones later on. This is a model from Microsoft, which is a prototypical kind of language model-based speech synthesizer. I'll explain later why synthesis can be just seen as language modeling. It's easy. So here's some different examples here. We're going to first hear that it can just do nice variation. I must do something about it. I must do something about it. Same speaker, two different ways of saying it. Random, but both plausible. He has not been named. He has not been named. Again, randomly putting emphasis on different words. We can't control it, but it is very natural. Now we're going to show the model some reference speech and say, please sound like this person. This is going to be a test of your accent perception. The first thing you hear is going to be natural speech. It's got a very specific accent from a particular English-speaking part of the world. And the second one is supposed to sound like the same person. I had decided to quit the show. It's a female voice. You may or may not hear it. It's totally unfair. She's from Scotland, where I live. I could hardly move for the next couple of days. She's definitely coming out American for me. Sounds like the same person, if only she'd been born in America. Throughout the centuries, people have explained the rainbow in various ways. There's a male, Scottish, from a particular part of Scotland. So what is the campaign about? He's gone Canadian. Why? Well, because in the training data for the model, there were lots of people with lots of accents and lots of speaking styles that probably weren't very many Scottish people. It's a very small country, very underrepresented in the training data. So when we ask this model to sound like this particular person, it can't. There isn't anyone in its training data like that. So we'll just find the closest person in its training data and sound like that person. So we're going to accidentally sound like some Canadian guy who was in the training data of this model, which I don't know what that was. So it's really hard to manufacture new speakers. We can't really sound like that person. We can fake it. So the models are limited. They're very powerful now. We can do lots of cool tricks, but there are limitations. So if you wanted to build a conversational user interface, you wanted a voice with very specific demographics and speaking style, whatever it was, a model might be your answer. Picking the right model is important, but it's not the only solution. You need to do more than that. So the model has limitations. And don't go inventing models. Models are invented by trial and error, to be honest. You can tell that from the fact that the papers have 20 or 30 authors. And they'll all become obsolete. Valley is already in its about fifth iteration. It's in Valley X or Valley 2 or whatever's coming out next week. They'll all be obsolete. So you won't get famous inventing a model. You'll be famous for 15 minutes, if that's what you want. But much less obvious and rather more serious than that, is that all these models that are coming out, if you go and read the papers and think, which model should I choose for my application? None of them are evaluated on any useful applications at all. They're evaluated on things that are very poorly defined, such as prosody transfer. What's prosody transfer? Can you do that without transferring something about the speaker style? Can you make me speak with Norbert's prosody? What would that even be? I don't know. We have different accents. Style transfer? Well, I don't know what style is. It's not emotion. So that's ill-defined. And the canonical task of the moment, because everybody wants to beat everyone else and everyone else wants to claim we're the first to human parity on this task, is a thing called zero-shot TTS. And that's just showing it a speaker you've never seen before. For example, a Scottish male, and saying, sound like this person. And people evaluate these things, either with listeners that can't hear Scottish accents or speaker ID systems that don't care about accent. So on to the actually most important ingredient of all. There's no point having loads of data, and there's no point having a super fancy model if you can't label the data with something of interest. I'm going to try and convince you of that now. So what would you label? Well, you've got to label your data. Your data is normally lots of utterances chopped up into pieces with transcriptions. So you might label who's speaking. If you know the genre, for example, you know the podcast it came from and you know the podcast is about sports, you could label it sports podcast as a genre. If you know people are doing a meditation video, you could label the style as being calm or something if you know what those labels are. You could label those every utterance. You could even get adventurous and go down to the words and say this word is emphasised. You could even go below that and say this word is pronounced in this particular way. If you are very brave, you might label something about prosody. Maybe you'll label things that have question intonation, things that don't, or maybe something more sophisticated than that. So you don't choose what things to label. You've got to choose when to label. It's not very obvious this, but you could label before you've even recorded. That's what we used to do in the studio. We used to write a script and give it to the speaker and say, read that. So we don't have to label the speech because they've got the labels. Label first, speech later. That's great. It was very limiting. It was limited by the capability of that actor's performance. We could record first, for example, get them to improvise a dialogue and then we'll have to transcribe it. That used to be really expensive, but it's okay because ASRs have solved problems. So just run it through a whisper and it just works. That's not a flippant remark. If you run it through ASR and it doesn't transcribe it, it's probably too hard for the TTS model because we're behind the curve. They're better than we are. So we can just use speech recognition and throw it away if it didn't work. We can label actually whilst we're modelling the speech. We saw models that try and put their own labels onto speech. They don't know what the names of those labels are, but they can give numerical labels or embeddings or vector space embeddings. Says this utterance lives here in the space. So being here plus the text explains everything. This utterance is over here in the space. Or the most modern paradigm of all, and that's just to synthesise and then see what you got. We've always done that in applications where there's a human in the loop. We can listen to it. It didn't like it. Let's try again. Let's try again. A dirty secret about all the generative models, Valli and everything else you already heard. They all get put through speech recognition immediately after synthesis to check they said the right thing. They said all the words and no extra words. And if that's not the case, we just quietly try again and don't admit it and don't tell anyone. Because they all have problems with either stopping speaking or hallucinating or going off script, just extemporising. Why do they do that? It's probably because it's in the data. The data's probably got speech in which the words weren't all transcribed. So it knows it should just sometimes say extra stuff or other things. It's all from the data. They're only generating what's in the data. They're not doing anything new. What we label is way more difficult and I don't have good answers to that. Some things are just kind of true to some extent. We know what the words are. That becomes less and less true for spontaneous speech. But we roughly can say what the words are. We often have metadata. So who the speaker was, where we got the data from, is it an audio book and so on. We can infer a lot of things. We'll see in a moment just how we might do that. We might automatically or analytically label things such as recording conditions. Or we might get people to label what they heard. That's the ultimate form of labelling. If you want to label style, the only thing that matters is what people hear. So if you can get people to label what they hear, then you can do that. We'll see a model in a moment that does that. And then, of course, the model could even learn its own labels. But they're hard to interpret. So labels are the key. If you can label it, the model will probably be able to do it. So there's lots of classical examples. This is going all the way back. We used to label pronunciations and then we could change the pronunciation by changing the phones. We could get words emphasised by capitalising them in the script. That's just borrowed from Lewis Carroll's novels. Get the voice talent to do that in the studio and then you get that for free. You could label prosody if you were mad. These are all classical. I won't bother with examples. We'll go straight to the newer stuff. Back to this one. This is the only bit of my own work in the whole talk. I'll just remind you what this one sounds like. Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do. Once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it. And what is the use of a book, thought Alice, without pictures or conversations? So she was considering in her own mind... Okay, so given a natural language expressed description of what the speaker sounds like and the condition sounds like, this model will do it. And it will say an arbitrary text in that way. This model is trained on about 40 or 50,000 hours of speech which therefore it has to be automatically annotated. Let me just give you an idea of what I mean by automatically annotate. Let's take a very simple measure that we can measure from any signal and that's how noisy it is. It's the ratio between the signal, the speech and the noise. So high is very clean and low is very noisy. Use only the white inner stalks. So that's a very clean one and we can automatically measure that it's clean with signal processing or machine learning and we can label it with a label. And all work had ceased with him. And that's got a bit of background noise in it so we'll say that's a bit noisy, fairly noisy. So given such a label and lots of other labels, we can label every utterance in some very large database automatically with all sorts of things. If you had a gender classifier, if you had an accent classifier, if you had a recording conditions or reverberation measuring machine, if you could measure some statistics on the prosody, the intonation, speaking rate, you could label it with all of those things. This model has a little bit of a gimmick. Instead of using them as labels, it actually assembles them into natural language captions. This is just to mimic the world of images. So we've all seen those stability.ai image generation from a prompt. How are those done? Well, they're done from a huge database of images with natural language descriptions. And those are just those accessibility captions on the internet. So any good website, every image should have an alt text for people who can't see the image. And so you've got free labels on your data saying, this is a cat playing with some wool or something. Unfortunately, speech doesn't have that on the internet. And if it did, it would just be a transcript of the words. It doesn't say, and it's said by an American lady with a high voice. If it did, then we could build such a model. So this was to mimic what if we have audio captions. There are huge audio caption datasets for general audio, but not so much for speech. And so given that natural language description, we can just put that into some representation space. Here, we just stick it through some big natural language processing type model. We borrowed from some kind of natural language processing type people. Put it into a vector space and say, that's what it means to be this accent in the speaking style. And this model can generalize pretty well in that it can make new combinations of things. So it can interpolate. So if we didn't have any American females in background noise, then this model will do it because it will have seen some American males and some Pakistani females and so on and so on. OK, so it can make new combinations of things. It can't do things it never saw at all. So you can't say, I'm with dance music in the background because there was no such data. It wouldn't do that. OK, so it's interpolating really, really well in the data. There's other models that do similar things. And in a moment, I'm just going to tell you how they all work. Let's just look at their capabilities. Here's a model which can edit audio by editing the text. So we'll listen to original audio and then audio that's been manipulated just through the interface of text. Fast cars that had the nice clothes, that had the money. They was criminals. Natural speech. Fast cars that had the nice clothes, that had expensive gold watches, that had the money. They was criminals. The middle part's synthetic. The end parts are natural. It's been manipulated through the medium of text. Of course, this model can also do that zero shot thing. So we'll hear a natural speech sample and then the model saying some new text in that speaker's voice. Like there's this one sketch we did that was about this pixie that would appear whenever racist things happens. Whenever someone make you feel like... That's natural. Now the text to speech. I don't want to make it as a thing where I'm absolving myself of any responsibility. Fairly plausible with a longer thing. OK, lots more do it. I mentioned before that the ultimate way to label your speech would be to get people to listen to it. Here's a paper that's a bit like the work of ours that I showed. But now they got people to label intuitively by listening some description of the speaker. The machinery was taken up in pieces on the backs of mules from the foot of the mountain. These are the sort of adjectives apparently that the people use when they heard the speakers and that they used to label them. It just shows what a wide range of labels people would use if they're unconstrained. I can't pray to have the things I want, he said slowly, and I won't pray not to have them. Not if I'm damned for it. Of course, it can be very expensive to get people to label a lot of data that way. So why can't you just label everything? You might think it's because it's too expensive. And I don't think that is the limitation at all. With the Lieth and King paper, we label 45,000 hours. We didn't do any human labeling at all. It was all automatic. The trick is really simple. Take a little bit of data and label it with the thing of interest. Speaker, style, recording conditions. Whatever that thing might be. And then build an automatic labeler from that data. And then just use that to label all the rest of the data. If that doesn't work, then probably your labels are wrong. Because that's just classical supervised machine learning. And that should just work. Whether it's signal to noise ratio, recording conditions, accent, gender. If you can't do that, then probably go and think your labels again, rather than blame the machine learning. That just works. And there's lots of models out there that do that. What's really hard, and there's a question for you, and I'm going to ask you over the next two days, because I don't know the answer to that. If I wanted to build a conversational voice, what should I label my speech with? So there's lots of off-the-shelf models for automatic labeling. There isn't really a good one for style. There are some for emotion, but they're trained on awful acted emotion databases. And they're frankly useless. So I'll just wrap up then with what the state of the art is. In the distant past, I don't know, 10 or more years ago. I've been doing it for 30 years, so I'm allowed to say that's a long time ago. In the distant past, we just went in the studio. We couldn't handle any of the data. We had to label it by giving a script to an actor in a studio. More recently, we went off and found lots of data, but the quality of the data wasn't great. It was audiobooks or YouTube or something. And now it's just everything. What we saw in that Leith and King paper is that only a very small amount of that data is of very high quality. Just a few hundred hours out of a few tens of thousands of hours was actually studio-like quality. But the model can routinely produce any speaker in any accent as if it was studio quality, because it can make new combinations. It can interpolate. The model is good news on the models, as long as you've got a little bit of compute. In the past, we had very generic models, and they just kind of worked great, but they didn't sound amazing. More recently, we have horrific architectures that are unreproducible. Don't even try. They're unstable. They're temperamental. They're hard to train. They don't do the same thing twice. But now we're into the world again of generic models, just language models. They're expensive to train the first time. So the Leith and King model, it costs somewhere between $5,000 and $10,000 worth of GPU compute to train that the first time. But the good news is once these models are trained, you can teach them new tricks by fine-tuning them afterwards. So what does the state-of-the-art look like? Well, it looks like the old models, just like that old model there. It texts to speech with some reference speech to give it some hints about how to say it or what to sound like. We just redraw the diagram a little bit. And it's going to relabel things like that. And now it's just language modelling. So what's a language model? A language model is just a thing that, given some context, predicts the next thing in the sequence. That's all language models do. It's all GPT does. It's all language models do. Just we can rearrange the things to make it look like it's doing cool tricks. So what's the context? Well, the context is, here's some speech turned into tokens. Think of them as symbols. Here's some text turned into tokens. And given those n minus 1 previous things, just fit a language model to that. Hopefully we've done n-grams at some point in the course in the past. All we're doing now is that n is just really, really big. And we're not just counting stuff. We're building very strong models of distributions that we can really sample from. Not just take means from, but sample from. And given n minus 1 previous things, sample from the distribution of current things. Put that in the history. And just proceed. It's an autoregression. And just proceed. And don't worry that the blue things and the green things are different colours. One's audio and one's text. We can just put them into the same space, the same symbol vocabulary of the model. That's actually mind-bending, but easy. So all you've got to do is decide what are the n minus 1 things and what would you like to generate? The n minus 1 things could be arbitrary combinations of audio and text or anything else. And the next things could be anything else as well. So, labelling. Labelling's everything. If you can label it, then you can put it in the context of your large language model. If you can label it, you've already expressed a knowledge about what you're trying to achieve. So just specify what you want to do. Specify what success sounds like. And then label the data with those things. And it will just work, given enough data. As long as it's in your data. And it probably is in the 1 million or 10 million hours of scraped data. Because if it's not, it's like no one ever said that like that before. And then now you try and do something that humans have never done. So to summarise the opportunities, I don't know. You're going to tell me I'm here for two days. I think the opportunities are that you don't need to worry about going shopping for models. All the big language models are great. They all sound good. You don't need to worry about data. It's been scraped. It's sitting there. So the Li-King model, although it's proprietary to stability, it's been reproduced by HuggingFace. It's out there. It's pre-trained on 10,000 hours. And even the data set is out there on HuggingFace. You can just pull it in one line of code. So there's no excuse for not playing with it. It just works. You can fine-tune it on one GPU in a couple of hours to a new speaker or new style. But what do you want it to do? I don't know. And I'm going to ask you about that over the next two days. Thanks very much.