Download the slides for the module 4 videos
Total video to watch in this module: 58 minutes
Whilst the video is playing, click on a line in the transcript to play the video from that point. All modern methods of speech synthesis - including unit selection that we've already covered - rely on a fairly substantial database of recorded natural speech. So now we need to think about what's going to go into that database. What should we record? Before proceeding, let's do the usual check. What should you already know? You should certainly know how the front end works, and in particular that it produces a thing called a linguistic specification. You need to know what's in that linguistic specification. In other words, what features the front-end is able to provide: Phonetic features, such as the current sound, preceding sound, following sound; Prosodic features; and also what we might be able to easily derive from those things, such as position - where we are in the current prosodic phrase, for example. Given that linguistic specification, you should also understand that we're going to select units at synthesis time from the most similar context we can find to the target. We're going to combine that similarity measure with a join cost to enable smooth concatenation. It's the job of the target cost to measure similarity (or distance) between units in the database - which we call candidates - and the target specification. Our target cost function is going to rank units from the database: it's going to rank candidates. The database is going to be fairly large, and we're going to need to label it. At the very least, we need to know where every unit starts and finishes. So we need some time alignment. Because the database is large, we might not want to do that by hand. There might be some other good reasons not to do it by hand. We're going to borrow some techniques from Automatic Speech Recognition. So, you need to know just the basics of speech recognition: using simple Hidden Markov Models, for example context-independent models of phonemes; very simple language modelling using finite state models; and how decoding works, in other words, how we choose the best path through our network of language model and acoustic model. Before starting the discussion of what's going in the database, let's just establish a few key concepts to make sure we have our terminology right. The first key concept is going to be that of the "base unit type" such as the diphone. The second key concept is that these base units (for example, diphones) occur in very varied linguistic contexts. We might want to cover as many of those as possible in the database. That leads us into the third concept, which is coverage: how many of these unit-types-in-linguistic-context we could reasonably get into a database of finite size. Looking at each of those key concepts in a little bit of detail then: The base unit type is the thing that our unit selection engine uses. Most commonly that type is going to be the diphone. It could also be the half-phone. In the modules about unit selection, we also talked about using heterogeneous unit types: things of variable linguistic size. We also said that we don't really need to think about variable-size units. We can use fixed-size units (such as the diphone) and the zero join cost trick, to effectively get larger units at runtime. So, from now on, let's just assume that our base unit type is the diphone. There's going to be a relatively small number of types of base unit. It's certainly going to be finite - a closed set - and it's maybe going to be of the order of thousands: one or two thousand types. In our unit selection engine, when we retrieve candidates from the database before choosing amongst them using the search procedure, we look for some match between the target and the candidate. At that retrieval stage, the only thing we do is to strictly match the base unit type. So, if the target is of one particular diphone we only go and get candidates of that exact type of diphone, from all the different linguistic contexts. The only exception to that would be if we've made a mistake designing our database. Then we might have to go and find some similar types of diphones, if we have no examples at all of a particular diphone. The consequence of insisting on this strict match is that our target cost does not need to query the base unit type: they all exactly match. The list of candidates for a particular target position are all of exactly matching diphone type. All the target cost needs to do is to query the context in which each of the candidates occurs, and measure the mismatch between that and the target specification. Given the base unit type then, the second key concept is that these base units occur in a natural context. They're in sentences, typically. Now, the context is potentially unbounded: it certainly spans the sentence in which we find the unit and we may even want to consider features beyond the sentence. The number of linguistic features that we consider to be part of the context specification is also unlimited. It's whatever the front-end might produce and whatever we might derive from that, such as these positional things. So the context certainly includes the phonetic context, the prosodic environment, and these derived positional features, and anything else that we think might be interesting to derive from our front-end's linguistic specification. The exact specification of context depends entirely on what our front-end can deliver and what our target cost is going to take into account, whether it's an Independent Feature Formulation or an Acoustic Space Formulation. For the purposes of designing our database, we're going to keep things a little bit simple and we're just going to consider linguistic features. We're just going to assume that our target cost is of the most simple IFF type when designing our database. Although the context in which a speech unit occurs is essentially unbounded (there are an infinite number of possible contexts because there are an infinite number of things that a person might say), in practice it will be finite because we will limit the set of linguistic features that we consider. We'll probably stick to features that are within the sentence. Nevertheless, the number of possible contexts is still very, very large and that's going to be a problem. Just think about the number of permutations of values of those linguistic specification: it's just very, very large. If we would like to build a database of speech which literally contains every possible speech base unit type (for example each diphone - maybe there's one to two thousand different diphone types) each occurring in every possible linguistic context, that list will be very, very long. Even if we limit the scope of context to just the preceding sound, the following sound, and some basic prosodic features, and positional features, this list will still be very, very long. We have to ask ourselves: Would it even be possible to record one example of every unit-in-context? Let's just point out a little shorthand in the language I'm using here. We've got "base unit types" such as diphones. We've got "context" which is the combination of linguistic features in the environment of this unit. So we should be talking about "base unit types in linguistic contexts". That's a rather cumbersome phrase! I am going to shorten that. I'm going to say "unit-in-context" for the remainder of this module. When I say "unit-in-context" I'm talking about the all the different diphones in all the different linguistic contexts. Natural language has a very interesting property that is one of the reasons it's difficult to cover all of these contexts. That is that they're distributed very unevenly. Whatever linguistic unit we think of - whether it's the phoneme or indeed the letter or the word - there are very few types - and for purposes of building the database, those types are units-in-context - very few types are very frequent. Think about words: you know that some words - such as this one here - are very, very frequent. Other words a very infrequent. That's true about almost any linguistic unit type. It's certainly going to be true about our units-in-context. The flipside of that is that there are many, many types that are individually very, very rare, but there's very large number of such types. Taken together, they are frequent. So we get this interesting property: that rare events are very large in number. In other words, in any one particular sentence that we might have to synthesize at runtime, there's a very high chance that we'll need at least one rare type. We've already come across that problem when building the front end. We know that we keep coming across new words all the time. These new words are the rare events, but taken together they're very frequent: they happen all the time. Let's have a very simple practical demonstration of this distribution of types being very uneven. Here's an exercise for you. Go and do this on your own. Maybe you could write it in Python. Use any data you want. I'm going to do it on the shell, because I'm old fashioned. Let's convince ourselves that linguistic units of various types have got this Zipf-like distribution. Let's take some random text. I've downloaded something from the British National Corpus. I'm not sure what it is, because it doesn't matter! Let's just have a look at that: it's just some some random text document that I found. I'm going to use the letter as the unit. I'm going to plot the distribution of letter frequencies. In other words, I'm going to count how many times each letter occurs in this document and then I'm going to sort them by their frequency of occurrence. We'll see that those numbers have a Zipf-like distribution. I'll take my document and the first thing I'm going to do is I'm going to downcase everything, so I don't care about case. Here's an old-fashioned way of doing that: "translate" ('tr') it. We take all the uppercase characters and translate them individually to lowercase. That's check that bit of the pipeline works. Everything there is lowercase - all has become downcased, you can see. I'm now going to pull out individual characters. I'm only going to count the characters a-to-z. I'm going to ignore numbers and punctuation for this exercise. So we'll grep, and we'll print only the matching part of the pattern, and we'll grep for the pattern "any individual letter in the range a-to-z lowercase". Let's check that bit of the pipeline works. That's just printing the document out letter by letter. I'm now going to count how often each letter occurs. There's a nice way of doing that sort of thing on the command line. First we sort them into order, and then we'll put them through a tool called 'uniq'. uniq finds consecutive lines that are identical and just counts how many times they occur. In its own, it will just print out one copy of each set of duplicate lines. We can also ask it to print the count out. Let's see if that works ... it just takes a moment to run because we're going through this big document. So there we now have each letter and the number of times it occurs. There's our distribution. It's a little bit hard to read like that because it's ordered by letter and not by frequency, so let's sort it by frequency. I'm going to 'sort' and sort will just operate on the leftmost field, which is conveniently here the number, and we'll sort it numerically not alpha-numerically. I'm going to reverse the sort, so I get the most frequent thing at the top. And there is our kind-of classical Zipf-like distribution. We can see that there are a few letters up here that are accounting for a lot of the frequency: in other words, much of the document. There's a long tail of letters down here that's rather low in frequency: much, much lower; an order of magnitude lower than those more frequent ones. If we were to plot those numbers (and I'll let you do that for yourself, maybe in a spreadsheet) we'd see that has this Zipf-like decaying distribution. So, the Zipf-like distribution is true for letters, even though there are only 26 types, we still see that decaying distribution. If we do this for linguistic objects with more and more types, we'll get longer and longer tails, until we end up looking at open-class types such as words, where we'll get a very, very long tail of things that happen just once. Let's do one more example. Let's do it with speech this time: transcribed speech. We'll look at the distribution of speech sounds. I've got a directory full of label files of transcribed speech. It doesn't really matter where it's come from at this point. Let's look at one of those: they're sequences of phonemes, labelling the phones in a spoken utterance, with timestamps and so on. I was going to pull out the phoneme label and I'm going to do the same thing that I did with the letters. So again I'm going to be old-school: just do this directly on the command line. If you're not comfortable with that, do it in Python or whatever your favourite tool is! There's many different ways to do this kind of thing. The first thing I'm going to do, I'm going to pull out the labels. I know that these labels are always one or two characters long. So let's 'grep'. I use "extended grip" ('egrep') - it's a bit more powerful. I don't want to print out the filenames that are matching. Again, I just want to print out the part of the file that matches this expression. I'm going to look for lowercase letters and I know that they should occur once or twice: so, single letters or pairs of letters, these are what the phoneme labels look like. I also know that they happen at the end of a line. I'm going to do that for all of my labelled speech files. Let's just make sure that bit of pattern works. Yes, that's pulling out all of those. We'll do the same thing that we did for the letters: sort it,'uniq -c' it, and order it by frequency in reverse order. Let's run that. Again we see the same sort of pattern we saw with letters. If we did this with words, or with any other unit, we'd get the same sort of pattern. There are a few types up here that are very frequent. There's a long tail of types down here that are much less frequent; again, at least an order of magnitude less frequent, and possibly more than that. Because this is a closed set, we don't get a very long tail. You should go and try this for yourself with a much bigger set of types. I suggest doing it with words or with linguistic unit-types-in-context: perhaps something like triphones. But, even for just context-independent phonemes, there are a few that are very low frequency. I'm using over a thousand sentences of transcribed speech here, and in those thousand sentences there are a couple of phonemes that occurred fewer than a hundred times, regardless of the context. That's going to be one of the main challenges in creating our database. Plotting those distributions of frequencies-of-types, from the most frequent to the least frequent (ordering them by their frequency of occurrence) and then plotting their frequency on this axis we always tend to get this sort of shape, this decaying curve. Now, you'll often see this curve called a Zipf distribution. That should have a particular exact equation: it's a particular sort of distribution. Of course, real data doesn't exactly obey these distributions. It just is somewhat similar, and has the same sort of properties. In particular, it has this Large Number of Rare Events. So really we should be talking about a Zipf-like distribution, not exactly a Zipf distribution.
Whilst the video is playing, click on a line in the transcript to play the video from that point. We've established the key concepts of base unit type (let's just assume that's a diphone), of context in which those base unit types occur (that context applies to the sentences in the database and applies equally to the sentences we're going to synthesize). Because base units can occur in many possible contexts, there's a very large number of possible types that we'd like to record in our database. We've seen this Zipf-like distribution of linguistic units which makes that very hard to achieve. If we just randomly chose text to record, we would get that Zipf-like distribution, whatever our unit type is, whatever the context is. The only thing we could do to improve coverage would be to increase the size of that database: to record more and more sentences. As we did that, the number of tokens, the number of recorded tokens, of frequent types would just increase steadily. The number of infrequent types would grow very slowly, because that's the long tail in the Zipf distribution. In other words, as we increase database size, we don't change the shape of this curve. All we do is move it upwards, so frequent types become ever more frequent and the rare types slowly become more frequent. Because we're just scaling things up, potentially linearly with the database size. So it would take a very long time to cover these things in the tail. In fact, we would find that even for very large databases, almost all types (in other words almost all units-in-context) will have no tokens at all. We know that - however large the database goes - we'll never cover all of those, because of this Zipf-like distribution. In practice, it's going to be impossible to find a database (composed of a set of sentences) that includes at least one token of every unit-in-context type. So what can we do then? Well, all we can do then is to try and design a script that's better than random selection, in terms of coverage and maybe some other properties. At first glance, it would appear the main design goal of our script is to improve coverage. In other words, to cover as many units-in-context as possible. What's that going to achieve? Well it would appear to increase the chance of finding an exact match at synthesis time. In other words, finding exactly the candidate that has the properties of the target: with all of the linguistic specification being the same. Now, we certainly would increase that, but it would remain very unlikely. In fact, the point of unit selection is that we don't need an exact match between the specification of the candidate and the specification of the target. The target costs will quantify how bad the mismatch is, but the mismatch does not have to be zero. The target cost will rank and help us choose between various degrees of mismatch in the various linguistic features. Just as important then, is that - for each target position - we have a list of candidates from which the join cost can choose. The longer that list, then the more chance there is of finding sequences that concatenate well together. So what we're really looking for is not to cover every possible context (because that's impossible), it's to cover a wide variety of contexts so that the target cost and the join cost can between them choose units that don't mismatch too badly in linguistic features, and join together well. We want to achieve that with the smallest possible database. There's a few good reasons for keeping the database size small. One is obvious: it will take time to record this thing. A side effect of taking a long time to record something is that it's harder and harder to maintain a very consistent speaking style over longer recording periods. Amateur speakers (like me, and probably you) find that very hard. In particular, when the recording sessions are split over days or weeks or maybe even months, it's very hard to have the same speaking effort, voice quality, and so on. Professional speakers are much better at that; that's one reason to use them. The second reason to keep the database size small is that, at least in unit selection, the runtime system will include an actual copy of that entire database: the waveforms. So if we've got to store the database in the runtime system (that we might be shipping to our customer) we don't want it to be too many gigabytes. There might be other goals that we'd like to include in our script design, not just coverage in the smaller script possible. We won't consider these in any great depth here. That's because you can explore them for yourself in the "Build your own unit selection voice" exercise. The sort of things that you might include in your script design might be: choosing sentences that are easy to read out loud, so that your speaker can say them fluently without too many mistakes; you might want to cover low-frequency linguistic forms such as questions; you might want to boost the coverage of phrase-final and phrase-initial units, because in long sentences they're very rare; and you might want to include some domain-specific material, if you think your final system is going to be more frequently used in a particular domain (for example, reading out the news, or reading out emails, or something really simple like telling the time). You're going to explore some of those things for yourself in the exercise, so we won't cover them too much more here. What we will do is look at a typical, simple approach to script design. It's going to be an algorithm: in other words, it's automatic. It's going to start from a very large text corpus and make a script from that. In other words, we're going to choose - one at a time - sentences from a very large corpus, and accumulate them in our script. We'll stop at some point: perhaps when the script is long enough, or when we think coverage is good enough. Now, there's as much art as there is science in script design! I'm not presenting this as the only solution. Every company that makes speech synthesis has their own script design method and probably has their own script that's evolved through many voice builds. We're going to use a simple greedy algorithm. In other words, we're going to make decisions and never backtrack. The algorithm goes like this: we take our large text corpus, and for every sentence in that corpus we give it a score. The score is about how good it would be if it was added to the script. The sort of thing we might use to score will be, perhaps, finding sentences with the largest number of types that were not yet represented in our script. "Types" here would be units-in-context. We add the best sentence from the large corpus to the script, and then we iterate. This is about the simplest possible algorithm. At the end, we'll look at a few little modifications we might make to that. The algorithm requires the large corpus. That needs to come from somewhere. You might want to be very careful about where that text comes from. If you care about copyright, then you might want to get text for which either there is no copyright (for example, out-of-copyright material), or you might want to obtain permission from the copyright holder. Regardless of where you get the text from, it's most likely that this is written language and was never meant to be read out loud. That would have consequences for the readability. It might be hard to read newspaper sentences out loud. It might be hard to read sentences from old novels out loud. You're going to find out that for yourself in the exercise! Because written text was not usually intended to be read aloud, its readability might be poor. Also, the sort of variation we get in it might be unlike what we normally get in spoken language. For example. the prosodic variation might be limited. There might be very simple differences, like there are far fewer questions in newspaper text than there are in everyday speech. We've already mentioned that phrase-initial and phrase-final segments might be rather low frequency in text because the sentences are much longer than in spoken language. We might want to correct for that in our text selection algorithm. If you're looking for a source of text to use in your own algorithm, the obvious choice is to go to Project Gutenberg where there's a repository of text of out of copyright novels, which you can scrape and use as a source of material. But be very careful, because historical novels might have language that's archaic and might be very difficult for your speaker to read out loud. To get a better understanding of the issues in text selection, let's work through a rather simplified example. We'll assume we have some large corpus of text to start from. Maybe we've taken lots of novels from Project Gutenberg. The first step would be to clean up the corpus. In that text there's likely to be a lot of words for which we don't have reliable pronunciation information. For example: they're not in our dictionary. And because, for example, they might be proper names, we don't anticipate that our letter-to-sound rules will work very well. We're going to therefore need to clean up the text a bit. We might throw away all the sentences that have words that are not in our dictionary, because we don't trust letter-to-sound rules. We might throw away all sentences that are rather long: they're hard to read out loud with reasonable prosody, and we're also much more likely to make mistakes. We might also throw away very short sentences because their prosody is atypical. Depending on who your speaker is, you might also want to throw away sentences that you deem "hard to read". There are readability measures (that are beyond the scope of this course) that you could use, or we could base that on vocabulary. We could throw away sentences with very low frequency or polysyllabic words that we think our speaker will find difficult. The goal of our text selection algorithm is going to simply be coverage of unit types-in-context. So we need to know - for all the sentences in this large text corpus - what units they contain. Now we don't have a speaker reading these sentences out yet. We only have the text, so all we can really do is put that text through the front end of our text-to-speech system and obtain - for each sentence - the base unit sequence (for example, the sequence of diphones) and the linguistic context attached to each of those. One way to think about the algorithm is to imagine a "wish list" that enumerates every possible type that we're looking for. If we were just looking for diphones (ignoring context for a moment), we would just write out the list of all possible diphones: from this one...to this one. (Remembering that silence would also be treated as a phoneme in this system.) If we start adding linguistic context, that wish list will grow in size exponentially as we add more and more context features. Imagine just asking for versions of each diphone in stressed and unstressed prosodic environments. Imagine then adding all possible linguistic features and enumerating a very long wish list in this way. If you implement an algorithm, you may or may not actually construct this list, because it's going to be huge. You might not actually store it in memory. For the worked example, I'm going to actually just ignore linguistic context and try and get coverage just of diphones. The method will work the same for diphones-in-context, of course. Here's part of my very large text corpus, that I've put through my text-to-speech front end. It's given me a sequence of phonemes, from which I've got the sequence of diphones. This is just the first few sentences of this very large corpus. I'm going to write down my wish list. Here's my wish list: it's just the diphones. Again, in reality we might attach context to those. What I'm going to do, I'm going to score each sentence for "richness". Richness is going to be defined as the number of diphones that I don't yet have in my script. When we're doing that scoring, if a sentence contains two copies of a particular unit-in-context (here, just the diphone) then only one of them will count. So we can get rid of these duplicates. What we need is a function that assigns a score to each of these sentences. We won't write that out. We'll just do this impressionistically. If we were just a count the number of types, then we would pretty much always choose long sentences, because they're more likely to have more types in. So at the very least, we're going to control for length. Perhaps we'll normalize "number of types" by "the length of the sentence". Under my scoring function it turns out that this sentence here scores the highest. It's got the most new types in the smallest sentence. So I will select that sentence, and I'll put it in my script for my speaker to read out in the recording studio later. I'll remove it from the large database. Then I'll repeat that procedure. But before I carry on, I'm going to cross off all the diphones that I got with that sentence. So I'll scan through the sentence and I'll look at all the diphones that occurred there and I'll cross them off my wish list. So: these diphones...and various other diphones have gone. Our wish list has got smaller, and sentences will no longer score any points for containing those diphones because we've already got them. That sentence has been selected: it disappears from the large corpus. Then we'll score the remaining sentences. The scores will be different than the first time round, because any diphones that have already gone into the script no longer score points. Remember that was these ones...and some other ones. The result of this algorithm is going to be a list of sentences in a specific order: in order of descending richness. So, when we record that script, we probably want to record it in that order too, so we maximize richness. And we can stop at any point: we could record a hundred, or a thousand, or ten thousand sentences, depending how much data we wanted. The order of the sentences will be defined by the selection algorithm. That's neat, because we can just run our algorithm and select a very large script and go into the studio and record for as long as (for example) we have money. I've already said there's as much art as there is science to script design, so it's perfectly reasonable to try and craft in some extra selection criteria into your text selection script. One sensible thing - that our algorithm doesn't currently do - will be to guarantee at least one token of every base unit type. That will mean we'll never have to back off at synthesis time. If our base unit is the diphone, then our first N sentences (our first few hundred sentences) will be designed just to get one of each diphone. After that we'll switch to looking for diphones-in-context. When we start looking for diphones-in-context, we might actually want to go looking for the hardest-to-find units first: the rare ones first. In other words, give sentences a higher score for containing low-frequency diphones-in-context. That's because we'll get the high frequency ones along "for free" anyway. Another thing we might do is try and include some domain-specific material for one or more domains. This is to make sure we have particularly good coverage in these smaller domains, so our voice will perform particularly well on these domains. These might be decided by, for example, the particular product that our synthesizer is going into. If we just selected a script of strictly in-domain sentences, we'd expect very good performance in that domain but we might be missing a lot of units that we need for more general-purpose synthesis. So a standard approach to getting a domain-specific script is to first use your in-domain sentences (they might be manually written, or from some language-generation system); measure the coverage of that; then fill in the gaps with a standard text selection algorithm, from some other large corpus. Then we can get the best of both worlds: particularly good coverage in a domain and general-purpose coverage, so we can also say anything else. All that remains then is to go into the recording studio! Find a voice talent who can read sentences fluently with a pleasant voice, and ask that person to read each of our sentences. Then we're going to have a recorded database. The next step will be to label that recorded database.
Whilst the video is playing, click on a line in the transcript to play the video from that point. We'll finish off this module on the database by looking at how we're going to label it ("annotate" it) But let's just orient ourselves in the bigger picture before we continue. What we've got so far is a script, which is composed of individual sentences. That script will have been designed, probably by a text selection algorithm that we've written. It will aim to: cover the units-in-context; to be readable; to provide each base unit in a wide variety of different linguistic contexts; and possibly some other things as well, such as specific domains. With that script, we've gone into a recording studio and asked a speaker (sometimes called the "voice talent") to record those sentences. They'll be recorded generally as isolated sentences. Our text selection algorithm will have very likely provided us with a script in an order of decreasing richness. So we'll record the script in that same order, meaning that we can stop at any point and maximize the coverage for that given amount of material. What remains to be done is to segment the speech: to put phonetic labels on it, so we know where each (for example) diphone starts and ends. On top of that phonetic labelling (or "segmentation"), we need to annotate the speech with all of the supra-segmental linguistic information: all the other things that the target cost might need to query. We'll start by looking at the time-aligned phonetic transcription. We're going to use a technique borrowed from speech recognition. Then we'll see how we attach the supra-segmental (in other words, the stuff above the phonetic level), how we attach the supra-segmental information to that phonetic transcription that's been time aligned. Let's think about two extremes of ways that we might label the speech, to understand why hand labelling might not be the right answer. If we think that we need a transcription of the speech that's exactly faithful to how the speaker said the text: It gets every vowel reduction, every slight mispronunciation, every pause, everything exactly faithful to the way the speaker said the text. Then we might think we want to hand label from scratch. In other words, from a blank starting sheet without any prior information. We'll be down at this end of the continuum. But, if we think about what we're going to do with this data. We're going to do unit selection speech synthesis. We're going to retrieve units and that retrieval will try to match a target sequence. The target sequence will have gone through our text-to-speech front-end. The front-end is not perfect. It might make mistakes, but it will be at least consistent. It'll always give you the same sequence for the same input text. But that sequence might not be the same as our speaker said the particular sentence in the database. So another extreme would be to label the database in a way that's entirely consistent with what the front-end is going to do at synthesis time. We can call that the "canonical phone sequence". In other words, the sequence that is exactly what comes out of the front end . If we had to choose between these two things, we'd actually want to choose this end, because we want consistency between the database and the runtime speech synthesis. Consider the example of trying to say a sentence that exists in its entirety in the database. We would obviously want to pull out the entire waveform and play it back and get perfect speech. The only way we could do that, is if the database had exactly the same labels on it that our front end predicts at the time we try and synthesize that sentence, regardless of how the speaker said it. Now there are some points in between these two extremes, and we're going to take one of those as our basis for labelling. We're going to slightly modify the sequence that comes out of the front end. We're going to move it a little bit closer to what the speaker actually said. We'll see exactly how we make those modifications, and why, as we go through the next few sections. To summarize that situation: we have some text that our speaker reads out; we could put that through the text-to-speech front-end and get a phonetic sequence from that. That's the canonical sequence. We can then make some time alignment between that canonical sequence and what the speaker actually said (their waveform). That will be very consistent, but there might be things our speaker did that are radically different to what's in that canonical sequence. A good example of that might be that the speaker put a pause between two words that our front end did not predict, because maybe our pausing model in the front end is not perfect. We could start from what the speaker said and we could hand transcribe and get a phonetic sequence. That phonetic sequence will be very faithful to what the speaker said, but it might be rather hard to match that up with what the front end does at synthesis times. There might be systemic mismatches there. Those mismatches will mean that - when we try and say this whole sentence, or perhaps just fragments of it - at synthesis time, we won't retrieve long contiguous sequences of units from the database. In other words, we'll make more joins than necessary. Joins are bad! Joins are what listeners notice. Consistency will help us pull out longer contiguous units by getting an exact match between the labels on the database and what the front end does at synthesis time. Our preference is going to be start from the text that we asked the speaker to read, get the phonetic sequence, and then make some further small modifications to adjust it so it's a slightly closer fit to what the speaker said in (for example) pausing. The sort of labelling we're doing, Taylor calls "analytical labelling". Do the readings to understand precisely what he means by that. We're going to prefer this to be done automatically. Yes, that's faster and cheaper, and that's a very important reason for doing it. But an equally important reason is that it's more consistent between what's in the database and what happens when we synthesize an unseen sentence. A good way to understand that is to think about the labels on the database as not being a close phonetic transcription of the speech, but being just an index: a way of retrieving appropriate units for our search to choose amongst. Having consistent names for those units in that index is more important than being very faithful to what the speech actually says. A natural question is whether we could automatically label speech and then, by hand, make some small changes to match what the speaker actually said. Of course, that is possible, and that's standard practice actually in some companies. Those corrections are not going to be small changes to alignments ('microscopic changes'). They're really going to be looking for gross errors such as bad misalignments, or speakers saying something that really doesn't match the text. We're not going to consider this idea of manual correction here: it's too time-consuming and too expensive. We're going to consider only a way of doing this fully automatically. In other words, if the speaker deviated from the text in some way - such as inserting a pause where the front end didn't predict a pause - we're going to discover that completely automatically. The way that we're going to do that is basically to do automatic speech recognition to transcribe the speech. But this is much easier than full-blown speech recognition, because we know the word sequence. Knowing the word sequence is basically like having a really, really good language model: very highly constrained. In automatic speech recognition, we normally only want to recover the word sequence, because that's all we want to output. But if you go back to look at the material on token passing, you'll realize that we can ask the tokens to remember anything at all while they're passing through the HMM states, not just the ends of words. They could also remember the times (the frames) where they left each model: in other words, each phoneme. Or we could ask them to remember when they left each state. We could get model- or state-alignments trivially, just by asking the tokens to record these things as they make their way around the recognition network. So the ingredients for a building forced aligner are basically exactly the same ingredients as for automatic speech recognition. We need acoustic models, that is, models of the units that we want to align. They're going to be phone-sized models. We need a pronunciation model that links the word level to the phone level. That's just going to be our pronunciation dictionary: the same one we already have for synthesis. We might extend it in ways that we don't normally do for speech recognition, such as putting in pronunciation variation. We're going to see in a moment that some rule-based variation, and specifically vowel reduction, is often built in. And we need a language model. That doesn't need to be a full-blown N-gram. We don't need coverage. What we need is just a model of the current word sequence for the current sentence. That's a very simple language model. In fact, the language model will be different for every sentence: we'll switch the language model in and out as we're aligning each sentence. One thing we might do is add optional silences between the words. We'll come back to exactly how to train the acoustic models in a moment. Let's assume we have a set of fully-trained phone models for now, and see what the language model looks like. Let's write the simplest language model we can think of. Here's the sentence we asked the speaker to say. So that's what we're going to force align to the speech that they produced. It's a finite state language model. The states are these words, and we just join them up with arcs, with an end state and a start state. That's a finite state language model. We're going to compile that together with the acoustic model and pronunciation model to make our recognition network, do token passing, and ask the tokens to record when they left every single state or phone, depending what alignment we want. That will get forced alignment for us. That was the language model. The next ingredient is a pronunciation model that maps the words in the language model to phones, of which we have HMMs. Our pronunciation model is basically a dictionary. It maps words - such as this word - to pronunciations, such as this. We're going to add a little rule-based pronunciation variation: we're going to allow every vowel to be reduced to schwa. We'll write out our finite state pronunciation model of the word "can", add arcs, and optionally, instead of the full vowel, we could generate a reduced vowel. So here's our finite state network model of the pronunciation of the word /k ae n/, but it can also be reduced to /k ax n/. I can say "What can [k ae n] it do for..." or "What can [k ax n] it do for..." This recognition network can align either of those variants. The third and final ingredient is the acoustic model, that's going to actually emit observations. We could borrow fully-trained models from an existing speech recognition system, for example speaker-independent models, although actually in practice we do tend to get better results with rather simpler models which we can make speaker-dependent, because we can train them on the same data that we're aligning. We might have thousands or tens of thousands of sentences, which is plenty of data to train context-independent phone models. Now, you might be shouting out at this point that training the models on the same data we're aligning is cheating! That's not true. We're not really doing recognition: we're doing alignment. The product is not the word sequence, it's just the timestamps. So there's no concept of a split between training and testing here. We've just got data. We train models on the data and then find the alignment between the models and the data. It's that alignment that we want, not the word sequence. Those were our ingredients for forced alignment: a language model, a pronunciation model and an acoustic model. We saw how the language model is just derived from the sentences that we ask the speaker to read. We saw how the pronunciation model was simply the dictionary from our speech synthesizer with some rule-based vowel reduction. If our dictionary was more sophisticated and had full pronunciation variation capabilities, that could be expressed in a finite state form and would become part of the alignment network. The other remaining ingredient to build is the acoustic model. So how can we train our acoustic models on the recorded speech data? Well it's no different to building any other speech recognition system. We know the word transcriptions of all the data and we have an alignment at sentence boundaries between the transcriptions and the speech. If all you know about automatic speech recognition is how to use whole word models, then you might think that we need to align the data at the word level before we can train the system. But think again: when we train whole word models such as in the "Build your own digit recognizer" exercise, those word models have many states and we did not need to align the states to the speech. So, building a speech recognition system never needs state-level alignments. That we're very tedious to try and do by hand. I've no idea how you would do that. We can generalize the idea of not needing state-level alignments to not needing model- or word-level alignments. That's easy, in fact. We just take our sub-word models (say, phone models) and we concatenate them together to get models of words, and then we concatenate word models to get a model of a sentence. We get a great, big, long HMM. We know that the beginning of the HMM aligns with the beginning of the audio, and the end of the HMM aligns with the end of the audio. Training that big, long HMM is no different to training a whole word model on segmented data. Using data where we just have a word transcription that's only aligned at the sentence level is so important and is the standard way of training an automatic speech recognition system, it comes with a special name. It's called "flat start training". Let's see how flat start training is just a generalization of what we already know about speech recognition. Let's pretend for a moment that these HMMs here are whole word models. They're models of digits. In the exercise to "Build your own digit recognizer", we needed to know that the beginning of this model aligned with the speech and where the end of the model aligned with the speech. Then, given this set of observations, we could train the model of "one" and the same for all the other digits. So we essentially had isolated digit training data. We just generalise that idea. This HMM now is an HMM of this little phrase. We know the start aligns with the start of the audio, the end aligns with the end of the audio. We just do exactly the same sort of training to train this long model from this long observation sequence. That extends out to a whole sentence. Right, we've got all the ingredients then: a language model constructed from the sentences we know; a pronunciation model from the dictionary, plus rules; acoustic models created with this thing called flat start training. Let's just make our language model a little bit more sophisticated, to accommodate variations that the speaker might make that our front end doesn't predict, and that is inserting pauses between words. This speaker has inserted a pause between these two words. Perhaps our front end didn't predict a pause in that situation. The way that we do that is to insert an additional acoustic model at every word juncture, that's a model of optional silence. This model's just an HMM. So it's got a state, it's got a self-transition, it's got a start and an end non-emitting state (because we're using HTK). That's just a 1-state HMM that can emit observations. As it stands there, it must emit at least one observation to get from the start to the end. We'd like to be able to emit no observations so that we have optionality, in other words it can have zero duration. We can do that just by adding this extra transition, like this. This skip transition has given us a model of optional silence. This state here is going to contain a Gaussian or Gaussian mixture model that emits observations with an appropriate distribution for silence. So we need to train this model. It's not obvious how to train this model, because we don't know where these optional silences are. But what we do know is that there's silence at the beginning and the end of the sentence and we're going to have a 'long silence' model to deal with those. So we're always going to have our traditional (perhaps 3-state) silence model. This 3-state silence model (we'll typically call it 'sil', or something like that). This model here, we're going to call 'sp' and there's an easy way to initialize the parameters of the state of this model and it's just to tie them to the centre state of this model, which will be easier to train because we'll always have a few 100 ms of silence at the beginning and end of every sentence that we recorded in the studio. So that's a little trick to get us an optional short pause model and to train it easily by tying it to our main silence model. Right, let's do some speech recognition! We're going to do it in the only way we know how: that's the way that HTK does it in its simple HVite tool. That's to compile together language model, pronunciation model, and acoustic model into a single finite state network, and then do token passing. Let's start with the language model. There's a language model: that's finite state. Let's just remind ourselves: it's got arcs and states like this. We'll compile this model with the pronunciation model, which essentially means replacing each word with its sub-word units. That's this model here. Again, remember these are all finite state and there are arcs joining all of those together. I won't draw them in, just to keep things neat. We can now enhance that with those two little tricks. One was to allow every vowel to be reduced, so it's optional vowel reduction. The other was to put this special short pause model at every word juncture, in case speakers put pauses where our front-end didn't predict it in the phone sequence. So we add some more nodes in the network for that, and now we can draw the arcs in to see how this finite state network looks. Let's draw the arcs on the network. I'll omit the arrows (they all flow from left to right, of course). So each word could have its canonical pronunciation, like this. Or it has, optionally, vowel reduction like this. Between words we always go through the short pause model, which can emit zero or more frames of silence. The word can have its canonical full pronunciation, or the vowel can be reduced. We go through the short pause model. That's already a schwa, so can't be reduced any further. Again the canonical pronunciation, or reduce the vowel. The optional short pause model at the word juncture. The canonical pronunciation or the vowel reduction. Then we end. On this network here, we just do token passing. Let's imagine a route that the token might take. We might say "There..." - may be we left a pause, maybe we didn't - the number of frames we spend here will tell us if there was any silence... we maybe reduce this vowel... maybe we say this with its full pronunciation... and this with its full pronunciation. This token would remember its state sequence and its model sequence and would let us recover the timestamps. In other words, we'd know the time that we left each of these phone models. We'll repeat that for every sentence in the database with our fully-trained models, that have been acquired using flat start training. What we now have then is a phonetic sequence with timestamps. We know the start and end times of every segment. That's not enough. We need to attach supra-segmental information to that. Here's what our speaker said. Here's the phonetic sequence from forced alignment We're going to attach that to all the supra-segmental information. We're just going to get the supra-segmental information from the text, using the front-end. We're not going to do any alignment or modifications to try and make it match what the speaker said. We'll just take the predictions from the front end. That's a little bit simplistic: we might imagine we could do better than that by, for example, hand-labelling prosody (if we think we can do that accurately...). But the simple method actually works quite well and it's the standard way of building voices when we use Festival, for example. This level was the force-aligned phone sequence. It's got timestamps. This was the canonical sequence from the front end. You just take the timestamps off one and transfer them to the other. Those sequences are not identical, but they're very similar, so we can make that transfer and in doing so we can also add in the short pauses where they existed, and we can mark which vowels got reduced by the speaker when he or she said this sentence. If we looked inside the Festival utterance structure after all of this process, we'd see that some timestamps had appeared on it. That concludes our look at databases: what to put in the recording script; how to annotate that speech once we've recorded in the studio. Unit selection then is just retrieval of candidates from that labelled database, followed by a search. Now we have a full working speech synthesizer. Hopefully you're doing the exercise at the same time. You've built that voice and listened to it and so you're asking yourself now, "How good is that synthetic voice?" We need to evaluate that, so the topic of the next module will be evaluation. We'll think about what are fair means to evaluate it. Of course we should listen to it ourselves: that's going to be helpful! We probably want to ask other people to listen to it, so we can get multiple listeners for a bit more reliability. Maybe we can measure some objective properties of the speech to decide how good it is? But, in general, we need to have a good think about what do we want to measure and why exactly do we want to do evaluation. So the questions we're going to answer in the next module are: Can we judge how good a synthetic voice is in isolation? Or, is it only possible to do that by comparing to some other system that's better or worse than it? We'll answer the question about who should be listening to the speech, and whether it should be us or other listeners, or indeed some algorithms (some objective measures). All of those have advantages and disadvantages and we'll consider those. We'll think in detail about exactly what aspects of the speech we want to measure and put quantities on: Is it naturalness? Is it how intelligible the speech is? Or is there something else we can measure?