Whilst the video is playing, click on a line in the transcript to play the video from that point. We've established the key concepts of base unit type (let's just assume that's a diphone), of context in which those base unit types occur (that context applies to the sentences in the database and applies equally to the sentences we're going to synthesize). Because base units can occur in many possible contexts, there's a very large number of possible types that we'd like to record in our database. We've seen this Zipf-like distribution of linguistic units which makes that very hard to achieve. If we just randomly chose text to record, we would get that Zipf-like distribution, whatever our unit type is, whatever the context is. The only thing we could do to improve coverage would be to increase the size of that database: to record more and more sentences. As we did that, the number of tokens, the number of recorded tokens, of frequent types would just increase steadily. The number of infrequent types would grow very slowly, because that's the long tail in the Zipf distribution. In other words, as we increase database size, we don't change the shape of this curve. All we do is move it upwards, so frequent types become ever more frequent and the rare types slowly become more frequent. Because we're just scaling things up, potentially linearly with the database size. So it would take a very long time to cover these things in the tail. In fact, we would find that even for very large databases, almost all types (in other words almost all units-in-context) will have no tokens at all. We know that - however large the database goes - we'll never cover all of those, because of this Zipf-like distribution. In practice, it's going to be impossible to find a database (composed of a set of sentences) that includes at least one token of every unit-in-context type. So what can we do then? Well, all we can do then is to try and design a script that's better than random selection, in terms of coverage and maybe some other properties. At first glance, it would appear the main design goal of our script is to improve coverage. In other words, to cover as many units-in-context as possible. What's that going to achieve? Well it would appear to increase the chance of finding an exact match at synthesis time. In other words, finding exactly the candidate that has the properties of the target: with all of the linguistic specification being the same. Now, we certainly would increase that, but it would remain very unlikely. In fact, the point of unit selection is that we don't need an exact match between the specification of the candidate and the specification of the target. The target costs will quantify how bad the mismatch is, but the mismatch does not have to be zero. The target cost will rank and help us choose between various degrees of mismatch in the various linguistic features. Just as important then, is that - for each target position - we have a list of candidates from which the join cost can choose. The longer that list, then the more chance there is of finding sequences that concatenate well together. So what we're really looking for is not to cover every possible context (because that's impossible), it's to cover a wide variety of contexts so that the target cost and the join cost can between them choose units that don't mismatch too badly in linguistic features, and join together well. We want to achieve that with the smallest possible database. There's a few good reasons for keeping the database size small. One is obvious: it will take time to record this thing. A side effect of taking a long time to record something is that it's harder and harder to maintain a very consistent speaking style over longer recording periods. Amateur speakers (like me, and probably you) find that very hard. In particular, when the recording sessions are split over days or weeks or maybe even months, it's very hard to have the same speaking effort, voice quality, and so on. Professional speakers are much better at that; that's one reason to use them. The second reason to keep the database size small is that, at least in unit selection, the runtime system will include an actual copy of that entire database: the waveforms. So if we've got to store the database in the runtime system (that we might be shipping to our customer) we don't want it to be too many gigabytes. There might be other goals that we'd like to include in our script design, not just coverage in the smaller script possible. We won't consider these in any great depth here. That's because you can explore them for yourself in the "Build your own unit selection voice" exercise. The sort of things that you might include in your script design might be: choosing sentences that are easy to read out loud, so that your speaker can say them fluently without too many mistakes; you might want to cover low-frequency linguistic forms such as questions; you might want to boost the coverage of phrase-final and phrase-initial units, because in long sentences they're very rare; and you might want to include some domain-specific material, if you think your final system is going to be more frequently used in a particular domain (for example, reading out the news, or reading out emails, or something really simple like telling the time). You're going to explore some of those things for yourself in the exercise, so we won't cover them too much more here. What we will do is look at a typical, simple approach to script design. It's going to be an algorithm: in other words, it's automatic. It's going to start from a very large text corpus and make a script from that. In other words, we're going to choose - one at a time - sentences from a very large corpus, and accumulate them in our script. We'll stop at some point: perhaps when the script is long enough, or when we think coverage is good enough. Now, there's as much art as there is science in script design! I'm not presenting this as the only solution. Every company that makes speech synthesis has their own script design method and probably has their own script that's evolved through many voice builds. We're going to use a simple greedy algorithm. In other words, we're going to make decisions and never backtrack. The algorithm goes like this: we take our large text corpus, and for every sentence in that corpus we give it a score. The score is about how good it would be if it was added to the script. The sort of thing we might use to score will be, perhaps, finding sentences with the largest number of types that were not yet represented in our script. "Types" here would be units-in-context. We add the best sentence from the large corpus to the script, and then we iterate. This is about the simplest possible algorithm. At the end, we'll look at a few little modifications we might make to that. The algorithm requires the large corpus. That needs to come from somewhere. You might want to be very careful about where that text comes from. If you care about copyright, then you might want to get text for which either there is no copyright (for example, out-of-copyright material), or you might want to obtain permission from the copyright holder. Regardless of where you get the text from, it's most likely that this is written language and was never meant to be read out loud. That would have consequences for the readability. It might be hard to read newspaper sentences out loud. It might be hard to read sentences from old novels out loud. You're going to find out that for yourself in the exercise! Because written text was not usually intended to be read aloud, its readability might be poor. Also, the sort of variation we get in it might be unlike what we normally get in spoken language. For example. the prosodic variation might be limited. There might be very simple differences, like there are far fewer questions in newspaper text than there are in everyday speech. We've already mentioned that phrase-initial and phrase-final segments might be rather low frequency in text because the sentences are much longer than in spoken language. We might want to correct for that in our text selection algorithm. If you're looking for a source of text to use in your own algorithm, the obvious choice is to go to Project Gutenberg where there's a repository of text of out of copyright novels, which you can scrape and use as a source of material. But be very careful, because historical novels might have language that's archaic and might be very difficult for your speaker to read out loud. To get a better understanding of the issues in text selection, let's work through a rather simplified example. We'll assume we have some large corpus of text to start from. Maybe we've taken lots of novels from Project Gutenberg. The first step would be to clean up the corpus. In that text there's likely to be a lot of words for which we don't have reliable pronunciation information. For example: they're not in our dictionary. And because, for example, they might be proper names, we don't anticipate that our letter-to-sound rules will work very well. We're going to therefore need to clean up the text a bit. We might throw away all the sentences that have words that are not in our dictionary, because we don't trust letter-to-sound rules. We might throw away all sentences that are rather long: they're hard to read out loud with reasonable prosody, and we're also much more likely to make mistakes. We might also throw away very short sentences because their prosody is atypical. Depending on who your speaker is, you might also want to throw away sentences that you deem "hard to read". There are readability measures (that are beyond the scope of this course) that you could use, or we could base that on vocabulary. We could throw away sentences with very low frequency or polysyllabic words that we think our speaker will find difficult. The goal of our text selection algorithm is going to simply be coverage of unit types-in-context. So we need to know - for all the sentences in this large text corpus - what units they contain. Now we don't have a speaker reading these sentences out yet. We only have the text, so all we can really do is put that text through the front end of our text-to-speech system and obtain - for each sentence - the base unit sequence (for example, the sequence of diphones) and the linguistic context attached to each of those. One way to think about the algorithm is to imagine a "wish list" that enumerates every possible type that we're looking for. If we were just looking for diphones (ignoring context for a moment), we would just write out the list of all possible diphones: from this one...to this one. (Remembering that silence would also be treated as a phoneme in this system.) If we start adding linguistic context, that wish list will grow in size exponentially as we add more and more context features. Imagine just asking for versions of each diphone in stressed and unstressed prosodic environments. Imagine then adding all possible linguistic features and enumerating a very long wish list in this way. If you implement an algorithm, you may or may not actually construct this list, because it's going to be huge. You might not actually store it in memory. For the worked example, I'm going to actually just ignore linguistic context and try and get coverage just of diphones. The method will work the same for diphones-in-context, of course. Here's part of my very large text corpus, that I've put through my text-to-speech front end. It's given me a sequence of phonemes, from which I've got the sequence of diphones. This is just the first few sentences of this very large corpus. I'm going to write down my wish list. Here's my wish list: it's just the diphones. Again, in reality we might attach context to those. What I'm going to do, I'm going to score each sentence for "richness". Richness is going to be defined as the number of diphones that I don't yet have in my script. When we're doing that scoring, if a sentence contains two copies of a particular unit-in-context (here, just the diphone) then only one of them will count. So we can get rid of these duplicates. What we need is a function that assigns a score to each of these sentences. We won't write that out. We'll just do this impressionistically. If we were just a count the number of types, then we would pretty much always choose long sentences, because they're more likely to have more types in. So at the very least, we're going to control for length. Perhaps we'll normalize "number of types" by "the length of the sentence". Under my scoring function it turns out that this sentence here scores the highest. It's got the most new types in the smallest sentence. So I will select that sentence, and I'll put it in my script for my speaker to read out in the recording studio later. I'll remove it from the large database. Then I'll repeat that procedure. But before I carry on, I'm going to cross off all the diphones that I got with that sentence. So I'll scan through the sentence and I'll look at all the diphones that occurred there and I'll cross them off my wish list. So: these diphones...and various other diphones have gone. Our wish list has got smaller, and sentences will no longer score any points for containing those diphones because we've already got them. That sentence has been selected: it disappears from the large corpus. Then we'll score the remaining sentences. The scores will be different than the first time round, because any diphones that have already gone into the script no longer score points. Remember that was these ones...and some other ones. The result of this algorithm is going to be a list of sentences in a specific order: in order of descending richness. So, when we record that script, we probably want to record it in that order too, so we maximize richness. And we can stop at any point: we could record a hundred, or a thousand, or ten thousand sentences, depending how much data we wanted. The order of the sentences will be defined by the selection algorithm. That's neat, because we can just run our algorithm and select a very large script and go into the studio and record for as long as (for example) we have money. I've already said there's as much art as there is science to script design, so it's perfectly reasonable to try and craft in some extra selection criteria into your text selection script. One sensible thing - that our algorithm doesn't currently do - will be to guarantee at least one token of every base unit type. That will mean we'll never have to back off at synthesis time. If our base unit is the diphone, then our first N sentences (our first few hundred sentences) will be designed just to get one of each diphone. After that we'll switch to looking for diphones-in-context. When we start looking for diphones-in-context, we might actually want to go looking for the hardest-to-find units first: the rare ones first. In other words, give sentences a higher score for containing low-frequency diphones-in-context. That's because we'll get the high frequency ones along "for free" anyway. Another thing we might do is try and include some domain-specific material for one or more domains. This is to make sure we have particularly good coverage in these smaller domains, so our voice will perform particularly well on these domains. These might be decided by, for example, the particular product that our synthesizer is going into. If we just selected a script of strictly in-domain sentences, we'd expect very good performance in that domain but we might be missing a lot of units that we need for more general-purpose synthesis. So a standard approach to getting a domain-specific script is to first use your in-domain sentences (they might be manually written, or from some language-generation system); measure the coverage of that; then fill in the gaps with a standard text selection algorithm, from some other large corpus. Then we can get the best of both worlds: particularly good coverage in a domain and general-purpose coverage, so we can also say anything else. All that remains then is to go into the recording studio! Find a voice talent who can read sentences fluently with a pleasant voice, and ask that person to read each of our sentences. Then we're going to have a recorded database. The next step will be to label that recorded database.
Script design
We can design the recording script in a way that should be better than randomly-selected text, in terms of coverage and other desirable properties.
Log in if you want to mark this as completed
|
|