Key concepts

The base units (e.g., diphones) can occur in many different contexts. This makes it difficult to record a database that covers all possible units-in-context.

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0500:14 All modern methods of speech synthesis - including unit selection that we've already covered - rely on a fairly substantial database of recorded natural speech.
00:1400:17 So now we need to think about what's going to go into that database.
00:1700:25 What should we record? Before proceeding, let's do the usual check.
00:2500:35 What should you already know? You should certainly know how the front end works, and in particular that it produces a thing called a linguistic specification.
00:3500:37 You need to know what's in that linguistic specification.
00:3701:29 In other words, what features the front-end is able to provide: Phonetic features, such as the current sound, preceding sound, following sound; Prosodic features; and also what we might be able to easily derive from those things, such as position - where we are in the current prosodic phrase, for example. Given that linguistic specification, you should also understand that we're going to select units at synthesis time from the most similar context we can find to the target. We're going to combine that similarity measure with a join cost to enable smooth concatenation. It's the job of the target cost to measure similarity (or distance) between units in the database - which we call candidates - and the target specification. Our target cost function is going to rank units from the database: it's going to rank candidates.
01:2901:34 The database is going to be fairly large, and we're going to need to label it.
01:3401:38 At the very least, we need to know where every unit starts and finishes.
01:3802:14 So we need some time alignment. Because the database is large, we might not want to do that by hand. There might be some other good reasons not to do it by hand. We're going to borrow some techniques from Automatic Speech Recognition. So, you need to know just the basics of speech recognition: using simple Hidden Markov Models, for example context-independent models of phonemes; very simple language modelling using finite state models; and how decoding works, in other words, how we choose the best path through our network of language model and acoustic model.
02:1402:24 Before starting the discussion of what's going in the database, let's just establish a few key concepts to make sure we have our terminology right.
02:2402:28 The first key concept is going to be that of the "base unit type" such as the diphone.
02:2803:07 The second key concept is that these base units (for example, diphones) occur in very varied linguistic contexts. We might want to cover as many of those as possible in the database. That leads us into the third concept, which is coverage: how many of these unit-types-in-linguistic-context we could reasonably get into a database of finite size. Looking at each of those key concepts in a little bit of detail then: The base unit type is the thing that our unit selection engine uses. Most commonly that type is going to be the diphone. It could also be the half-phone.
03:0703:26 In the modules about unit selection, we also talked about using heterogeneous unit types: things of variable linguistic size. We also said that we don't really need to think about variable-size units. We can use fixed-size units (such as the diphone) and the zero join cost trick, to effectively get larger units at runtime.
03:2603:30 So, from now on, let's just assume that our base unit type is the diphone.
03:3003:36 There's going to be a relatively small number of types of base unit.
03:3603:59 It's certainly going to be finite - a closed set - and it's maybe going to be of the order of thousands: one or two thousand types. In our unit selection engine, when we retrieve candidates from the database before choosing amongst them using the search procedure, we look for some match between the target and the candidate.
03:5904:06 At that retrieval stage, the only thing we do is to strictly match the base unit type.
04:0604:15 So, if the target is of one particular diphone we only go and get candidates of that exact type of diphone, from all the different linguistic contexts.
04:1504:20 The only exception to that would be if we've made a mistake designing our database.
04:2004:42 Then we might have to go and find some similar types of diphones, if we have no examples at all of a particular diphone. The consequence of insisting on this strict match is that our target cost does not need to query the base unit type: they all exactly match. The list of candidates for a particular target position are all of exactly matching diphone type.
04:4204:52 All the target cost needs to do is to query the context in which each of the candidates occurs, and measure the mismatch between that and the target specification.
04:5205:01 Given the base unit type then, the second key concept is that these base units occur in a natural context. They're in sentences, typically.
05:0105:11 Now, the context is potentially unbounded: it certainly spans the sentence in which we find the unit and we may even want to consider features beyond the sentence.
05:1105:23 The number of linguistic features that we consider to be part of the context specification is also unlimited. It's whatever the front-end might produce and whatever we might derive from that, such as these positional things.
05:2305:54 So the context certainly includes the phonetic context, the prosodic environment, and these derived positional features, and anything else that we think might be interesting to derive from our front-end's linguistic specification. The exact specification of context depends entirely on what our front-end can deliver and what our target cost is going to take into account, whether it's an Independent Feature Formulation or an Acoustic Space Formulation.
05:5406:01 For the purposes of designing our database, we're going to keep things a little bit simple and we're just going to consider linguistic features.
06:0106:26 We're just going to assume that our target cost is of the most simple IFF type when designing our database. Although the context in which a speech unit occurs is essentially unbounded (there are an infinite number of possible contexts because there are an infinite number of things that a person might say), in practice it will be finite because we will limit the set of linguistic features that we consider.
06:2606:30 We'll probably stick to features that are within the sentence.
06:3006:47 Nevertheless, the number of possible contexts is still very, very large and that's going to be a problem. Just think about the number of permutations of values of those linguistic specification: it's just very, very large.
06:4707:04 If we would like to build a database of speech which literally contains every possible speech base unit type (for example each diphone - maybe there's one to two thousand different diphone types) each occurring in every possible linguistic context, that list will be very, very long.
07:0407:30 Even if we limit the scope of context to just the preceding sound, the following sound, and some basic prosodic features, and positional features, this list will still be very, very long. We have to ask ourselves: Would it even be possible to record one example of every unit-in-context? Let's just point out a little shorthand in the language I'm using here. We've got "base unit types" such as diphones.
07:3007:43 We've got "context" which is the combination of linguistic features in the environment of this unit. So we should be talking about "base unit types in linguistic contexts". That's a rather cumbersome phrase!
07:4307:56 I am going to shorten that. I'm going to say "unit-in-context" for the remainder of this module. When I say "unit-in-context" I'm talking about the all the different diphones in all the different linguistic contexts.
07:5608:09 Natural language has a very interesting property that is one of the reasons it's difficult to cover all of these contexts. That is that they're distributed very unevenly.
08:0908:29 Whatever linguistic unit we think of - whether it's the phoneme or indeed the letter or the word - there are very few types - and for purposes of building the database, those types are units-in-context - very few types are very frequent.
08:2908:36 Think about words: you know that some words - such as this one here - are very, very frequent.
08:3608:57 Other words a very infrequent. That's true about almost any linguistic unit type. It's certainly going to be true about our units-in-context. The flipside of that is that there are many, many types that are individually very, very rare, but there's very large number of such types. Taken together, they are frequent.
08:5709:04 So we get this interesting property: that rare events are very large in number.
09:0409:14 In other words, in any one particular sentence that we might have to synthesize at runtime, there's a very high chance that we'll need at least one rare type.
09:1409:17 We've already come across that problem when building the front end.
09:1709:20 We know that we keep coming across new words all the time.
09:2009:40 These new words are the rare events, but taken together they're very frequent: they happen all the time. Let's have a very simple practical demonstration of this distribution of types being very uneven. Here's an exercise for you.
09:4009:44 Go and do this on your own. Maybe you could write it in Python.
09:4409:59 Use any data you want. I'm going to do it on the shell, because I'm old fashioned. Let's convince ourselves that linguistic units of various types have got this Zipf-like distribution. Let's take some random text.
09:5910:02 I've downloaded something from the British National Corpus.
10:0210:06 I'm not sure what it is, because it doesn't matter!
10:0610:15 Let's just have a look at that: it's just some some random text document that I found.
10:1510:42 I'm going to use the letter as the unit. I'm going to plot the distribution of letter frequencies. In other words, I'm going to count how many times each letter occurs in this document and then I'm going to sort them by their frequency of occurrence. We'll see that those numbers have a Zipf-like distribution. I'll take my document and the first thing I'm going to do is I'm going to downcase everything, so I don't care about case.
10:4210:46 Here's an old-fashioned way of doing that: "translate" ('tr') it.
10:4610:52 We take all the uppercase characters and translate them individually to lowercase.
10:5211:14 That's check that bit of the pipeline works. Everything there is lowercase - all has become downcased, you can see. I'm now going to pull out individual characters.
11:1411:34 I'm only going to count the characters a-to-z. I'm going to ignore numbers and punctuation for this exercise. So we'll grep, and we'll print only the matching part of the pattern, and we'll grep for the pattern "any individual letter in the range a-to-z lowercase". Let's check that bit of the pipeline works.
11:3411:40 That's just printing the document out letter by letter.
11:4011:43 I'm now going to count how often each letter occurs.
11:4311:47 There's a nice way of doing that sort of thing on the command line.
11:4711:53 First we sort them into order, and then we'll put them through a tool called 'uniq'.
11:5311:58 uniq finds consecutive lines that are identical and just counts how many times they occur.
11:5812:03 In its own, it will just print out one copy of each set of duplicate lines.
12:0312:14 We can also ask it to print the count out. Let's see if that works ... it just takes a moment to run because we're going through this big document.
12:1412:17 So there we now have each letter and the number of times it occurs.
12:1712:28 There's our distribution. It's a little bit hard to read like that because it's ordered by letter and not by frequency, so let's sort it by frequency.
12:2812:40 I'm going to 'sort' and sort will just operate on the leftmost field, which is conveniently here the number, and we'll sort it numerically not alpha-numerically.
12:4012:46 I'm going to reverse the sort, so I get the most frequent thing at the top.
12:4612:49 And there is our kind-of classical Zipf-like distribution.
12:4913:29 We can see that there are a few letters up here that are accounting for a lot of the frequency: in other words, much of the document. There's a long tail of letters down here that's rather low in frequency: much, much lower; an order of magnitude lower than those more frequent ones. If we were to plot those numbers (and I'll let you do that for yourself, maybe in a spreadsheet) we'd see that has this Zipf-like decaying distribution. So, the Zipf-like distribution is true for letters, even though there are only 26 types, we still see that decaying distribution.
13:2913:42 If we do this for linguistic objects with more and more types, we'll get longer and longer tails, until we end up looking at open-class types such as words, where we'll get a very, very long tail of things that happen just once.
13:4213:50 Let's do one more example. Let's do it with speech this time: transcribed speech. We'll look at the distribution of speech sounds.
13:5013:53 I've got a directory full of label files of transcribed speech.
13:5313:56 It doesn't really matter where it's come from at this point.
13:5614:12 Let's look at one of those: they're sequences of phonemes, labelling the phones in a spoken utterance, with timestamps and so on. I was going to pull out the phoneme label and I'm going to do the same thing that I did with the letters.
14:1214:16 So again I'm going to be old-school: just do this directly on the command line.
14:1614:32 If you're not comfortable with that, do it in Python or whatever your favourite tool is! There's many different ways to do this kind of thing. The first thing I'm going to do, I'm going to pull out the labels. I know that these labels are always one or two characters long. So let's 'grep'.
14:3214:36 I use "extended grip" ('egrep') - it's a bit more powerful.
14:3614:39 I don't want to print out the filenames that are matching.
14:3914:43 Again, I just want to print out the part of the file that matches this expression.
14:4314:58 I'm going to look for lowercase letters and I know that they should occur once or twice: so, single letters or pairs of letters, these are what the phoneme labels look like.
14:5815:02 I also know that they happen at the end of a line.
15:0215:05 I'm going to do that for all of my labelled speech files.
15:0515:11 Let's just make sure that bit of pattern works. Yes, that's pulling out all of those.
15:1115:32 We'll do the same thing that we did for the letters: sort it,'uniq -c' it, and order it by frequency in reverse order. Let's run that.
15:3215:35 Again we see the same sort of pattern we saw with letters.
15:3515:40 If we did this with words, or with any other unit, we'd get the same sort of pattern.
15:4015:42 There are a few types up here that are very frequent.
15:4215:51 There's a long tail of types down here that are much less frequent; again, at least an order of magnitude less frequent, and possibly more than that.
15:5115:54 Because this is a closed set, we don't get a very long tail.
15:5415:57 You should go and try this for yourself with a much bigger set of types.
15:5716:26 I suggest doing it with words or with linguistic unit-types-in-context: perhaps something like triphones. But, even for just context-independent phonemes, there are a few that are very low frequency. I'm using over a thousand sentences of transcribed speech here, and in those thousand sentences there are a couple of phonemes that occurred fewer than a hundred times, regardless of the context.
16:2616:31 That's going to be one of the main challenges in creating our database.
16:3116:51 Plotting those distributions of frequencies-of-types, from the most frequent to the least frequent (ordering them by their frequency of occurrence) and then plotting their frequency on this axis we always tend to get this sort of shape, this decaying curve.
16:5116:56 Now, you'll often see this curve called a Zipf distribution.
16:5617:01 That should have a particular exact equation: it's a particular sort of distribution.
17:0117:05 Of course, real data doesn't exactly obey these distributions.
17:0517:07 It just is somewhat similar, and has the same sort of properties.
17:0717:11 In particular, it has this Large Number of Rare Events.
17:1117:18 So really we should be talking about a Zipf-like distribution, not exactly a Zipf distribution.

Log in if you want to mark this as completed
Excellent 75
Very helpful 22
Quite helpful 6
Slightly helpful 3
Confusing 2
No rating 0
My brain hurts 0
Really quite difficult 0
Getting harder 2
Just right 92
Pretty simple 14
No rating 0