Target and candidate units

We use the linguistic specification from the front end to define a target unit sequence. Then, we find all potential candidate units in the database.

slownormalfast

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0400:30 Now that we have those key concepts established, let's work our way up to a complete description of unit selection. The first thing we're going to construct is a target unit sequence. That's a sequence of the ideal units that we would like to be able to find in the database. These target units are going to be abstract linguistic things. They don't have waveforms.
00:3000:52 When we talk about the search a bit later on, we'll find that we want the information about the target, about the candidates from the database, and also about the way that they join, to all be stored locally. That's going to reduce the complexity of the search problem dramatically. So, let's do that straight away.
00:5200:58 What the front-end gives us is this structured linguistic representation.
00:5801:03 It has connections between the different tiers. It has structure.
01:0301:09 For example, it might have some tree shapes, or some bracketing like this structure.
01:0901:14 We're going to attach all the information to the phoneme tier.
01:1401:25 We're going to produce a flat representation, where all of that higher-level structure - such as part of speech - is attached down on to the pronunciation.
01:2501:42 What we've got effectively, is a string of segments (that's just a fancy word for phones) with contextual information attached. So: context-dependent phones.
01:4201:46 One of them might be like that, and it's part of a sequence.
01:4601:50 In this example, my base unit type is the phoneme.
01:5002:06 A real system probably wouldn't do that. I'm just using it to make the diagrams simpler to draw and simpler for you to understand. I'm going to move that to the top of the page, because we need some room to put the candidates on this diagram later.
02:0602:44 What we have in this target unit sequence is a linear string of base unit types, and each of those is annotated with all of its linguistic context: everything that we think might be important for affecting the way that this particular phoneme is pronounced, in this particular environment. It's essential to understand that this information here is stored inside this target unit specification. We do not need to refer to its context, to read off that linguistic specification. It's local.
02:4403:00 I'm just going to repeat again that the base unit type in this diagram is the phoneme - just for simplicity. We could build a system like that, though we wouldn't expect it to work that well. In reality we'll probably use diphones.
03:0003:03 That is going to make the diagram look a bit messy.
03:0303:20 So we'll just pretend that the whole phone is the acoustic unit: so the base unit type is the phoneme. There's our target unit sequence, and what we'd like to do now is go and find candidate waveform fragments to render that utterance.
03:2003:26 We're going to get those candidate units from a pre-recorded database.
03:2603:29 The full details of the database are not yet clear to us.
03:2903:40 That's for a good reason: we don't know exactly what we need to put in that database yet, because that all depends on how we're going to select units from it.
03:4003:51 For every target unit position - such as this one - we're going to go to the database and retrieve all the candidates that match the base unit type.
03:5104:07 So we'll pull all of the waveform fragments out that have been labelled, in this case, with the phoneme /@/. Here's one candidate for the first one we found, and remember that the candidates are waveform fragments.
04:0704:13 They're also going to be annotated with the same linguistic specification as the target.
04:1304:19 That waveform's what we're going to concatenate eventually to produce speech.
04:1904:48 In general (if we've designed our database well), we'll have multiple candidates for each target position. So we can go off and fetch more from the database: we'll get all of them in fact. I've only got a tiny database in this toy example, so I just found 5. In general, in a big system, we might find hundreds or thousands for some of the more common unit types.
04:4804:51 We're going to repeat that for all of the target positions.
04:5105:05 Let's do the next one. Now, it seems a bit odd to treat silence as a recorded unit here, but remember in the case of diphones we're just going to treat silence as if it was another segment type - another phoneme.
05:0505:11 So we can have silence-to-speech and speech-to-silence diphones, just like any other diphone.
05:1105:19 We'll go off now and get candidates for all of the other target positions.
05:1905:46 At this stage, I haven't applied any selection criteria at all, except that we're insisting on an exact match between the base unit type of each target and the candidates that are available to synthesize that part of the utterance. That implies that our database minimally needs to contain at least one recording of every base unit type.
05:4605:50 That should be pretty easy to design into any reasonable size database.
05:5006:09 Right, where are we at this point? Let's orient ourselves again into the bigger picture. We have run the front end, and got a linguistic specification of the complete utterance. We've attached the linguistic specification down on to the segment - on to the pronunciation level.
06:0906:14 That's given us a linear string: a sequence of target units.
06:1406:18 Those target units are each annotated with linguistic features.
06:1806:34 They do not yet have waveforms. We're going to find - for each target - a single candidate from the database. So far, all we've managed to do is retrieve all possible candidates from the database, and we've just matched on the base unit type.
06:3406:50 So what remains to be done is to choose amongst the multiple candidates for each target position, so that we end up with a sequence of candidates that we can concatenate to produce output speech. We're going to need some principle on which to select from all the different possible sequences of candidates.
06:5006:54 Of course, what we want is the one that sounds the best.
06:5407:07 We're going to have to formalize and define what we mean by "best sounding", quantify that, and then come up with an algorithm to find the best sequence.
07:0707:29 It's important to remember at all times that the linguistic features are locally attached to each target and each candidate unit. Specifically, we don't need to look at the neighbours. That's going to be particularly important for the candidates, because for different sequences of candidate units their neighbours might change. That will not change their linguistic features.
07:2907:49 The linguistic features are determined by the source utterance in the recorded database where that candidate came from. The next steps are to come up with some function that quantifies "best sounding", and then to search for the sequence of candidates that optimizes that function.