Module 3 – unit selection target cost functions

The target cost is critical to choosing an appropriate unit sequence. Several different forms are possible, using linguistic features, or acoustic properties, or a combination of both.
Log in

Module status: ready

This module continues the topic of unit selection speech synthesis, and here we add more detail about the target cost. Before starting this module, make sure you have a good general understanding of the motivations behind unit selection speech synthesis, provided in the previous module.

You should also already have a reasonable understanding of:

  • at least one form of regression model that can predict a continuous value given discrete and/or continuous input features: for example, a Regression Tree or a Neural Network
  • how to compare two vectors: for example, Euclidean distance

Download the slides for the module 3 videos

Total video to watch in this module: 40 minutes

This is the simplest form of target cost function, because no prediction of any acoustic properties is involved.

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0400:46 Now we've got a complete picture of how unit selection works, we can start to look in more detail at some of the most important components. The first thing we're going to look at is the target cost function. We're going to use Taylor's terminology here, to be consistent with his book. We're going to first look at the simplest way we could configure the target cost function. That's to calculate the cost as a weighted sum of mismatches in linguistic features. When Taylor says Independent Feature Formulation, he doesn't mean that the linguistic features are completely independent of each other in a statistical sense. What he's saying is that - in the target cost computation - the features are all considered independently.
00:4600:52 A mismatch in one feature doesn't interact with mismatches in other features.
00:5200:57 That makes the calculation really simple, but it is a weakness.
00:5701:11 The source of that weakness is the sparsity of these linguistic features, due to the extremely large number of permutations of possible values. Before carrying on, make sure that you understand the general principles of unit selection from the previous videos.
01:1101:40 You need to know that unit selection is basically about selecting waveform fragments from a database of pre-recorded natural speech. Obviously, that speech is going to have to be annotated so we can find those units. We haven't said much about that yet, because it's going to come later. The selection of candidates is based on two costs: a target cost function, which we're going to talk a lot more about now, and a join cost function, which calculates the mismatch across concatenation points (across the joins).
01:4001:48 Because of the join cost, the selection of one candidate depends on the preceding and following candidates, all the way to the ends of the utterance.
01:4801:51 Therefore, to minimize the total cost, we need to conduct a search.
01:5102:03 We've already talked about that. So let's get into those details about the target cost: It's measuring mismatch between a target and a candidate for that target position.
02:0302:06 We need to decide how that mismatch could be measured.
02:0602:10 We'll start with measuring that mismatch in the simplest possible way.
02:1002:20 We can call it simple because it's going to use things we already have from our front end. What we already have, of course, is the linguistic specification of the targets, and that comprises a set of linguistic features.
02:2002:34 We've already described how we can essentially flatten those on to the segment (on to the pronunciation level). So, what we're dealing with then is a sequence of pronunciation units: phonemes, or maybe diphones.
02:3403:02 Each of them has a specification attached to it, so it knows the context in which it appears. That's true in both the target sequence and for each individual candidate, because the candidates came from real recorded sentences, where we also knew the full linguistic specification. So we know the same things for the target and for each of the candidates. The features will be the same because they'll be produced in the same way. It's a simple count of how many don't match.
03:0203:10 The motivation for that should be obvious. Ideally - although it doesn't happen very often - we would like to find exactly-matching candidates.
03:1003:31 Those exactly-matching candidates will have a cost of zero: there'll be no mismatch (a sum of zeros). The more mismatched the context is between the candidate and target, the higher the cost. If the mismatch is in terms of linguistic features, we can just sum up the individual mismatches.
03:3103:53 Always remember that the target cost function (like the join cost function) is computing a cost, and that cost is only a prediction of how bad this candidate might sound if we were to use it in this target position. The advantage of this Independent Feature Formulation type of target cost function is that it works with things we already know.
03:5304:01 For every target, the front end text processor has provided us with a linguistic specification.
04:0104:12 For every candidate that we are considering for that target position, we also know the same linguistic specification. Now, precisely how we know that, we'll cover in the module on the database, a little bit later.
04:1204:17 So, we know the same things for every target and for every candidate.
04:1704:19 We can just make a direct comparison between them.
04:1904:29 So, let's make that completely clear. Let's focus in on one particular target position, one particular candidate that we're considering for that target.
04:2904:34 We'll make a direct comparison between the two in terms of their linguistic specification.
04:3404:38 That's going to include things like their phonetic context.
04:3804:49 So we know the context in which this appears - it's actually this left and right phonetic context - although remember it can be attached locally, because this sequence is constant.
04:4904:51 For this candidate, we also know the same things.
04:5104:57 We know the phonetic context which it was extracted from in its source sentence.
04:5705:05 It's not this context necessarily, it's the context of the natural sentence it came from.
05:0505:16 That context is described as a set of separate linguistic features: phonetic context, perhaps stress, position-in-syllable, position-in-word, position-in-phrase, ...
05:1605:21 anything that our front-end text processor can generate for us at synthesis time.
05:2105:40 For example, imagine that the candidate that we're considering here (we're measuring the target cost for) actually occurred in the natural sentence "A car." So we know, for example, that it was phrase-initial and the target here is also phrase-initial.
05:4005:44 We know that it was word-final - there's a word boundary here.
05:4406:10 We also know that the target position is word-final. And of course we know the phonetic context: this candidate came after a silence however for the target position we want something that's after "the": there's a mismatch. We know that this candidate came from before a [k] and we also know that target comes before a [k]: that's a match.
06:1006:15 So: left-phonetic-context mismatches, right-phonetic-context matches.
06:1506:18 We're just going to sum up penalties for all of those mismatches.
06:1806:29 We should know enough phonetics to know that some linguistic contexts have a bigger effect on sound - and more importantly on perception of that sound - than others.
06:2906:35 So, we need to capture that difference in importance between the different features.
06:3506:46 The simplest form of the Independent Feature Formulation target cost considers all the features - it considers them to be independent - and it just sums up the number of mismatches.
06:4607:09 The only way of weighting one against the other is to put these weights as we sum up those mismatches. So, for example, in Festival's multisyn unit selection module, these are the weights. We can see, for example, that a mismatch in left-phonetic-context incurs a slightly higher penalty than a mismatch in right-phonetic-context.
07:0907:21 That's capturing our knowledge of co-articulation: that left context has a stronger effect on the current sound than the right context. Where do these weights come from?
07:2107:27 Well they're set by hand, by listening to a lot of synthetic speech and tuning the weights.
07:2707:31 That's quite hard to do; that's obviously a very skilled thing.
07:3107:35 But currently that's the best method for picking those weights.
07:3507:44 Festival has a couple of special things in its target cost that aren't really part of the target cost itself: it's just a convenient way of implementing something.
07:4407:47 They're there to detect problems with the database.
07:4708:00 We're going to come back to that when we talk about the database, and we can see where these pseudo-features come from. They're to do with the automatic labelling of the database, in fact. Don't worry about these for now.
08:0008:04 Concentrate on these features: phonetic context and the prosodic context.
08:0408:11 Those are the ones produced by the front end and those are the ones used to choose between competing candidates from different linguistic contexts.
08:1108:14 Let's work through an example to make that crystal clear.
08:1408:23 Let's just take the main features that are produced by the front end and forget these special values that Festival uses to detect problems in the database.
08:2308:34 So here they are, and their weights. I'm going to consider a single target position in the sentence I'd like to say. That's its linguistic specification.
08:3408:40 I've got two competing candidates, each with their linguistic specifications.
08:4008:49 We're going to look at the match / mismatch between each candidate in turn and that target specification, and compute the target cost for each of them.
08:4908:55 It's just a simple process of deciding if there's a mismatch and noting that.
08:5509:15 Let's do candidate 1 first. For candidate 1: stress matches, syllable position mismatches, word position matches, Part Of Speech matches, phrase position matches, left-context matches, but right-phonetic-context mismatches.
09:1509:52 I will do the same for candidate 2 separately: stress mismatches, syllable position matches, word position matches, Part Of Speech mismatches, phrase position matches, left-phonetic-context mismatches, right-phonetic-context matches. Candidate 1 has two mismatches, but we need to do a weighted sum to take into account the relative importance of those mismatches.
09:5210:08 The syllable position mismatch incurs a penalty of 5 and the right-phonetic-context mismatch incurs a penalty of 3, giving us a total of 8.
10:0810:27 Separately for candidate 2: that stress mismatch incurs a penalty of 10, the Part Of Speech mismatch costs 6, and the left-phonetic-context mismatch costs 4, giving us a total of 20.
10:2710:55 Now remember, we don't simply use these two values to choose between these two candidates, because we don't yet know how their waveforms will concatenate with the candidates left and right of them in the lattice. Those costs (those target costs) just go into the lattice and become part of the total cost of all the different paths passing through each of these candidates. As with most of the examples in this part of the course, I'm drawing my lattice in terms of whole phones because it's neater.
10:5511:10 Let's draw a picture of what it would be like for diphone units, just so we see that it can be done. In diphone units, I'll run the front end in the same way. I've rewritten segments as diphones.
11:1011:33 So that's now my target sequence, and I'm going to go and retrieve diphone candidates from the database. Each of these candidates has a waveform, and of course also has a linguistic specification. In the Independent Feature Formulation, it's only the linguistic specification that we're going to use for comparison.
11:3311:39 Let's again focus in on one particular target position: we'd like to say this diphone.
11:3912:02 We have two available candidates. We know the recorded utterances that each of those candidates came from: "They saw each other for the first time in Boston" So the top candidate there came from that utterance. "They ran the canoe in and climbed the high earth bank" The other candidate came from that utterance.
12:0212:27 For each of those utterances (these are in the database) they've got natural recorded speech plus a complete linguistic specification. We can see that we're always matching on the base unit type: that's always an exact match; that's how we retrieve the candidates, just by looking at that. Then we can look at other features around them. Now the calculation of target cost for diphones is just a little bit messier because we do it separately for the left and the right halves.
12:2712:44 That's because some features might actually differ going through the diphone: it might cross (for example) a syllable or word boundary. The left half of the diphone might be in a different Part Of Speech to the right half. We just calculate the target cost as two sub-costs: the left and right halves, and then add those together.
12:4412:48 An Independent Feature Formulation target cost is really rather simple.
12:4813:24 In fact, it's a bit too simple. If we return to this example that we just worked through, we can see that there's a problem with the Independent Feature Formulation type of target cost. It's a bit too simplistic - it's too naive - and the simplicity is because we've treated the features as independent for the purposes of calculating the target cost. The target cost function doesn't consider two rather important things. One thing that it fails to consider is combinations of features. For example, there might be interactions between the stress status of a syllable and whether it's phrase final or not.
13:2413:40 Both of those are competing to affect F0, but this function just considers them independently, and just accumulates the penalties. The other oversimplification of this function is that things strictly match or mismatch: it's a binary distinction.
13:4014:14 There's no concept of a "near match". So there's no distance: things are either exactly the same (incurring zero penalty) or different (incurring the maximum penalty: the weight in that column). There's an example in the table of where a "near match" might be OK. Candidate 2 came from a left-phonetic-context of [v]. This has got a fairly similar place of articulation to the desired (the target) left-phonetic-context of [b].
14:1414:27 They're both also voiced. So we might prefer to take candidates from [v] left-phonetic-contexts than radically different ones, like a liquid.
14:2714:32 However, this still incurred the maximum penalty of 4 here.
14:3214:41 It would be better if we could soften that somewhat and say that that's a "near match" and maybe there should be a lower penalty in that case.
14:4115:00 This function is unable to do that. So, that's pretty much all there is to the Independent Feature Formulation. We're working with features that have already been produced by our front-end. That's super-convenient and is also going to be computationally quick. We've already had to do all of that front-end processing, so those features are things we already have: they come "for free".
15:0015:07 Those are calculations we had to do to disambiguate pronunciation (for example).
15:0715:17 So we're just deriving simple symbolic features from - in Festival's case - the existing utterance structure, or more generally the linguistic specification.
15:1715:25 So computation of this is going to be cheap. Of course a weighted sum of mismatches is very cheap to compute. So this target cost function will be fast.
15:2515:30 That's good! It makes some dramatic simplifications though.
15:3015:34 Nevertheless, it will work. This is pretty much what Festival does.
15:3415:38 It's almost a pure Independent Feature Formulation target cost function in Festival.
15:3815:46 Now, we didn't make any acoustic predictions at all in computing this target cost.
15:4616:08 The function simply worked with symbolic features. The symbolic features could optionally include symbolic prosodic features. So, if the front-end can predict them with sufficient accuracy - for example we might attempt to predict ToBI accents and boundary tones - we will have these symbolic features that capture prosody.
16:0816:12 Those can be taken into account when selecting candidates from the database.
16:1216:15 Of course, we'll have to annotate the database with the same features.
16:1516:25 But what if we don't have that? What if we don't explicitly mark up prosody even symbolically on either the target or the candidates in the database?
16:2516:31 How on earth will we get any prosody at all? How is prosody created using such a cost function?
16:3116:49 Well, very simply by choosing candidates from an appropriate context - for example, phrase final - we'll get appropriate prosody automatically. That's the same principle that we use to get the correct phonetic co-articulation or the correct syllable stress.
16:4917:12 We'll get prosody simply by choosing candidates essentially from the right position in the prosodic phrase. Therefore, all we really need to do to get prosody is to make sure that the linguistic features from our front end capture sufficient contextual information relevant to prosody. An awful lot of that rests simply on position-within-prosodic-constituents: where the syllable is within the word, within the phrase,...
17:1217:31 That will get us prosody. Optionally - and I say optionally because predicting prosody even symbolically is very error-prone - optionally, we could attempt to predict prosody and then use that as part of the cost function as just another linguistic feature. It would have to have its own weight.
17:3117:47 I just stated that an Independent Feature Formulation makes no attempt to make any acoustic predictions whatsoever about the target. It simply gets the candidates, and whatever acoustic properties they have, that's what the synthetic speech has.
17:4717:54 But thinking about the system as a whole, of course we are making predictions about the acoustics, because we're generating synthetic speech.
17:5418:26 It's just implicit in the procedure. Whilst the cost function itself only deals with symbolic features, the output of the system is synthetic speech and that of course has acoustic properties. Taken as a whole, the database, the target cost function, the search for the best candidate sequence: that whole complex system is making acoustic predictions. It's a complicated sort of regression from the linguistic specification to a speech waveform. However, it's completely implicit.
18:2618:36 There's an advantage to being completely implicit. We don't need to make explicit acoustic predictions, so we don't need complicated models for that, that will make mistakes.
18:3618:50 We just get natural output. There's also a weakness: we can't really inspect the system. We can't really see how it's making this acoustic prediction. All we can do is indirectly control that by, for example, changing the weights in the target cost.
18:5019:06 So, what we're going to move on to now is we're going to look at a different formulation of the target cost function. Something that does make some acoustic predictions - explicit predictions of actual acoustic properties - and then measures the difference between target and candidate in that acoustic space.
19:0619:20 That's the Acoustic Space Formulation. That's going to help get us out of a sparsity problem. But also we can then observe those acoustic predictions: measure their accuracy in an objective sense.
19:2019:26 That might help was improve the system in a way that's rather opaque in the Independent Feature Formulation.

Log in if you want to mark this as completed
The IFF suffers from sparsity problems, which we can try to overcome by making the comparisons between targets and candidates in terms of acoustic properties.

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0400:20 I think this is a good time to orient ourselves again, to check that we understand where we are in the bigger picture of unit selection. What we should understand so far is that a unit selection speech synthesizer has the same front end as as any other synthesizer.
00:2000:41 So we run that text processor. From the linguistic specification, we construct a target sequence, essentially flattening down the specification on to the individual targets. For each target unit, we retrieve all possible candidates from the database. We just go for an exact match on the base unit type and we hope for variety in all of the other linguistic features.
00:4100:50 For each candidate, we compute a target cost. So far, we understand the Independent Feature Formulation style of target cost. It's just a weighted sum.
00:5000:54 The weights are the penalties for each mismatched linguistic feature.
00:5401:10 We compute join costs and perform a search. We're going to look at a more sophisticated form of target cost now, where we predict some acoustic properties for the targets and compare those with actual acoustic properties of candidates.
01:1001:14 The motivation for that is the weakness of the Independent Feature Formulation.
01:1401:21 That compares only linguistic features: symbolic things produced by the front-end.
01:2101:27 That's computationally efficient, but it creates an artificial form of sparsity.
01:2701:44 We could summarize this weakness by thinking about a single candidate for a particular target position. The target and the candidate may have differing (in other words, mismatched) linguistic features. Yet, the candidate could be ideal for that position. It could sound very similar to the ideal target.
01:4401:50 It's very hard to get round that when we're only looking at these linguistic features.
01:5001:56 It's very hard to detect which combinations of features lead to the same sound.
01:5602:00 What we need to do is to compare how the units sound.
02:0002:09 We want to compare how a candidate actually sounds (because we have its waveform) with how we think - or how we predict - a target should sound.
02:0902:16 That's going to involve making a prediction of the acoustic properties of the target.
02:1602:21 Taylor tries to summarize this situation in this one diagram.
02:2102:23 Let's see if we can understand what's going on in this diagram.
02:2302:55 For now, let's just think about candidates that have both linguistic specifications and actual acoustic properties. What Taylor is saying with this diagram is that it's possible that there are two different speech units that have very different linguistic features: this one and this one. These are maximally-different linguistic features: they mismatch in both stress and phrase finality. (We're just considering those two dimensions in this picture.) Yet it's possible that they sound very similar.
02:5503:06 The axes of this space are acoustic properties, and these two units lie very close to each other in acoustic space. This is completely possible.
03:0603:12 It's possible that two things that have different linguistic specifications but sound very similar.
03:1203:18 To fully understand the implications of this, we need to also think about the target units.
03:1803:29 At synthesis time our target units do not have acoustic properties because they're just abstract linguistic structures. We're trying to predict the acoustic properties.
03:2903:35 There is some ideal acoustic property of each target and so the same situation could hold.
03:3504:15 It could be the case that we are looking for a target that has this linguistic specification and - using the Independent Feature Formulation - this potential candidate here would be very far away. It would incur a high target cost: it mismatches twice (both features). These other possible candidates would appear to be closer in linguistic feature space. But if it's the case that [stress -] and [phrase-finality +] happens to sound very similar to [stress +] and [phrase-finality -] then we shouldn't consider these two things here are far apart at all.
04:1504:23 But the only way to discover that is actually to go into acoustic space and measure the distance in acoustic space: measure this distance between these two things.
04:2304:32 Because, in linguistic feature space, we won't be able to detect that they would have sounded similar. Unfortunately Taylor fails to label his axes.
04:3204:35 It's probably deliberate because he's trying to say this is an abstract acoustic space.
04:3504:41 c1 and c2 could be any dimensions of acoustic space that you want.
04:4104:45 It might be that this one's duration and this is some other acoustic property.
04:4504:49 But it might be something else: maybe the other way around, or maybe something else.
04:4904:58 It doesn't really matter. The point is that in acoustic space things might be close together, but in linguistic space they're far apart.
04:5805:02 They're apparently linguistically different, but they are acoustically interchangeable.
05:0205:23 It's that interchangeability that's the foundation of unit selection: that's what we're trying to discover. Now, for our target units to move closer to the candidates (which are acoustic things) we need to predict some acoustic properties for the targets. We don't necessarily need to predict a speech waveform because we're not going to play back these predicted acoustic properties.
05:2305:39 We're only going to use them to choose candidates. So we really don't need a waveform and neither do we need to predict every acoustic property. We just need to predict sufficient properties to enable a comparison with candidate units. Let's try to make this clearer with a picture.
05:3906:02 Back to this diagram again. Again, just for the purpose of explanation, our units are phone-sized. These candidates here are fully-specified acoustic recordings of speech. We have waveforms from which we could estimate any acoustic properties. We can measure duration; we could estimate F0; we could look at the spectral envelope or formants if we wanted.
06:0206:08 The targets are abstract linguistic specifications only, with no acoustics.
06:0806:13 So far, we only know how to compare them in terms of linguistic features, which both have.
06:1306:24 What we're going to do now: we're going to try and move target units closer to the space in which the candidates live. We're going to give them some acoustic properties.
06:2406:33 Let's just think of one: let's imagine adding a value for F0 to all of the target units.
06:3306:44 We also know F0 for all the candidates. It will be then easy to make a comparison between that predicted F0 and the true F0 of a candidate.
06:4406:53 We would compare these things, and we could do that for any acoustic properties we liked.
06:5306:57 Now, what acoustic features are we going to try and add to our targets?
06:5707:03 Well, we have a choice. We could do anything we like.
07:0307:06 We could predict simple acoustic things such as F0.
07:0607:10 In other words, have a model of prosody that predicts values of F0.
07:1007:16 Equally, we could predict values for duration or energy (all correlates of prosody).
07:1607:31 So: we'd need a predictive model of prosody. We'd have to build it, train it, put it inside the front-end, run it at synthesis time. It would produce values for these things which you could compare to the true acoustic values of the candidates.
07:3107:45 We could go further. We could predict some much more detailed specification: maybe even the full spectral envelope. Typically we're going to encode the envelope in some compact way that makes it easy to compare to the candidates.
07:4507:47 Cepstral coefficients would be a good choice there.
07:4707:56 It would seem that the more and more detail that we can predict, the better, because we can make more accurate comparisons to the candidates.
07:5608:11 That's true in principle. However, these are all predicted values and the predictions will have errors. The more detailed the predictions need to be - for example the full spectral envelope - the less certain we are that they're correct.
08:1108:14 It's getting harder and harder to predict those things.
08:1408:18 So all of this is only going to work if we can rather accurately predict these properties.
08:1808:35 If we don't think we can accurately predict them, we're better off with the Independent Feature Formulation. Indeed that's why the earlier systems had the Independent Feature Formulation, because we didn't have sufficiently powerful statistical models or good enough data to build accurate predictors of anything else.
08:3508:47 But we've got better at that. We have better models, and so we could indeed today envisage predicting a complete acoustic specification - in fact, all the way to the waveform if you wanted. How would we do that?
08:4708:54 Well it's a regression problem! We've got inputs: the linguistic features that we already have from the front-end for our targets.
08:5409:31 We have a thing we're trying to predict: it could be F0, duration, energy, MFCCs, ... anything that you like. So you just need to pick your favourite regression model. Here one you know about: the fantastic Classification And Regression Tree. We'll run it in regression mode, because we're going to predict continuously-valued things. For example, the leaves of this tree might have actual values for F0. We'll write the values in here and these would be the predicted values for things with appropriate linguistic features.
09:3109:35 It's not the greatest model in the world, but it's one we know how to use.
09:3509:38 If you don't like that one, pick any other model you like.
09:3809:57 Maybe you could have a neural network. That would work fine as well, or any other statistical model that can perform regression. We're actually going to stop talking now about Acoustic Space Formulation, because we're getting very close to statistical parametric synthesis. That's coming later in the course.
09:5710:12 When we fully understood statistical parametric synthesis - which will use models such as trees or neural networks - we can then come full circle and use that same statistical model to drive the target cost function and therefore to do unit selection.
10:1210:32 We call that a "hybrid method". Let's wrap up our discussion of the target cost function. We have initially made a hard distinction between two different sorts of target cost function: the Independent Feature Formulation, strictly based on linguistic features; the Acoustic Space Formulation, strictly based on comparing acoustic properties.
10:3210:36 Of course, we don't need to have that artificial separation.
10:3610:39 We could have both sorts of features in a single target cost function.
10:3911:05 There's no problem with that. We could sum up the differences in linguistic features plus the absolute difference in F0 plus the absolute difference in duration, ... and so on, all with weights accordingly. There's absolutely no problem then to build a mixed-style target cost function where we have a whole set of sub-costs; some of them using linguistic features, some using acoustic features, and we have to set weights.
11:0511:37 It's going to be even more difficult to set the weights than in the Independent Feature Formulation case, but we'd have to do it somehow. Then we can combine the advantages of both types of sub-cost. The Independent Feature Formulation inherently suffers from extreme sparsity and so predicting some acoustic features can escape some of those sparsity problems inherent in that formulation. However, we don't know how to predict every single acoustic property. For example, things happen phrase-finally other than changes in F0 and duration and amplitude.
11:3711:49 And things aren't easily captured even by the spectral envelope, such as for example creaky voice. It will be very difficult to go all the way to a full prediction of that and then go find candidates with that property.
11:4912:02 We'd probably be a lot better off using linguistic features, such as "Is it phrase final?" and pulling candidates that are phrase final and would automatically have creaky voice where appropriate. So, using features still has a place.
12:0212:06 We'd probably use them alongside acoustic properties as well.
12:0612:11 Finally, of course, we should always remember that all of these features have errors in.
12:1112:16 Even the linguistic features from the front-end will occasionally be wrong.
12:1612:19 The acoustic predictions will have an intrinsic error in them.
12:1912:25 The more detailed those predictions, the harder the task is in fact, so the greater the error.
12:2512:30 So everything has errors and we need to take that into account.
12:3012:43 To summarize: the Independent Feature Formulation uses rather more robust, perhaps slightly less error-prone features from the front-end. But is suffering from extreme sparsity problems.
12:4312:51 The Acoustic Space Formulation gets us over some of the sparsity problems, but we run into problems of accuracy of predicting acoustic properties.
12:5113:01 So, many real systems use some combination of these two things: a target cost function that combines linguistic features and acoustic properties as well.

Log in if you want to mark this as completed
We'll wrap up with a summary of the various design choices you have to make when building a new unit selection speech synthesiser, and a final look forward to hybrid speech synthesis.

Whilst the video is playing, click on a line in the transcript to play the video from that point.
00:0500:20 I'm going to conclude that discussion of unit selection speech synthesis, including the different forms that the target cost function could take, with a summary of the different design choices that you have when you're going to build a new system.
00:2000:25 The first choice is: what kind of unit you're going to use.
00:2500:31 All my diagrams were using whole phones, although that's not really a sensible choice in practice.
00:3100:37 Far more commonly, we'll find systems using diphones or half-phones.
00:3700:47 In either case - but especially the half-phone case - the "zero join cost trick" is very important to effectively get larger units. Those might be actually much larger units.
00:4701:04 That's really easy to implement. You just need to remember which candidates were contiguous in the database and define their join cost to be zero, and not have them calculated by the join cost function. For example, maybe these units were contiguous, and we'll just write zero join cost between them.
01:0401:09 The lattice will be formed as usual, with all the paths...
01:0901:19 joining everything together. On some of those paths would be zero join costs. On others the join cost function will compute the cost. It doesn't matter to the search.
01:1901:25 The search will just find the best overall path.
01:2501:28 You need to choose what kind of target cost you going to use.
01:2801:33 Festival is almost a pure Independent Feature Formulation style target cost.
01:3301:37 It's just got a couple of little bits of acoustic information in there.
01:3702:18 Or we could use a purely Acoustic Space Formulation, doing sufficient partial synthesis so that we only make comparisons in acoustic properties. You've then got to decide which acoustic properties (which acoustic features) to predict so that that comparison is meaningful and will find you the right candidates. But most common of all probably is to do both of those things: to use features, because we have them from the front end - they're good in some situations. For example, features like phrase-final are really good at getting all of the different acoustic properties that correlate with phrase finality: lengthening, F0 falling, voice quality changes such as creakiness.
02:1802:32 Those things aren't all easy to predict. Better just to get units from the right context in the database. But almost always we'll have some acoustic prediction in there. So we might have a prosody model that's "sketching out" an F0 contour that we'd like to try and meet.
02:3202:38 Or a duration model that tells us what duration candidates to prefer.
02:3803:15 The join cost then makes sure that we only choose sequences of candidates that will concatenate smoothly and imperceptibly. We didn't say anything about any further signal processing, but in many systems (although not in Festival) a little bit of further signal processing is done. For example, to manipulate F0 in the locality of a join to make it more continuous. We're not going to deviate very much from the original natural units: that would degrade quality and also it will get us further away from this implicit or explicit prediction that we got from the unit selection process.
03:1503:19 The search is straightforward dynamic programming. It can be done very efficiently.
03:1903:23 It can be formulated on a lattice to make it look like this picture here.
03:2303:27 We could implement it in any way we like: for example, Token Passing.
03:2703:32 In a real system, the length of these lists of candidates will be much, much longer.
03:3203:35 There might be hundreds or thousands of common diphones.
03:3503:41 With such long candidate lists, the number of paths through the lattice becomes unmanageable.
03:4103:44 It's too large and the system will be too slow.
03:4403:48 It's therefore normal to do some pruning, just as in Automatic Speech Recognition.
03:4804:13 There are many forms of pruning. The two most common would be: firstly, to limit the number of candidates for any target position - so that will be based only on target cost, it will be computed locally and we just keep a few hundred candidates perhaps for each position (the ones with the lowest target cost); the second most common form of pruning is during the search.
04:1304:34 During the dynamic programming, as paths explore this grid, we'll just apply beam search (just as in Automatic Speech Recognition), comparing all of the paths at any moment in time during the search to the current best path. Those that are worse than that path in other words have a cost greater than it by some margin - called the beam - are discarded.
04:3404:38 As in Automatic Speech Recognition, pruning is an approximation.
04:3804:44 We're no longer guaranteed to find the lowest-cost candidate sequence.
04:4404:55 The payoff is speed. The final design choice - the thing that we're going to cover next - is what to put in our database.
04:5504:59 So let's finish by looking forward to what's coming up.
04:5905:03 We need to know a lot more about this database. It's got to have natural speech in it.
05:0305:11 It's going to need to be from a single speaker for obvious reasons: we're going to concatenate small fragments of it. But what exactly should we record?
05:1105:15 How should we record it? Do we need to be very careful about that?
05:1505:30 And how do we annotate it? We need to know where all of (say) the diphones start and finish, and annotate each of them with their linguistic properties for use in either an IFF or an ASF type target cost. That's coming next.
05:3005:45 After we've built the database, we can then move on to a more powerful form of speech synthesis, which is to use a statistical parametric model that will generate the waveform entirely with a model. There'll be no concatenation of waveforms.
05:4505:51 Nevertheless, it will still need the database to learn that model.
05:5106:02 When we come on to talk about the database, it will be important to fully understand our target cost: what features it requires, for example.
06:0206:42 Because that will help us decide how to cover all of the permutations of features in the database. When we think about how to annotate the database, we'll probably want to do that automatically because the database is probably going to be very large. Finally, we'll come full circle to this thing called "hybrid synthesis" which is probably best described as unit selection driven by a statistical model. Here's classical unit selection and a hybrid method would take the target sequence of units and replace it with predicted acoustic parameters and use those to go and match candidates from the database.
06:4206:53 The target cost will be in this acoustic space. So we would replace those targets with parameters: for example, F0 or some parametrization of the spectral envelope.
06:5306:59 Here it's something called Line Spectral Pairs. We'd have the candidates as usual from the database. We'd form them into a lattice.
06:5907:03 Here it's called a "sausage" but it's really a lattice.
07:0307:16 We would choose the best path through that. The target cost function would be making comparisons between these units here and the predicted acoustic properties from a powerful statistical model.

Log in if you want to mark this as completed

Prioritise the Hunt & Black paper, because we’ll be discussing this in the class for Module 3.

Reading

Taylor – Chapter 16 – Unit-selection synthesis

A substantial chapter covering target cost, join cost and search.

Hunt & Black: Unit selection in a concatenative speech synthesis system using a large speech database

The classic description of unit selection, described as a search through a network.

Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system

A description of the implementation and evaluation of Festival's unit selection engine, called Multisyn.

King et al: Speech synthesis using non-uniform units in the Verbmobil project

Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.

If you need to watch the videos again, or refer to the readings whilst answering the questions, that’s allowed.

For this class, you need to bring a copy of the essential Hunt & Black reading either in hardcopy or on a device. As with all essential readings, you must read this paper before class.

Download the slides for the class on 2025-01-21 15:10-16:00

That concludes our discussion of how unit selection speech synthesis works. We still haven’t fully specified what’s in the pre-recorded speech database, but that is coming up in the next module.