Whilst the video is playing, click on a line in the transcript to play the video from that point. Now we've got a complete picture of how unit selection works, we can start to look in more detail at some of the most important components. The first thing we're going to look at is the target cost function. We're going to use Taylor's terminology here, to be consistent with his book. We're going to first look at the simplest way we could configure the target cost function. That's to calculate the cost as a weighted sum of mismatches in linguistic features. When Taylor says Independent Feature Formulation, he doesn't mean that the linguistic features are completely independent of each other in a statistical sense. What he's saying is that - in the target cost computation - the features are all considered independently. A mismatch in one feature doesn't interact with mismatches in other features. That makes the calculation really simple, but it is a weakness. The source of that weakness is the sparsity of these linguistic features, due to the extremely large number of permutations of possible values. Before carrying on, make sure that you understand the general principles of unit selection from the previous videos. You need to know that unit selection is basically about selecting waveform fragments from a database of pre-recorded natural speech. Obviously, that speech is going to have to be annotated so we can find those units. We haven't said much about that yet, because it's going to come later. The selection of candidates is based on two costs: a target cost function, which we're going to talk a lot more about now, and a join cost function, which calculates the mismatch across concatenation points (across the joins). Because of the join cost, the selection of one candidate depends on the preceding and following candidates, all the way to the ends of the utterance. Therefore, to minimize the total cost, we need to conduct a search. We've already talked about that. So let's get into those details about the target cost: It's measuring mismatch between a target and a candidate for that target position. We need to decide how that mismatch could be measured. We'll start with measuring that mismatch in the simplest possible way. We can call it simple because it's going to use things we already have from our front end. What we already have, of course, is the linguistic specification of the targets, and that comprises a set of linguistic features. We've already described how we can essentially flatten those on to the segment (on to the pronunciation level). So, what we're dealing with then is a sequence of pronunciation units: phonemes, or maybe diphones. Each of them has a specification attached to it, so it knows the context in which it appears. That's true in both the target sequence and for each individual candidate, because the candidates came from real recorded sentences, where we also knew the full linguistic specification. So we know the same things for the target and for each of the candidates. The features will be the same because they'll be produced in the same way. It's a simple count of how many don't match. The motivation for that should be obvious. Ideally - although it doesn't happen very often - we would like to find exactly-matching candidates. Those exactly-matching candidates will have a cost of zero: there'll be no mismatch (a sum of zeros). The more mismatched the context is between the candidate and target, the higher the cost. If the mismatch is in terms of linguistic features, we can just sum up the individual mismatches. Always remember that the target cost function (like the join cost function) is computing a cost, and that cost is only a prediction of how bad this candidate might sound if we were to use it in this target position. The advantage of this Independent Feature Formulation type of target cost function is that it works with things we already know. For every target, the front end text processor has provided us with a linguistic specification. For every candidate that we are considering for that target position, we also know the same linguistic specification. Now, precisely how we know that, we'll cover in the module on the database, a little bit later. So, we know the same things for every target and for every candidate. We can just make a direct comparison between them. So, let's make that completely clear. Let's focus in on one particular target position, one particular candidate that we're considering for that target. We'll make a direct comparison between the two in terms of their linguistic specification. That's going to include things like their phonetic context. So we know the context in which this appears - it's actually this left and right phonetic context - although remember it can be attached locally, because this sequence is constant. For this candidate, we also know the same things. We know the phonetic context which it was extracted from in its source sentence. It's not this context necessarily, it's the context of the natural sentence it came from. That context is described as a set of separate linguistic features: phonetic context, perhaps stress, position-in-syllable, position-in-word, position-in-phrase, ... anything that our front-end text processor can generate for us at synthesis time. For example, imagine that the candidate that we're considering here (we're measuring the target cost for) actually occurred in the natural sentence "A car." So we know, for example, that it was phrase-initial and the target here is also phrase-initial. We know that it was word-final - there's a word boundary here. We also know that the target position is word-final. And of course we know the phonetic context: this candidate came after a silence however for the target position we want something that's after "the": there's a mismatch. We know that this candidate came from before a [k] and we also know that target comes before a [k]: that's a match. So: left-phonetic-context mismatches, right-phonetic-context matches. We're just going to sum up penalties for all of those mismatches. We should know enough phonetics to know that some linguistic contexts have a bigger effect on sound - and more importantly on perception of that sound - than others. So, we need to capture that difference in importance between the different features. The simplest form of the Independent Feature Formulation target cost considers all the features - it considers them to be independent - and it just sums up the number of mismatches. The only way of weighting one against the other is to put these weights as we sum up those mismatches. So, for example, in Festival's multisyn unit selection module, these are the weights. We can see, for example, that a mismatch in left-phonetic-context incurs a slightly higher penalty than a mismatch in right-phonetic-context. That's capturing our knowledge of co-articulation: that left context has a stronger effect on the current sound than the right context. Where do these weights come from? Well they're set by hand, by listening to a lot of synthetic speech and tuning the weights. That's quite hard to do; that's obviously a very skilled thing. But currently that's the best method for picking those weights. Festival has a couple of special things in its target cost that aren't really part of the target cost itself: it's just a convenient way of implementing something. They're there to detect problems with the database. We're going to come back to that when we talk about the database, and we can see where these pseudo-features come from. They're to do with the automatic labelling of the database, in fact. Don't worry about these for now. Concentrate on these features: phonetic context and the prosodic context. Those are the ones produced by the front end and those are the ones used to choose between competing candidates from different linguistic contexts. Let's work through an example to make that crystal clear. Let's just take the main features that are produced by the front end and forget these special values that Festival uses to detect problems in the database. So here they are, and their weights. I'm going to consider a single target position in the sentence I'd like to say. That's its linguistic specification. I've got two competing candidates, each with their linguistic specifications. We're going to look at the match / mismatch between each candidate in turn and that target specification, and compute the target cost for each of them. It's just a simple process of deciding if there's a mismatch and noting that. Let's do candidate 1 first. For candidate 1: stress matches, syllable position mismatches, word position matches, Part Of Speech matches, phrase position matches, left-context matches, but right-phonetic-context mismatches. I will do the same for candidate 2 separately: stress mismatches, syllable position matches, word position matches, Part Of Speech mismatches, phrase position matches, left-phonetic-context mismatches, right-phonetic-context matches. Candidate 1 has two mismatches, but we need to do a weighted sum to take into account the relative importance of those mismatches. The syllable position mismatch incurs a penalty of 5 and the right-phonetic-context mismatch incurs a penalty of 3, giving us a total of 8. Separately for candidate 2: that stress mismatch incurs a penalty of 10, the Part Of Speech mismatch costs 6, and the left-phonetic-context mismatch costs 4, giving us a total of 20. Now remember, we don't simply use these two values to choose between these two candidates, because we don't yet know how their waveforms will concatenate with the candidates left and right of them in the lattice. Those costs (those target costs) just go into the lattice and become part of the total cost of all the different paths passing through each of these candidates. As with most of the examples in this part of the course, I'm drawing my lattice in terms of whole phones because it's neater. Let's draw a picture of what it would be like for diphone units, just so we see that it can be done. In diphone units, I'll run the front end in the same way. I've rewritten segments as diphones. So that's now my target sequence, and I'm going to go and retrieve diphone candidates from the database. Each of these candidates has a waveform, and of course also has a linguistic specification. In the Independent Feature Formulation, it's only the linguistic specification that we're going to use for comparison. Let's again focus in on one particular target position: we'd like to say this diphone. We have two available candidates. We know the recorded utterances that each of those candidates came from: "They saw each other for the first time in Boston" So the top candidate there came from that utterance. "They ran the canoe in and climbed the high earth bank" The other candidate came from that utterance. For each of those utterances (these are in the database) they've got natural recorded speech plus a complete linguistic specification. We can see that we're always matching on the base unit type: that's always an exact match; that's how we retrieve the candidates, just by looking at that. Then we can look at other features around them. Now the calculation of target cost for diphones is just a little bit messier because we do it separately for the left and the right halves. That's because some features might actually differ going through the diphone: it might cross (for example) a syllable or word boundary. The left half of the diphone might be in a different Part Of Speech to the right half. We just calculate the target cost as two sub-costs: the left and right halves, and then add those together. An Independent Feature Formulation target cost is really rather simple. In fact, it's a bit too simple. If we return to this example that we just worked through, we can see that there's a problem with the Independent Feature Formulation type of target cost. It's a bit too simplistic - it's too naive - and the simplicity is because we've treated the features as independent for the purposes of calculating the target cost. The target cost function doesn't consider two rather important things. One thing that it fails to consider is combinations of features. For example, there might be interactions between the stress status of a syllable and whether it's phrase final or not. Both of those are competing to affect F0, but this function just considers them independently, and just accumulates the penalties. The other oversimplification of this function is that things strictly match or mismatch: it's a binary distinction. There's no concept of a "near match". So there's no distance: things are either exactly the same (incurring zero penalty) or different (incurring the maximum penalty: the weight in that column). There's an example in the table of where a "near match" might be OK. Candidate 2 came from a left-phonetic-context of [v]. This has got a fairly similar place of articulation to the desired (the target) left-phonetic-context of [b]. They're both also voiced. So we might prefer to take candidates from [v] left-phonetic-contexts than radically different ones, like a liquid. However, this still incurred the maximum penalty of 4 here. It would be better if we could soften that somewhat and say that that's a "near match" and maybe there should be a lower penalty in that case. This function is unable to do that. So, that's pretty much all there is to the Independent Feature Formulation. We're working with features that have already been produced by our front-end. That's super-convenient and is also going to be computationally quick. We've already had to do all of that front-end processing, so those features are things we already have: they come "for free". Those are calculations we had to do to disambiguate pronunciation (for example). So we're just deriving simple symbolic features from - in Festival's case - the existing utterance structure, or more generally the linguistic specification. So computation of this is going to be cheap. Of course a weighted sum of mismatches is very cheap to compute. So this target cost function will be fast. That's good! It makes some dramatic simplifications though. Nevertheless, it will work. This is pretty much what Festival does. It's almost a pure Independent Feature Formulation target cost function in Festival. Now, we didn't make any acoustic predictions at all in computing this target cost. The function simply worked with symbolic features. The symbolic features could optionally include symbolic prosodic features. So, if the front-end can predict them with sufficient accuracy - for example we might attempt to predict ToBI accents and boundary tones - we will have these symbolic features that capture prosody. Those can be taken into account when selecting candidates from the database. Of course, we'll have to annotate the database with the same features. But what if we don't have that? What if we don't explicitly mark up prosody even symbolically on either the target or the candidates in the database? How on earth will we get any prosody at all? How is prosody created using such a cost function? Well, very simply by choosing candidates from an appropriate context - for example, phrase final - we'll get appropriate prosody automatically. That's the same principle that we use to get the correct phonetic co-articulation or the correct syllable stress. We'll get prosody simply by choosing candidates essentially from the right position in the prosodic phrase. Therefore, all we really need to do to get prosody is to make sure that the linguistic features from our front end capture sufficient contextual information relevant to prosody. An awful lot of that rests simply on position-within-prosodic-constituents: where the syllable is within the word, within the phrase,... That will get us prosody. Optionally - and I say optionally because predicting prosody even symbolically is very error-prone - optionally, we could attempt to predict prosody and then use that as part of the cost function as just another linguistic feature. It would have to have its own weight. I just stated that an Independent Feature Formulation makes no attempt to make any acoustic predictions whatsoever about the target. It simply gets the candidates, and whatever acoustic properties they have, that's what the synthetic speech has. But thinking about the system as a whole, of course we are making predictions about the acoustics, because we're generating synthetic speech. It's just implicit in the procedure. Whilst the cost function itself only deals with symbolic features, the output of the system is synthetic speech and that of course has acoustic properties. Taken as a whole, the database, the target cost function, the search for the best candidate sequence: that whole complex system is making acoustic predictions. It's a complicated sort of regression from the linguistic specification to a speech waveform. However, it's completely implicit. There's an advantage to being completely implicit. We don't need to make explicit acoustic predictions, so we don't need complicated models for that, that will make mistakes. We just get natural output. There's also a weakness: we can't really inspect the system. We can't really see how it's making this acoustic prediction. All we can do is indirectly control that by, for example, changing the weights in the target cost. So, what we're going to move on to now is we're going to look at a different formulation of the target cost function. Something that does make some acoustic predictions - explicit predictions of actual acoustic properties - and then measures the difference between target and candidate in that acoustic space. That's the Acoustic Space Formulation. That's going to help get us out of a sparsity problem. But also we can then observe those acoustic predictions: measure their accuracy in an objective sense. That might help was improve the system in a way that's rather opaque in the Independent Feature Formulation.
Independent Feature Formulation
This is the simplest form of target cost function, because no prediction of any acoustic properties is involved.
Log in if you want to mark this as completed
|
|