Whilst the video is playing, click on a line in the transcript to play the video from that point. I think this is a good time to orient ourselves again, to check that we understand where we are in the bigger picture of unit selection. What we should understand so far is that a unit selection speech synthesizer has the same front end as as any other synthesizer. So we run that text processor. From the linguistic specification, we construct a target sequence, essentially flattening down the specification on to the individual targets. For each target unit, we retrieve all possible candidates from the database. We just go for an exact match on the base unit type and we hope for variety in all of the other linguistic features. For each candidate, we compute a target cost. So far, we understand the Independent Feature Formulation style of target cost. It's just a weighted sum. The weights are the penalties for each mismatched linguistic feature. We compute join costs and perform a search. We're going to look at a more sophisticated form of target cost now, where we predict some acoustic properties for the targets and compare those with actual acoustic properties of candidates. The motivation for that is the weakness of the Independent Feature Formulation. That compares only linguistic features: symbolic things produced by the front-end. That's computationally efficient, but it creates an artificial form of sparsity. We could summarize this weakness by thinking about a single candidate for a particular target position. The target and the candidate may have differing (in other words, mismatched) linguistic features. Yet, the candidate could be ideal for that position. It could sound very similar to the ideal target. It's very hard to get round that when we're only looking at these linguistic features. It's very hard to detect which combinations of features lead to the same sound. What we need to do is to compare how the units sound. We want to compare how a candidate actually sounds (because we have its waveform) with how we think - or how we predict - a target should sound. That's going to involve making a prediction of the acoustic properties of the target. Taylor tries to summarize this situation in this one diagram. Let's see if we can understand what's going on in this diagram. For now, let's just think about candidates that have both linguistic specifications and actual acoustic properties. What Taylor is saying with this diagram is that it's possible that there are two different speech units that have very different linguistic features: this one and this one. These are maximally-different linguistic features: they mismatch in both stress and phrase finality. (We're just considering those two dimensions in this picture.) Yet it's possible that they sound very similar. The axes of this space are acoustic properties, and these two units lie very close to each other in acoustic space. This is completely possible. It's possible that two things that have different linguistic specifications but sound very similar. To fully understand the implications of this, we need to also think about the target units. At synthesis time our target units do not have acoustic properties because they're just abstract linguistic structures. We're trying to predict the acoustic properties. There is some ideal acoustic property of each target and so the same situation could hold. It could be the case that we are looking for a target that has this linguistic specification and - using the Independent Feature Formulation - this potential candidate here would be very far away. It would incur a high target cost: it mismatches twice (both features). These other possible candidates would appear to be closer in linguistic feature space. But if it's the case that [stress -] and [phrase-finality +] happens to sound very similar to [stress +] and [phrase-finality -] then we shouldn't consider these two things here are far apart at all. But the only way to discover that is actually to go into acoustic space and measure the distance in acoustic space: measure this distance between these two things. Because, in linguistic feature space, we won't be able to detect that they would have sounded similar. Unfortunately Taylor fails to label his axes. It's probably deliberate because he's trying to say this is an abstract acoustic space. c1 and c2 could be any dimensions of acoustic space that you want. It might be that this one's duration and this is some other acoustic property. But it might be something else: maybe the other way around, or maybe something else. It doesn't really matter. The point is that in acoustic space things might be close together, but in linguistic space they're far apart. They're apparently linguistically different, but they are acoustically interchangeable. It's that interchangeability that's the foundation of unit selection: that's what we're trying to discover. Now, for our target units to move closer to the candidates (which are acoustic things) we need to predict some acoustic properties for the targets. We don't necessarily need to predict a speech waveform because we're not going to play back these predicted acoustic properties. We're only going to use them to choose candidates. So we really don't need a waveform and neither do we need to predict every acoustic property. We just need to predict sufficient properties to enable a comparison with candidate units. Let's try to make this clearer with a picture. Back to this diagram again. Again, just for the purpose of explanation, our units are phone-sized. These candidates here are fully-specified acoustic recordings of speech. We have waveforms from which we could estimate any acoustic properties. We can measure duration; we could estimate F0; we could look at the spectral envelope or formants if we wanted. The targets are abstract linguistic specifications only, with no acoustics. So far, we only know how to compare them in terms of linguistic features, which both have. What we're going to do now: we're going to try and move target units closer to the space in which the candidates live. We're going to give them some acoustic properties. Let's just think of one: let's imagine adding a value for F0 to all of the target units. We also know F0 for all the candidates. It will be then easy to make a comparison between that predicted F0 and the true F0 of a candidate. We would compare these things, and we could do that for any acoustic properties we liked. Now, what acoustic features are we going to try and add to our targets? Well, we have a choice. We could do anything we like. We could predict simple acoustic things such as F0. In other words, have a model of prosody that predicts values of F0. Equally, we could predict values for duration or energy (all correlates of prosody). So: we'd need a predictive model of prosody. We'd have to build it, train it, put it inside the front-end, run it at synthesis time. It would produce values for these things which you could compare to the true acoustic values of the candidates. We could go further. We could predict some much more detailed specification: maybe even the full spectral envelope. Typically we're going to encode the envelope in some compact way that makes it easy to compare to the candidates. Cepstral coefficients would be a good choice there. It would seem that the more and more detail that we can predict, the better, because we can make more accurate comparisons to the candidates. That's true in principle. However, these are all predicted values and the predictions will have errors. The more detailed the predictions need to be - for example the full spectral envelope - the less certain we are that they're correct. It's getting harder and harder to predict those things. So all of this is only going to work if we can rather accurately predict these properties. If we don't think we can accurately predict them, we're better off with the Independent Feature Formulation. Indeed that's why the earlier systems had the Independent Feature Formulation, because we didn't have sufficiently powerful statistical models or good enough data to build accurate predictors of anything else. But we've got better at that. We have better models, and so we could indeed today envisage predicting a complete acoustic specification - in fact, all the way to the waveform if you wanted. How would we do that? Well it's a regression problem! We've got inputs: the linguistic features that we already have from the front-end for our targets. We have a thing we're trying to predict: it could be F0, duration, energy, MFCCs, ... anything that you like. So you just need to pick your favourite regression model. Here one you know about: the fantastic Classification And Regression Tree. We'll run it in regression mode, because we're going to predict continuously-valued things. For example, the leaves of this tree might have actual values for F0. We'll write the values in here and these would be the predicted values for things with appropriate linguistic features. It's not the greatest model in the world, but it's one we know how to use. If you don't like that one, pick any other model you like. Maybe you could have a neural network. That would work fine as well, or any other statistical model that can perform regression. We're actually going to stop talking now about Acoustic Space Formulation, because we're getting very close to statistical parametric synthesis. That's coming later in the course. When we fully understood statistical parametric synthesis - which will use models such as trees or neural networks - we can then come full circle and use that same statistical model to drive the target cost function and therefore to do unit selection. We call that a "hybrid method". Let's wrap up our discussion of the target cost function. We have initially made a hard distinction between two different sorts of target cost function: the Independent Feature Formulation, strictly based on linguistic features; the Acoustic Space Formulation, strictly based on comparing acoustic properties. Of course, we don't need to have that artificial separation. We could have both sorts of features in a single target cost function. There's no problem with that. We could sum up the differences in linguistic features plus the absolute difference in F0 plus the absolute difference in duration, ... and so on, all with weights accordingly. There's absolutely no problem then to build a mixed-style target cost function where we have a whole set of sub-costs; some of them using linguistic features, some using acoustic features, and we have to set weights. It's going to be even more difficult to set the weights than in the Independent Feature Formulation case, but we'd have to do it somehow. Then we can combine the advantages of both types of sub-cost. The Independent Feature Formulation inherently suffers from extreme sparsity and so predicting some acoustic features can escape some of those sparsity problems inherent in that formulation. However, we don't know how to predict every single acoustic property. For example, things happen phrase-finally other than changes in F0 and duration and amplitude. And things aren't easily captured even by the spectral envelope, such as for example creaky voice. It will be very difficult to go all the way to a full prediction of that and then go find candidates with that property. We'd probably be a lot better off using linguistic features, such as "Is it phrase final?" and pulling candidates that are phrase final and would automatically have creaky voice where appropriate. So, using features still has a place. We'd probably use them alongside acoustic properties as well. Finally, of course, we should always remember that all of these features have errors in. Even the linguistic features from the front-end will occasionally be wrong. The acoustic predictions will have an intrinsic error in them. The more detailed those predictions, the harder the task is in fact, so the greater the error. So everything has errors and we need to take that into account. To summarize: the Independent Feature Formulation uses rather more robust, perhaps slightly less error-prone features from the front-end. But is suffering from extreme sparsity problems. The Acoustic Space Formulation gets us over some of the sparsity problems, but we run into problems of accuracy of predicting acoustic properties. So, many real systems use some combination of these two things: a target cost function that combines linguistic features and acoustic properties as well.
Acoustic Space Formulation
The IFF suffers from sparsity problems, which we can try to overcome by making the comparisons between targets and candidates in terms of acoustic properties.
Log in if you want to mark this as completed
|
|