Whilst the video is playing, click on a line in the transcript to play the video from that point. We've retrieved from the database a number of possible candidate waveform fragments to use in each target position. The task now is to choose amongst them. There are many, many possible sequences of candidates, even for this very small example here. Let's just pick one of them for illustration... and you can imagine how many more there are. We want to measure how well each of those will sound. We want to quantify it: put a number on it. Then we're going to pick the one that we predict will sound the best. So, what do we need to take into account when selecting from amongst those many, many possible candidate sequences? Perhaps the most obvious one is that, when we're choosing a candidate - let's say for this position - from these available candidates, we could consider the linguistic context of the target: in other words, its linguistic environment in this target sentence. We could consider the linguistic environment of each individual candidate, and measure how close they are. We're going to look at the similarity between a candidate and a target in terms of their linguistic contexts. The motivation for that is pretty obvious. If we could find candidates from identical linguistic contexts to those in the target unit sequence, we'd effectively be pulling out the entire target sentence from the database. Now, that's not possible in general, because there's an infinite number of sentences that our system will have to synthesize. So we're not (in general) going to find exactly-matched candidate units, measured in terms of their linguistic context. We're going to have to use candidate units from mismatched non-identical linguistic contexts. So we need a function to measure this mismatch. We need to quantify it. We're going to do that with a function. The function is going to return a cost (we might call that a distance). The function is called the target cost function. A target cost of zero means that the linguistic context - measured using whatever features are available to us - was identical between target and candidate. That's rarely (if ever) going to be the case, so we're going to try and look for ones that have low target costs. The way I've just described that is in terms of linguistic features: effectively counting how many linguistic features (for example left phonetic context, or syllable stress, or position in phrase) match and how many mis-match. The number of mismatches will lead us to a cost. Taylor, in his book, proposes two possible formulations of the target cost function. One of them is what we've just described. It basically counts up the number of mismatched linguistic features. He calls that the "Independent Feature Formulation" because a mismatch in one feature and a mismatch in another feature both count independently towards the total cost. The function won't do anything clever about particular combinations of mismatch. Another way to think about measuring the mismatch between a candidate and a target is in terms of their acoustic features, but we can't do that directly because the targets don't have any acoustic features. We're trying to synthesize them. They're just abstract linguistic specifications at this point. So, if we wanted to measure the difference between a target and a candidate acoustically (which really is what we want to do: we want to know if they're going to sound the same or not) we would have to make a prediction about the acoustic properties of the target units. The target cost is a very important part of unit selection, so we're going to devote a later part of the course to that, and not go into the details just at this moment. All we need at this point is to know that we can have a function that measures how close a candidate is to the target. That closeness could be measured in terms of whether they have similar linguistic environments, or whether they "sound the same". That measure of "sounding the same" involves an extra step of making some prediction of the acoustic properties of those target units. Measuring similarity between an individual candidate and its target position is only part of the story. What are we going to do with those candidates after we've selected them? We're going to concatenate their waveforms, and play that back, and hope a listener doesn't notice that we've made a new utterance by concatenating fragments of other utterances. The most perceptible artefact we get in unit selection synthesis is sometimes those concatenation points, or "joins". Therefore, we're going to have to quantify how good each join is, and take that into account when choosing the sequence of candidates. So the second part of quantifying the best-sounding candidate sequence is to measure this concatenation quality. Let's focus on this target position, and let's imagine we've decided that this candidate has got the lowest overall target cost. It's tempting just to choose that - because we'll make an instant local decision - and then repeat that for each target position, choosing its candidate with the lowest target cost. However, that fails to take into account whether this candidate will concatenate well with the candidates either side. So, before choosing this particular candidate, we need to quantify how well it will concatenate with each of the things it needs to join to. The same will be true to the left as well. We can see now that the choice of candidate in this position depends (i.e., it's going to change, potentially) on the choice of candidate in the neighbouring positions. So, in general, then we're going to have to measure the join cost - the potential quality of the concatenation - between every possible pair of units... and so on for all the other positions. So we have to compute all of these costs and they have to be taken into account when deciding which overall sequence of candidates is best. Our join cost function has to make a prediction about how perceptible the join will be. Will a listener notice there's a join? Why would a listener notice there's been a join in some speech? Well, that's because there'll be a mismatch in the acoustic properties around the join. That mismatch - that change in acoustic properties - will be larger than is normal in natural connected speech. For example, sudden discontinuities in F0 don't happen in natural speech. So, if they do happen in synthetic speech they are likely to be heard by listeners. Our join cost function is going to measure the sorts of things that we think listeners can hear. The obvious ones are going to be the pitch (or the physical underlying property: fundamental frequency / F0), the energy - if speech suddenly gets louder or quieter we will notice that, if it's in an unnatural way - and, more generally, the overall spectral characteristics. Underlying all of this there's an assumption. The assumption is that measuring acoustic mismatch is a prediction of the perceived discontinuity that a listener will experience when listening to this speech. If we're going to use multiple acoustic properties in the join cost function, then we have to combine those mismatches in some way. A typical way is the way that Festival's Multisyn unit selection engine works. That's to measure the mismatch in each of those three properties separately and then sum them together. Since some might be more important than others, there'll be some weights. So, it'll be a weighted sum of mismatches. It's also quite common to inject a little bit of phonetic knowledge into the join cost. We know that listeners are much more sensitive to some sorts of discontinuities than others. A simple way of expressing that is to say that they are much more likely to notice a join in some segment types than in other segment types. For example, making joins in unvoiced fricatives is fairly straightforward: the spectral envelope doesn't have much detail, and there's no pitch to have a mismatch in. So we can quite easily splice those things together. Whereas perhaps in a more complex sound, such as a liquid or a diphthong, with a complex and changing spectral envelope, it's more difficult to hide the joins in those sounds. So, very commonly, join costs will also include some rules which express phonetic knowledge about where the joins are best placed. Here's a graphical representation of what the join cost is doing. We have a diphone on the left, and a diphone on the right. (Or, in our simple example, just whole phones) We have their waveforms, because these are candidates from the database. Because we have their waveforms, we can extract any acoustic properties that we like. In this example, we've extracted fundamental frequency, energy and the spectral envelope. It's plotted here as a spectrogram. We could parameterize that spectral envelope any way we like. This picture is using formants to make things obvious. More generally, we wouldn't use formants: they're rather hard to track automatically. We'd use a more generalized representation like the cepstrum. We're going to measure the mismatch in each of these properties. For example... the F0 is slightly discontinuous, so that's going to contribute something to the cost. The energy is continuous here, so there's very low mismatch (so, low cost) in the energy. We're similarly going to quantify the difference in the spectral envelope just before the join and just after the join. We're going to sum up those mismatches with some weights that express the relative importance of them, perceptually. That's a really simple join cost. It's going to work perfectly well, although it's a little bit simple. Its main limitation is it's extremely local. We just took the last frame (maybe 20ms) of one diphone and the first frame (maybe the first 20ms) of the next diphone (the next candidate that we're considering concatenating) and we're just measuring the very local mismatch between those. That will fail to capture things like sudden changes of direction. Maybe F0 has no discontinuity but in the left diphone it was increasing and in the right diphone it was decreasing. That sudden change from increasing to decreasing will also be unnatural: listeners might notice. So we could improve that: we could put several frames around the join and measure the join cost across multiple frames. We could look at the rate of change (the deltas). Or we could just generalize that much further and build some probabilistic model of what trajectories of natural speech parameters normally look like, compare that model's prediction to the concatenated diphones, and measure how natural they are under this model. Now, eventually we are going to go there: we're going to have a statistical model that's going to do that for us, but we're not ready for that yet because we don't know about statistical models. So we're going to defer that for later, once we've understood statistical models and how they can be used to synthesize speech themselves, we'll then come back to unit selection and see how that statistical model can help us compute the joint cost, and in fact also the target cost. When we use a statistical model underlying our unit selection system we call that "hybrid synthesis". But that's for later: we'll come back to that.
Target cost and join cost
To choose between the many possible sequences of candidate units, we need to quantify how good each possible sequence will sound.
Log in if you want to mark this as completed
|
|