Whilst the video is playing, click on a line in the transcript to play the video from that point. I'm going to conclude that discussion of unit selection speech synthesis, including the different forms that the target cost function could take, with a summary of the different design choices that you have when you're going to build a new system. The first choice is: what kind of unit you're going to use. All my diagrams were using whole phones, although that's not really a sensible choice in practice. Far more commonly, we'll find systems using diphones or half-phones. In either case - but especially the half-phone case - the "zero join cost trick" is very important to effectively get larger units. Those might be actually much larger units. That's really easy to implement. You just need to remember which candidates were contiguous in the database and define their join cost to be zero, and not have them calculated by the join cost function. For example, maybe these units were contiguous, and we'll just write zero join cost between them. The lattice will be formed as usual, with all the paths... joining everything together. On some of those paths would be zero join costs. On others the join cost function will compute the cost. It doesn't matter to the search. The search will just find the best overall path. You need to choose what kind of target cost you going to use. Festival is almost a pure Independent Feature Formulation style target cost. It's just got a couple of little bits of acoustic information in there. Or we could use a purely Acoustic Space Formulation, doing sufficient partial synthesis so that we only make comparisons in acoustic properties. You've then got to decide which acoustic properties (which acoustic features) to predict so that that comparison is meaningful and will find you the right candidates. But most common of all probably is to do both of those things: to use features, because we have them from the front end - they're good in some situations. For example, features like phrase-final are really good at getting all of the different acoustic properties that correlate with phrase finality: lengthening, F0 falling, voice quality changes such as creakiness. Those things aren't all easy to predict. Better just to get units from the right context in the database. But almost always we'll have some acoustic prediction in there. So we might have a prosody model that's "sketching out" an F0 contour that we'd like to try and meet. Or a duration model that tells us what duration candidates to prefer. The join cost then makes sure that we only choose sequences of candidates that will concatenate smoothly and imperceptibly. We didn't say anything about any further signal processing, but in many systems (although not in Festival) a little bit of further signal processing is done. For example, to manipulate F0 in the locality of a join to make it more continuous. We're not going to deviate very much from the original natural units: that would degrade quality and also it will get us further away from this implicit or explicit prediction that we got from the unit selection process. The search is straightforward dynamic programming. It can be done very efficiently. It can be formulated on a lattice to make it look like this picture here. We could implement it in any way we like: for example, Token Passing. In a real system, the length of these lists of candidates will be much, much longer. There might be hundreds or thousands of common diphones. With such long candidate lists, the number of paths through the lattice becomes unmanageable. It's too large and the system will be too slow. It's therefore normal to do some pruning, just as in Automatic Speech Recognition. There are many forms of pruning. The two most common would be: firstly, to limit the number of candidates for any target position - so that will be based only on target cost, it will be computed locally and we just keep a few hundred candidates perhaps for each position (the ones with the lowest target cost); the second most common form of pruning is during the search. During the dynamic programming, as paths explore this grid, we'll just apply beam search (just as in Automatic Speech Recognition), comparing all of the paths at any moment in time during the search to the current best path. Those that are worse than that path in other words have a cost greater than it by some margin - called the beam - are discarded. As in Automatic Speech Recognition, pruning is an approximation. We're no longer guaranteed to find the lowest-cost candidate sequence. The payoff is speed. The final design choice - the thing that we're going to cover next - is what to put in our database. So let's finish by looking forward to what's coming up. We need to know a lot more about this database. It's got to have natural speech in it. It's going to need to be from a single speaker for obvious reasons: we're going to concatenate small fragments of it. But what exactly should we record? How should we record it? Do we need to be very careful about that? And how do we annotate it? We need to know where all of (say) the diphones start and finish, and annotate each of them with their linguistic properties for use in either an IFF or an ASF type target cost. That's coming next. After we've built the database, we can then move on to a more powerful form of speech synthesis, which is to use a statistical parametric model that will generate the waveform entirely with a model. There'll be no concatenation of waveforms. Nevertheless, it will still need the database to learn that model. When we come on to talk about the database, it will be important to fully understand our target cost: what features it requires, for example. Because that will help us decide how to cover all of the permutations of features in the database. When we think about how to annotate the database, we'll probably want to do that automatically because the database is probably going to be very large. Finally, we'll come full circle to this thing called "hybrid synthesis" which is probably best described as unit selection driven by a statistical model. Here's classical unit selection and a hybrid method would take the target sequence of units and replace it with predicted acoustic parameters and use those to go and match candidates from the database. The target cost will be in this acoustic space. So we would replace those targets with parameters: for example, F0 or some parametrization of the spectral envelope. Here it's something called Line Spectral Pairs. We'd have the candidates as usual from the database. We'd form them into a lattice. Here it's called a "sausage" but it's really a lattice. We would choose the best path through that. The target cost function would be making comparisons between these units here and the predicted acoustic properties from a powerful statistical model.
Design choices
We'll wrap up with a summary of the various design choices you have to make when building a new unit selection speech synthesiser, and a final look forward to hybrid speech synthesis.
Log in if you want to mark this as completed
|
|