Whilst the video is playing, click on a line in the transcript to play the video from that point. In this module we're going to cover unit selection. We're going to look at the entire system. The key concepts that are involved. The idea of selecting from a large database of relatively natural recorded speech. Of using some cost functions to choose amongst the many possible sequences of units available in that database. We're going to look at a thing called the target cost function and the join cost function that capture these costs. We're not going to look in complete detail at the target cost function because that will come a little bit later, but we'll finish off by seeing how a search is conducted efficiently to find the optimal unit sequence. Before starting on that, you need to make sure you already know the following things: The first area of importance is a little bit of phonetics. What do we need to know about phonetics? We need to know what a phoneme is: a category of sound and a phone is a realization of that - someone speaking that sound. For the consonants, we need to know that we can describe them using these distinctive features. The first feature is place and that's where in the mouth the sound is made and in particular with a consonant where some sort of constriction is formed and in the IPA chart those are arranged along the horizontal axis of this table. So going from the lips right back to the vocal folds, this is a dimension called place. At each place is possibly used several different manners of making the sound and those are arranged on the vertical axis, not in any particular order. For example we could make a complete closure, and let sound pressure release in an explosive fashion. That's a plosive sound. And finally, for any combination of place and manner, we can also vary the activity of the vocal folds. That's voicing. And so it's possible to make a further contrast, for example between /p/ and /b/ by vibrating the vocal folds in the case of /b/. We can describe vowels in a similar way. We can talk about the place in the mouth that the tongue is positioned. For example: is it near the roof of the mouth? Or is it down near the bottom? So, how open or closed is the vocal tract by the tongue? And we can talk about where that positioning takes place, whether it's near the front of the mouth or the back of the mouth. And these two dimensions, sometimes called height and front-back, characterize the vowel sounds. Here too, there's the third dimension. We can make an additional contrast, by slightly extending the length of the vocal tract, by protruding the lips. And that's called rounding. So again, there's a set of distinctive features that we can use to describe the vowel sounds: height, front-back, and rounding. And these will become useful later on, in unit selection where we have to decide how similar two sounds are. The next thing you need to know about is the source filter model. This was covered in the course Speech Processing. In a source filter model, there's a source of sound: that's either the vibration of the vocal folds, or some form of sound in the vocal tract such as release of a closure, or frication. It's possible of course for both of those things to happen at the same time. We might make voiced fricatives, like [v] or [z]. The source of sound goes through a filter: that's the vocal tract. That filter imposes a vocal tract frequency response. One way of describing that frequency response is in terms of the peaks (the resonant frequencies). They're called formants. Another way is just to talk more generally about the overall spectral envelope. So you need to know something about the source filter model. You should also already know about the front end, what's sometimes called the text processor. The front end takes the text input and adds lots and lots of information to that based on either rules, or statistical models, or sources of knowledge such as the pronunciation dictionary. This picture here captures a lot of the work that's done in the front end. There's the input text and the front end adds things such as: part of speech tags, possibly some structural information, possibly some prosodic information, and always some pronunciation information. As we progress we're going to see that we might find it convenient to attach all of that information to the segment - to the phoneme. So we might end up with structures that look rather flatter, that are essentially context-dependent phonemes. That will become clearer as we go through unit selection. But what you need to know at this stage is where all of that information comes from: where the linguistic specification is made, and that's in the front end. Now, we've already also covered a more basic form of speech synthesis, that's called diphone speech synthesis, in which we record one of each type of speech sound, and we use a very special speech sound: that's the second half of one phone and the first half of a consecutive phone said in connected speech and we perform quite a lot of signal processing at synthesis time to impose, for example, the prosody that we require, on that sequence of waveforms. And the final thing that you need to know about, that we will have covered already in automatic speech recognition for example, is dynamic programming. In unit selection, we're going to have to make a choice between many possible sequences of waveform fragments. In other words, we're going to search amongst many possibilities and the search space could be extremely large and so we need an efficient way of doing that search and we're going to formulate the problem in a way that allows us to use dynamic programming to make that search very efficient.
What you should already know
Before continuing, you should check that you have the right background by watching this video.
Log in if you want to mark this as completed
|
|