Whilst the video is playing, click on a line in the transcript to play the video from that point. We'll finish off this module on the database by looking at how we're going to label it ("annotate" it) But let's just orient ourselves in the bigger picture before we continue. What we've got so far is a script, which is composed of individual sentences. That script will have been designed, probably by a text selection algorithm that we've written. It will aim to: cover the units-in-context; to be readable; to provide each base unit in a wide variety of different linguistic contexts; and possibly some other things as well, such as specific domains. With that script, we've gone into a recording studio and asked a speaker (sometimes called the "voice talent") to record those sentences. They'll be recorded generally as isolated sentences. Our text selection algorithm will have very likely provided us with a script in an order of decreasing richness. So we'll record the script in that same order, meaning that we can stop at any point and maximize the coverage for that given amount of material. What remains to be done is to segment the speech: to put phonetic labels on it, so we know where each (for example) diphone starts and ends. On top of that phonetic labelling (or "segmentation"), we need to annotate the speech with all of the supra-segmental linguistic information: all the other things that the target cost might need to query. We'll start by looking at the time-aligned phonetic transcription. We're going to use a technique borrowed from speech recognition. Then we'll see how we attach the supra-segmental (in other words, the stuff above the phonetic level), how we attach the supra-segmental information to that phonetic transcription that's been time aligned. Let's think about two extremes of ways that we might label the speech, to understand why hand labelling might not be the right answer. If we think that we need a transcription of the speech that's exactly faithful to how the speaker said the text: It gets every vowel reduction, every slight mispronunciation, every pause, everything exactly faithful to the way the speaker said the text. Then we might think we want to hand label from scratch. In other words, from a blank starting sheet without any prior information. We'll be down at this end of the continuum. But, if we think about what we're going to do with this data. We're going to do unit selection speech synthesis. We're going to retrieve units and that retrieval will try to match a target sequence. The target sequence will have gone through our text-to-speech front-end. The front-end is not perfect. It might make mistakes, but it will be at least consistent. It'll always give you the same sequence for the same input text. But that sequence might not be the same as our speaker said the particular sentence in the database. So another extreme would be to label the database in a way that's entirely consistent with what the front-end is going to do at synthesis time. We can call that the "canonical phone sequence". In other words, the sequence that is exactly what comes out of the front end . If we had to choose between these two things, we'd actually want to choose this end, because we want consistency between the database and the runtime speech synthesis. Consider the example of trying to say a sentence that exists in its entirety in the database. We would obviously want to pull out the entire waveform and play it back and get perfect speech. The only way we could do that, is if the database had exactly the same labels on it that our front end predicts at the time we try and synthesize that sentence, regardless of how the speaker said it. Now there are some points in between these two extremes, and we're going to take one of those as our basis for labelling. We're going to slightly modify the sequence that comes out of the front end. We're going to move it a little bit closer to what the speaker actually said. We'll see exactly how we make those modifications, and why, as we go through the next few sections. To summarize that situation: we have some text that our speaker reads out; we could put that through the text-to-speech front-end and get a phonetic sequence from that. That's the canonical sequence. We can then make some time alignment between that canonical sequence and what the speaker actually said (their waveform). That will be very consistent, but there might be things our speaker did that are radically different to what's in that canonical sequence. A good example of that might be that the speaker put a pause between two words that our front end did not predict, because maybe our pausing model in the front end is not perfect. We could start from what the speaker said and we could hand transcribe and get a phonetic sequence. That phonetic sequence will be very faithful to what the speaker said, but it might be rather hard to match that up with what the front end does at synthesis times. There might be systemic mismatches there. Those mismatches will mean that - when we try and say this whole sentence, or perhaps just fragments of it - at synthesis time, we won't retrieve long contiguous sequences of units from the database. In other words, we'll make more joins than necessary. Joins are bad! Joins are what listeners notice. Consistency will help us pull out longer contiguous units by getting an exact match between the labels on the database and what the front end does at synthesis time. Our preference is going to be start from the text that we asked the speaker to read, get the phonetic sequence, and then make some further small modifications to adjust it so it's a slightly closer fit to what the speaker said in (for example) pausing. The sort of labelling we're doing, Taylor calls "analytical labelling". Do the readings to understand precisely what he means by that. We're going to prefer this to be done automatically. Yes, that's faster and cheaper, and that's a very important reason for doing it. But an equally important reason is that it's more consistent between what's in the database and what happens when we synthesize an unseen sentence. A good way to understand that is to think about the labels on the database as not being a close phonetic transcription of the speech, but being just an index: a way of retrieving appropriate units for our search to choose amongst. Having consistent names for those units in that index is more important than being very faithful to what the speech actually says. A natural question is whether we could automatically label speech and then, by hand, make some small changes to match what the speaker actually said. Of course, that is possible, and that's standard practice actually in some companies. Those corrections are not going to be small changes to alignments ('microscopic changes'). They're really going to be looking for gross errors such as bad misalignments, or speakers saying something that really doesn't match the text. We're not going to consider this idea of manual correction here: it's too time-consuming and too expensive. We're going to consider only a way of doing this fully automatically. In other words, if the speaker deviated from the text in some way - such as inserting a pause where the front end didn't predict a pause - we're going to discover that completely automatically. The way that we're going to do that is basically to do automatic speech recognition to transcribe the speech. But this is much easier than full-blown speech recognition, because we know the word sequence. Knowing the word sequence is basically like having a really, really good language model: very highly constrained. In automatic speech recognition, we normally only want to recover the word sequence, because that's all we want to output. But if you go back to look at the material on token passing, you'll realize that we can ask the tokens to remember anything at all while they're passing through the HMM states, not just the ends of words. They could also remember the times (the frames) where they left each model: in other words, each phoneme. Or we could ask them to remember when they left each state. We could get model- or state-alignments trivially, just by asking the tokens to record these things as they make their way around the recognition network. So the ingredients for a building forced aligner are basically exactly the same ingredients as for automatic speech recognition. We need acoustic models, that is, models of the units that we want to align. They're going to be phone-sized models. We need a pronunciation model that links the word level to the phone level. That's just going to be our pronunciation dictionary: the same one we already have for synthesis. We might extend it in ways that we don't normally do for speech recognition, such as putting in pronunciation variation. We're going to see in a moment that some rule-based variation, and specifically vowel reduction, is often built in. And we need a language model. That doesn't need to be a full-blown N-gram. We don't need coverage. What we need is just a model of the current word sequence for the current sentence. That's a very simple language model. In fact, the language model will be different for every sentence: we'll switch the language model in and out as we're aligning each sentence. One thing we might do is add optional silences between the words. We'll come back to exactly how to train the acoustic models in a moment. Let's assume we have a set of fully-trained phone models for now, and see what the language model looks like. Let's write the simplest language model we can think of. Here's the sentence we asked the speaker to say. So that's what we're going to force align to the speech that they produced. It's a finite state language model. The states are these words, and we just join them up with arcs, with an end state and a start state. That's a finite state language model. We're going to compile that together with the acoustic model and pronunciation model to make our recognition network, do token passing, and ask the tokens to record when they left every single state or phone, depending what alignment we want. That will get forced alignment for us. That was the language model. The next ingredient is a pronunciation model that maps the words in the language model to phones, of which we have HMMs. Our pronunciation model is basically a dictionary. It maps words - such as this word - to pronunciations, such as this. We're going to add a little rule-based pronunciation variation: we're going to allow every vowel to be reduced to schwa. We'll write out our finite state pronunciation model of the word "can", add arcs, and optionally, instead of the full vowel, we could generate a reduced vowel. So here's our finite state network model of the pronunciation of the word /k ae n/, but it can also be reduced to /k ax n/. I can say "What can [k ae n] it do for..." or "What can [k ax n] it do for..." This recognition network can align either of those variants. The third and final ingredient is the acoustic model, that's going to actually emit observations. We could borrow fully-trained models from an existing speech recognition system, for example speaker-independent models, although actually in practice we do tend to get better results with rather simpler models which we can make speaker-dependent, because we can train them on the same data that we're aligning. We might have thousands or tens of thousands of sentences, which is plenty of data to train context-independent phone models. Now, you might be shouting out at this point that training the models on the same data we're aligning is cheating! That's not true. We're not really doing recognition: we're doing alignment. The product is not the word sequence, it's just the timestamps. So there's no concept of a split between training and testing here. We've just got data. We train models on the data and then find the alignment between the models and the data. It's that alignment that we want, not the word sequence. Those were our ingredients for forced alignment: a language model, a pronunciation model and an acoustic model. We saw how the language model is just derived from the sentences that we ask the speaker to read. We saw how the pronunciation model was simply the dictionary from our speech synthesizer with some rule-based vowel reduction. If our dictionary was more sophisticated and had full pronunciation variation capabilities, that could be expressed in a finite state form and would become part of the alignment network. The other remaining ingredient to build is the acoustic model. So how can we train our acoustic models on the recorded speech data? Well it's no different to building any other speech recognition system. We know the word transcriptions of all the data and we have an alignment at sentence boundaries between the transcriptions and the speech. If all you know about automatic speech recognition is how to use whole word models, then you might think that we need to align the data at the word level before we can train the system. But think again: when we train whole word models such as in the "Build your own digit recognizer" exercise, those word models have many states and we did not need to align the states to the speech. So, building a speech recognition system never needs state-level alignments. That we're very tedious to try and do by hand. I've no idea how you would do that. We can generalize the idea of not needing state-level alignments to not needing model- or word-level alignments. That's easy, in fact. We just take our sub-word models (say, phone models) and we concatenate them together to get models of words, and then we concatenate word models to get a model of a sentence. We get a great, big, long HMM. We know that the beginning of the HMM aligns with the beginning of the audio, and the end of the HMM aligns with the end of the audio. Training that big, long HMM is no different to training a whole word model on segmented data. Using data where we just have a word transcription that's only aligned at the sentence level is so important and is the standard way of training an automatic speech recognition system, it comes with a special name. It's called "flat start training". Let's see how flat start training is just a generalization of what we already know about speech recognition. Let's pretend for a moment that these HMMs here are whole word models. They're models of digits. In the exercise to "Build your own digit recognizer", we needed to know that the beginning of this model aligned with the speech and where the end of the model aligned with the speech. Then, given this set of observations, we could train the model of "one" and the same for all the other digits. So we essentially had isolated digit training data. We just generalise that idea. This HMM now is an HMM of this little phrase. We know the start aligns with the start of the audio, the end aligns with the end of the audio. We just do exactly the same sort of training to train this long model from this long observation sequence. That extends out to a whole sentence. Right, we've got all the ingredients then: a language model constructed from the sentences we know; a pronunciation model from the dictionary, plus rules; acoustic models created with this thing called flat start training. Let's just make our language model a little bit more sophisticated, to accommodate variations that the speaker might make that our front end doesn't predict, and that is inserting pauses between words. This speaker has inserted a pause between these two words. Perhaps our front end didn't predict a pause in that situation. The way that we do that is to insert an additional acoustic model at every word juncture, that's a model of optional silence. This model's just an HMM. So it's got a state, it's got a self-transition, it's got a start and an end non-emitting state (because we're using HTK). That's just a 1-state HMM that can emit observations. As it stands there, it must emit at least one observation to get from the start to the end. We'd like to be able to emit no observations so that we have optionality, in other words it can have zero duration. We can do that just by adding this extra transition, like this. This skip transition has given us a model of optional silence. This state here is going to contain a Gaussian or Gaussian mixture model that emits observations with an appropriate distribution for silence. So we need to train this model. It's not obvious how to train this model, because we don't know where these optional silences are. But what we do know is that there's silence at the beginning and the end of the sentence and we're going to have a 'long silence' model to deal with those. So we're always going to have our traditional (perhaps 3-state) silence model. This 3-state silence model (we'll typically call it 'sil', or something like that). This model here, we're going to call 'sp' and there's an easy way to initialize the parameters of the state of this model and it's just to tie them to the centre state of this model, which will be easier to train because we'll always have a few 100 ms of silence at the beginning and end of every sentence that we recorded in the studio. So that's a little trick to get us an optional short pause model and to train it easily by tying it to our main silence model. Right, let's do some speech recognition! We're going to do it in the only way we know how: that's the way that HTK does it in its simple HVite tool. That's to compile together language model, pronunciation model, and acoustic model into a single finite state network, and then do token passing. Let's start with the language model. There's a language model: that's finite state. Let's just remind ourselves: it's got arcs and states like this. We'll compile this model with the pronunciation model, which essentially means replacing each word with its sub-word units. That's this model here. Again, remember these are all finite state and there are arcs joining all of those together. I won't draw them in, just to keep things neat. We can now enhance that with those two little tricks. One was to allow every vowel to be reduced, so it's optional vowel reduction. The other was to put this special short pause model at every word juncture, in case speakers put pauses where our front-end didn't predict it in the phone sequence. So we add some more nodes in the network for that, and now we can draw the arcs in to see how this finite state network looks. Let's draw the arcs on the network. I'll omit the arrows (they all flow from left to right, of course). So each word could have its canonical pronunciation, like this. Or it has, optionally, vowel reduction like this. Between words we always go through the short pause model, which can emit zero or more frames of silence. The word can have its canonical full pronunciation, or the vowel can be reduced. We go through the short pause model. That's already a schwa, so can't be reduced any further. Again the canonical pronunciation, or reduce the vowel. The optional short pause model at the word juncture. The canonical pronunciation or the vowel reduction. Then we end. On this network here, we just do token passing. Let's imagine a route that the token might take. We might say "There..." - may be we left a pause, maybe we didn't - the number of frames we spend here will tell us if there was any silence... we maybe reduce this vowel... maybe we say this with its full pronunciation... and this with its full pronunciation. This token would remember its state sequence and its model sequence and would let us recover the timestamps. In other words, we'd know the time that we left each of these phone models. We'll repeat that for every sentence in the database with our fully-trained models, that have been acquired using flat start training. What we now have then is a phonetic sequence with timestamps. We know the start and end times of every segment. That's not enough. We need to attach supra-segmental information to that. Here's what our speaker said. Here's the phonetic sequence from forced alignment We're going to attach that to all the supra-segmental information. We're just going to get the supra-segmental information from the text, using the front-end. We're not going to do any alignment or modifications to try and make it match what the speaker said. We'll just take the predictions from the front end. That's a little bit simplistic: we might imagine we could do better than that by, for example, hand-labelling prosody (if we think we can do that accurately...). But the simple method actually works quite well and it's the standard way of building voices when we use Festival, for example. This level was the force-aligned phone sequence. It's got timestamps. This was the canonical sequence from the front end. You just take the timestamps off one and transfer them to the other. Those sequences are not identical, but they're very similar, so we can make that transfer and in doing so we can also add in the short pauses where they existed, and we can mark which vowels got reduced by the speaker when he or she said this sentence. If we looked inside the Festival utterance structure after all of this process, we'd see that some timestamps had appeared on it. That concludes our look at databases: what to put in the recording script; how to annotate that speech once we've recorded in the studio. Unit selection then is just retrieval of candidates from that labelled database, followed by a search. Now we have a full working speech synthesizer. Hopefully you're doing the exercise at the same time. You've built that voice and listened to it and so you're asking yourself now, "How good is that synthetic voice?" We need to evaluate that, so the topic of the next module will be evaluation. We'll think about what are fair means to evaluate it. Of course we should listen to it ourselves: that's going to be helpful! We probably want to ask other people to listen to it, so we can get multiple listeners for a bit more reliability. Maybe we can measure some objective properties of the speech to decide how good it is? But, in general, we need to have a good think about what do we want to measure and why exactly do we want to do evaluation. So the questions we're going to answer in the next module are: Can we judge how good a synthetic voice is in isolation? Or, is it only possible to do that by comparing to some other system that's better or worse than it? We'll answer the question about who should be listening to the speech, and whether it should be us or other listeners, or indeed some algorithms (some objective measures). All of those have advantages and disadvantages and we'll consider those. We'll think in detail about exactly what aspects of the speech we want to measure and put quantities on: Is it naturalness? Is it how intelligible the speech is? Or is there something else we can measure?
Annotating the database
There are several reasons for avoiding manual annotation of the database. Instead, we will borrow methods from Automatic Speech Recognition.
Log in if you want to mark this as completed
|
|