Whilst the video is playing, click on a line in the transcript to play the video from that point. Having considered speech signal analysis - epoch detection (which is really just for signal processing in the time domain), then F0 estimation (which is useful for all sorts of things both in unit selection and statistical parametric speech synthesis), and estimating the smooth spectral envelope - it is now time to think about representing those speech parameters. What we have so far is just analysis. It takes speech signals and converts them, or extracts from them various pieces of useful information: epochs, F0, spectral envelope. The thing we still haven't covered in detail is the aperiodic energy. That's coming up pretty soon. What we're going to do now is model these things. Or, more specifically, we're going to get ready for modelling. We're going to get the representations suitable for statistical modelling. Everything that's going to happen here we can really just describe as "feature engineering". To motivate our choice of which speech parameters are important and what their representation should be, we'll just have a quick look forward to statistical parametric speech synthesis. Here's the big block diagram that says it all. We're going to take our standard front end, which is going to extract linguistic features from the text. We're going to perform some big complicated and definitely non-linear regression. The eventual output is going to be the waveform. We already know from what we've done in Speech Processing that the waveform isn't always the best choice of representation of speech signals. It's often better to parametrize it, which is what we're working our way up to. So, we're going to assume that the waveforms not a suitable output for this regression function. We're going to regress on to speech parameters. Our choice of parameters is going to be motivated by a couple of things. One is that we can reconstruct the waveform from it. So it must be complete. We think a smooth spectral envelope, the fundamental frequency, and this thing that we still have to cover - aperiodic energy - will be enough to completely reconstruct a speech waveform. The second thing is that they have to be suitable for modelling. We might want to massage them: to transform them in certain ways to make them amenable to certain sorts of statistical model. So, if we've established that the parameters are spectral envelope + F0 + aperiodic energy, how would we represent them in a way that's suitable for modelling? Remember that this thing that we're constructing - that can analyse and then reconstruct speech - we can call that a "vocoder" (voice coder ) - it has an excitation signal driving some filter and the filter has a response which is the spectral envelope. In voiced speech, the excitation signal will be F0. Additionally, we need to know if there is a value of F0, so typically we'll have F0 and a binary flag indicating whether the signal is voiced or unvoiced. How would we represent that for modelling? Well we could just use the raw value of F0. But if we plot it we'll realise it has very much a non-Gaussian distribution. We might want to do something to make it look a bit more Gaussian. The common thing to do will be to take the log. So we might take the log of F0 as that representation, plus this binary flag. The spectral envelope needs a little bit more thought. For the moment we have a smooth spectral envelope that's hopefully already independent of the source. That's good, but we're going to find out in a moment it's still very high dimensional and strongly correlated and those aren't good properties for some sorts of statistical model. Then we'd better finally tackle this problem of what to do about the other sort of energy in speech which is involved in - for example - fricatives. Let's write a wish-list. What do we want our parameters to be like? Well, we're going to use machine learning: statistical modelling. It's always convenient if the number of parameters is the same over time. It doesn't vary - for example - with the type of speech segment. We don't want a different number of parameters for vowels and constants. That would be really messy in machine learning. So we want a fixed number of parameters (fixed dimensionality) and we'd like it to be low-dimensional. There's no reason to have 2000 parameters per frame if 100 will do. For engineering reasons, it's much nicer to work at fixed frame rates - say, every 5ms for synthetic speech or every 10ms for speech recognition - than at a variable frame rate such as - for example - pitch-synchronous signal processing. So we're just going to go for a fixed frame rate here because it's easier to deal with. Of course we want what we've been aiming at all the time which is to separate out different aspects of speech, so we can model and manipulate them separately. There are some other important properties of this parametrization and I'm going to group them under this rather informal term of being "well-behaved". What I mean by that is that when we - for example - perturb them, if we add little errors to each of their values, which is going to happen whenever we average them with others or when we just model them and reconstruct them, whether that's consecutive frames or frames pooled across similar sounds to train a single hidden Markov model say - whenever any of these things happen, when we perturb the values of the speech parameters, we would like them to still reconstruct valid speech waveforms and not be unstable. So they want to do the "right thing" when we average them / smooth them / introduce errors to them. Finally, depending on our choice of statistical model, we might need to do some other processing to make the parameters have the correct statistical properties. Specifically, if we're going to use Gaussian distributions to model them, and we would like to avoid covariance because that adds a lot of extra parameters, we'd like statistically uncorrelated parameters. That's probably not necessary for neural networks, but it's quite necessary for Gaussians, which we're going to use in hidden Markov models. We've talked about STRAIGHT and there's a reading to help you fill in all the details about that. Let's just clarify precisely what we get out of STRAIGHT and whether it's actually suitable for modelling. It gives us the spectral envelope, which is smooth and free from the effects of F0; good, we need that! It gives us a value for F0. Now, we could also use any external F0 estimator or the one inside the STRAIGHT vocoder. That doesn't matter: that can be an external thing. It gives us also the non-periodic energy, which we'll look at and parametrize in a moment. The smooth spectral envelope is of the same resolution as the original FFT that we computed it from. Remember that when we draw diagrams like this rather colourful spectrogram in 3D, the underlying data is of course discrete. Just because we join things that with smooth lines doesn't mean that it's not discrete. So this spectral envelope here is a set of discrete bins. It's the same as the FFT bins. It's just been smoothed. Also, because consecutive values ... if we zoom in on this bit here... consecutive values (i.e., consecutive FFT bins) will be highly correlated. They'll go up and down together in the same way as the outputs of a filterbank are highly correlated. That high resolution and that high correlation make this representation less than ideal for modelling with Gaussians. We need to do something about that. We need to improve the representation of the spectral envelope. While we are we doing that, we might as well also warp the frequency scale because we know that perceptual scales normally are a better way of representing the spectrum for speech processing. We'll warp it on to the Mel scale. We'll decorrelate, and we're going to do that using a standard technique: of the cepstrum. We're going to then reduce the dimensionality of that representation simply by truncating the cepstrum. What we will end up with is something called the Mel cepstrum. That sounds very similar to MFCCs and it's motivated by all the same things, but it's calculated in a different way. That's because we need to be able to reconstruct the speech signal, which we don't need to do in speech recognition. In speech recognition, we warp the frequency scale with a filterbank: a triangular filterbank spaced on a Mel scale; that loses a lot of information. Here, we're not going to do that. We're going to work with a continuous function rather than the discreet filterbank. We'll omit the details of that because it's not important. Once we're on that warped scale (probably the Mel scale, but you could choose some other perceptual frequency scale which would also be fine), we're going to decorrelate. We'll first do that by converting the spectrum to the cepstrum. The cepstrum is just another representation of the spectral envelope as a sum of cosine basis functions. Then we can reduce the dimensionality of that by keeping only the first N coefficients. The more we keep, the more detail we'll keep in that spectral envelope representation, so the choice of number is empirical. In speech recognition we kept very few, just 12: a very coarse spectral envelope, but that's good enough for pattern recognition. It will give very poor reconstruction though. So, in synthesis, we're going to keep many more: perhaps 40 to 60 cepstral coefficients. So that finalizes the representation of the spectral envelope. We use an F0-adaptive window to get the smoothest envelope we can, we do an FFT, do a little additional smoothing, as described in the STRAIGHT paper. Then we will warp onto a Mel scale, convert to the cepstrum, and truncate. That gives us a set of relatively uncorrelated parameters, reasonably small in number, from which we can reconstruct the speech waveform. So let's finally crack the mystery of the aperiodic energy! What is it? How do we get it out of the spectrum? Let's go back to our favourite spectrum, of this particular sound here. The assumption is that this spectrum contains both voiced and unvoiced sounds. So what we're seeing is the complete spectrum of the speech signal. In general, speech signals have both periodic and non-periodic energy. Even vowel sounds have some non-periodic energy: maybe turbulence at the vocal folds. So we'll assume that this spectrum is made up of a perfectly-voiced part which, if we drew the idealized spectrum, would be a perfect line spectrum.... ...plus some aperiodic energy which also has a spectral shape but has no structure (no line spectrum), so some some shaped noise. These two things have been added together in what we see on this spectrum here. So the assumption STRAIGHT makes is that it's the difference between the peaks, which are the perfect periodic part, and the troughs, which are being - if you like - "filled in" by this aperiodic spectrum sitting behind them, it's this difference that tells us how much aperiodic energy there is in this spectrum. So we're just going to measure the difference. The way STRAIGHT does that is to fit one envelope to the periodic energy - that's the tips of all the harmonics. It would fit another envelope to the troughs in-between. In-between two harmonics we assume that all energy present at that point (at that frequency) is non-periodic because it's not at a multiple of F0. Then we're just going to look at the ratio between these two things. If the red and blue lines are very close together, there's a lot of a periodic energy relative to the amount of periodic (i.e., voiced) energy. The key point here about STRAIGHT is that we're essentially looking at the difference between these upper and lower envelopes of the spectrum: the ratio between these things. That's telling us something about the ratio between periodic and aperiodic energy. That's another parameter that we'll need to estimate from our speech signals, and store so that we can reconstruct it: so we can add back in an appropriate amount of aperiodic energy at each frequency when we resynthesise. Because all of this is done at the same resolution as the original FFT spectrum, it's all very high resolution. That's a bad thing: we need to fix that. Again, just for the same reasons as always, the parameters are highly correlated because neighbouring bins will have often the same value. So we also need to improve the representation of the aperiodic energy. We don't need a very high-resolution representation of aperiodic energy. We're not perceptually very sensitive to fine structure. So we just have a "broad-brush" representation. The standard way to do that would be to divide the spectrum into broad frequency bands and just average the amount of energy in each of those bands, at each moment in time (at each frame, say every 5ms). If we did that on a linear frequency scale we might - for example - divide it into these bands. Then, for each time window ...let's take a particular time window... we just average the energy and use that as the representation. Because it's always better to do things on a perceptual scale, our bands might look more like this: getting wider as we go up in frequency. We'll do the same thing. The number of bands is a parameter we can choose. In older papers you'll often see just 5 bands used and newer papers (perhaps with higher bandwidth speech) use more bands - maybe 25 bands - but we don't really need any more than that. That's relatively low-resolution compared to the envelope capturing the periodic energy. Let's finish off by looking relatively high level at how we actually reconstruct the speech waveform. It's pretty straightforward because it's really just a source-filter model again. The source and filter are not the true physical source and filter. They're the excitation and the spectral envelope that we've estimated from the waveform. So they're a signal model. What we've covered up to this point is all of this analysis phase. The synthesis phase is pretty straightforward. We take the value of F0 and we create a pulse train at that frequency. We take this non-periodic (i.e., aperiodic) energy in various bands and we just create some shaped noise. We just have a random number generator and put a different amount of energy into the various frequency bands according to that aperiodicity ratio. For the spectral envelope (possibly collapsed down into the Mel cepstrum and then inverted back up to the full spectrum), we just need to create a filter that has the same frequency response as that. We take the aperiodic energy and mix it with the periodic energy - so, mix these two things together - and the ratio (the "band aperiodicity ratio") tells us how to do. Excite the filter with it and get our output signal. In this course, we're not going to go into the deep details of exactly how you make a filter that has a particular frequency response. We're just going to state without proof that it's possible, and it can be done from those Mel cepstral coefficients. So STRAIGHT as sophisticated as it is, still uses a pulse train to simulate voiced energy. That's something that's just going to have a simple line spectrum. We already know that that might sound quite "buzzy": that's a rather artificial source. STRAIGHT is doing something a little bit better than just a pulse train. Instead of pulses, it performs a little bit of phase manipulation and those pulses become like this. That's just smearing the phase. Those two signals both have the same magnitude spectrum but different phase spectra. This is one situation where moving from the pure pulse to this phase-manipulated pulse actually is perceived as better by listeners. The other thing that STRAIGHT does better than our old source-filter model, as we knew it before, is that it can mix together periodic and non-periodic energy. We can see here that there's non-periodic energy mixed in with these pulses. Good: we've decomposed speech into an appropriate set of speech parameters that's complete (that we can reconstruct from). It's got the fundamental frequency plus a flag that's a binary number telling is if there is a frequency or not (i.e., or whether it's unvoiced). We have a smooth spectral envelope, which we've parametrized as the Mel cepstrum because it decorrelates and reduces dimensionality. Aperiodic energy is represented as essentially a shaped noise spectrum, and the shaping is just a set of broad frequency bands. We've seen just in broad terms how to reconstruct the waveform. So, there's an analysis phase and that produces these speech parameters. Then there's a synthesis phase that reconstructs a waveform. What we're going to do now is split apart the analysis and synthesis phases. We're going to put something in the middle, and that thing is going to be a statistical model. We're going to need to use the model because our input signals will be our training data (perhaps a thousand sentences of someone in the studio) and our output signal will be different sentences: the things we want to create at text-to-speech time. This model needs to generalize from the signals it has seen (represented as vocoder parameters) to unseen signals.
Speech signal modelling
After we parameterise a speech signal, we need to decide how best to represent those parameters for use in statistical modelling, and eventually how to reconstruct the waveform from them.
Log in if you want to mark this as completed
|
|