Download the slides for the module 6 videos
Total video to watch in this module: 81 minutes
This video just has a plain transcript, not time-aligned to the videoIn this module, we're going to cover the signal processing techniques that are necessary to get us onto statistical modelling for speech synthesis: something called Statistical Parametric Speech Synthesis.Specifically, we're going to develop those parameters. We call them speech parameters.There's two parts to this module.In the first part, we're going to consider analysis of speech signals.We're going to generalise what we already know about the source filter-model of speech production and we're going to think strictly in terms of the signal.So, instead of the source and filter, we will move on to thinking about more generally an excitation signal going through a spectral envelope.When we've developed that model - that idea of thinking about the signal, and modelling the signal, and no longer worrying really about the true aspects of speech production - we can then think about getting the speech parameters in a form that's suitable for statistical modelling.Specifically then, we're going to analyse the following aspects of speech signals.We're going to think about the difference between epochs (which are moments in time) and fundamental frequency (which is an average rate of vibration of the vocal folds over some small period of time).We're going to think about the spectral envelope, and that's going to be our generalisation of the filter in the source-filter model.Now always remember to put axes on your diagrams! This bottom thing is a spectrum, so the axis will be frequency and the units might be Hertz and magnitude.In the top diagram, the axes are of course time and amplitude, perhaps.Just check as usual before going on that you have a good knowledge of the following things.We need to know something about speech signals.We need to know something about the source-filter model.And we also need to know about Fourier analysis to get ourselves between the time domain on the frequency domain.What do you need to know about speech signals?Well, if we take this spectrum, we need to be able to identify F0 (the fundamental): that's here.We need to be able to identify the harmonics, and that includes the fundamental: that's the fine structure.We need to understand where this overall shape comes from, and so far we understand that is coming from the vocal tract frequency response.We might think of it as a set of resonances - a set of formants - or more generally as just the shape of this envelope.So that's what we need to about speech signals.We've got a conceptual model in mind already, called the source-filter model.That's really to help us understand speech production and in a more abstract way to understand what we see when we look at speech signals, particularly in the frequency domain: in that spectrum.Our understanding of the source-filter model at the moment is that there is a filter.The filter could take various forms such as linear predictive, which is essentially a set of resonances.We know that if we excite this - if we put some input into it - we will get a speech signal out.The sort of inputs we have been considering have been, for example, a pulse train which would give us voiced speech out.Or, random noise which will give us unvoiced speech out.So that takes care of the source-filter model.Finally, we just need to remind ourselves what Fourier analysis can achieve.That can take any signal, such as the time-domain speech waveform, and express it as a sum of basis functions: a sum of sinusoids.In doing so, we go from the time domain to the frequency domain.In that frequency domain representation, we almost always just plot the magnitude of those sinusoids. That's what the spectrum is.But there is correspondingly also the phases of the those components: that's the phase spectrum.So, strictly speaking, Fourier analysis gives us a magnitude spectrum, which is what we always want, and a phase spectrum (which we often don't inspect).To exactly reconstruct a speech signal, we need the correct phase spectrum as well.So, in the first part of the module we'll consider what exactly it is we need to analyse about speech signals.Where we're going is a decomposition of a speech signal into some separate speech parameters.For example, we might want to manipulate them and then reconstruct the speech waveform.Or we might want to take those speech parameters and make a statistical model of them, that can predict them from text, and use that for speech synthesisThe first thing I'm going to do is drop the idea of a source-filter model mapping onto speech production, because we don't really need the true source and filter.In other words, we don't really normally need to extract exactly what the vocal folds were doing during vowel production.We don't really need to know exactly what the filter was like, for example where the tongue was.So what we're going to do - rather than thinking about the physics of speech production - we're just going to think much more pragmatically about the signal.Because all we really need to do is to take a signal and measure something about it.For example, track the value of F0 so we can put it into the join cost function of a unit selection synthesiser.We might want to decompose the signal into its separate parts so that we can separately modify each of those parts.For example, the spectral envelope relates to phonetic identity and the fundamental frequency relates to prosody.Or we might want to do some manipulations without actually decomposing the signal.In other words, by staying in the time domain.For example, we might want to do very simple manipulations such as smoothly join two candidate units from the database in unit selection speech synthesis.So here's a nice spectrum of a voiced sound.You can identify the fundamental frequency, the harmonics, the formants and the spectral envelope.We're going to model this signal as we see it in front of us.We're not going to attempt to recover the true physics of speech production, so we're going to be completely pragmatic.Our speech parameters are going to be things that relate directly to the signal.We're not going to worry whether they do or don't map back onto the physical speech production process that made this signal.So, for example, there is an envelope of this spectrum.Call it a "spectral envelope".Just draw it roughly.Now that clearly must be heavily influenced by the frequency response of the vocal tract.But we can't say for sure that that's the only thing that affects it.For example, the vocal fold activity isn't just a flat spectrum.It's not a nice perfect line spectrum, because the vocal folds don't quite make a pulse strain.So the spectral envelope is also influenced by something about the vocal folds: about the source.We're not going to try and uncover the true vocal tract frequency response.We're just going to try and extract this envelope around the harmonics.Before we move on to the details of each of these speech parameters that we would like to analyse or extract from speech signals, let's just clear up one potential for misunderstanding.That's the difference between epochs and fundamental frequency.In a moment we'll look at each of them separately.It's very important to make clear that these are two different things.Obviously, they're related because they come from the same physical part of speech production.But we extract them differently, and we use them for different purposes.Epoch detection is perhaps more commonly known as pitch marking, but I'm going to say "epoch detection" so we don't confuse ourselves with terminology.It's sometimes also called glottal closure instant detection or GCI detection.This is needed for signal processing algorithms.Most obviously, if we're going to do pitch-synchronous signal processing, we need to know where the pitch periods are.For example, in TD-PSOLA, we need to find the pitch periods, and epoch detection therefore is a necessary first step in TD-PSOLA.More simply, even if we're just overlap-and-adding units together, we might do that pitch synchronously so that's also kind of TD-PSOLA but without modifying duration of F0.Again, we need to know these pitch marks or epochs for that.A few vocoders might need pitch marks because they operate pitch-synchronously.So that's epoch detection.F0 estimation, perhaps more often called "pitch tracking", and again, I'm going to try and consistently say "F0 estimation" to avoid this confusion between "pitch marking" and "pitch tracking" which sound a bit similar.F0 estimation involves finding the rate of vibration of the vocal folds.It's going to be a local measure because the rate of vibration changes over time.It's not going to be as local epochs.In other words, we're going to be able to estimate it over a short window of time.F0 is needed as a component in the join cost.All units selection systems going to use that.We might also use it in the target cost.If we've got an ASF-style target cost function, we will need to know the true F0 of candidates so we can compare it to the predicted F0 of the targets and measure that and put it as a component of the ASF target cost.Almost all vocoders need F0 as one of its speech parameters.So just to make that completely clear, because it's a very common confusion, epoch detection is about finding one point in each pitch period, for example the biggest peak.That looks trivial on this waveform.But in general, it's not trivial because waveforms don't always look as nice as this example.If we think we could do this perfectly, then this will be a great way to estimate F0, because we could just take some region of time, some window, and we could count how many epochs per second there were.And that would be the value of F0.Now epoch detection is a bit error prone.We might miss the occasional period.So when we estimate F0, we don't tend to always do it directly from the epochs.So, separately, from epoch detection, F0 estimation is the process of finding, for some local region (or window) of the speech signal, the average rate of vibration of the vocal folds.Of course, that window needs to be small enough so we think that that rate is constant over the window.Hopefully, already, the intuition should be obvious: that, because F0 estimation can consider multiple periods of the signal, we should be able to do that more robustly than finding any individual epoch.
This video just has a plain transcript, not time-aligned to the videoLet's start by detecting epochs: most commonly called "pitch marking".Here's the most obvious use for pitch marks: it's to do some pitch-synchronous, overlap-and-add signal processing.Here, I've got a couple of candidate units I've chosen during unit selection.I would like to concatenate these waveforms.I would like to do that in a way that is least likely to be noticed by the listener.So, take the two waveforms and we're going to try and find an alignment between them - by sliding one of them backwards and forwards - such that when we crossfade them (in other words, overlap-and-add them) it will look and sound as natural as possible.If we move one of the waveforms side to side and observe its similarity to the top one, there will be a point where it looks very similar.The very easiest way to find that is to place pitch marks on both signals.So if we draw pitch marks on them - that's the pitch marks for the top signal, and pitch marks for the bottom signal - we can see that simply by lining up the pitch marks, we'll get a very good way of crossfading the two waveforms.It will be pitch-synchronous.Our overlap-and-add procedure will do the following:It will just apply a fade-out to the top waveform, so apply some amplitude control to it, where it will be full volume here, turn the volume down. And at the same time, the bottom waveform: turn the volume up and fade it in.Then we just add the two waveforms together.That's pitch-synchronous overlap-and-add.And if we've got pitch marks, it's very simple to implement that.So where do these pitch marks come from?Well, one tempting thing we might do is to actually try and record the vocal fold activity directly.That's what we used to do.Before we got good at doing this from the waveform, we might put a device called the laryngograph on to the person speaking, and record in parallel on a separate audio channel the activity of the vocal folds.That's this signal called Lx.That signal's obviously much simpler than the speech signal.It's closer to our idealised pulse train, and it's really fairly straightforward to find the epochs from this signal (not completely trivial, but fairly straightforward).However, it's very inconvenient to place this device on speakers, especially when recording large speech databases.For some speakers, it's hard to position: we don't get a very good recording of the Lx signal.So we're not going to consider doing it from that signal.We're going to do it directly from the speech waveform.Let's develop an algorithm for epoch detection.I'm going to keep this a simple as possible, but I'm going to point to the sort of things that in practise you would need to do to make it really good.Our goal, then, is to find a single, consistent location within each pitch period of the speech waveform.The key here is: consistent.The reason that should be obvious from our pitch-synchronous overlap-and-add example.We want to know that we're overlapping waveforms at the same point in their cycle; for example, the instant of glottal closure.The plan for developing this algorithm then is: we'll actually try and make the signal a bit more like the Lx signal.In other words, to make the problem simpler by making a signal simpler.We'll try and throw away everything except the fundamental.We'll try and turn the signal into a very simple-looking signal.Then we'll attempt to find the main peak in each period of that signal.It will turn out that peak picking is actually a bit too hard to do reliably.But we flip the problem into one of finding zero crossings, and that's much easier.So to make that clear then in this example waveform here: we're looking for a single consistent point within each pitch period.Here, the obvious example would be this main peak.We're looking for these points here in the time domain, and the algorithm is going to work in the time domain.We're not going to transform this signal.We're just going to filter it and then do peak picking, via zero crossings.To understand how to simplify the signal to make it look easier to do peak picking on, we could examine it in the frequency domain.Here's the spectrum of a vowel sound.If we zoom in, we'll see the harmonics very clearly.This is the fundamental.That's what we are looking for, but we're not looking for its frequency, we're looking for the location of the individual epochs in the time domain.The reason the waveform looks complicated, is because there's energy at all of these other frequencies as well, mixed in, and they're weighted by the spectral envelope (or the formants).All of this energy here is making the waveform more complex and is making it harder to find the pitch periods.So we'll simplify the signal, and we'll try and throw away everything except the fundamental.If we do that, it will look a bit like a sine wave: a bit like a pure tone.How do we throw away everything except the fundamental?Well, just apply a low pass filter: a filter whose response looks something like this.Its response is 1 up to some frequency; in other words, it just lets all those frequencies through, multiplied by 1.Then it has some cutoff, and then it cuts away all of these frequencies.And so this part here is passed through the filter.It's called the pass band.All of this stuff is rejected by the filter.In other words, it's amplitude is reduced down to zero.I should point out that this perfect-looking filter like this is impossible in reality.Real filters might look a bit more like this; they have some slope.Nevertheless, we can apply a low-pass filter to get rid of almost all the energy except for the fundamental.So if we low-pass filter speech, we'll get a signal that looks a little bit like this.It's almost a sine wave, except it varies in amplitude.The frequency of this sine wave is F0, and it's now looking much easier to find the main peaks: these peaks here.Direct peak pickings is a little bit hard.Let's think about a naive way that we might do that.We might set some threshold like this, and every time the signal goes above it, we'll find a peak.But if the signal's amplitude drops a lot, we might not hit that threshold.So, we might miss some peaks.If we set the threshold very low, we might start picking up crossings of the threshold where there's just a bit of noise in the signal.This is a bit of unvoiced speech where there happened to be some low frequencies around F0 that got through, but it's not the peaks we're looking for.Direct peak picking is hard.So what we're going to do is we're going to turn the problem into one of finding zero crossings.The top waveform is the low-pass filtered speech and the bottom waveform is just its derivative: I have differentiated the waveform.What does that mean? That means just taking its local slope.So at, for example, this point, the local slope is positive; at this point, the local slope is negative; and importantly on the peaks the local slope is about zero, although that's true about these peaks as well.So the waveform on the bottom is the differentiation or derivative.We might just write that as "delta".We're now going to find these points because they will correspond to the zero crossings in the bottom waveform.Now, just to find the top peaks we're looking for where the slope changes from positive negative.So we're looking for crossings from positive to negative.For example, this peak can easily be identified by where the signal crosses the boundary here.So what we've done so far is low-pass filtered the speech waveform and then take the derivative (or the differential).That could be easily done as simply as just taking the difference between consecutive samples in the waveform.The result of this very simple algorithm is this.We have now the original speech waveform, which got low-pass filtered, differentiated.We found all the zero crossings that were going from positive to negative, and that's what these red lines are indicating.And this gives us a consistent mark within each pitch period.Now these marks aren't exactly aligned with the main peaks.We would need some sort of post processing to make that alignment, but we've done pretty well with such a very simple algorithm.However, we see some problems.For example, we're getting some spurious pitch marks where there's no voicing just because this unvoiced speech happens to have some energy around F0 by chance, and that happened to lead to some zero crossings.So let's just summarise that very simple algorithm.It's typical of many signal processing algorithms in that it has three steps.The first step is to pre-process the signal: make the signal simpler to - in this case - remove everything except F0.So the pre-processing here was just simply a low-pass filter.There is then the main part of the algorithm, which is to do peak picking on that simplified signal.Peak picking was too hard, but we could differentiate and do zero-crossing detection, which is easy.Then put some improvements in that algorithm to get rid of some of the spurious zero crossings.For example, the ones that happened in unvoiced speech.We might run a smoothing filter across the signal that retains the main shape of the signal but gets rid of those little fluctuations where we got the spurious zero crossings in unvoiced speech.Then find the crossings and put pitch marks on each of them.And finally, like almost all signal pressing algorithms, not only does it have pre-processing, it also has some post-processing and - in the case of pitch marking - that might be to then align the pitch marks with the main peaks in each waveform.So in this diagram, that might mean applying some offset or time shift correction to try and line them up with the main peak.So that's epoch detection or "pitch marking".Those pitch marks are just timestamps.They'd be stored in the same way we might store a label.They're just the list of times which we think there's a pitch period in voiced speech.They will be used, for example, in pitch-synchronous overlap-and-add signal processing.For example, something just a simple as just concatenating two candidates in unit selection.
This video just has a plain transcript, not time-aligned to the videoSo moving on then to estimating F0.Let's just remind ourselves of the difference between epoch detection and F0 estimation.Epoch detection is finding something in the time domain - instants in time - perhaps they relate to when the vocal folds snap shut (the glottal closure instants).That has uses in signal processing, specifically in pitch-synchronous signal processing.We're now going to move on to F0 estimation, which is a related problem.F0 is defined as the local rate of vibration of the vocal folds, expressed as a frequency in Hertz (cycles per second).That's a parameter of speech.It might be useful for the join cost in a unit selection synthesiser and, as we're going to see eventually, it's an important parameter in a vocoder for analysing speech into a parametric representation, which you might either manipulate or model, and then reconstructing the waveform from that representation.So both epoch detection and F0 estimation have many different names in the literature.I'm attempting to be consistent in my use of:epoch detection for "pitch marking" or "GCI detection", andF0 estimation for "pitch tracking", "FO tracking", and so on.Now, since these are obviously very related (because they're from the same physical property of the signal, which comes from the same physical part of speech production: the vocal folds), surely it would make sense to do epoch detection and then just look at the duration between epochs.That's the fundamental period. Do "1 over that", then get the fundamental frequency.So it's a perfectly reasonable question to ask: Can we estimate F0 after doing epoch detection?Let's understand why that might not be the best thing to do.Here are the epochs I detected with my very simple algorithm.It doesn't matter that they're not lined up with the main peak in the signal.Let's try and estimate F0 from these epochs.We could do it very locally.We could look at a pair of epochs.We could measure the duration (time) that elapsed between them.This is called the fundamental period, or T0, and that's just equal to 1/F0.T0 is in seconds (s) and F0 is in Hertz (Hz).That would be correct, however, that would be very prone to lots of local errors.For example, imagine our epoch detector missed out this epoch.It made a mistake.We would then get an octave error here in F0; the fundamental period has doubled and we get an F0 halving error.So we got lots of local errors, lots of noise in our estimate of F0.A smarter way to estimate F0, given that we know it changes slowly, will be to estimate its value over rather longer time periods than single epochs.Then we could be more robust to errors in epoch detection.The method we're about to develop does essentially that, but it doesn't need to do epoch detection.It works directly with the speech signal.It avoids the step of epoch detection and the errors that that might make.It works directly with a speech signal and looks, over some windows of time, for repeating patterns in the speech waveform.So the key, then, is that when we estimate F0, we can do so by looking at rather longer time windows than one pitch period: multiple pitch periods.Then we should be more robust, because the more signal we see, the more robust estimate of F0 we should get.Our choice of window size will be dictated by how fast we think F0 can change and how fast it can turn on and off at boundaries between voiced and unvoiced speech.So I just said we're going to attempt to find the periods - the fundamental periods - in this waveform without having to explicitly mark them as epochs.The way we're going to do that is just look for repeating patterns in the waveform.Visually, there's a very obvious repeating pattern in this waveformThis pattern repeats (approximately) and it slowly changes.We're going to try and look for that.Specifically, we're going to try and look for it repeating from one pitch period to the next.We're going to try and measure the similarity between two consecutive pitch periods.The method isn't going to just do that for one pair of pitch periods - for example, this one and this one - it's going to do it for this one and the next one and this one in the next one, over some window that we can choose.So we're just going to look for general self-similarity between the waveform and itself,shifted in time.The shift will be one period.Now, we don't know the period, so we can't just shift the waveform by exactly a period and measure the similarity.We're going to have to try shifting the waveform by many different amounts and find the place where it's most self-similar.Let's do that.I'll need a copy of the waveform and I'm going to put the waveforms on top of each other like that.All I'm going to do is take one of the copies, and I'm going to slowly shift it to one side: I'm going to time-shift it.The shift has got a special name in this form of signal processing: it's called the lag.So let's just let one of them drift slowly to one side and see what happens.From time to time we see little glimpses of self-similarity between the waveforms.That was one there - it's fairly self-similar.A little bit of a one there. There's one coming up: that's a bit self-similar.But right at this moment they were very similar to each other.Let's just look at it in slow motion.To be sure, we spotted the exact moment where there's a lot of self-similarity in the waveforms.So at that moment, the shift between one waveform and the next is exactly one pitch period.They're really very similar to each other; they're not identical, but very similar.So how are we going to measure that similarity?There's actually a very simple way of measuring the similarity between two signals. They could be two different signals or the signal and a shifted version of itself.That's just to sample-by-sample multiply them together.That's technically known as the inner product (that doesn't matter).Sample-by-sample, we'll multiply them together.If you think about it, that will give the biggest possible value when the signals are maximally aligned, because all the positive numbers (all the positive samples) will be multiplied by positive numbers and the negative ones be multiplied by negative numbers.A negative number times a negative number is always positive, and all of that will give us the biggest possible value.And so we can write that in this very simple equation.Don't be put off by the number of different letters in this equation!We're going to deconstruct it in a moment.This is known as cross-correlation.Sometimes it's called autocorrelation or "modified autocorrelation".The differences between those are not really important to us; they're just to do with how big a window we use as we're calculating the self-similarity.We're going to stick with the cross-correlation, which is defined like this.This r is the cross-correlation value, or the function.It has two parameters.It should be fairly obvious that it varies depending where in the signal we calculate it, t : so, different point in the utterance, so t may vary.And it's also going to vary - this is the key parameter - as we shift one signal with respect to the other; that's the lag, Ï„.What we'll do, we'll vary Ï„ and we'll try and find the value that gives the highest value of the cross-correlation (or what I've called here autocorrelation) function.And that value of Ï„ will be the fundamental period: the value that gives the biggest self-similarity.We'll see that in a momentAnd how is this defined?We're going to sum speech sample, so x is the value of a sample in the waveform.It's just a number stored in that raw waveform file and j indexes into time.So that's the j'th sample in this utterance.Let's say there might be 16 000 samples per second.And we're going to multiply this sample by a shifted sample, a sample that's also at time j, plus Ï„ (plus the shift).So it's going to be one speech sample (that's one of the waveforms) multiplied by another speech sample from the other waveformSo what are all these other letters on the right hand side?j indexes time and in this summation it's going to start at one particular time and count up to another particular time.It's going to do that over a window.The size of the window is W.So W is window size in samples.Let's look at how that really works for one particular time and one particular lag, andcalculate this autocorrelation (rather: modified autocorrelation or cross-correlation function - don't worry about the differences in those names!)Here's a speech waveform.I'll need a copy of it.This is the one that I'm going to shift.So the top one will give me those x_j samples and the bottom one will get shifted.It'll have a lag.That'll be my x_{j+Ï„}, which is the lag.I'll do this over a fixed window, and the size of the window in samples is W.We'll need to choose that value.We're going to shift one of the waveforms by some lag.Let's just do that.I'll just pick some arbitrary lag for now, and just it for one value of Ï„.So that waveform has just being shifted by the lag, and that's denoted by Ï„.So we're now looking a bit further into the future of this waveform: it slid to the left, so we're looking at samples that were originally to the right.Let's just write all the various parameters on that equation onto this diagram.We've already seen that the window size is W.The window is going to be where we multiply corresponding samples of the two waveforms and sum them up.So the start time of the window was t+1 and the end time was t+W.It looks a bit silly to use t+1, and not t, but the reason for that is so that there's exactly W samples inside this window.We could have done t to t+W-1; it doesn't really matter.What are the two samples that we're multiplying and adding together?Well, we'll take one sample here that will be x_j and we'll take sample down here, that will will be x_{j+Ï„} because we shifted this one.Just to be completely clear, it's just the value of the sample: so it's how high above zero axis it is.We'll take those two things and we'll multiply them together and we'll do that for every corresponding pair of samples.So we'll take this sample times the sample, plus the next sample times the next sample, plus the next sample times the next sample,... and we'll do that W times and then we'll sum them up.Let's look again at the equation while we're talking about that.In the top waveform, we were pulling out the x_j samples.In the bottom waveform, we were pulling out the shifted versions: looking forwards in time by Ï„, so that it's x_{j+Ï„}.Then we're just simply multiplying them together.So x_j x_{j+Ï„} means multiply those two things together.And then we were doing that for all pairs of samples across some window W and then summing the result together.Let's make sure we fully understand the left hand side.The most important parameter - and it's in parentheses to show it's the argument of the function - is Ï„.What we're going to do in a moment, is we're going to calculate this function for several different values of Ï„ starting at 0, and plot it.We're going to look for peaks in that function.There's obviously going to be a really big peak at Ï„=0 because that's just the signal multiplied sample by sample with itself with no lag: it's maximally self similar.We'll get the maximum value there.But, at a shift of one pitch period, we're hoping we'll also get another maximum of this function.So the key parameter there is Ï„.But of course, the function also depends on where exactly in the waveform we placed the window: where's the start time of the window?So if we wanted to track F0 across a whole utterance, we would have to calculate this function and do this peak picking operation (we're going to do in a moment) at several different values of t.Maybe would like the value of F0 every 10 milliseconds.So we have to vary t from 0 up to the end of the utterance in steps of 10 ms and calculate F0 at each of those through this procedure that we're going to look at in a moment.So here's a plot of a function in samples.This is for waveform this actually being downsampled to a rather low sample rate.If you want to investigate this for yourself, there's a blog post.Go and find that: there's a spreadsheet and there's waveforms and you can calculate this exact function for yourself, step by step.So I said the key parameter of the autocorrelation (or cross-correlation) function is Ï„ (or lag).At a lag of 0, that means the waveforms at the top on the bottom are not shifted with respect to each other, and so we're going to get a maximum possible value for this function.Then we're going to set Ï„=1 and we're going to calculate the whole thing again.So move the waveform one sample to the left, multiply and sum all the samples together and get the value of the function.They'll be a little bit less similar, so the value will go down a bit.And then we just plot that function for various different values of Ï„.Now, over what range should we calculate it?Well, we have an idea about what sort of values the pitch period might be: the fundamental period.We vary Ï„ from, let's say, zero up to well past the maximum value of the pitch period that we expect for this particular speaker.I've done that here.I've varied Ï„ from zero up to a shift of 100 samples in this downsampled waveform (the sampling rate is much lower than 16 kHz actually).Look what happens.At another leg - this lag here - we get a really big peak: the waveforms are very self similar, even though one is shifted with respect to the other: that's the pitch period.If we keep on shifting the waveform, eventually when there's a shift of 2 pitch periods, we'll get another peak in the self-similarity.Hopefully, that will be smaller than the first peak.So this autocorrelation function is a function of the lag, Ï„.Let's just look again at the animation to reinforce that.We took our waveform.We made a copy of the waveform, not shifted it all; this is with a lag of 0.We placed a window over which we're going to calculate this function.Note that the window is spanning several epochs and it has many speech samples in it.So we're estimating F0 from a lot of information here: a lot more than a few error-prone individual epoch marks.Then we calculate the autocorrelation (strictly speaking: the cross-correlation) function between these two waveforms for the current value of Ï„, which is zero.So we take this sample, multiply by this sample and add it to this times this plus this times this,... all the way through the waveform.Because every sample has been essentially be multiplied by itself, we get the sum of some squared numbers which are all positive and get a nice big value.All we do then is move the waveform one sample to the left and calculate that whole function again, and then another sample to the left and calculate again.So, as this waveform slides to the left, at every shift of one sample we recalculate the autocorrelation function across this whole window.So here Ï„ is increasing.The Ï„ is going up and up and up up as we slide to the left.So it's clear that this cross-correlation function is going to take a little bit computation because every individual point on it - which is for a particular value of Ï„ - so this one's for a value of Ï„=40 samples of lag - just to calculate that one value involves sum over W samples within the window.So we have to do W multiplies and add them all together to calculate that one value.And then we have to repeat that for as many values of Ï„ as we think we need, so that we'll find this big peak.So now you understand why estimating F0 from a speech signal might be a bit slower than you were expecting, because we've got lots of computation to calculate each value of this plot, got W multiplies and then a summation.We've got to do that Ï„ times and we still got to actually identify F0, which is to find this big peak.That gives us just one number, the value of F0 at this frame.We're then going to move the window (the frame) along in time, maybe forwards 10 ms and do it all again.We're going repeat that every 10 ms for a whole utterance.So the general form of this algorithm for F0 estimation is looking a little bit like the algorithm for epoch detection.There's a core to the algorithm.That's what we're currently talking about.The core of this algorithm is to calculate this correlation function and to make a plot of it and then to find the biggest peak in that plot.So we're looking for the biggest peak, at not zero lag, because that's trivial: that's where the waveform is just equal to itself.So some peak after that.The way to locate that - we'll set some search range of expected lags (in other words, we know in advance what the range of F0 of our speaker is) and from that range, we can set the range of pitch periods from a minimum to a maximum.We'll search over that range in the autocorrelation function and find the biggest peak in there and whatever the biggest peak is, that lag equals the pitch period T0 (sometimes called the fundamental period), and 1 over that = F0.Now, we already said in epoch detection that peak picking is hard.That's true.So we expect to make errors when we're peak picking in the autocorrelation function.Other things that make it non-trivial is that real signals are not perfectly periodic.So as we shift one waveform with respect to the other, the pitch periods won't all line on top of each other perfectly.F0 might be changing slowly.Or there might be local effects such as some jitter in F0: the periods are getting longer and shorter locally.As we saw, as we shifted the waveforms past each other there are some moments of self-similarity which happened at times other than exact multiples of the pitch period, and that's because of the formant structure of the waveforms.So some of the peaks in the autocorrelation function - these ones here - are actually due to the structure within a pitch period, giving us some spurious self-similarity that's not the pitch period.That's because of the formant structure of speech.So just like epoch detection, we'll take the core idea, which is a good one.Here, the core idea is cross-correlation (or modified autocorrelation) and peak picking.These are straightforward things, but we're going to have to dress them up with some pre- and post-processing to deal with their limitations.So, like most signal posting algorithms, we might consider doing some pre-processing or post-processing.The pre-processing will be to try and make the problem simpler before we hand over to the main part of the algorithm, which here is going to be cross-correlation and peak picking, and the post-processing will be to fix up any errors that that core algorithm makes.
This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.let's consider if any pre processing would actually help.It's tempting to do what we did in iPAQ detection, and that is the low pass philtre, the speech way form to throw away everything except the fundamental.That's what's happened here.We've low pass philtre, the way form think about what Cross correlation is really doing.Is looking for similarity between one pitch period on the next pitch period on that similarity of sample by sample.And it doesn't matter whether the way former is complicated or simple.We could measure self similarity in both cases on, in fact, low pass filtering, throwing away information.Think about what it does in the frequency domain.Let's just zoom in to make things clearer.We're trying to estimate the frequency off the fundamental, which is this thing here.One way to do that would throw away everything except the fundamental in the measure.It, however, there's additional evidence about the value from all these harmonics, which, in multiples of the fundamental, so keeping all that information in the way form might actually aid.As in recovering the self similarity or cross correlation between the way form on the lag version of that way form, so it's no store clear that low pass philtre in would be a good idea.So some pitch estimation albums do it, and some don't another temptation.We try and get a signal that's as close as possible to the source to what was coming out of the vocal folds on.One way to do that is called inverse filtering.And so let's try another son.Invest filtering in the frequency domain.We would try and make this spectral envelope, which currently looks like this flat.So we try and flatten the envelope by putting the signal through a philtre, which has the inverse response off the spectral envelope.So boost all the low energy parts on DH suppressed the high engine parts that will give us a signal, which was a bit more like a pulse train.Our embassy building has made assumptions.Specifically, the vocal tract is a linear predictive philtre.In this case, on doing this inverse filtering might introduce distortions to the signal.So again, some pitch estimation algorithms do it, and some don't let's now take a look out.Perhaps the most famous F zero estimation method of them all.It's called a robust algorithm for pitch tracking.So no pitch tracking is the most common term used in the literature.This does alter Correlation dressed up with some pre imposed processing.Let's just understand what we're looking at in this picture.This is not a spectra, Graham.The access along here is time, but the vertical axis is lag.So this is a Carella Graham.So if we take our cross correlation plot from before, it's now along the vertical axis So if this was a spectra ground, we'd have a spectrum along the vertical axis.Here we have a cross correlation function along the vertical axis on DH.We're using black pixels to do no peaks.You see that piques correspond to dark areas on the like.Range here is different between my example on the one from the paper.This is ah, great visual representation of the problem we're faced with with peak picking, we have to find this peak on.We have to truck it over time.You have to know when it stops.The speech stops being voiced to see when it starts again and track it through time again.And that is the fundamental period from which Of course, we can recover the fundamental frequency.We can see why this is hard if we just do some naive picking of peaks will track for perfectly fine here.But we'll switch up to this peak here because he looks stronger, maybe switch down to speak here.And then we might make errors here as well.So we're going to get errors in if they're tracking.This would be an octave era jumping between the different peaks.We might also accidentally pick up some of the in between peaks on.We've got non octave errors as well.To recover from these errors in peak picking, it's normal to do some post processing on what we will use.This diagram for is to obtain candidate values that have zero, not the final value.So we would say there is some candidate values here now.There's some kind of values here, aunt here, lots of counter values, all of these possible evidences of F zero and then we'll try and join up the dots.I understand the way of doing that to some dynamite programming.This diagram is from a different method on DH.The axe is a bit different here the vertical axis being transformed from lag into frequency.Just one over on these dots are candidates, their peaks from the cross correlation function on.We can use dynamic programming to join the dots on.The dynamic programming will have a cost to do with how good each candidate is, how high the peak walls on to do with how likely it is to join to the next dot.So, for example, how much FC removes so pretty much all signal processing algorithms do some pre processing.They have a core which is doing all the work on some post processing on DH in all three of those areas in the pre processing the altar correlation itself on the post, processing lots of promises we need to choose.We've already seen some.For example, the window size Here are the tunable parameters in the wrapped algorithm.There's quite a few of them now.We need to set them through experimentation on intuition on through our knowledge of speech, for example, the range of minimum maximum values with zero, and to get the very best performance from algorithms like this, we would want to tune some or all of these parameters to the individual speaker that we're dealing with F zero range is the most obvious one made that as narrow as possible will make a few errors as possible.Whilst auto correlation or modified articulation, cross correlation is by far the most common way of extracting FTO.There are small tent is out there.Here's one that you should already understand on its to use the caps from.So if we look at a speak spectrum, which this diagram on top is a rather abstract picture off.So that's frequency on that magnitudes.That's just a spectrum rather idealised.These are the harmonics.This is the spectral envelope.When we fit, the kept strum to this spectrum were extracting shape parameters on the lower order ones captured these grow shapes like the spectral envelope.But eventually one of the higher order capital coefficients will just fit nicely onto these harmonics.It will be the component of that particular qui friend see, and we'll get a large value for that capital coefficient rip lot.The capsule coefficients along this axis and the magnitude along this axis will find eventually that one has a large value on that cui friend.See which is a time in seconds is again the fundamental period.That's the less common effort.But it will work and there are various other methods out there which we're not going to go into great detail.One would be the following.We would construct a philtre called a comb Philtre on this philtre actually is notches at multiples off a possible value of zero.So we hypothesised in volume of zero.Put the speech signal through this philtre and see how much energy it removed.And then we would vary the value of FDR of this philtre which moving up and down and remove the maximum amount of energy and that would then find is the value of ft of the signal on.Of course, we can always throw machine learning at the problem so we could throw something like a neural net and ask it to learn how to extract of zero, obviously from some label data.So we would need some data to do that.It would need ground truth data note that the wrapped algorithm and so on doesn't need any ground truth or the parameters of basically set by hand on this machine learning method is not magic because it still needs some feature engineering.It's still how's this auto correlation idea is its core on it still needs the dynamic programming post processing.This isn't very different than really from the other algorithms.Let's finish off by concerning how you would evaluate an F zero estimation algorithm.Well, to do that, you would need some ground truth with which to compare.And there are various ways of getting that one is to do what we said We don't like to do when we're actually recording data for speech synthesis, but we would be willing to do to get some ground truth data for algorithm development on.That's to use a device that ring a graph to physically measure vocal fold activity on from that get a pretty reliable estimate of zero to compare against our algorithms estimate from the way form, we might even hand correct the contours that we extract with alluring a graph to get true values for comparison.This, of course, is a bit tedious and a bit expensive, and so we wouldn't try not to do that ourselves.We try and find somebody else's data, and there are various public databases available, such as this one here.He wanted to compare the performance of your F zero estimation after them against somebody else's.Has Syria estimation albums could make very sorts of error.A big error would be whether we detect voicing or not correctly, that would be a voicing status era.So the voice versus a GN voiced error.I could express that as a percentage of time where we got it right or wrong on the second floor, there will be to get the value of F zero incorrect one.Often when we evaluate algorithms for FC, your estimation will decompose those era into gross errors such as octave errors.So we're off by factors of to On DH fine grained errors were off by a small amount.And finally, let's ask ourselves whether FC your estimation is equally difficult for all different types of speech, perhaps even all speakers.So F zero estimation albums normally assume that speech is perfectly periodic.That's just an assumption we've made in all of that signal processing.And if the speech is not perfectly periodic, they will perform poorly.For example, on creaky voice, we don't expect to get very good estimates of fear of a creaky voice, not really well defined.What the value of zero is for irregular vibrations of the vocal folds.IPAQ detection algorithms, which you're looking for moments in time, should be able to detect the individual epochs in creaky voice.But we don't expect them to perform as well as on motile or regular voice overall.Then, as we move forward into modelling these signals and eventually getting on to statistical parametric speech synthesis first, we predict that it's going to be harder to vote code.Some voice qualities, creaky voice is one of them.Perhaps breathy voice is another one.It's heart of oak.Oh, them.It will therefore be hard to do statistical Parametric synthesis off these different voice qualities.
This video just has a plain transcript, not time-aligned to the videoTHIS IS AN UNCORRECTED AUTOMATIC TRANSCRIPT. IT MAY BE CORRECTED LATER IF TIME PERMITS.The other important aspect ofthe speak signal to extract was we've got zero Is thie spectral envelope? Let's just remind ourselves again.We are now moving on from the idea of an explicit source philtre model where we imagine the source of the vocal folds on the philtre is the vocal tract to a more abstract model.Still the idea of source and philtre, but now strictly fitted to the signal as we observe it on DH.We don't care so much if we get the truce.Also the true philtre, so long as we can achieve our aims off manipulation or modification or whatever they might be.So we're now try and get out this spectral envelope.It might be something that joins it, the tops of all the harmonics on DH.We need to worry about how to get that without accidentally also fitting our envelope to the detail, the harmonics, which is a source feature that we don't want.We can already see that here in the FT spectrum.So this is just computed through a Fourier transform on DH.We get both source and philtre features.We get both envelope on detail and now we want to do is to get rid of the detail, to get rid of these harmonics on, only to get this envelope.Some techniques for doing that make an assumption about the form of the philtre, such as linear prediction, which assumes it's an all poll philtre.Fix that to the signal on.By controlling the complexity of that philtre, it was by not giving it to many coefficients.We make sure we fit on it, the envelope and not to the detail.We're not going to do that here.We're going to do something more directly.Signal based.We're going to follow, along with a famous vocoder that's widely used in parliamentary synthesis, called straight from Kawahara.In this paper, they state that it's important to pick the correct analysis window size well, they say the following if the time window that we use for spectral analysis on remember that computing This involves taking a frame of speech and choosing the duration of that frame and putting it through the Fourier transform.If the duration ofthe that analysis, frame or window is comparable to the fundamental period D 01 of absolute, then the para spectrum varies in the time domain.But if the time window for analysis spans many pitch periods, we get period variation in the frequency domain, and that's just stating formally something we actually already know.So go on, get to speak signal and open it up in your favourite speech at it a prat Or in this case, I've used waves ever on DH.Try changing the size of the analysis window whilst you're calculating a spectrum.So in wave surfer, we do that with these controls on DH.If we set that analysis window size relatively small, so it becomes comparable in size to the fundamental period one or two fundamental periods what we see in the time domain in the Spectra Graham.So going along the time axis, we see variation nautical fluctuation in the paper.We see these striations this way.These are the individual pitch.Pulse is the box because the analysis window is so small as it slides forward across the way form the power off the way form.Inside that window goes up and down, period by period.Sometimes you have a lot of energy, sometimes have less energy, and that's what we're seeing in the Spectra.Graham without fluctuations in darkness.So a very short window we already knew.This gives us a very good time resolution, but relatively poor frequency resolution Kandahar at are also reminders that when we use a very long analysis window, maybe a long window like this with many pitch periods inside it, we no longer see that fluctuation in the time domain because now the average amount of energy falling inside that window is very much constant or nearly constant.It doesn't go open down with individual pitch periods falling inside or not inside the window because along analysis window, we get very good frequency resolution.That's that access.And so now we don't see those vertical striations.We see these horizontal striations, and that's what Long and arses windows.That is what we're getting in this view here.So this FFT spectrum was clearly calculated with a relatively long analysis window because resolving the individual harmonics when viewed on a spectra graham, those air those horizontal striations so straight uses this insight to do something pretty clever.Straight sets its analysis window the thing just before the Fourier transform in a size that adaptive to the fundamental period conductive to F zero.So by varying the size of the analysis window with zero for example, making it exactly to pitch periods in duration, we ensure that amount of energy falling inside that window is fairly constant as the window slides across the way.Former two fixed frame rate on DH.If we make that allows his window what Kawahara calls comparable to the fundamental period.For example, twice the fundamental period.We won't get that high frequency resolution, which resolves the harmonics, which we don't want.The straight basically, does this clever trick often f zero adaptive window that minimises the interference between harmonics on the final extracted spectral envelope.Then that's just standard.A 50 analysis on.Then it does some smoothing to remove any remaining interference from the source.As we'll see a little bit later, there is more to speak parameters F zero the spectral envelope because we've been neglecting the non periodic energy.What do we do with Fricka? Tibbs, for example? So straight promised rises That, of course, and it does that by estimating a radio between periodic and AP.Erotic energy will come on to that in a moment when we've done all that, We'll have a complete parameter isation of speech signal, and by complete I mean, we could reconstruct the way form from it on.So we'll just need to see how to do that in the synthesis phase.So, like many vocoder is weaken break straight into analysis phase and the synthesis phase on.If we were on one after the other, sometimes that's called copy synthesis or just analysis synthesis, and that will just give us a vote coded speak signal.Are we going to see that we're going to break it into two parts, do analysis and then statistical modelling on, then use a synthesis face to regenerate away form when we're synthesising unseen sentences.But that's coming later.Let's just cheque that Kawahara is right, which, of course he is.If we change the analysis window size again, I did this in wave surface, and you can try this for yourself.We'll see the effect ofthe reducing interference from the source.So here I've got my time to Main Way form so that this time Andi, it's after 16.Something frequency on.I've taken analysis window, which have drawn away from their off about to pitch periods and remember that tapered wind was applied before analysis, which is why we don't make it one pitch period So this thing will be faded in and faded out.And then that will move forward in fixed steps.Say, every five milliseconds.Finalising that window that's about to pitch periods long gives me this result, and we can see here there's little or no evidence of the harmonics.We've got essentially the envelope.It's not particularly smooth, but we could easily fix that with some simple moving average or median smoothing or something like that.If we took a longer than I was this window.So if I take one that's four times longer, perhaps something like this, then I would get this result.And now I'm resolving the harmonics.So I've got interference from the source, and there's not this move Special envelope.So what Straight does is choose Windows sizes off about this size and gets this result from the fast Fourier transform to which it does a little bit of extra smoothing.To get the smooth, spectral envelope out in the next step, we're going to see that we need to parameter rise this smooth spectral envelope for various reasons, including reducing its dimensionality stand away.To do that is with the caps from on a male scale.So milk actual analysis on this figure demonstrates what happens if we do that directly on their 50 spectrum.First is doing it on the straight smooth spectral envelope.Do we take the FFT spectrum on what the frequency, skeletal male scale and the new cultural analysis So we're using a rather large number of capital coefficients will fit in detail to all of the harmonics.As we can see down here on, the reason that will fit better at low frequencies than high frequency is because they're doing this on a warp frequency scale.So our resolution is reduced up in these high frequencies.However, if we first use straight to get the spectral envelope that's free from interference from the harmonics and then apply this Mel capture analysis, we'll get this red envelope and we can see that that's relatively independent of the harmonics that might have some other problems.Maybe it's a little bit too smooth.It doesn't capture every little bit of detail, but it will be independent, relatively speaking, in a statistical sense, from the value of zero.And that's really what we want because we want a representation of speech parameter realisation in which we decompose the signal into those things that relate to property such as F zero.Those things that relate to filleting identity, such as the special envelope on so we could model them separately on weaken independently, manipulate them if you want one reason for motivating independent parameters for Prodi on forfeiting identity, think back to unit selection on the Spar City issue.Often independent feature formulation.Target cost.If we moved to an acoustic space formulation, we can escape a lot of that's Par City.And if in that acoustic space, we can treat independently fundamental frequency and spectral envelope, weaken further combat capacity because we can re combine them in different ways.In other words, if we can separate these things and put them back together in different combinations, we don't need in our database every speak sounded every context at every possible value.F zero.But the main reason for looking at this straight spectral envelope on the way with parameter rise if modelling, is to move forward on to statistical Parametric speech synthesis and so the next step is to take what we've learned in speak parameter isation and think about how we're going to represent those parameters before we try and model them.So what's coming next, then, is to think about representation z off the speech parameters that are suitable for statistical modelling on DH.Whilst we're going through that will have in our mind hidden Markov model with a Gaussian Probability density function as our key statistical model.
Whilst the video is playing, click on a line in the transcript to play the video from that point. Having considered speech signal analysis - epoch detection (which is really just for signal processing in the time domain), then F0 estimation (which is useful for all sorts of things both in unit selection and statistical parametric speech synthesis), and estimating the smooth spectral envelope - it is now time to think about representing those speech parameters. What we have so far is just analysis. It takes speech signals and converts them, or extracts from them various pieces of useful information: epochs, F0, spectral envelope. The thing we still haven't covered in detail is the aperiodic energy. That's coming up pretty soon. What we're going to do now is model these things. Or, more specifically, we're going to get ready for modelling. We're going to get the representations suitable for statistical modelling. Everything that's going to happen here we can really just describe as "feature engineering". To motivate our choice of which speech parameters are important and what their representation should be, we'll just have a quick look forward to statistical parametric speech synthesis. Here's the big block diagram that says it all. We're going to take our standard front end, which is going to extract linguistic features from the text. We're going to perform some big complicated and definitely non-linear regression. The eventual output is going to be the waveform. We already know from what we've done in Speech Processing that the waveform isn't always the best choice of representation of speech signals. It's often better to parametrize it, which is what we're working our way up to. So, we're going to assume that the waveforms not a suitable output for this regression function. We're going to regress on to speech parameters. Our choice of parameters is going to be motivated by a couple of things. One is that we can reconstruct the waveform from it. So it must be complete. We think a smooth spectral envelope, the fundamental frequency, and this thing that we still have to cover - aperiodic energy - will be enough to completely reconstruct a speech waveform. The second thing is that they have to be suitable for modelling. We might want to massage them: to transform them in certain ways to make them amenable to certain sorts of statistical model. So, if we've established that the parameters are spectral envelope + F0 + aperiodic energy, how would we represent them in a way that's suitable for modelling? Remember that this thing that we're constructing - that can analyse and then reconstruct speech - we can call that a "vocoder" (voice coder ) - it has an excitation signal driving some filter and the filter has a response which is the spectral envelope. In voiced speech, the excitation signal will be F0. Additionally, we need to know if there is a value of F0, so typically we'll have F0 and a binary flag indicating whether the signal is voiced or unvoiced. How would we represent that for modelling? Well we could just use the raw value of F0. But if we plot it we'll realise it has very much a non-Gaussian distribution. We might want to do something to make it look a bit more Gaussian. The common thing to do will be to take the log. So we might take the log of F0 as that representation, plus this binary flag. The spectral envelope needs a little bit more thought. For the moment we have a smooth spectral envelope that's hopefully already independent of the source. That's good, but we're going to find out in a moment it's still very high dimensional and strongly correlated and those aren't good properties for some sorts of statistical model. Then we'd better finally tackle this problem of what to do about the other sort of energy in speech which is involved in - for example - fricatives. Let's write a wish-list. What do we want our parameters to be like? Well, we're going to use machine learning: statistical modelling. It's always convenient if the number of parameters is the same over time. It doesn't vary - for example - with the type of speech segment. We don't want a different number of parameters for vowels and constants. That would be really messy in machine learning. So we want a fixed number of parameters (fixed dimensionality) and we'd like it to be low-dimensional. There's no reason to have 2000 parameters per frame if 100 will do. For engineering reasons, it's much nicer to work at fixed frame rates - say, every 5ms for synthetic speech or every 10ms for speech recognition - than at a variable frame rate such as - for example - pitch-synchronous signal processing. So we're just going to go for a fixed frame rate here because it's easier to deal with. Of course we want what we've been aiming at all the time which is to separate out different aspects of speech, so we can model and manipulate them separately. There are some other important properties of this parametrization and I'm going to group them under this rather informal term of being "well-behaved". What I mean by that is that when we - for example - perturb them, if we add little errors to each of their values, which is going to happen whenever we average them with others or when we just model them and reconstruct them, whether that's consecutive frames or frames pooled across similar sounds to train a single hidden Markov model say - whenever any of these things happen, when we perturb the values of the speech parameters, we would like them to still reconstruct valid speech waveforms and not be unstable. So they want to do the "right thing" when we average them / smooth them / introduce errors to them. Finally, depending on our choice of statistical model, we might need to do some other processing to make the parameters have the correct statistical properties. Specifically, if we're going to use Gaussian distributions to model them, and we would like to avoid covariance because that adds a lot of extra parameters, we'd like statistically uncorrelated parameters. That's probably not necessary for neural networks, but it's quite necessary for Gaussians, which we're going to use in hidden Markov models. We've talked about STRAIGHT and there's a reading to help you fill in all the details about that. Let's just clarify precisely what we get out of STRAIGHT and whether it's actually suitable for modelling. It gives us the spectral envelope, which is smooth and free from the effects of F0; good, we need that! It gives us a value for F0. Now, we could also use any external F0 estimator or the one inside the STRAIGHT vocoder. That doesn't matter: that can be an external thing. It gives us also the non-periodic energy, which we'll look at and parametrize in a moment. The smooth spectral envelope is of the same resolution as the original FFT that we computed it from. Remember that when we draw diagrams like this rather colourful spectrogram in 3D, the underlying data is of course discrete. Just because we join things that with smooth lines doesn't mean that it's not discrete. So this spectral envelope here is a set of discrete bins. It's the same as the FFT bins. It's just been smoothed. Also, because consecutive values ... if we zoom in on this bit here... consecutive values (i.e., consecutive FFT bins) will be highly correlated. They'll go up and down together in the same way as the outputs of a filterbank are highly correlated. That high resolution and that high correlation make this representation less than ideal for modelling with Gaussians. We need to do something about that. We need to improve the representation of the spectral envelope. While we are we doing that, we might as well also warp the frequency scale because we know that perceptual scales normally are a better way of representing the spectrum for speech processing. We'll warp it on to the Mel scale. We'll decorrelate, and we're going to do that using a standard technique: of the cepstrum. We're going to then reduce the dimensionality of that representation simply by truncating the cepstrum. What we will end up with is something called the Mel cepstrum. That sounds very similar to MFCCs and it's motivated by all the same things, but it's calculated in a different way. That's because we need to be able to reconstruct the speech signal, which we don't need to do in speech recognition. In speech recognition, we warp the frequency scale with a filterbank: a triangular filterbank spaced on a Mel scale; that loses a lot of information. Here, we're not going to do that. We're going to work with a continuous function rather than the discreet filterbank. We'll omit the details of that because it's not important. Once we're on that warped scale (probably the Mel scale, but you could choose some other perceptual frequency scale which would also be fine), we're going to decorrelate. We'll first do that by converting the spectrum to the cepstrum. The cepstrum is just another representation of the spectral envelope as a sum of cosine basis functions. Then we can reduce the dimensionality of that by keeping only the first N coefficients. The more we keep, the more detail we'll keep in that spectral envelope representation, so the choice of number is empirical. In speech recognition we kept very few, just 12: a very coarse spectral envelope, but that's good enough for pattern recognition. It will give very poor reconstruction though. So, in synthesis, we're going to keep many more: perhaps 40 to 60 cepstral coefficients. So that finalizes the representation of the spectral envelope. We use an F0-adaptive window to get the smoothest envelope we can, we do an FFT, do a little additional smoothing, as described in the STRAIGHT paper. Then we will warp onto a Mel scale, convert to the cepstrum, and truncate. That gives us a set of relatively uncorrelated parameters, reasonably small in number, from which we can reconstruct the speech waveform. So let's finally crack the mystery of the aperiodic energy! What is it? How do we get it out of the spectrum? Let's go back to our favourite spectrum, of this particular sound here. The assumption is that this spectrum contains both voiced and unvoiced sounds. So what we're seeing is the complete spectrum of the speech signal. In general, speech signals have both periodic and non-periodic energy. Even vowel sounds have some non-periodic energy: maybe turbulence at the vocal folds. So we'll assume that this spectrum is made up of a perfectly-voiced part which, if we drew the idealized spectrum, would be a perfect line spectrum.... ...plus some aperiodic energy which also has a spectral shape but has no structure (no line spectrum), so some some shaped noise. These two things have been added together in what we see on this spectrum here. So the assumption STRAIGHT makes is that it's the difference between the peaks, which are the perfect periodic part, and the troughs, which are being - if you like - "filled in" by this aperiodic spectrum sitting behind them, it's this difference that tells us how much aperiodic energy there is in this spectrum. So we're just going to measure the difference. The way STRAIGHT does that is to fit one envelope to the periodic energy - that's the tips of all the harmonics. It would fit another envelope to the troughs in-between. In-between two harmonics we assume that all energy present at that point (at that frequency) is non-periodic because it's not at a multiple of F0. Then we're just going to look at the ratio between these two things. If the red and blue lines are very close together, there's a lot of a periodic energy relative to the amount of periodic (i.e., voiced) energy. The key point here about STRAIGHT is that we're essentially looking at the difference between these upper and lower envelopes of the spectrum: the ratio between these things. That's telling us something about the ratio between periodic and aperiodic energy. That's another parameter that we'll need to estimate from our speech signals, and store so that we can reconstruct it: so we can add back in an appropriate amount of aperiodic energy at each frequency when we resynthesise. Because all of this is done at the same resolution as the original FFT spectrum, it's all very high resolution. That's a bad thing: we need to fix that. Again, just for the same reasons as always, the parameters are highly correlated because neighbouring bins will have often the same value. So we also need to improve the representation of the aperiodic energy. We don't need a very high-resolution representation of aperiodic energy. We're not perceptually very sensitive to fine structure. So we just have a "broad-brush" representation. The standard way to do that would be to divide the spectrum into broad frequency bands and just average the amount of energy in each of those bands, at each moment in time (at each frame, say every 5ms). If we did that on a linear frequency scale we might - for example - divide it into these bands. Then, for each time window ...let's take a particular time window... we just average the energy and use that as the representation. Because it's always better to do things on a perceptual scale, our bands might look more like this: getting wider as we go up in frequency. We'll do the same thing. The number of bands is a parameter we can choose. In older papers you'll often see just 5 bands used and newer papers (perhaps with higher bandwidth speech) use more bands - maybe 25 bands - but we don't really need any more than that. That's relatively low-resolution compared to the envelope capturing the periodic energy. Let's finish off by looking relatively high level at how we actually reconstruct the speech waveform. It's pretty straightforward because it's really just a source-filter model again. The source and filter are not the true physical source and filter. They're the excitation and the spectral envelope that we've estimated from the waveform. So they're a signal model. What we've covered up to this point is all of this analysis phase. The synthesis phase is pretty straightforward. We take the value of F0 and we create a pulse train at that frequency. We take this non-periodic (i.e., aperiodic) energy in various bands and we just create some shaped noise. We just have a random number generator and put a different amount of energy into the various frequency bands according to that aperiodicity ratio. For the spectral envelope (possibly collapsed down into the Mel cepstrum and then inverted back up to the full spectrum), we just need to create a filter that has the same frequency response as that. We take the aperiodic energy and mix it with the periodic energy - so, mix these two things together - and the ratio (the "band aperiodicity ratio") tells us how to do. Excite the filter with it and get our output signal. In this course, we're not going to go into the deep details of exactly how you make a filter that has a particular frequency response. We're just going to state without proof that it's possible, and it can be done from those Mel cepstral coefficients. So STRAIGHT as sophisticated as it is, still uses a pulse train to simulate voiced energy. That's something that's just going to have a simple line spectrum. We already know that that might sound quite "buzzy": that's a rather artificial source. STRAIGHT is doing something a little bit better than just a pulse train. Instead of pulses, it performs a little bit of phase manipulation and those pulses become like this. That's just smearing the phase. Those two signals both have the same magnitude spectrum but different phase spectra. This is one situation where moving from the pure pulse to this phase-manipulated pulse actually is perceived as better by listeners. The other thing that STRAIGHT does better than our old source-filter model, as we knew it before, is that it can mix together periodic and non-periodic energy. We can see here that there's non-periodic energy mixed in with these pulses. Good: we've decomposed speech into an appropriate set of speech parameters that's complete (that we can reconstruct from). It's got the fundamental frequency plus a flag that's a binary number telling is if there is a frequency or not (i.e., or whether it's unvoiced). We have a smooth spectral envelope, which we've parametrized as the Mel cepstrum because it decorrelates and reduces dimensionality. Aperiodic energy is represented as essentially a shaped noise spectrum, and the shaping is just a set of broad frequency bands. We've seen just in broad terms how to reconstruct the waveform. So, there's an analysis phase and that produces these speech parameters. Then there's a synthesis phase that reconstructs a waveform. What we're going to do now is split apart the analysis and synthesis phases. We're going to put something in the middle, and that thing is going to be a statistical model. We're going to need to use the model because our input signals will be our training data (perhaps a thousand sentences of someone in the studio) and our output signal will be different sentences: the things we want to create at text-to-speech time. This model needs to generalize from the signals it has seen (represented as vocoder parameters) to unseen signals.