This video just has a plain transcript, not time-aligned to the videoTHIS IS AN AUTOMATIC TRANSCRIPT WITH LIGHT CORRECTIONS TO THE TECHNICAL TERMS so we have to parameterisethe speech signal.So in all of automatic speech recognition, we immediately get rid of waveforms and replace them with features.The features are going to be vectors per framea sequence of vectors.So let's draw some pictures that represent that so we don't have to use too much heavy duty notation.There's the diagram for the previous slide.We've got a frame of speech over some short period of time we assume that statistical properties are constant static on the sort of duration that we could get away with.There will be something like 25 millisecondsand for that little fragment of speech will pull out the waveform.We'll do some analysis.We don't know what this analysis is yetthis analysis will yield a vector of numbers.This vector here.We don't know quite how many numbers to put in this vector just yet.And then we'll do that at fixed intervals throughout the word.So we'll move on to the next frame.I'll get the next vector and we'll do that all the way through the waveform.So we'll do that for fragments of speech that are 25 milliseconds in duration and then we'll space them apart about 10 milliseconds apart (100 times per second)we'll take a little snapshot of spectral envelopeand store it in a vector.Get the sequence of these factors so that we can draw Nice pictures of these things are good with these vectors.These things here, we're going to have sequence of vectors.So each vector corresponds to one frame of speech.We're going to stack those things into some other data structure, and that's what this thing here is.So this thing here inside each of these cells is a vector, and the vector is just a set of numbers that characterise the spectral envelope for the speech.This is just to have some nice compact graphical notation so we can draw pictures of words and show how they're going to be compared to each other.Okay, we're comfortable with this this way of notating things.So what is going to go into this feature vector? What do we need to know about the speech signal to do pattern recognition.So as we've already seen in classification and regression trees, machine learning's kind of magic, but it doesn't do everything for us, and specifically, it doesn't tell us what features to extract in the first place.It might be able to select the most important ones from a big bag of candidate features.That's what a decision Tree will claim to do.But we, as the engineer had to think of the features first.So there's some feature engineering to domachine learning doesn't solve this problem for us.We need to use our knowledge of the problem.So if this wasn't speech recognition.if it was image recognition or gene sequencing or natural image processing, we'd use completely different features but we might use the same form of machine learning.So the feature engineering is where our knowledge of the problem goes.The machine is going to tell us the parameters of the model.We've got to choose which model and what features to operate on.So we need to use our knowledge of speech to decide on these features.So another feature of machine learning is that it's usually best if the features are compact, only the features that are really useful for recognising the patterns we want to recognise and don't have loads of useless, noisy features.even decision trees, which claim to be able to sort out noisy features from useless features have limited power in that regardso if we can engineer the features to be as useful as possible first, we would expect better performance.We need to capture, obviously the important information.So that's a That's the spectral envelope that's sufficient for capturing speech segment type.It was a vowel.Getting the first two, maybe the first three formants would be a pretty good start for identifying the vowel.We want them to be somehow invariant to things that we don't care about.So, for example, having two recordings of the same vowel said at different fundamental frequencies.We'd like the extracted features to be essentially the same in both cases, for English.For other languages we might want some features that capture the pitch because that might be segmental information.But for many languages it's notWe'd like a representation that doesn't vary when F0 varies and the spectral envelope is a good candidate for that, it's essentially independent of itso we want to get rid of the fundamental frequency most of the time, even in languages such as Chinese languages and some other languages around the world lots of Asian languages where pitch is a feature we might extract it separately and use it as a separate feature would still want the spectral envelope.You want to get rid of fundamental frequency and it's possible like to get rid of things like the speaker identity because we'd like a recording of me saying one on a recording of you saying one to be matched against each other so we could do speaker-independent recognition.[The idea of extracting the spectral envelope using a filterbank will be fully developed in the next module. For now, just pretend the feature vector is the FFT coefficients themselves, or formants, or whatever you are comfortable with.]
Feature vectors
We will make a first attempt at parameterising each frame, but we'll need to revisit this after learning more about the probabilistic model that will be used.
Log in if you want to mark this as completed
|
|