Page 45

Forum Replies Created

Viewing 15 posts - 661 through 675 (of 1,084 total)

← 1 2 3 … 44 45 46 … 71 72 73 →

Author

Posts
November 13, 2016 at 21:30 in reply to: Acoustic modelling and decoding #6023
Simon
Professor
To “force” the model to generate one sequence, we can only evaluate the probability that it generated that sequence. This is exactly what the decoding algorithm must do: it “decodes” how the model generated the given observation sequence. Typically, we will make an approximation, such as only decoding the single most likely state sequence.
November 13, 2016 at 17:59 in reply to: Acoustic modelling and decoding #6021
Simon
Professor
I think perhaps you are being tempted to think of ASR as a pipeline of processes (the dreaded “flowchart” view)? That view leads us into thinking that certain things happen in certain “modules” and other things in other “modules”. Let’s try a different view, in which we only make the standard machine learning split into two phases: “training” and “testing (or recognition)”. Your question is about the second one, and assumes we have a fully-trained model.

We have a single, generative model of spoken utterances. The model can randomly generate any and all possible observation sequences. When the model generates a particular observation sequence, we can compute quantities such as the likelihood (just call that “probability” for now), the most likely state sequence, most likely word sequence, and so on.

Given a particular observation sequence to be recognised, we force our model to generate that particular sequence. We record the most likely word sequence and announce that as the result.

So, all we need is

a) the generative model – this will be a combination of the acoustic model and language model

b) one or more algorithms for computing quantities we are interested in – one of these algorithms will be called the decoding algorithm
November 11, 2016 at 11:44 in reply to: LoadMasterFile: MLF file header is missing #6018
Simon
Professor
Time to develop your debugging skills…

Two possibilities to get you started

1. you are not correctly loading multiple MLFs – post your full command line here and I’ll check it

2. there is a formatting error in one of the MLFs – how might you efficiently figure out which one that is? More generally, how might you sanity check an individual users data, before deciding to include it in one of your experiments?

Tip: add this command in a shell script to make the shell to print out each command, with the variables all replaced by their actual values, just before executing it – this can be helpful in debugging:
```
set -x
```
You can turn that behaviour off again with
```
set +x
```
So to debug just part of a shell script, wrap it like this:
```
... script does stuff here
set -x
HResults ....
set +x
..... script continues here
```
November 10, 2016 at 12:24 in reply to: Will MFCC Filter fail to capture the formant? #6015
Simon
Professor
Indeed, textbooks often suggest that you imagine the frequency axis to be time, then treat the FFT spectrum as a waveform. That’s fine, but we are smart people and know that the Fourier transform doesn’t only apply to time-domain signal: the horizontal axis can be labelled with anything we like.

You are worried that the cepstrum will fail to accurately capture high peaks in the spectrum. That’s a legitimate concerns. First, we can state that the cepstrum derived from the log magnitude spectrum will faithfully capture every detail, if we use enough cepstral co-efficients.

Your concern becomes relevant when we use (say) only the first 12 coefficients. When we do this (i.e., truncate the cepstrum), we are making an assumption about the shape of the spectral envelope. The fewer coefficients we use, the “smoother” we assume the envelope is.

The solution is empirical: try different numbers of cepstral coefficients and choose the number that works best (e.g., gives lowest WER in our speech recogniser).

For ASR, 12 coefficients is just right.

You could experiment with this number in the digit recogniser exercise. Just be careful to not store anything in the shared directory (everything there must use the original parameterisation) and to do everything in your own workspace. This will involve modifying the make_mfccs script and well as the CONFIG_for_coding file. If you do this experiment, talk to the tutor first. Do it for a speaker-independent system with nice large training and testing sets.
November 9, 2016 at 20:27 in reply to: Accessing the macs in AT in remote #6011
Simon
Professor
Regrettably, remote access is difficult to provide. The machines are frequently switched between Mac OS and a virtual Windows installation, which makes remote login impractical.

Although we do not have the resources to support you, the Build your own digit recogniser exercise should be relatively easy to set up on your own machine, especially on Mac or Linux. You would need to take a copy of the data from the shared folder. It is OK to copy the labels, the MFCC files, and the info.txt file only; do not copy the waveforms (they contain personal information). After the course is over, you must delete the data.
November 6, 2016 at 10:52 in reply to: Best path/local distance #5944
Simon
Professor
Correct, the local distance in DTW is the geometric distance between the pair of feature vectors at a given point in the grid.

We hope that the total distance (usually denoted D), which is the sum of local distances, will be lowest for the template that actually corresponds to what was said in the unknown word.

For a single, given unknown word, DTW is repeated once for every template. In each case, DTW finds the best path that aligns the unknown word with the current template being tried. This results in a separate value for D for each template. We then compare all those D values and pick the lowest.
November 4, 2016 at 19:04 in reply to: the features stored in vector #5914
Simon
Professor
At this point in the course, it is indeed a little mysterious what is in the feature vectors. There’s a good reason for keeping you all in suspense: we need to know more about the generative model before making a final decision about the feature vectors.

In other words, we cannot do our feature engineering correctly until we know exactly what properties the generative model has. Specifically, we will need to know its limitations (what it can not model).

So, for now, let us pretend that the feature vector contains one of these possible sets of features:
- the FFT co-efficients, or
- the formant frequencies, or
- the energy in different frequency bands (a “filterbank”)
The mystery will be solved within a few lectures, when we will learn about MFCCs.
November 4, 2016 at 14:32 in reply to: Technical improvements to the site #5912
Simon
Professor
Custom shortcode for video player to allow changing away from video.js in the future and/or adding speed control.
November 4, 2016 at 14:30 in reply to: Technical improvements to the site #5911
Simon
Professor
Add a “show me all unread posts” feature to the forum.
November 2, 2016 at 12:30 in reply to: Technical improvements to the site #5876
Simon
Professor
Fix formatting of search results (e.g., http://www.speech.zone/?s=speech) to correctly align images, titles and excerpts. Use smaller (or no) images. Just requires CSS tweaks.

Formatting should probably be similar to that of archive pages.
November 2, 2016 at 12:03 in reply to: Weighting each dimension when measuring the local distance #5866
Simon
Professor
Yes, in effect, this is what we will do when we move from measuring distance in a vector space, to using generative models.

In fact, we can do a lot more than just weighting each dimension. We can perform feature engineering to transform the feature space, such that our problem (e.g., classification) becomes easier.

In foundation lecture 6, we will take a first look at these ideas. We will then continue this topic in main lecture 7 when we engineer features (MFCCs) that work well with our chosen generative model (a Gaussian probability density function).
November 2, 2016 at 09:29 in reply to: How to get auto-labeled files in Chinese? #5857
Simon
Professor
Festival alone cannot actually automatically label files. All it can do is process text through the front end to get a linguistic specification, which includes the sequence of phones.

The alignment is generally done using HMMs, often with the HTK toolkit.

For a language not supported by Festival, you need to use a TTS front-end, or be able to convert text into a string of phones some other way (e.g., by dictionary lookup). After that, the alignment step is the same as for English.
November 1, 2016 at 13:15 in reply to: Response to Speech Processing feedback of 2016-10-20 #5788
Simon
Professor
Results of the poll: of students who expressed a preference, 73% prefer the current room with its arrangement around group tables.
November 1, 2016 at 12:07 in reply to: Sharon Goldwater: Vectors and their uses #5773
Simon
Professor
You could imagine doing speech recognition by measuring the cosine similarity between feature vectors. But this is not the usual way.

We typically use a generative model (the Gaussian, or Normal, probability density function) of feature vectors, within a generative model of sequences (a Hidden Markov Model).
October 29, 2016 at 09:39 in reply to: Fourier transform #5682
Simon
Professor
Great question! Fourier analysis decomposes any signal into a sum of simple signals (called base functions): sine waves, each with a frequency, magnitude and phase.

Since sine waves are periodic, Fourier analysis can surely only be applied to periodic signals, can’t it? Correct. At least, only to signals that we assume are periodic.

Short-term analysis

For a signal such as speech, where the spectral envelope changes over time, we must always use short-term analysis techniques. That means taking a frame of the signal (typically 25ms) and making some assumptions about the signal within that frame.

We will assume that the spectrum doesn’t change at all within the frame: the signal is “stationary“.

Assumption that the signal is periodic

To apply Fourier analysis, we make another assumption: the signal is periodic. In the case of short term analysis, the Fourier analysis effectively assumes that the frame of signal is repeated over and over before and after the frame.

for sounds like fricatives, we effectively turn them into signals that repeat with a period of one frame. Since the frequency resolution of the Fourier transform is limited by the duration of the frame, we don’t actually see this “assumed periodicity” in the resulting spectrum: it’s at a frequency lower than we can resolve.
Author

Posts

Viewing 15 posts - 661 through 675 (of 1,084 total)

← 1 2 3 … 44 45 46 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis