Forum Replies Created
-
AuthorPosts
-
Are you sure it is actually suspending and not just locking the screen?
Try running this simple process in the terminal that will let you see if the screen is locked (and still running) or actually suspended:
while true; do date; sleep 5; done
Wait 5 minutes to let the screen lock. Unlock it. If the machine had suspended, you will see a gap in the times printed. For me, this shows that the machine keeps running when the screen is locked.
You can change the time before the screen locks, if you like.
There is just less correlation than between filterbank outputs. So much less that we assume there is none (or at least, none worth modelling)!
In reality, there is of course some remaining correlation. In an advanced Automatic Speech Recognition course we would look at ways to model covariance without having a full covariance matrix for each and every Gaussian (because that would be too many parameters). We might share covariances between Guassians, or do clever things with the diagonal vs. off-diagonal entries. But all of that is well beyond the scope of the Speech Processing course.
Before commencing Viterbi training, the model must have some parameters. These could come from uniform segmentation, for example. These parameters will not be optimal: they will not be the parameters that maximise the probability of the training data.
For that initial model, we use the Viterbi algorithm to find an alignment between model and training data. This alignment is the best possible one, given the current model’s parameters (which, remember, are not optimal at this stage). Because the model is not yet optimal, this alignment will not necessarily be the best either.
This alignment is used to update the model parameters. The model is now better: it will generate the training data with a higher probability than with the previous model parameters.
Because the model is better, it will now be able to find a more probable alignment with the training data than the previous model. So we re-align the data, then use this new alignment to update the model parameters.
This can be repeated (iterated) a number of times. We stop when the model no longer improves, as measured by the probability of it generating the training data.
The vertical axis on the cepstrum is the value of the cepstral co-efficients. It could be labelled “amplitude” or “magnitude”. Sometimes we use the absolute magnitude for the purposes of plotting so that everything is positive.
We need to clearly separate the model from the algorithm,
The HMM is the model. It has no memory other than the state, which has to encapsulate all information required to do computation (e.g., to generate an observation).
There are various algorithms available to perform computations with this model. These all take advantage of the memoryless nature of the model in order to simplify that computation.
The Viterbi algorithm computes the single most likely state sequence (= path through the model) to have generated a given observation sequence. The key step in the Viterbi algorithm is to compare all the paths arriving at a particular state at a particular time and keep only the most probable. This is possible because we do not need to know anything more about those paths other than the fact they are all in the same state.
The Expectation (E) step computes all the statistics necessary, such as state occupancy probabilities. A simple way to think of this is averaging. This is where all the computation happens because we are considering all possible state sequences.
The Maximisation (M) step updates the model parameters, using those statistics, to maximise the probability of the training data. This is a very simple equation requiring very little computation.
In theoretical explanations of Expectation-Maximisation, which for HMMs is called the Baum-Welch algorithm, the E step is typically defined as computing the statistics. For example, we compute the state occupancy probabilities so that the M step can use them as the weights in a weighted sum of the observations (i.e,, the training data).
In a practical implementation of the E step, we actually compute the weighted sum as we go. We create a simple data structure called an accumulator which has two parts: one to sum up the numerator and the other to sum up the denominator, of each M step equation (e.g., for the mean of a particular Gaussian, Jurafsky and Martin 2nd edition equation 9.35 in Section 9.4.2). There will be one accumulator for each model parameter. The M step is then simply to divide numerator by denominator and update the model parameter (and to do that for each model parameter).
For Speech Processing, aim for a conceptual understanding of Expectation-Maximisation. You would not need to reproduce these equations under exam conditions.
Now that I’ve mentioned accumulators, here is something right at the edge of the course coverage:
For large-vocabulary connected speech recognition, we model phonemes. If there is enough data, we can make these models context-dependent and thus get a lower WER. The context is typically the left and right phoneme, and such models are called triphones. We need to share parameters between clusters of model parameters because there won’t be enough (or sometimes any) training examples for certain triphones. This is called “tying”. It turns out to be very easy to implement training for tied triphones: we just make a single accumulator for each cluster of shared model parameters.
You are correct: P(O|W) is computed by the HMM. The probability is the product of the emission and transition probabilities.
If we compute that by summing over all state sequences (which is the correct thing to do, by definition – we marginalise or “sum away” the state sequence) then we call that the “forward probability”.
But, normally during recognition, we approximate that sum by only considering the most likely state sequence.
The VM image appears to have been corrupted. You probably need to download a fresh copy of the VM (this means you will lose any scripts stored inside the VM, but hopefully you have a backup of those somewhere).
Ask Matt Spike for help in configuring the keyboard mapping. For now, a workaround is to copy-paste an existing # symbol.
On my Apple keyboard, hash is Alt-3, even though the keyboard doesn’t show a
#
symbol on that key.Tutorials and tutorial groups
This is the first time we have run the course with small tutorial groups. Our normal format is to divide the class into two large groups of around 40 students, and to run 1 hour 50 minute computing lab sessions with the lecturer and two tutors in attendance.
We decided to replace that with two sessions per group per week. Inevitably, a 50 minute tutorial will always feel short and will run out of time, but we were not able to resource double-length tutorials for the number of groups required.
We carefully designed each group to contain students from a variety of backgrounds, in the hope that you would help each other with some of the basic background skills. This was only partially successful.
There was a wide variation in how well tutorial groups worked. Many did a great job of meeting in advance of tutorials to study together and prepare questions. Other groups didn’t manage this. We don’t fully understand why this is and would appreciate additional feedback from all students.
First assignment
Improve the documentation of Festival and/or the assignment.
It is not necessary to become a Festival expert to do the assignment. Some of you worried that you should be explaining precisely how Festival works, but the assignment didn’t require that. You were allowed to assume that Festival works as per the theory taught in class. We will make this more clear in future.
Walk-through examples of how to analyse some errors
Normally this would have happened during in-person lab sessions. We were hoping that tutorials and the forums would replace this, but that didn’t work as well as expected. In future, if we teach the course without in-person lab sessions, we will add walk-throughs, possibly as videos.
The SIGNALS material (especially the notebooks) is too hard
As noted in the response to the numerical scores, this was a result of our experimentation with this new material. We don’t apologise for conducting experiments on you! We need to do that every year to keep improving the course.
Several of you made excellent suggestions and we will incorporate some of these in future:
- Divide material into levels of difficulty, or prioritise into essential/recommended/extra like the readings
- Add walk-through videos or live classes
The PHON material is too hard
In line with the numerical score distribution, around a third of respondents found the PHON material very challenging. We will adjust the delivery of this material in future to make it more accessible to people without any background.
For this year, the PHON material is being assessed in both items of coursework and in the exam. Our expectations of what students will master are realistic. Knowing the basics will be enough to do well on this course.
Summary of the numerical responses, with my responses
There is enough ongoing technical support (for the VM, Festival etc): 4.4
This seems to be working fine.
The tutorial sessions are useful: 3.8
The tutorials appear to be mostly useful, but we accept that there is room for improvement. In the second half of the course we will shift slightly towards covering the core material rather than problem sets / Python notebooks. Tutorial groups vary widely in how well they are working – see response to written comments below.
I find speech.zone easy to use and navigate: 4.5
Thank-you. If there is anything you would like to improve in general about the site, post on the Requests forum.
I find the forums on speech.zone useful: 4.4
Also good – please keep using them!
The lecturers (Simon, Catherine, and Rebekkah) are helpful: 4.6 (no ratings below 3 ;5 responses of 3; the rest 4 or 5)
Thank-you for the positive feedback. We very much appreciate that.
The tutors (Jason, and Jason) are helpful: 4.1
Thank-you from the tutors.
The difficulty of the SIGNALS tutorial material is appropriate (1 – too easy, 3- just right, 5 – too hard): 3.8 (with only one response below 3 and all others 3, 4, or 5)
The best way for us to improve the course is to experiment a little each year. This material (new for this year) was designed to be challenging. It ended up a little more challenging than intended. We will be keeping this material in the course, but we will divide each notebook into levels of difficulty to make it clearer what is essential and what is optional.
The difficulty of the PHON tutorial material is appropriate (1 – too easy, 3- just right, 5 – too hard): 3.0 (with a spread of scores right across the range).
This area is where a few students will have a strong background and other no background at all. The wide spread of scores surprised us. In future, we will try to make this material more accessible for students without any background.
I would like a weekly timetabled lecture for the remaining modules (1-strongly disagree, 5-strongly agree): 3.4 (with 8 people rating below 3 and all others 3,4 or 5)
General positive feeling towards this, but not overwhelming. We have responded to this by providing a pre-recorded class video at the start of each week, then going over that same material in a live format on Tuesdays. The live class will be recorded, so attendance is not required for those who cannot make it.
-
AuthorPosts