Forum Replies Created
-
AuthorPosts
-
The number of observations in the observation sequence is fixed, and they all have to be generated by the model (i.e., the compiled-together language model and acoustic model).
There are many possible paths through the model that could generate this observation sequence. Some paths will pass through mostly short words, each of which generates a short sequence of observations (because short words tend to have short durations when spoken). Other paths pass through long words, each of which will typically generate a longer sequence of observations.
So, to generate the fixed-length observation sequence, the model might take a path through many short words, or through a few long words, or something in-between.
Paths through many short words are likely to contain insertion errors. Paths through a few long words are likely to contain deletion errors. The path with the lowest WER is likely to be a compromise between the two: we need some way to control that, which is what the WIP provides.
Again, J&M’s explanation of the LMSF is not the best, so don’t get lost in their explanations of the interaction between LMSF and WIP.
In summary:
- The LMSF is required because the language model computes probability mass, whilst the acoustic model computes probability density.
- The WIP enables us to trade off insertion errors against deletion errors.
“fixed” just means that it is a constant value
The word insertion penalty, which is a log probability, is “logWIP” in J&M equation 9.50. It is summed to the partial path log probability (e.g., the token log probability in a token passing implementation) once for each word in that partial path, which is why it is multiplied by N in the equation.
The HTK manual says
The grammar scale factor is the amount by which the language model probability is scaled before being added to each token as it transits from the end of one word to the start of the next
but of course, they mean “language model log probability”, and when they say
The word insertion penalty is a fixed value added to each token
they mean “added to the log probability of each token” (the same applies to the previous point too).
The HTK term “penalty” is potentially misleading, since in their implementation the value is added not subtracted. Conceptually there is no difference and it doesn’t really matter: we can just experiment with positive and negative values to find a value that minimises the WER on some held-out data.
The implementation in HTK is consistent with J&M equation 9.50.
When J&M say
Thus, if (on average) the language model probability decreases…
they are talking about the probability decreasing as the sentence length increases, since more and more word probabilities will be multiplied together.
Their explanation of the LMSF is rather long-winded. There is a much simpler and better explanation for why we need to scale the language model probability when combining it with the acoustic model likelihood. In equation 9.48, P(O|W) implies that the acoustic model calculates a probability mass. It generally does not!
If the acoustic model uses Gaussian probability density functions, it cannot compute probability mass. It can only compute a probability density. Density is proportional to the probability mass in a small region around the observation O. The constant of proportionality is unknown.
Since we always work in the log probability domain, equation 9.48 involves a sum of two log probabilities.
The acoustic model will compute quantities on a different scale to the language model. We need to account for the unknown constant of proportionality by scaling one or other of them in this sum. The convention is to scale the language model log probability, hence the LMSF. We typically find a good value for the LMSF empirically (e.g., by minimising the Word Error Rate on some held-out data).
The error
Unable to open label file rec/panagiot_test.2.lab
tells us that
HResults
cannot find the.lab
file (which contains the correct label) corresponding to the recognition result stored in the file inrec/panagiot_test.2.rec
HResults
is looking for that.lab
file in therec
directory. But this is because it did not find it within the MLF (multiple label file)panagiot_test.mlf
– perhaps this file is missing from that user’s data?The error
(standard_in) 2: syntax error
occurs later – so debug that next…
Your error is occurring with
HResults
, notHVite
:FATAL ERROR - Terminating program HResults
I apologise that the Remote Desktop is unreliable for sound playback. Please can all students submit support requests to the IS Helpline about this.
Unfortunately, at the weekend, you cannot access the lab and will have to use the Remote Desktop.
1. refer to the Formatting instructions for what is included and excluded from the word count
2. no, use any reasonable format that you like – keep your reader in mind: what format will be easiest for them read?
3. your goal is to demonstrate your understanding, so you will very likely need to say something brief about what Text Normalisation is, so that you can explain the error and why it occurred; if you propose a possible solution, you will of course need to say something about how Text Normalisation is performed (what are the sub-steps? how is each done? rules? machine learning?).
4. If you insist, yes, but it will be included in the word count, so I cannot see any good reason to include one. An Appendix is for optional material that the reader does not have to look at unless they wish to – so you would be using up word count on something that may not be read…
5. To get a good mark, yes! Refer to the Bibliography section of Report Write-up.
That’s correct – actually
car
is a core Scheme function, which returns the first item from a list.October 31, 2023 at 17:01 in reply to: Module 6 – Clarification for pitch period, impulse response, fundamental period #17064Moving on the Module 6 and the video Pitch period, we are now looking at how to extract the vocal tract’s impulse response from a natural speech waveform.
If the impulse response actually did decay all the way to zero before the next glottal pulse, this would be easy for the reason stated above: one pitch period of the speech waveform would be exactly the impulse response we want.
Unfortunately, in natural speech, things are not that simple: the impulse responses overlap. So all we can do is deal in terms of pitch periods. We extract overlapping frames from the waveform so that we can reconstruct the waveform later using overlap-and-add. Since the analysis frames overlap, they will contain more than one pitch period. A good choice is an analysis frame capturing two pitch periods.
October 31, 2023 at 16:55 in reply to: Module 6 – Clarification for pitch period, impulse response, fundamental period #17063A pitch period is the period in-between two glottal pulses. Its duration is denoted T0 (measured in seconds). The term `pitch period’ is used to refer to both the duration (T0) and to the speech waveform itself.
The term ‘fundamental period’ is another way of saying ‘pitch period’ (and is more technically correct, of course, because ‘fundamental frequency’ is more correct than ‘pitch’ when talking about a speech waveform.
A ‘pitch mark’ is a label we might place on a speech waveform to mark the position of a glottal pulse. Assuming the pitch marks are accurate, then the duration between two consecutive pitch marks is T0, by definition.
If the glottal pulses are sufficiently far apart in time (a large T0), then the impulse response of the vocal tract will decay away to zero before the next glottal pulse. In this case, each pitch period is equal to one impulse response of the vocal tract. This is almost the case in the (synthetic) speech waveform in the video Impulse Response where the waveform has decayed almost to zero before the next period starts.
So, a simple way to understand voiced speech is as a sequence of impulse responses of the vocal tract. This is a useful and helpful simplification for developing our understanding of speech signals. The video Source-filter model also makes this simplifying assumption (and all the speech signals used as examples are synthetic, to make things clearer).
However, in natural speech, the waveform generally does not decay all the way to zero before the next glottal pulse. Therefore, the impulse responses overlap (and we can assume they are simply summed, using our simplified model of the vocal tract).
I’ve been pushing hard for longer access hours, but unfortunately have been refused on the grounds of “Health and Safety”. The instructions I have been given by the PPLS Head of Information Services is that students should use other spaces on campus (which I think includes Informatics rooms in the same building) for group study, and to access the PPLS AT 4.02 lab machines remotely from there.
I’m sorry about this – I have been unable to get a satisfactory explanation of why it’s unsafe to use AT 4.02 at the weekend, whilst is safe to use other rooms.
How much disk space is available?
$ df -h .
If the Use% column is showing close to 100%, that means the disk is nearly full.
If you are using a disk that is shared with other people (as is the case in the PPLS lab), then the amount of available space is the total for everyone sharing that disk (it doesn’t belong to you individually). The number reported by
du
will fluctuate up and down, as other users create or delete files.How much disk am I using? Change to your home directory, then measure the size of all the items there:
$ cd
$ du -sh *
That may take a minute or two to run and may produce a lot of output. It will be more convenient to sort the output by size:
$ du -sh * | sort -h
Now you know which directory is the largest, you could
cd
into it, and repeat the above, drilling down to find what is using the most space.Or, get clever and find all directories at once and measure their size, reporting this in a sorted list (this will take some time, so be patient):
$ find . -type d -exec du -sh {} \; | sort -h
One example would be a convolutional layer. This has a very specific pattern of connections that express the operation of convolution between the activations output by a layer and a “kernel” (which is expressed by weight sharing).
We might use a convolutional layer when we wish to apply the same operation to all parts of some representation (potentially of varying size). They are very commonly used in image processing, but have their uses in speech processing too. For example, we might use them to create a learnable feature extractor for waveform-input ASR.
bash$ sox recordings/arctic_a0001.wav -b16 -r 16k wav/arctic_a0001.wav remix 1
works as expected for me on your file.
Use
soxi
to inspect your output file: does it have the expected sampling rate, bit depth and duration?One explanation for the large size of your output file could be that you accidentally combined multiple files, which would happen if you did this:
bash$ sox recordings/*.wav -b16 -r 16k wav/arctic_a0001.wav remix 1
-
AuthorPosts