Page 4

Forum Replies Created

Viewing 15 posts - 46 through 60 (of 1,087 total)

← 1 2 3 4 5 … 71 72 73 →

Author

Posts
December 6, 2023 at 12:13 in reply to: Jurafsky & Martin – Chapter 9 #17332
Simon
Professor
“fixed” just means that it is a constant value

The word insertion penalty, which is a log probability, is “logWIP” in J&M equation 9.50. It is summed to the partial path log probability (e.g., the token log probability in a token passing implementation) once for each word in that partial path, which is why it is multiplied by N in the equation.
December 6, 2023 at 10:08 in reply to: Jurafsky & Martin – Chapter 9 #17328
Simon
Professor
The HTK manual says

The grammar scale factor is the amount by which the language model probability is scaled before being added to each token as it transits from the end of one word to the start of the next

but of course, they mean “language model log probability”, and when they say

The word insertion penalty is a fixed value added to each token

they mean “added to the log probability of each token” (the same applies to the previous point too).

The HTK term “penalty” is potentially misleading, since in their implementation the value is added not subtracted. Conceptually there is no difference and it doesn’t really matter: we can just experiment with positive and negative values to find a value that minimises the WER on some held-out data.

The implementation in HTK is consistent with J&M equation 9.50.
December 6, 2023 at 09:59 in reply to: Jurafsky & Martin – Chapter 9 #17327
Simon
Professor
When J&M say

Thus, if (on average) the language model probability decreases…

they are talking about the probability decreasing as the sentence length increases, since more and more word probabilities will be multiplied together.

Their explanation of the LMSF is rather long-winded. There is a much simpler and better explanation for why we need to scale the language model probability when combining it with the acoustic model likelihood. In equation 9.48, P(O|W) implies that the acoustic model calculates a probability mass. It generally does not!

If the acoustic model uses Gaussian probability density functions, it cannot compute probability mass. It can only compute a probability density. Density is proportional to the probability mass in a small region around the observation O. The constant of proportionality is unknown.

Since we always work in the log probability domain, equation 9.48 involves a sum of two log probabilities.

The acoustic model will compute quantities on a different scale to the language model. We need to account for the unknown constant of proportionality by scaling one or other of them in this sum. The convention is to scale the language model log probability, hence the LMSF. We typically find a good value for the LMSF empirically (e.g., by minimising the Word Error Rate on some held-out data).
November 21, 2023 at 14:02 in reply to: Using -A argument #17222
Simon
Professor
The error

Unable to open label file rec/panagiot_test.2.lab

tells us that HResults cannot find the .lab file (which contains the correct label) corresponding to the recognition result stored in the file in rec/panagiot_test.2.rec

HResults is looking for that .lab file in the rec directory. But this is because it did not find it within the MLF (multiple label file) panagiot_test.mlf – perhaps this file is missing from that user’s data?

The error

(standard_in) 2: syntax error

occurs later – so debug that next…
November 21, 2023 at 13:05 in reply to: Using -A argument #17220
Simon
Professor
Your error is occurring with HResults, not HVite :

FATAL ERROR - Terminating program HResults
November 3, 2023 at 18:04 in reply to: Sound for Festival not working through remote desktop #17113
Simon
Professor
I apologise that the Remote Desktop is unreliable for sound playback. Please can all students submit support requests to the IS Helpline about this.

Unfortunately, at the weekend, you cannot access the lab and will have to use the Remote Desktop.
November 3, 2023 at 11:54 in reply to: List of questions about first report #17107
Simon
Professor
1. refer to the Formatting instructions for what is included and excluded from the word count

2. no, use any reasonable format that you like – keep your reader in mind: what format will be easiest for them read?

3. your goal is to demonstrate your understanding, so you will very likely need to say something brief about what Text Normalisation is, so that you can explain the error and why it occurred; if you propose a possible solution, you will of course need to say something about how Text Normalisation is performed (what are the sub-steps? how is each done? rules? machine learning?).

4. If you insist, yes, but it will be included in the word count, so I cannot see any good reason to include one. An Appendix is for optional material that the reader does not have to look at unless they wish to – so you would be using up word count on something that may not be read…

5. To get a good mark, yes! Refer to the Bibliography section of Report Write-up.
November 3, 2023 at 09:45 in reply to: Bad function error in Festival #17102
Simon
Professor
That’s correct – actually car is a core Scheme function, which returns the first item from a list.
October 31, 2023 at 17:01 in reply to: Module 6 – Clarification for pitch period, impulse response, fundamental period #17064
Simon
Professor
Moving on the Module 6 and the video Pitch period, we are now looking at how to extract the vocal tract’s impulse response from a natural speech waveform.

If the impulse response actually did decay all the way to zero before the next glottal pulse, this would be easy for the reason stated above: one pitch period of the speech waveform would be exactly the impulse response we want.

Unfortunately, in natural speech, things are not that simple: the impulse responses overlap. So all we can do is deal in terms of pitch periods. We extract overlapping frames from the waveform so that we can reconstruct the waveform later using overlap-and-add. Since the analysis frames overlap, they will contain more than one pitch period. A good choice is an analysis frame capturing two pitch periods.
October 31, 2023 at 16:55 in reply to: Module 6 – Clarification for pitch period, impulse response, fundamental period #17063
Simon
Professor
A pitch period is the period in-between two glottal pulses. Its duration is denoted T0 (measured in seconds). The term `pitch period’ is used to refer to both the duration (T0) and to the speech waveform itself.

The term ‘fundamental period’ is another way of saying ‘pitch period’ (and is more technically correct, of course, because ‘fundamental frequency’ is more correct than ‘pitch’ when talking about a speech waveform.

A ‘pitch mark’ is a label we might place on a speech waveform to mark the position of a glottal pulse. Assuming the pitch marks are accurate, then the duration between two consecutive pitch marks is T0, by definition.

If the glottal pulses are sufficiently far apart in time (a large T0), then the impulse response of the vocal tract will decay away to zero before the next glottal pulse. In this case, each pitch period is equal to one impulse response of the vocal tract. This is almost the case in the (synthetic) speech waveform in the video Impulse Response where the waveform has decayed almost to zero before the next period starts.

So, a simple way to understand voiced speech is as a sequence of impulse responses of the vocal tract. This is a useful and helpful simplification for developing our understanding of speech signals. The video Source-filter model also makes this simplifying assumption (and all the speech signals used as examples are synthetic, to make things clearer).

However, in natural speech, the waveform generally does not decay all the way to zero before the next glottal pulse. Therefore, the impulse responses overlap (and we can assume they are simply summed, using our simplified model of the vocal tract).
October 30, 2023 at 09:16 in reply to: Access to labs at the weekends #17042
Simon
Professor
I’ve been pushing hard for longer access hours, but unfortunately have been refused on the grounds of “Health and Safety”. The instructions I have been given by the PPLS Head of Information Services is that students should use other spaces on campus (which I think includes Informatics rooms in the same building) for group study, and to access the PPLS AT 4.02 lab machines remotely from there.

I’m sorry about this – I have been unable to get a satisfactory explanation of why it’s unsafe to use AT 4.02 at the weekend, whilst is safe to use other rooms.
April 7, 2023 at 15:27 in reply to: Disk space #16810
Simon
Professor
How much disk space is available?

$ df -h .

If the Use% column is showing close to 100%, that means the disk is nearly full.

If you are using a disk that is shared with other people (as is the case in the PPLS lab), then the amount of available space is the total for everyone sharing that disk (it doesn’t belong to you individually). The number reported by du will fluctuate up and down, as other users create or delete files.

How much disk am I using? Change to your home directory, then measure the size of all the items there:

$ cd
$ du -sh *

That may take a minute or two to run and may produce a lot of output. It will be more convenient to sort the output by size:

$ du -sh * | sort -h

Now you know which directory is the largest, you could cd into it, and repeat the above, drilling down to find what is using the most space.

Or, get clever and find all directories at once and measure their size, reporting this in a sorted list (this will take some time, so be patient):

$ find . -type d -exec du -sh {} \; | sort -h
March 30, 2023 at 17:59 in reply to: When are two layers NOT fully connected? #16801
Simon
Professor
One example would be a convolutional layer. This has a very specific pattern of connections that express the operation of convolution between the activations output by a layer and a “kernel” (which is expressed by weight sharing).

We might use a convolutional layer when we wish to apply the same operation to all parts of some representation (potentially of varying size). They are very commonly used in image processing, but have their uses in speech processing too. For example, we might use them to create a learnable feature extractor for waveform-input ASR.
February 10, 2023 at 13:13 in reply to: Downsampling with sox #16774
Simon
Professor
bash$ sox recordings/arctic_a0001.wav -b16 -r 16k wav/arctic_a0001.wav remix 1

works as expected for me on your file.

Use soxi to inspect your output file: does it have the expected sampling rate, bit depth and duration?

One explanation for the large size of your output file could be that you accidentally combined multiple files, which would happen if you did this:

bash$ sox recordings/*.wav -b16 -r 16k wav/arctic_a0001.wav remix 1
February 9, 2023 at 18:00 in reply to: Downsampling with sox #16771
Simon
Professor
Run soxi recordings/arctic_a0001.wav to see information about that file format, and post the output here. If you wish, attach one file, such as recordings/arctic_a0001.wav to your post so I can investigate.
Author

Posts

Viewing 15 posts - 46 through 60 (of 1,087 total)

← 1 2 3 4 5 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis