Forum Replies Created
-
AuthorPosts
-
How much disk space is available?
$ df -h .If the Use% column is showing close to 100%, that means the disk is nearly full.
If you are using a disk that is shared with other people (as is the case in the PPLS lab), then the amount of available space is the total for everyone sharing that disk (it doesn’t belong to you individually). The number reported by
duwill fluctuate up and down, as other users create or delete files.How much disk am I using? Change to your home directory, then measure the size of all the items there:
$ cd
$ du -sh *That may take a minute or two to run and may produce a lot of output. It will be more convenient to sort the output by size:
$ du -sh * | sort -hNow you know which directory is the largest, you could
cdinto it, and repeat the above, drilling down to find what is using the most space.Or, get clever and find all directories at once and measure their size, reporting this in a sorted list (this will take some time, so be patient):
$ find . -type d -exec du -sh {} \; | sort -hOne example would be a convolutional layer. This has a very specific pattern of connections that express the operation of convolution between the activations output by a layer and a “kernel” (which is expressed by weight sharing).
We might use a convolutional layer when we wish to apply the same operation to all parts of some representation (potentially of varying size). They are very commonly used in image processing, but have their uses in speech processing too. For example, we might use them to create a learnable feature extractor for waveform-input ASR.
bash$ sox recordings/arctic_a0001.wav -b16 -r 16k wav/arctic_a0001.wav remix 1works as expected for me on your file.
Use
soxito inspect your output file: does it have the expected sampling rate, bit depth and duration?One explanation for the large size of your output file could be that you accidentally combined multiple files, which would happen if you did this:
bash$ sox recordings/*.wav -b16 -r 16k wav/arctic_a0001.wav remix 1Run
soxi recordings/arctic_a0001.wavto see information about that file format, and post the output here. If you wish, attach one file, such asrecordings/arctic_a0001.wavto your post so I can investigate.-rindicates the sampling rate of the output file.soxwill automatically determine the sampling rate of the input.Here is a screenshot for another example aggregate device, this time combining an external USB microphone with the built-in headphone port of a laptop.
Attachments:
You must be logged in to view attached files.The problem is that, on newer Macs, the microphone and the headphones/speakers appear as separate audio devices. So there is no single device with both inputs and outputs.
Here’s a possible solution:
try creating an aggregate device, using Audio MIDI Setup (which you’ll find in /Applications/Utilities). Press the small “+” in the lower left corner to create a new device. The attached screenshot shows you what to do.
Then, select this as your device in SpeechRecorder.
Warning! If you use the built-in microphone of your laptop at the same time as the built-in speakers, you will get audio feedback! Use headphones (being careful about the volume in case of feedback), or mute the speakers whilst recording / turn the microphone volume to zero for playback.
Attachments:
You must be logged in to view attached files.Correct! Can you explain how they contribute to modelling duration?
Please state the word count on the first page of your assignment, and also include it in the name of the submission (as the instructions state). Using the word count from Overleaf is perfectly acceptable. If that word count is within the limit, you will not be penalised.
You are right that HMMs have “a set of probability distributions” but there are two different types of probability distribution in an HMM. One type is the emission probability density functions: the multivariate Gaussian in each emitting state that generates observations (MFCCs).
What is the other type?
In the video Cepstral Analysis, Mel-Filterbanks, MFCCs we first had a recap of filterbank features. These would be great features, except that they exhibit co-variance.
We then reminded ourselves of how the source and filter combine in the time domain using convolution, or in the frequency domain using multiplication. We made them additive by taking the log and devised a way to deconvolve the source and filter. This video only explained the classical cepstrum – there was no Mel scale or filterbank.
Finally, in the video From MFCCs, towards a generative model using HMMs we developed MFCCs, by using our filterbank features as a starting point, then applying the same crucial steps as we would for using the cepstrum to obtain the filter without the source: take the log (make source and filter additive), series expansion (separate source and filter along the quefrency axis), truncate (discard the source).
If, after reading the questions again carefully, you are certain they are duplicates, please email the screenshots to me.
First, remember that pitch is the perceptual correlate of F0. We can only measure F0 from a speech waveform. Pitch only exists in the mind of the listener.
When we say F0 is not lexically contrastive in ASR, we mean that it is not useful for telling two words apart. The output of ASR is the text, so we do not need to distinguish “preSENT” from “PREsent”, for example, we simply need to output the written form “present”.
Duration is lexically contrastive because there are pairs of words in the language that differ in their vowel length.
Hidden Markov Models do model duration. Can you explain how they do that?
It’s preferred to copy-paste text into your forum post rather than use screenshots (which are not searchable, and which we cannot quote in our replies).
You are trying to use an MFCC file which does not exist – that’s what the HTK error “Cannot open Parm File” means.
Take a look in
/Volumes/Network/courses/sp/data/mfcc/to see how to data are organised into train and test partitions, and what the filenames are.The train and test data are arranged differently:
Because we have labels for the train data, we can keep all of the training examples from each speaker in a single MFCC file, where the corresponding label file specifies not just the labels (e.g., “one” or “seven”) but also the start and end times.
The test data are cut into individual digits, ready to be recognised.
November 10, 2022 at 21:39 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16489Only you can answer that: you need to give enough background for your reader to understand the points you will make later in the report. For example, if your explanation for an audible concatenation refers to source and filter properties, you should have provided enough background about that for your explanation to be understood by your reader.
-
AuthorPosts
This is the new version. Still under construction.