Forum Replies Created
-
AuthorPosts
-
Run
soxi recordings/arctic_a0001.wav
to see information about that file format, and post the output here. If you wish, attach one file, such asrecordings/arctic_a0001.wav
to your post so I can investigate.-r
indicates the sampling rate of the output file.sox
will automatically determine the sampling rate of the input.Here is a screenshot for another example aggregate device, this time combining an external USB microphone with the built-in headphone port of a laptop.
Attachments:
You must be logged in to view attached files.The problem is that, on newer Macs, the microphone and the headphones/speakers appear as separate audio devices. So there is no single device with both inputs and outputs.
Here’s a possible solution:
try creating an aggregate device, using Audio MIDI Setup (which you’ll find in /Applications/Utilities). Press the small “+” in the lower left corner to create a new device. The attached screenshot shows you what to do.
Then, select this as your device in SpeechRecorder.
Warning! If you use the built-in microphone of your laptop at the same time as the built-in speakers, you will get audio feedback! Use headphones (being careful about the volume in case of feedback), or mute the speakers whilst recording / turn the microphone volume to zero for playback.
Attachments:
You must be logged in to view attached files.Correct! Can you explain how they contribute to modelling duration?
Please state the word count on the first page of your assignment, and also include it in the name of the submission (as the instructions state). Using the word count from Overleaf is perfectly acceptable. If that word count is within the limit, you will not be penalised.
You are right that HMMs have “a set of probability distributions” but there are two different types of probability distribution in an HMM. One type is the emission probability density functions: the multivariate Gaussian in each emitting state that generates observations (MFCCs).
What is the other type?
In the video Cepstral Analysis, Mel-Filterbanks, MFCCs we first had a recap of filterbank features. These would be great features, except that they exhibit co-variance.
We then reminded ourselves of how the source and filter combine in the time domain using convolution, or in the frequency domain using multiplication. We made them additive by taking the log and devised a way to deconvolve the source and filter. This video only explained the classical cepstrum – there was no Mel scale or filterbank.
Finally, in the video From MFCCs, towards a generative model using HMMs we developed MFCCs, by using our filterbank features as a starting point, then applying the same crucial steps as we would for using the cepstrum to obtain the filter without the source: take the log (make source and filter additive), series expansion (separate source and filter along the quefrency axis), truncate (discard the source).
If, after reading the questions again carefully, you are certain they are duplicates, please email the screenshots to me.
First, remember that pitch is the perceptual correlate of F0. We can only measure F0 from a speech waveform. Pitch only exists in the mind of the listener.
When we say F0 is not lexically contrastive in ASR, we mean that it is not useful for telling two words apart. The output of ASR is the text, so we do not need to distinguish “preSENT” from “PREsent”, for example, we simply need to output the written form “present”.
Duration is lexically contrastive because there are pairs of words in the language that differ in their vowel length.
Hidden Markov Models do model duration. Can you explain how they do that?
It’s preferred to copy-paste text into your forum post rather than use screenshots (which are not searchable, and which we cannot quote in our replies).
You are trying to use an MFCC file which does not exist – that’s what the HTK error “Cannot open Parm File” means.
Take a look in
/Volumes/Network/courses/sp/data/mfcc/
to see how to data are organised into train and test partitions, and what the filenames are.The train and test data are arranged differently:
Because we have labels for the train data, we can keep all of the training examples from each speaker in a single MFCC file, where the corresponding label file specifies not just the labels (e.g., “one” or “seven”) but also the start and end times.
The test data are cut into individual digits, ready to be recognised.
November 10, 2022 at 21:39 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16489Only you can answer that: you need to give enough background for your reader to understand the points you will make later in the report. For example, if your explanation for an audible concatenation refers to source and filter properties, you should have provided enough background about that for your explanation to be understood by your reader.
It’s good practice to specify the language (and accent, or other properties, when relevant) you are working with: you would be amazed at how many published papers forget to do that!. Likewise, it is good practice to be clear about what data are used, where they come from, etc.
The data here include both that in the unit selection voice, and the sentences you use to illustrate mistakes.
November 7, 2022 at 22:09 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16466Correct! (Also in the pronunciation dictionary, of course.)
Actually, the symbol set is not exactly phonemes – it include allophones, for example. What is the difference?
November 7, 2022 at 20:58 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16464You correctly state that diphones are used because they capture co-articulation.
But are you sure phonemes are not used anywhere in the system?
-
AuthorPosts