Page 5

Forum Replies Created

Viewing 15 posts - 61 through 75 (of 1,087 total)

← 1 2 3 4 5 6 … 71 72 73 →

Author

Posts
February 8, 2023 at 13:13 in reply to: Downsampling with sox #16769
Simon
Professor
-r indicates the sampling rate of the output file. sox will automatically determine the sampling rate of the input.
February 4, 2023 at 09:02 in reply to: Playback on Mac SpeechRecorder not working #16763
Simon
Professor
Here is a screenshot for another example aggregate device, this time combining an external USB microphone with the built-in headphone port of a laptop.

Attachments:
You must be logged in to view attached files.
February 3, 2023 at 17:33 in reply to: Playback on Mac SpeechRecorder not working #16761
Simon
Professor
The problem is that, on newer Macs, the microphone and the headphones/speakers appear as separate audio devices. So there is no single device with both inputs and outputs.

Here’s a possible solution:

try creating an aggregate device, using Audio MIDI Setup (which you’ll find in /Applications/Utilities). Press the small “+” in the lower left corner to create a new device. The attached screenshot shows you what to do.

Then, select this as your device in SpeechRecorder.

Warning! If you use the built-in microphone of your laptop at the same time as the built-in speakers, you will get audio feedback! Use headphones (being careful about the volume in case of feedback), or mute the speakers whilst recording / turn the microphone volume to zero for playback.

Attachments:
You must be logged in to view attached files.
December 10, 2022 at 10:54 in reply to: confusion about lexically contrastive #16655
Simon
Professor
Correct! Can you explain how they contribute to modelling duration?
December 7, 2022 at 19:55 in reply to: Different word count #16649
Simon
Professor
Please state the word count on the first page of your assignment, and also include it in the name of the submission (as the instructions state). Using the word count from Overleaf is perfectly acceptable. If that word count is within the limit, you will not be penalised.
December 7, 2022 at 08:17 in reply to: confusion about lexically contrastive #16635
Simon
Professor
You are right that HMMs have “a set of probability distributions” but there are two different types of probability distribution in an HMM. One type is the emission probability density functions: the multivariate Gaussian in each emitting state that generates observations (MFCCs).

What is the other type?
December 6, 2022 at 18:04 in reply to: MFCC #16628
Simon
Professor
In the video Cepstral Analysis, Mel-Filterbanks, MFCCs we first had a recap of filterbank features. These would be great features, except that they exhibit co-variance.

We then reminded ourselves of how the source and filter combine in the time domain using convolution, or in the frequency domain using multiplication. We made them additive by taking the log and devised a way to deconvolve the source and filter. This video only explained the classical cepstrum – there was no Mel scale or filterbank.

Finally, in the video From MFCCs, towards a generative model using HMMs we developed MFCCs, by using our filterbank features as a starting point, then applying the same crucial steps as we would for using the cepstrum to obtain the filter without the source: take the log (make source and filter additive), series expansion (separate source and filter along the quefrency axis), truncate (discard the source).
December 1, 2022 at 08:30 in reply to: Duplicate questions in quiz? #16577
Simon
Professor
If, after reading the questions again carefully, you are certain they are duplicates, please email the screenshots to me.
November 28, 2022 at 17:24 in reply to: confusion about lexically contrastive #16562
Simon
Professor
First, remember that pitch is the perceptual correlate of F0. We can only measure F0 from a speech waveform. Pitch only exists in the mind of the listener.

When we say F0 is not lexically contrastive in ASR, we mean that it is not useful for telling two words apart. The output of ASR is the text, so we do not need to distinguish “preSENT” from “PREsent”, for example, we simply need to output the written form “present”.

Duration is lexically contrastive because there are pairs of words in the language that differ in their vowel length.

Hidden Markov Models do model duration. Can you explain how they do that?
November 28, 2022 at 17:19 in reply to: Recognise Test Data for multiple users #16560
Simon
Professor
It’s preferred to copy-paste text into your forum post rather than use screenshots (which are not searchable, and which we cannot quote in our replies).

You are trying to use an MFCC file which does not exist – that’s what the HTK error “Cannot open Parm File” means.

Take a look in /Volumes/Network/courses/sp/data/mfcc/ to see how to data are organised into train and test partitions, and what the filenames are.

The train and test data are arranged differently:

Because we have labels for the train data, we can keep all of the training examples from each speaker in a single MFCC file, where the corresponding label file specifies not just the labels (e.g., “one” or “seven”) but also the start and end times.

The test data are cut into individual digits, ready to be recognised.
November 10, 2022 at 21:39 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16489
Simon
Professor
Only you can answer that: you need to give enough background for your reader to understand the points you will make later in the report. For example, if your explanation for an audible concatenation refers to source and filter properties, you should have provided enough background about that for your explanation to be understood by your reader.
November 10, 2022 at 21:35 in reply to: Intro in Lab Report #16488
Simon
Professor
It’s good practice to specify the language (and accent, or other properties, when relevant) you are working with: you would be amazed at how many published papers forget to do that!. Likewise, it is good practice to be clear about what data are used, where they come from, etc.

The data here include both that in the unit selection voice, and the sentences you use to illustrate mistakes.
November 7, 2022 at 22:09 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16466
Simon
Professor
Correct! (Also in the pronunciation dictionary, of course.)

Actually, the symbol set is not exactly phonemes – it include allophones, for example. What is the difference?
November 7, 2022 at 20:58 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16464
Simon
Professor
You correctly state that diphones are used because they capture co-articulation.

But are you sure phonemes are not used anywhere in the system?
November 7, 2022 at 18:27 in reply to: Clarification on ‘What in the speech signal differentiates different phones?’ #16457
Simon
Professor
There are lots of connections. Some hints:

do we use phonemes in TTS?

speech sounds are affected by the surrounding speech sounds through the process of co-articulation (which occurs both within and between words)

the source and filter each have different consequences for the acoustic properties of a speech sound: how is that knowledge used in TTS?
Author

Posts

Viewing 15 posts - 61 through 75 (of 1,087 total)

← 1 2 3 4 5 6 … 71 72 73 →

Simon

Forum Replies Created

Attachments:

Attachments:

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis