Forum Replies Created

Viewing 11 posts - 1 through 11 (of 11 total)

Author

Posts
April 9, 2018 at 13:41 in reply to: Bad Pitch Marking #9236
Marlene S
Student
How is ‘bad pitch marking’ determined by festival? This might be a stupid question, but if festival knows it’s bad, couldn’t it make it better? Also I assume it is determined within the ‘build_utts’ command (?)
December 18, 2017 at 20:44 in reply to: Complexity of using Euclidean distance #8799
Marlene S
Student
Thanks Maria! Hm that would mean it is like a Gaussian without its variance? Don’t see why that would be useful – that would just be a bad model, but a model nonetheless (and I think the whole point of using Euclidean distance is to go model-free, without making parametric assumptions). I know there is a classification method that uses reference examples instead of a parametric model, and it is based on computing the Euclidean distance to every single example, and is very inefficient. I know this is not the case in DTW, as it uses only 1 reference example.

So I think the question is just ill-phrased, it should say “why is using an HMM with Gaussian PDFs a better approach than DTW with Euclidean distance” – that would cover all the other things that are implied by the two approaches, such as using different amounts of training data. In that case, I would agree that they are equivalent in terms of computational cost (during testing – for HMM, training would be more computationally expensive).
December 18, 2017 at 17:03 in reply to: Summing up over ALL path costs? #8794
Marlene S
Student
Yes, but to we do sum up over sub-paths that belong to the SAME (sub)path, leading into that node? Else how do we get the total path cost? And wouldn’t that be too greedy otherwise?
December 18, 2017 at 11:46 in reply to: Question 15 – join discontinuity #8782
Marlene S
Student
ii. “to disguise the joins by “lightly smoothing” F0 and the spectral envelope in the local region around each join” (cf. slides on waveform generation)

iii. unit selection: there is not much on in the the course material, but the basic idea is that you select units (from a larger data base) which match best, in terms of matching the desired outcome (from the linguistic specification) AND matching EACH OTHER –> fewer/less audible joins.
December 18, 2017 at 10:07 in reply to: Complexity of using Euclidean distance #8775
Marlene S
Student
Agreed. But reading the question

A probability distribution is generally superior to using a Euclidean distance measure. This is because the probability distribution

i. is less computationally expensive
…
iii. accounts for variance

I think it is fair to assume that the two are used ON THE SAME DATA SET. Since iii is true (I know that from another question), it is also fair to assume that the data set contains multiple training examples per frame/word (and for the purpose of running time analysis, I would also assume that the data set is fairly LARGE). If there was a single training example per word/frame, yes, I’d agree that they have about the same cost, but then there would be no variance and generally not much difference between the two methods. In the case of multiple training examples, using a Gaussian PDF is faster, because it only has to compute 1 value per candidate word/frame, whereas Euclidean has to compute the distance to all of the training examples in the data base (cf. k nearest neighbours method – which is known to have high complexity at test time).
November 27, 2017 at 20:45 in reply to: Acoustic Model and Language Model #8588
Marlene S
Student
So what is the language model computed from? Another corpus? I thought the transitions between words were also learned during training…
November 26, 2017 at 23:31 in reply to: Acoustic Model and Language Model #8563
Marlene S
Student
Where does the prior come in in ASR? So far, I thought we were just comparing likelihoods, which I assume is proportional to the posterior if we assume that all the priors are the same (e.g., each word is equally probable in the digit recognizer).

– I guess it is not hard to integrate priors into the language model, based on word frequency.
– For the word model, if there are alternate pronunciations, the prior may be all we can go by?
– But how to include priors into the phone/acoustic model? Or do they not play a role (since each phone or subphone has just 1 model), and we assume likelihood = posterior?
November 19, 2017 at 10:53 in reply to: Significant figures #8477
Marlene S
Student
Ok thank you. I thought 0 does not count as significant figure…
November 8, 2017 at 15:31 in reply to: Filter bank vs. filter coefficients #8288
Marlene S
Student
Cool, thank you that is a super interesting topic!

Just another follow-up to this: Holmes and Holmes (2001, 159) write: “it seems desirable not to use features of the acoustic signal that are not used by human listeners, even if they are reliably present in human productions”. Why this limitation? If the machine can “hear” and interpret it (as in, use it to get a better classification accuracy), why does it matter whether humans can?

They give this reason: “because they may be distorted by the acoustic environment or electrical transmission path without causing the perceived speech quality to be impaired”. Not a very convincing argument to me. The same would apply to features we ARE using I would say.

I understand why “the human model” is good for inspiration, but why limit ourselves? Humans can use a lot of information a machine can’t; maybe there are some benefits in exploiting the specific strengths of the machine (e.g., greater sensitivity to different frequencies, ability to measure phase, doesn’t get distracted…) to make up for that?
November 8, 2017 at 14:30 in reply to: Filter bank vs. filter coefficients #8286
Marlene S
Student
But why are MFCCs better features than filter coefficients? Shouldn’t they both ultimately model the same underlying thing – the shape of the vocal tract at production? I do not see why speech recognition could not equally well build a model from the speaker’s or the listener’s point of view (even human listening is sometimes hypothesised to be based on our own model of/experience with production…)
November 4, 2017 at 11:44 in reply to: Response to Speech Processing feedback of 2017-10-19 #8263
Marlene S
Student
Thank you about the hint with the speed control – that is so useful! As for text books, I have one comment to add, which will probably solve itself in later years, but in ANLP we are currently using a pre-print version of Jurafsky and Martin edition 3 (provided by the lecturers hence legal), which I find MUCH better than version 2 (has completely new structure). http://web.stanford.edu/~jurafsky/slp3/ (unfortunately, the speech synthesis and speech recognition chapters are still to be written).
Author

Posts

Viewing 11 posts - 1 through 11 (of 11 total)

Marlene S

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis