Forum Replies Created
-
AuthorPosts
-
How is ‘bad pitch marking’ determined by festival? This might be a stupid question, but if festival knows it’s bad, couldn’t it make it better? Also I assume it is determined within the ‘build_utts’ command (?)
Thanks Maria! Hm that would mean it is like a Gaussian without its variance? Don’t see why that would be useful – that would just be a bad model, but a model nonetheless (and I think the whole point of using Euclidean distance is to go model-free, without making parametric assumptions). I know there is a classification method that uses reference examples instead of a parametric model, and it is based on computing the Euclidean distance to every single example, and is very inefficient. I know this is not the case in DTW, as it uses only 1 reference example.
So I think the question is just ill-phrased, it should say “why is using an HMM with Gaussian PDFs a better approach than DTW with Euclidean distance” – that would cover all the other things that are implied by the two approaches, such as using different amounts of training data. In that case, I would agree that they are equivalent in terms of computational cost (during testing – for HMM, training would be more computationally expensive).
Yes, but to we do sum up over sub-paths that belong to the SAME (sub)path, leading into that node? Else how do we get the total path cost? And wouldn’t that be too greedy otherwise?
ii. “to disguise the joins by “lightly smoothing” F0 and the spectral envelope in the local region around each join” (cf. slides on waveform generation)
iii. unit selection: there is not much on in the the course material, but the basic idea is that you select units (from a larger data base) which match best, in terms of matching the desired outcome (from the linguistic specification) AND matching EACH OTHER –> fewer/less audible joins.
Agreed. But reading the question
A probability distribution is generally superior to using a Euclidean distance measure. This is because the probability distribution
i. is less computationally expensive
…
iii. accounts for varianceI think it is fair to assume that the two are used ON THE SAME DATA SET. Since iii is true (I know that from another question), it is also fair to assume that the data set contains multiple training examples per frame/word (and for the purpose of running time analysis, I would also assume that the data set is fairly LARGE). If there was a single training example per word/frame, yes, I’d agree that they have about the same cost, but then there would be no variance and generally not much difference between the two methods. In the case of multiple training examples, using a Gaussian PDF is faster, because it only has to compute 1 value per candidate word/frame, whereas Euclidean has to compute the distance to all of the training examples in the data base (cf. k nearest neighbours method – which is known to have high complexity at test time).
So what is the language model computed from? Another corpus? I thought the transitions between words were also learned during training…
Where does the prior come in in ASR? So far, I thought we were just comparing likelihoods, which I assume is proportional to the posterior if we assume that all the priors are the same (e.g., each word is equally probable in the digit recognizer).
– I guess it is not hard to integrate priors into the language model, based on word frequency.
– For the word model, if there are alternate pronunciations, the prior may be all we can go by?
– But how to include priors into the phone/acoustic model? Or do they not play a role (since each phone or subphone has just 1 model), and we assume likelihood = posterior?Ok thank you. I thought 0 does not count as significant figure…
Cool, thank you that is a super interesting topic!
Just another follow-up to this: Holmes and Holmes (2001, 159) write: “it seems desirable not to use features of the acoustic signal that are not used by human listeners, even if they are reliably present in human productions”. Why this limitation? If the machine can “hear” and interpret it (as in, use it to get a better classification accuracy), why does it matter whether humans can?
They give this reason: “because they may be distorted by the acoustic environment or electrical transmission path without causing the perceived speech quality to be impaired”. Not a very convincing argument to me. The same would apply to features we ARE using I would say.
I understand why “the human model” is good for inspiration, but why limit ourselves? Humans can use a lot of information a machine can’t; maybe there are some benefits in exploiting the specific strengths of the machine (e.g., greater sensitivity to different frequencies, ability to measure phase, doesn’t get distracted…) to make up for that?
But why are MFCCs better features than filter coefficients? Shouldn’t they both ultimately model the same underlying thing – the shape of the vocal tract at production? I do not see why speech recognition could not equally well build a model from the speaker’s or the listener’s point of view (even human listening is sometimes hypothesised to be based on our own model of/experience with production…)
Thank you about the hint with the speed control – that is so useful! As for text books, I have one comment to add, which will probably solve itself in later years, but in ANLP we are currently using a pre-print version of Jurafsky and Martin edition 3 (provided by the lecturers hence legal), which I find MUCH better than version 2 (has completely new structure). http://web.stanford.edu/~jurafsky/slp3/ (unfortunately, the speech synthesis and speech recognition chapters are still to be written).
-
AuthorPosts