Forum Replies Created
-
AuthorPosts
-
Upon further reading of the HTK manual, I would simplify and rephrase the previous post to be: Can you please explain the model training process of going from Uniform Segmentation to Viterbi Segmentation (then iterating) during HInit, and then the further refinement using Forward/Backward (and iterating) during HRest. It’s clear from looking at my own data that both HInit and HRest calculate and revise both the transition probabilities for the states, AND the gaussians at each state. And there appears to be a relationship between the transition probs and the gaussian probs, but that exact relationship is unclear to me. Could you show a worked example, from start to finish, of a single piece of training data as it moves through HInit and HRest, and how the HMM model evolves? It could be a hypothetical word of length 100ms, with 10 frames.
Is there a command to query Festival as to which ‘Phrase_Method’ it is using?
Per the manual:
“There are two methods for predicting phrase breaks in Festival, one simple and one sophisticated. These two methods are selected through the parameter Phrase_Method…”Based on my testing, it seems pretty clear that its using the ‘simple’ method, the CART which decides based on punctuation existence/type. And the manual implies that it must select one method or the other. But if I could query the Phrase_Method parameter, I could know for sure. How does one query a parameter??
Following up on this: In Festival, it appears that when it can’t find a diphone, it will back off to the next-best diphone. EXCEPT when the missing diphone is an ‘Interword’, in which case it inserts silence. Which, as one would guess, sounds bad when the utterance is played. Here is an example from Festival after issuing the Wave_Synth command on my utterance:
Missing diphone: @_dh
Interword so inseting silence.
Missing diphone: @_hw
Interword so inseting silence.
Missing diphone: jh_iii
diphone still missing, backing off: jh_iii
backed off: jh_iii -> jh_ii
Missing diphone: ch_z
Interword so inseting silence.The first 2 specific diphones that it can’t find don’t strike me as particularly rare. The actual word sequence of @_dh is “to the” and @_hw is from “the white”.
Why would such (seemingly) common diphones not be in the database? Does the diphone set for this voice ONLY contain diphones that were recorded within words, or does it have SOME interword-only (diphones NOT derived from within words) diphones, just not all possible ones?Following up on Radiance Factor:
Q1: Since vowels are essentially ‘low pass filtered’ by the effects of the resonances in the vocal tract, is it fair to say that the Radiance Factor is more pronounced/noticeable on consonants, especially unvoiced ones?
Q2: Also, is this effect localized, in the sense that it dissipates with distance? If I measure the spectrum of recorded speech from further away from the speaker’s mouth, will the high freq boost go away (as a function of distance), or is it that once the air outside the mouth is perturbed, the spectrum is essentially fixed? The latter would be my intuition, but if radiance factor is caused by some pressure differential between the air in the vocal tract and the air outside the lips, maybe it could change as the pressure normalizes to some steady state as the sound waves propagate further outward. Kind of like how a rock in a pond makes a big splash but then the ripples become normalized as they move away from the splash point.
-
AuthorPosts