Forum Replies Created
-
AuthorPosts
-
Now I see. It turns out that I missed the step of building join cost. Just to clarify, pitch marking is used to find the join locations which are probably at the pitch marks? When joining two diphone units, we will join the two windowed pitch periods at the join point. Right? Thanks!
1. I also measured the number of joins. It turns out that the system with male pm setting also has fewer joins (453) than the system with female pm setting (467)? I thought changing pm setting won’t result in that different units choice. Is it because target cost penalizes bad pitch marking
2. If I turn off the target cost, all the target costs will be zero when check the utterance relations right? Should I compare the join cost then?
Thank you!
Can I ask a follow-up question? I calculated the mean target cost and join cost my system with male pm setting produces for 30 sentences. It turns out that the system with male pm setting gives much lower target and join cost, but the system does produce twice as many pitch marking errors compared to my system with female pm setting. Also, the system doesn’t sound any better than my standard one. I’m just wondering is there a reason for my weird results?
Now I see!
So the reason why the deltas are produced from a Guassian distribution is because we take into account all the frames produced by the state and produce the mean delta. Right? (I assume that we will also consider examples from all the clustered states under this leaf node as well. Is it correct?)
Thank you!
Thanks! It is all clear to me now. Just a follow-up question:
Can the model itself find the most appropriate number of states? Or is it by convention predetermined to be three states for each phoneme?
Intuitively, I guess there is a limit number of states in each model ie. not larger than the number of frames in the observation sequence generated by the model. Is it correct?
Thanks!
This is now clear to me. Thank you so much. I can see the maths in IE but not in Chrome or Firefox.
Now that the training set is already labelled with the pronunciation, I assume that every letter is already aligned with its correct phone in each word in the training set, so why are we bothered to implement this algorithm to realign each letter with its phone ?
Are the words in the training set already hand labelled with their pronunciation before the algorithm?
If not, how can we find a single good alignment for each word in the training set? If We are to use unigram probabilities, say we count all the possible realisation of “c” in its allowable list (/k/, /s/…) and conclude that P(/k/|”c”) is the highest among the list. With this probability, how are we able to align “c” with /s/ in the case of “cistern”?
-
AuthorPosts