› Forums › Speech Synthesis › The front end › Letter-to-sound: Alignment method for the training set
- This topic has 5 replies, 2 voices, and was last updated 9 years, 3 months ago by Simon.
-
AuthorPosts
-
-
October 11, 2015 at 12:53 #283
I don’t quite understand the alignment method for training set mentioned in p294 (chapter 8.2.3) of Jurafsky & Martin’s textbook.
It seems to me that the training only aligns each letter to its most probable pronunciation in the allowable list without considering history/context. That is, if “c” is mostly realised as “K” in English, “c” will always be aligned to “k” in any words in the training set?
If it is the case, I find it difficult to see how a machine learning classifier could extract other features from this training set since there is already only one-to-one mapping between each letter and its phone in the training set.
-
October 12, 2015 at 09:39 #285
This algorithm is for preparing the training set for a letter-to-sound model (e.g., a classification tree). The end result of the algorithm is a single alignment between letters and phonemes, for each word in the training set (i.e., a pre-existing pronunciation dictionary).
It’s important to realise that, across the whole training set, a particular letter (e.g., “c”) might align with different phonemes (sometimes /k/, sometimes /ch/, etc) in different words. It won’t necessarily always align with the same phoneme all the time.
So, how do we get to that single alignment? We use a simple unigram model of the probability of each letter aligning with each phoneme. Most of the probabilities in this model will be zero, and the only non-zero probabilities are for those letter-phoneme pairs given in the allowables list.
The key machine learning concept to understand in this algorithm is that of first initialising this unigram model and then iteratively improving the model.
To initialise, and then to improve the model, we need an alignment for all words in the training set, so that we can count how many times each phoneme aligns with each letter. The allowables lists are used to find the first alignment. The model is then updated, and then this improved model is used to find a better alignment.
If the allowables list for a particular letter only contained a single phoneme, then that letter would always have to align with that phoneme. But in general, the allowables lists will have many phonemes for each letter.
-
October 13, 2015 at 11:04 #288
Are the words in the training set already hand labelled with their pronunciation before the algorithm?
If not, how can we find a single good alignment for each word in the training set? If We are to use unigram probabilities, say we count all the possible realisation of “c” in its allowable list (/k/, /s/…) and conclude that P(/k/|”c”) is the highest among the list. With this probability, how are we able to align “c” with /s/ in the case of “cistern”?
-
October 13, 2015 at 12:44 #289
Yes, the words in the training set are hand-labelled with the pronunciation: this is just a dictionary. See this topic.
At synthesis time, the dictionary will be used in preference to the letter-to-sound model for all words in the dictionary. The letter-to-sound model will only be used for words not in the dictionary.
-
October 14, 2015 at 18:04 #319
Now that the training set is already labelled with the pronunciation, I assume that every letter is already aligned with its correct phone in each word in the training set, so why are we bothered to implement this algorithm to realign each letter with its phone ?
-
October 14, 2015 at 18:19 #320
The pronunciation dictionary (written by hand) does not specify an alignment between letters and phonemes. See this topic for an extract from cmulex, showing what is contained in the dictionary.
We need to use this algorithm to find the alignment, before going on to train a classification tree.
-
-
AuthorPosts
- You must be logged in to reply to this topic.