Forum Replies Created
-
AuthorPosts
-
Every time a new Item is created, it just gets assigned the next numerical id in sequence. So, the ids do not carry any human-friendly information and are best ignored. Sometimes an Item may be deleted, so not every numerical id will necessarily be present.
If you wanted to see all Items currently present in an Utterance, then you could save the utterance to a file using the utt.save command, and then open that file in a text editor such as Aquamacs.
festival> (utt.save myutt 'my_file_name.utt)
The file my_file_name.utt will be saved in whatever directory you were in when you started Festival. Note that the waveform is not saved within the utterance file.
Festival can handle utf-8 and utf-16 characters, but not via the interactive command-line interface. This is a limitation of the input method. You would need to input such text from a file.
Hmm – that’s a good point, and not something I’d spotted before. It is almost certainly a typo.
Anyway, for the purposes of understanding, it’s fine to assume that Festival performs POS tagging using precisely the method described in Jurafsky & Martin.
The weights are simply the fractions of data points in each side of the split. So, we compute entropy as usual for each side (“yes” vs “no”) and then when we sum these two values, we weight each of them by the fraction of the data that went down that branch (e.g., if 1/3 of the data points had “yes” as the answer to the question under consideration, then we would weight the “yes” side’s entropy by 1/3 and the “no” side’s by 2/3, then add them together).
To query a variable in Scheme, just type it’s name at the Festival prompt, without any parentheses. If you get “unbound variable” that means the variable is not set, so the method will be the built-in default (in this case, the hand-crafted CART).
It’s tertiary stress, which is marked up in the Unisyn lexicon – see Section 3.4.3 of the Unisyn manual. Tertiary stress is essentially there not to show that a syllable might receive a pitch accent, but to block some post lexical rules, such as vowel reduction.
So, the second syllable in “upset” should never be reduced, in any context. I think Unisyn would regard “upset” as a compound word “up + set”, which is why the tertiary stress is marked up.
This section of the Festival manual gives you some clues about what happens in the Postlex module, including vowel reduction and possessive “s”.
For example, compare the Segments produced for these two sentences:
- Simon’s bike.
- Matt’s bike.
The default is the hand-crafted CART. You can inspect this classification tree thus:
festival> phrase_cart_tree
which should give something like:
((lisp_token_end_punc in ("?" "." ":")) ((BB)) ((lisp_token_end_punc in ("'" "\"" "," ";")) ((B)) ((n.name is 0) ((BB)) ((NB)))))
and if you draw that as a tree, you’ll see that the punctuation symbols
? . :
all lead to a Big Break (BB), and that the symbols
' " , ;
all lead to a Break (B) and otherwise there is No Break (NB) unless we reach the end of the input text, in which case a BB is placed even if there is no sentence-final punctuation.
Both types of error will result in a problem that we can hear in the synthetic speech. But yes, they happen at different points in the pipeline.
We can’t really say that they will be “realised similarly” though.
This topic is about finding pronunciation errors.
This topic is about finding waveform generation errors.
That’s right. The rules operate on the phonetic string for a complete sentence output by the letter-to-sound module (which includes the dictionary and the letter-to-sound “rules”, which typically means a classification tree).
The post-lexcial rules rewrite this string to account for contextual effects that only apply when a word is said in context, not in isolation (“citation form”).
Because there are relatively few such effects (at least, only a few that can easily be described in terms of changing the phonetic string), post-lexical rules are usually written by hand.
In Festival, and many other systems, duration is predicted at the segmental (i.e., phone) level. Festival uses a regression tree, because duration is a continuous value.
The tree could directly predict duration in ms or s. But it’s often better to predict what is called a z-score (the figure in that article is helpful). This is the duration expressed as the difference (in numbers of standard deviations) from the mean duration for that phoneme. Here’s what z-score means for duration:
large positive numbers: duration is a lot longer than average
small positive numbers: duration is bit longer than average
zero: duration is exactly equal to the average
negative numbers: duration is bit shorter than average
large negative numbers: duration is lot shorter than averageand we would expect z-scores in a relatively narrow range about the mean (+/- 2 would cover 96% of all cases).
POS is needed in the lexicon to disambiguate homographs. Because POS is the only way to choose the correct pronunciation for words such as “present”, we need to run a POS tagger before trying to look the word up in the lexicon.
In Festival, the lex.lookup_all function will retrieve all matching words and show you their POS tags, for example (for a voice based on the CMU lexicon):
festival> (voice_cmu_us_slt_arctic_hts) cmu_us_slt_arctic_hts festival> (lex.lookup_all 'present) (("present" n (((p r eh z) 1) ((ax n t) 0))) ("present" v (((p r iy z) 0) ((eh n t) 1))))
Later in the processing pipeline, the POS tags will also be used to predict phrase breaks.
You have correctly found that this voice does indeed have many missing diphones. A larger or more carefully-designed recording script would not have this problem.
The reason this happens so frequently for this voice is that the diphone coverage was determined using one dictionary (CMUlex) but the voice has been built with a different dictionary (Unisyn). Normally, we wouldn’t do that, but it’s useful for the purposes of this assignment to show what happens when diphones are missing.
The database comprises sentences of connected speech, so does have both within- and across-word diphones.
The database is the awb speaker (i.e., Alan Black himself) from the ARCTIC set of corpora
Footnote: the voice is called voice_cstr_edi_awb_arctic_multisyn which means “built in CSTR / Edinburgh accent / speaker ‘awb’ / ARCTIC corpus / multisyn unit selection engine”
See this topic.
First: how do we come up with the list of possible questions in the first place?
We use our own knowledge of the problem to design the questions, and indeed to select which predictors to ask questions about. It’s not important to choose only good questions because the CART training procedure will automatically find the best ones and ignore less useful ones. So, we try to think of every possible question that we might ask.
Second: during training, how does the algorithm choose the best question to split the data at the current node?
It tries every possible question, and for each one it makes a note of the reduction in entropy (information gain). It chooses the question that gives the best information gain and puts that in to the tree.
Third: what happens if the training algorithm puts a “not so effective” question into the tree?
This will never happen. If the best available question does not give a large enough information gain, then we terminate and do not split that node any further (although the tree can keep growing the other branches).
There is no backtracking: that would massively increase the computational complexity of the training. So, we call this a “greedy” algorithm.
-
AuthorPosts