Forum Replies Created
-
AuthorPosts
-
January 24, 2016 at 17:30 in reply to: Labelling the diphones (not the features, just the phonemes) #2325
We’ll look at this in detail in the lecture.
Yes, this would be pretty straightforward to do. As you say, you could treat it as a special kind of word (presumably pronounced as a special new phone).
In fact, you might find that – at least in unit selection – you will get these in-breaths ‘for free’ because phrase-initial silence diphones will be chosen to synthesise silence in phrase-initial positions.
Building synthetic voices from spontaneous speech is an area of active research.
Although we might be able to gather a lot of spontaneous speech, one barrier is that we then have to manually transcribe it. The second barrier is that it is hard to align the phonetic sequence with the speech; this is for many of the same reasons that Automatic Speech Recognition of such speech is hard (you list some of them: disfluencies, co-articulations, deletions,…).
The hypothesised advantage of using spontaneous speech, over read text, is that the voice would sound more natural.
You put your finger on the core theoretical problem though: without a good model of the variation in spontaneous speech (including a predictive model of that variation given only text input), it is indeed just unwanted noise in the database.
If we want a voice that speaks in a single speaking style (or emotion), then we can simply record data in that style and build the voice as usual. That will work very well, but will not scale to producing many different styles / emotions / etc.
Can you each try to specify more precisely what you mean by ‘expression’ before I continue my answer? Is it a property of whole utterances, or parts of utterances, for example?
In many systems, I think a mixture of explicit and implicit labels are used.
Read what Taylor has to say about intuitive versus analytic labelling (section 17.1.4 to 17.1.6).
When you mention hand-labelling POS tags, I think a better idea is to hand-label some training data, then train a POS tagger, and use that to label your actual database. This is what is done in practice.
Can you suggest which explicit labels you would like to place on the database?
Labelling prosody on the database is one of those topics that has a long history of research, but no really good solutions: it’s a very hard problem.
We are fairly sure that highly-accurate ToBI labels are helpful, provided that we have them for both the database utterances and test sentences. So, even if we hand-label the database, we still have the hard problem of accurately predicting from text at synthesis time, in a way that is consistent with the database labels.
Yes, many people have looked at simpler systems that ToBI. Festival reduces the number of boundary strength levels, for example. Your suggestion to train a model on a hand-labelled subset of data and use that to label the rest of the database is excellent: this is indeed what people do. But there remains the “predicting from text” problem at synthesis time.
One simpler approach is to think just about prominences and boundaries.
Perhaps a more promising approach these days is to label the database with more sophisticated linguistic information than plain old POS tags, such as shallow syntactic and semantic structure.
We certainly do need to train models of full and reduced vowels before we can used forced alignment to choose which one is the most likely label to place on a particular segment in the database.
We’ll look at how to train those models in the lecture.
It’s certainly the case in unit selection that there are many versions that will sound as good as the one chosen via the target and join costs. Actually, there will very probably be many that sound better, but were not the lowest cost sequence in the search (why is that?).
It’s easy in principal to generate an n-best list during a Viterbi search (although not implemented in Festival).
Here’s an idea for how you might generate variants from your own unit selection voice without modifying any code:
- Synthesis the sentence, and examine the utterance structure to see which prompts from the database were used
- Remove one or more (maybe all) of those prompts from utts.data
- Restart Festival
- Synthesise the sentence again: different units will be chosen
Sure – that would be fine.
In general, I don’t think people use Natural Language Generation (NLG) for this, mainly because NLG systems are typically limited domain, and so will only generate a closed set of sentences (or at least, from a closed vocabulary).
The vast majority of missing diphones will be cross-word (why is that?). So, all you would really need to do is find word pairs that contain the required diphone. However, you would want these to occur in a reasonably natural sentence, so that they can be used in the same way as the other prompts (i.e., recorded and used in their entirety).
You might need to cut a string on a separator, keeping only some parts if it. There are lots of ways to do that. The built-in cut command is one way (and you can pass it files too, whereby it will perform the same operation to all lines). The pipe “|” sends the output of one process to the input of the next.
$ # -c cuts using character positions $ echo some_file.txt | cut -c6-9 file $ # -d cuts using the delimiter you specify $ echo some_file.txt | cut -d"_" -f1 some $ # and -f specifies which field(s) you want to keep $ echo some_file.txt | cut -d"_" -f2 file.txt $ echo a_long_file_name.txt | cut -d"_" -f2-4 long_file_name.txt
I’ve clarified my response: removing the question sentences entirely seems to be preferable to keeping them but removing their question marks.
We’ll look at a more detailed example of greedy text selection in the lecture.
Your suggestion to normalise for the length of the sentence is a good idea, otherwise we might just select the longest sentences (because they contain more diphones than shorter sentences).
You make a good point about final total coverage: 100% might be impossible simply because there are no occurrences of certain very rare diphones in our large corpus. The ARCTIC corpus covers around 75-80% of all possible diphones. The initial large corpus contained at least one example of about 90% of all possible diphones (reducing to around 80% when discarding sentences that are not “nice”), so that would be a ceiling on the possible coverage that could ever be obtained.
A training algorithm is used to train a model on some data. Give me more context to your question and I’ll provide a more specific answer.
We’ll do a more detailed example in the lecture.
I agree that this is a somewhat strange design decision in the ARCTIC corpora. In the tech report, the authors’ don’t justify this decision, but I assume it is because questions are too sparse to attempt coverage of them, and because the features used in their text selection algorithm don’t capture the differences between statements and questions.
Your suggestion to remove sentences that are questions from the corpus entirely, rather than keep them without a question mark, seems sensible to me.
-
AuthorPosts