Forum Replies Created
-
AuthorPosts
-
Building synthetic voices from spontaneous speech is an area of active research.
Although we might be able to gather a lot of spontaneous speech, one barrier is that we then have to manually transcribe it. The second barrier is that it is hard to align the phonetic sequence with the speech; this is for many of the same reasons that Automatic Speech Recognition of such speech is hard (you list some of them: disfluencies, co-articulations, deletions,…).
The hypothesised advantage of using spontaneous speech, over read text, is that the voice would sound more natural.
You put your finger on the core theoretical problem though: without a good model of the variation in spontaneous speech (including a predictive model of that variation given only text input), it is indeed just unwanted noise in the database.
If we want a voice that speaks in a single speaking style (or emotion), then we can simply record data in that style and build the voice as usual. That will work very well, but will not scale to producing many different styles / emotions / etc.
Can you each try to specify more precisely what you mean by ‘expression’ before I continue my answer? Is it a property of whole utterances, or parts of utterances, for example?
In many systems, I think a mixture of explicit and implicit labels are used.
Read what Taylor has to say about intuitive versus analytic labelling (section 17.1.4 to 17.1.6).
When you mention hand-labelling POS tags, I think a better idea is to hand-label some training data, then train a POS tagger, and use that to label your actual database. This is what is done in practice.
Can you suggest which explicit labels you would like to place on the database?
Labelling prosody on the database is one of those topics that has a long history of research, but no really good solutions: it’s a very hard problem.
We are fairly sure that highly-accurate ToBI labels are helpful, provided that we have them for both the database utterances and test sentences. So, even if we hand-label the database, we still have the hard problem of accurately predicting from text at synthesis time, in a way that is consistent with the database labels.
Yes, many people have looked at simpler systems that ToBI. Festival reduces the number of boundary strength levels, for example. Your suggestion to train a model on a hand-labelled subset of data and use that to label the rest of the database is excellent: this is indeed what people do. But there remains the “predicting from text” problem at synthesis time.
One simpler approach is to think just about prominences and boundaries.
Perhaps a more promising approach these days is to label the database with more sophisticated linguistic information than plain old POS tags, such as shallow syntactic and semantic structure.
We certainly do need to train models of full and reduced vowels before we can used forced alignment to choose which one is the most likely label to place on a particular segment in the database.
We’ll look at how to train those models in the lecture.
It’s certainly the case in unit selection that there are many versions that will sound as good as the one chosen via the target and join costs. Actually, there will very probably be many that sound better, but were not the lowest cost sequence in the search (why is that?).
It’s easy in principal to generate an n-best list during a Viterbi search (although not implemented in Festival).
Here’s an idea for how you might generate variants from your own unit selection voice without modifying any code:
- Synthesis the sentence, and examine the utterance structure to see which prompts from the database were used
- Remove one or more (maybe all) of those prompts from utts.data
- Restart Festival
- Synthesise the sentence again: different units will be chosen
Sure – that would be fine.
In general, I don’t think people use Natural Language Generation (NLG) for this, mainly because NLG systems are typically limited domain, and so will only generate a closed set of sentences (or at least, from a closed vocabulary).
The vast majority of missing diphones will be cross-word (why is that?). So, all you would really need to do is find word pairs that contain the required diphone. However, you would want these to occur in a reasonably natural sentence, so that they can be used in the same way as the other prompts (i.e., recorded and used in their entirety).
You might need to cut a string on a separator, keeping only some parts if it. There are lots of ways to do that. The built-in cut command is one way (and you can pass it files too, whereby it will perform the same operation to all lines). The pipe “|” sends the output of one process to the input of the next.
$ # -c cuts using character positions $ echo some_file.txt | cut -c6-9 file $ # -d cuts using the delimiter you specify $ echo some_file.txt | cut -d"_" -f1 some $ # and -f specifies which field(s) you want to keep $ echo some_file.txt | cut -d"_" -f2 file.txt $ echo a_long_file_name.txt | cut -d"_" -f2-4 long_file_name.txt
I’ve clarified my response: removing the question sentences entirely seems to be preferable to keeping them but removing their question marks.
We’ll look at a more detailed example of greedy text selection in the lecture.
Your suggestion to normalise for the length of the sentence is a good idea, otherwise we might just select the longest sentences (because they contain more diphones than shorter sentences).
You make a good point about final total coverage: 100% might be impossible simply because there are no occurrences of certain very rare diphones in our large corpus. The ARCTIC corpus covers around 75-80% of all possible diphones. The initial large corpus contained at least one example of about 90% of all possible diphones (reducing to around 80% when discarding sentences that are not “nice”), so that would be a ceiling on the possible coverage that could ever be obtained.
A training algorithm is used to train a model on some data. Give me more context to your question and I’ll provide a more specific answer.
We’ll do a more detailed example in the lecture.
I agree that this is a somewhat strange design decision in the ARCTIC corpora. In the tech report, the authors’ don’t justify this decision, but I assume it is because questions are too sparse to attempt coverage of them, and because the features used in their text selection algorithm don’t capture the differences between statements and questions.
Your suggestion to remove sentences that are questions from the corpus entirely, rather than keep them without a question mark, seems sensible to me.
Your descriptions of IFF and ASF are correct. You are also right to say that the acoustic features in ASF are predicted from the same linguistic features used in IFF.
The key point to understand is that different combinations of linguistic features all have the same acoustic features. So, sparsity might be less of a problem in the ASF case.
In other words, we don’t really need to find a candidate unit that has the same linguistic features as the target, we just need it to sound like it has the same linguistic features.
However, for an ASF target cost to work well, we need to
- predict the acoustic features accurately from the linguistic features
- measure distances in acoustic space in a way that correlates with perception
Neither of those are trivial. Your phrase “direct mappings” suggests these mappings are easy to learn: they are not.
I think it’s one of those terms that linguists use so frequently, they forget to define it carefully. First we need to know what a phrase is. In the context of speech, we mean the prosodic phrase. This short sentence has a single prosodic phrase when spoken:
“The cat sat on the mat.”
and this one has two:
“The cat sat on the mat, and the dog ran round the tree.”
Phrase-final means the last word, syllable or phone in a prosodic phrase. It’s important because special things happen in phrase-final position: syllables become longer, and F0 often lowers (in statements), for example.
-
AuthorPosts