Forum Replies Created
-
AuthorPosts
-
Sebastian Andersson, Kallirroi Georgila, David Traum, Matthew Aylett, and Robert Clark. Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Proc. Speech Prosody, Chicago, USA, May 2010. PDF
Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. DOI: 10.1016/j.specom.2011.08.001
Using spontaneous speech as the basis for a speech synthesiser is an attractive idea, but is rather hard in practice, for several reasons. Here are some of them:
Word-level transcription: spontaneous speech is harder to transcribe even at the word level than read speech, because it is not entirely made of words (as found in a lexicon); ASR could be tried, as could hard-transcription, but both would have difficulty with this – remember that commercial ASR is designed for careful planned speech such as dictation and will not work very well for unplanned speech
Phonetic transcription: even harder than word-level transcription, because the pronunciations deviate considerably from those found in the lexicon (due to co-articulation, assimilation, deletion,…)
Phonetic alignment: the idea that speech is a linear string of phones (“beads on a string”) was never quite true even for read speech, but is even more problematic for spontaneous speech.
Here’s an experiment to try:
- record a spontaneous utterance
- transcribe the words
- record a read-text version of that
- compare the spontaneous and read-text versions side by side
- listen
- examine waveforms and spectrograms
- try to hand-label word and phone boundaries
That function is calculating the midpoints, yes. The code you’re showing is used for stripping the join cost coefficients during voice building, but it’s performing the same calculation that is done during synthesis.
In lectures, we did indeed gloss over a couple of special cases:
Diphthongs: the 50% point is a poor choice, since the spectrum may be changing rapidly there, so we make the join 75% of the way through the segment where the spectrum is generally a little more stable.
Stops: the end of the closure (stored in cl_end) will have been found during forced alignment (how?) and so we use that as the join point; picking the 50% point in a stop (=closure+burst) might sometimes be before the burst, and other times in the middle of the burst, so would be a bad place to make a join (e.g., we might end up with two bursts in the synthetic speech).
Diphone boundaries are generally just the midpoint between phone boundaries. So, there is no need to store this information in the .utt files because it’s very fast to compute on the fly (e.g., as the file is loaded).
Likewise, it’s easy to construct an index of all available diphones on the fly, as the .utt files are loaded, and store it in memory.
January 24, 2016 at 17:30 in reply to: Labelling the diphones (not the features, just the phonemes) #2325We’ll look at this in detail in the lecture.
Yes, this would be pretty straightforward to do. As you say, you could treat it as a special kind of word (presumably pronounced as a special new phone).
In fact, you might find that – at least in unit selection – you will get these in-breaths ‘for free’ because phrase-initial silence diphones will be chosen to synthesise silence in phrase-initial positions.
Building synthetic voices from spontaneous speech is an area of active research.
Although we might be able to gather a lot of spontaneous speech, one barrier is that we then have to manually transcribe it. The second barrier is that it is hard to align the phonetic sequence with the speech; this is for many of the same reasons that Automatic Speech Recognition of such speech is hard (you list some of them: disfluencies, co-articulations, deletions,…).
The hypothesised advantage of using spontaneous speech, over read text, is that the voice would sound more natural.
You put your finger on the core theoretical problem though: without a good model of the variation in spontaneous speech (including a predictive model of that variation given only text input), it is indeed just unwanted noise in the database.
If we want a voice that speaks in a single speaking style (or emotion), then we can simply record data in that style and build the voice as usual. That will work very well, but will not scale to producing many different styles / emotions / etc.
Can you each try to specify more precisely what you mean by ‘expression’ before I continue my answer? Is it a property of whole utterances, or parts of utterances, for example?
In many systems, I think a mixture of explicit and implicit labels are used.
Read what Taylor has to say about intuitive versus analytic labelling (section 17.1.4 to 17.1.6).
When you mention hand-labelling POS tags, I think a better idea is to hand-label some training data, then train a POS tagger, and use that to label your actual database. This is what is done in practice.
Can you suggest which explicit labels you would like to place on the database?
Labelling prosody on the database is one of those topics that has a long history of research, but no really good solutions: it’s a very hard problem.
We are fairly sure that highly-accurate ToBI labels are helpful, provided that we have them for both the database utterances and test sentences. So, even if we hand-label the database, we still have the hard problem of accurately predicting from text at synthesis time, in a way that is consistent with the database labels.
Yes, many people have looked at simpler systems that ToBI. Festival reduces the number of boundary strength levels, for example. Your suggestion to train a model on a hand-labelled subset of data and use that to label the rest of the database is excellent: this is indeed what people do. But there remains the “predicting from text” problem at synthesis time.
One simpler approach is to think just about prominences and boundaries.
Perhaps a more promising approach these days is to label the database with more sophisticated linguistic information than plain old POS tags, such as shallow syntactic and semantic structure.
We certainly do need to train models of full and reduced vowels before we can used forced alignment to choose which one is the most likely label to place on a particular segment in the database.
We’ll look at how to train those models in the lecture.
It’s certainly the case in unit selection that there are many versions that will sound as good as the one chosen via the target and join costs. Actually, there will very probably be many that sound better, but were not the lowest cost sequence in the search (why is that?).
It’s easy in principal to generate an n-best list during a Viterbi search (although not implemented in Festival).
Here’s an idea for how you might generate variants from your own unit selection voice without modifying any code:
- Synthesis the sentence, and examine the utterance structure to see which prompts from the database were used
- Remove one or more (maybe all) of those prompts from utts.data
- Restart Festival
- Synthesise the sentence again: different units will be chosen
Sure – that would be fine.
In general, I don’t think people use Natural Language Generation (NLG) for this, mainly because NLG systems are typically limited domain, and so will only generate a closed set of sentences (or at least, from a closed vocabulary).
The vast majority of missing diphones will be cross-word (why is that?). So, all you would really need to do is find word pairs that contain the required diphone. However, you would want these to occur in a reasonably natural sentence, so that they can be used in the same way as the other prompts (i.e., recorded and used in their entirety).
You might need to cut a string on a separator, keeping only some parts if it. There are lots of ways to do that. The built-in cut command is one way (and you can pass it files too, whereby it will perform the same operation to all lines). The pipe “|” sends the output of one process to the input of the next.
$ # -c cuts using character positions $ echo some_file.txt | cut -c6-9 file $ # -d cuts using the delimiter you specify $ echo some_file.txt | cut -d"_" -f1 some $ # and -f specifies which field(s) you want to keep $ echo some_file.txt | cut -d"_" -f2 file.txt $ echo a_long_file_name.txt | cut -d"_" -f2-4 long_file_name.txt
I’ve clarified my response: removing the question sentences entirely seems to be preferable to keeping them but removing their question marks.
-
AuthorPosts