› Forums › Speech Synthesis › Unit selection › Disfluencies, and intelligibility
- This topic has 3 replies, 2 voices, and was last updated 8 years, 7 months ago by Simon.
-
AuthorPosts
-
-
January 15, 2016 at 20:56 #2084
In the readings (so far) I haven’t come across much mention of disfluencies, or filled pauses. My Amazon Echo, for example, is now using the word ‘Hmmmm…’ when it needs to ‘punt’ on an answer, as in: ‘Hmmmmm, I can’t seem to find what you are looking for.’ This seems like an obvious way to add more ‘human-ness’ to a TTS system – although not necessarily naturalness(if they are ‘inserted’ in inappropriate places in a speech waveform), and not necessarily intelligibility (if the disfluency is sufficiently distracting). But this then brings up an interesting point: maybe the stated goal of intelligibility is actually not as necessary as we assume, or can be thought of as ‘end-use’ dependent. After all, real human speech is often not perfectly intelligible. We often ask a speaker to repeat, or even more often, we fill in what we didn’t hear using context, and sometimes we just don’t know completely what was said, and we leave it at that (either hoping to fill in our understanding from future speech, or just accepting that we didn’t catch the whole meaning). So, if the end use is reading an audio book, or giving turn-by-turn gps directions, or a phone banking system, then intelligibility would be paramount. But for a dialogue system, a character in an interactive game, a companion robot, or any type of interaction where absolute perfect understanding of the underlying ‘signal’ is not absolutely necessary, but where ‘human-ness’ is of high importance, then a certain LACK of intelligibility might even be desirable. So I guess I have 2 questions:
1) Is disfluency and filled pauses something we can explore when we build our own voice? Can you point to some literature on that, or make some basic suggestions?
2) What do you make of this idea that intelligibility, in certain situations, might not be absolutely necessary, or even 100% desirable? -
January 17, 2016 at 10:16 #2157
1) A simple way to do that would be to add the fillers (e.g., “Hmm”) as words in the dictionary. You can then make sure there are some example recordings of that word in your database. Try it and see if it works…
2) We’ll discuss intelligibility etc in the lecture on evaluation, so please ask this question again at that point.
-
January 17, 2016 at 11:08 #2167
OK, let’s leave aside 2) for now (I have an entire catalog of evaluation questions, so I’ll wait until later in the course to ask those).
Focusing on disfluencies, and what I am calling ‘human-ness’. Yes, I see how it would be easy enough to add specific word-like fillers to your recordings, then add them to your dictionary, and then add them to the text you want to synthesize,just like any other word. Hmm, umm, uh, err, ahh all come to mind.
1) How about something that wouldn’t have a phonetic spelling, and isn’t composed of diphones, like a sigh, or a deep intake of breath? Could we use a symbol in the text, like #, and tell festival to consider this a word. And then…map this symbol directly to a particular wav file?
2) What if I wanted un-expectedness (in an attempt to increase ‘human-ness’)? Let’s say I don’t want to write ‘hmm’ and ‘uh’ in my text to be synthesized. I want the system to insert filled pauses for me, not at random, but based on some concept of where a human might actually insert them in natural speech. Could this be considered a kind of post-lexical rule, but with some probability function added to it? For example, a rule that says ‘when the word “well” is followed by the word “I”, insert in between them one of the filler words “hmm”, “uh”, and “err”, or nothing, but with different likelihoods, such that about 71% of the time nothing is inserted, about 17% of the time “hmm” is inserted, about 9% of the time “uh” is inserted, and about 3% of the time “err” is inserted. -
January 17, 2016 at 11:29 #2169
1) theoretically there is no problem at all doing that, but this is not implemented in Festival; if you wanted to evaluate this kind of thing, you might manually edit the synthetic speech to insert those effects – that would be perfectly acceptable experimental technique
2) that’s part of the PhD topic of Rasmus Dall
-
-
AuthorPosts
- You must be logged in to reply to this topic.