› Forums › Speech Synthesis › Unit selection › Spontaneous Speech Transcription Strategy
- This topic has 4 replies, 2 voices, and was last updated 8 years, 10 months ago by Simon.
-
AuthorPosts
-
-
February 4, 2016 at 14:51 #2416
Would it be possible to record spontaneous speech, in the studio, without a script, and then use some form of ASR (commercial grade, therefore hopefully robust) to generate a post-facto script? There would be errors, of course, but hopefully relatively few, and these could be hand-fixed. Thereby creating an accurate script of the spontaneous speech without needing to hand-transcribe the entire recording. This script would then be used to generate phone strings for forced alignment, etc.
Has this been done?
-
February 4, 2016 at 15:59 #2418
Using spontaneous speech as the basis for a speech synthesiser is an attractive idea, but is rather hard in practice, for several reasons. Here are some of them:
Word-level transcription: spontaneous speech is harder to transcribe even at the word level than read speech, because it is not entirely made of words (as found in a lexicon); ASR could be tried, as could hard-transcription, but both would have difficulty with this – remember that commercial ASR is designed for careful planned speech such as dictation and will not work very well for unplanned speech
Phonetic transcription: even harder than word-level transcription, because the pronunciations deviate considerably from those found in the lexicon (due to co-articulation, assimilation, deletion,…)
Phonetic alignment: the idea that speech is a linear string of phones (“beads on a string”) was never quite true even for read speech, but is even more problematic for spontaneous speech.
Here’s an experiment to try:
- record a spontaneous utterance
- transcribe the words
- record a read-text version of that
- compare the spontaneous and read-text versions side by side
- listen
- examine waveforms and spectrograms
- try to hand-label word and phone boundaries
-
February 4, 2016 at 16:02 #2420
Sebastian Andersson, Kallirroi Georgila, David Traum, Matthew Aylett, and Robert Clark. Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Proc. Speech Prosody, Chicago, USA, May 2010. PDF
Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. DOI: 10.1016/j.specom.2011.08.001
-
February 5, 2016 at 11:23 #2428
Do the audio examples from these 2 papers still exist somewhere? Can I listen to them?
-
February 7, 2016 at 10:37 #2497
There are some examples for the Speech Communication paper.
-
-
AuthorPosts
- You must be logged in to reply to this topic.