Finish

That’s the end of the Text-To-Speech part of the course. The last video of this module was a pointer forward into the Automatic Speech Recognition part of the course. It made it clear that all of our knowledge about speech signals, and in particular about separating the source and filter, will continue to be very useful.

What you should know

Note that Simon says in the videos that we don’t cover unit selection in this course, which is true for the videos but we do cover this in the lectures, readings and assignment.

Diphone: why use diphones? how does this relate to coarticulation? what goes into a diphone database?
Waveform concatenation, Overlap-add, Pitch period:
- What are potential issues for concatenating waveforms? i.e. when do we get ‘glitches’ and ‘pops’ ?
- Why are discontinuities at joins a problem?
- How does Overlap-add and pitch synchronous concatenation help
TD-PSOLA
- What can you manipulate with TD-PSOLA?
- How does TD-PSOLA increase/decrease F0?
- How does TD-PSOLA increase/decrease duration?
- How does this relate to impulse responses? i.e. why doesn’t it change the actual phone/spectral envelope?
Unit selection: Target and join costs (lecture and J&M 8.5) – we haven’t covered the Viterbi algorithm in Module 6, but it will come up again in the ASR modules for this course.
Convolution : convolution in the time domain = multiplication in the frequency domain (i.e. see the application of filters in the frequency domain – module 4, e.g. low/band/high pass filters). You should aim to understand this at a conceptual level.
Connected speech/citation speech:
- identify examples of connected speech processes: assimilation, lenition, deletion, vowel reduction, as discussed in the lectures/videos in reference to potential rules helping us to generate correct pronunciations.

Key Terms

diphone, diphone database
concatenation
concatenative synthesis
waveform, waveform generation
diphone synthesis
unit selection
coarticulation
overlap-add
pitch period
TD-PSOLA
discontinuity
join, join cost
target, target cost
convolution
connected speech
assimilation
lenition
deletion
vowel reduction