Using our knowledge of speech production, we predict that making joins at mid-phone positions will sound better than at phone boundaries.
First, remind yourself about the architecture of a text-to-speech system: a front-end linguistic processor, followed by a waveform generator.
and then watch the video.
23 minutes 7 seconds
Reading
Jurafsky & Martin – Section 8.4 – Diphone Waveform Synthesis
A simple way to generate a waveform is by concatenating speech units from a pre-recorded database. The database contains one recording of each required speech unit.
Ladefoged (Elements) – Chapter 10 – Fourier analysis
An attempt to explain Fourier analysis. Although chapters 1-9 are great, I actually do not recommend chapter 10.
Ladefoged (Elements) – Chapter 11 – Digital filters and LPC analysis
A brave attempt to use 'long hand' to spell out how LPC analysis works, but not a recommended reading.