Diphones

Using our knowledge of speech production, we predict that making joins at mid-phone positions will sound better than at phone boundaries.

First, remind yourself about the architecture of a text-to-speech system: a front-end linguistic processor, followed by a waveform generator.

CUI 2024 video available

Pipeline architecture for TTS

and then watch the video. 23 minutes 7 seconds

Reading

Jurafsky & Martin – Section 8.4 – Diphone Waveform Synthesis

A simple way to generate a waveform is by concatenating speech units from a pre-recorded database. The database contains one recording of each required speech unit.

Ladefoged (Elements) – Chapter 10 – Fourier analysis

An attempt to explain Fourier analysis. Although chapters 1-9 are great, I actually do not recommend chapter 10.

Ladefoged (Elements) – Chapter 11 – Digital filters and LPC analysis

A brave attempt to use 'long hand' to spell out how LPC analysis works, but not a recommended reading.

Note: the readings above from Ladefoged may be useful for students with little or no mathematical background, although in general I don’t think they are very good. Use the forums to tell me what you think.