First, remind yourself about the architecture of a text-to-speech system: a front-end linguistic processor, followed by a waveform generator.
The first method we’ll use for waveform generation is Time-Domain Pitch Synchronous Overlap and Add (TD-PSOLA), so let’s see first how that works in practice.
If we wish to modify the spectral envelope, then we need a more powerful technique than TD-PSOLA, and so we’ll then look at linear prediction, which can manipulate source and filter separately.