Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.
Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering
The term “trajectory tiling” means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are “covered over” or “tiled” with waveform fragments.
Zen, Black & Tokuda: Statistical parametric speech synthesis
A review article that makes some useful connections between HMM-based speech synthesis and unit selection.
Taylor – Chapter 15 – Hidden-Markov-model synthesis
Written with a traditional “starting from automatic speech recognition” viewpoint, you will need to make the connections for yourself to the more general concept of text-to-speech as a regression problem.
King: A beginners’ guide to statistical parametric speech synthesis
A deliberately gentle, non-technical introduction to the topic. Every item in the small and carefully-chosen bibliography is worth following up.