Bennett: Large Scale Evaluation of Corpus-based Synthesisers

An analysis of the first Blizzard Challenge, which is an evaluation of speech synthesisers using a common database.

Hunt & Black: Unit selection in a concatenative speech synthesis system using a large speech database

The classic description of unit selection, described as a search through a network.

King: A beginners’ guide to statistical parametric speech synthesis

A deliberately gentle, non-technical introduction to the topic. Every item in the small and carefully-chosen bibliography is worth following up.

Kominek & Black: CMU ARCTIC databases for speech synthesis

Widely used, copyright-free speech databases for use in speech synthesis

Ling et al: Deep Learning for Acoustic Modeling in Parametric Speech Generation

A key review article.

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).

Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions

Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.

Talkin: A Robust Algorithm for Pitch Tracking (RAPT)

The classic algorithm for estimating F0 from speech signals.

Taylor – Chapter 16 – Unit-selection synthesis

A substantial chapter covering target cost, join cost and search.

Taylor – Chapter 3 – The text-to-speech problem

Discusses the differences between spoken and written forms of language, and describes the structure of a typical TTS system.

Taylor – Section 17.1 – Databases

Including the important issue of labelling the data

Taylor – Section 17.2 – Evaluation

Testing of the system by the developers, as well as via listening tests.