Watts et al. Where do the improvements come from in sequence-to-sequence neural TTS?

A systematic investigation of the benefits of moving from frame-by-frame models to sequence-to-sequence models.

Wang et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

This paper introduces the VALL-E model, which frames speech synthesis as a language modelling task in which a sequence of audio codec codes are generated conditionally, given a preceding sequence of text (and a speech prompt).

Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.

Shen et al: Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions

Tacotron 2 was one of the most successful sequence-to-sequence models for text-to-speech of its time and inspired many subsequent models.

Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.

Łańcucki. FastPitch: Parallel Text-to-speech with Pitch Prediction

Very similar to FastSpeech2, FastPitch has the advantage of an official Open Source implementation by the author (at NVIDIA).