reading

Clark et al: Multisyn: Open-domain unit selection for the Festival speech synthesis system

A description of the implementation and evaluation of Festival’s unit selection engine, called Multisyn.

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing.

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 1-5)

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 1-5 (up to and including ‘Fourier Analysis’).

Handbook of phonetic sciences – Ch 20 – Intro to Signal Processing for Speech (Sections 6-7)

Written for a non-technical audience, this gently introduces some key concepts in speech signal processing. Read sections 6-7.

King et al: Speech synthesis using non-uniform units in the Verbmobil project

Of purely historical interest, this is an example of a system using a heterogeneous unit type inventory, developed shortly before Hunt & Black published their influential paper.

King: Measuring a decade of progress in Text-to-Speech

A distillation of the key findings of the first 10 years of the Blizzard Challenge.

Mayo et al: Multidimensional scaling of listener responses to synthetic speech

Multi-dimensional scaling is a way to uncover the different perceptual dimensions that listeners use, when rating synthetic speech.

Norrenbrock et al: Quality prediction of synthesised speech…

Although standard speech quality measures such as PESQ do not work well for synthetic speech, specially constructed methods do work to some extent.

Pollet & Breen: Synthesis by Generation and Concatenation of Multiform Segments

Another way to combine waveform concatenation and SPSS is to alternate between waveform fragments and vocoder-generated waveforms.

Qian et al: A Unified Trajectory Tiling Approach to High Quality Speech Rendering

The term “trajectory tiling” means that trajectories from a statistical model (HMMs in this case) are not input to a vocoder, but are “covered over” or “tiled” with waveform fragments.

Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 improves over FastSpeech by not requiring a complicated teacher-student training regime, but instead being trained directly on the data. It is very similar to FastPitch 2, which was released around the same by different authors.

Tachibana et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

DCTTS is comparable to Tacotron, but is faster because it uses non-recurrent architectures for the encoder and decoder.