The front end

We need to process the input text, first to identify the words, then to decide how they should be said.

Tokenisation and normalisation
The first task is to convert all of the text into words.
Letter to sound
Once the text is entirely converted to words, we need to decide on their pronunciations.
CART
Classification and regression trees are widely-applicable models for making predictions. We can use them for letter-to-sound, prosody, and many other tasks.
Prosody prediction
This is typically predicted in several stages: placement of events, classification of their types, then realisation.

April 15, 2025	This video was Excellent Difficulty Just right Doing Text-to-Speech
April 15, 2025	This video was Excellent Difficulty Just right What is a Neural Network?
April 14, 2025	This video was Excellent Difficulty Just right Wrap-up
April 13, 2025	This video was Excellent Difficulty My brain hurts HMM speech synthesis, described as context-dependent modelling
April 13, 2025	This video was Excellent Difficulty My brain hurts HMM speech synthesis, described as context-dependent modelling