We need to process the input text, first to identify the words, then to decide how they should be said.
Letter to sound
Once the text is entirely converted to words, we need to decide on their pronunciations.
CART
Classification and regression trees are widely-applicable models for making predictions. We can use them for letter-to-sound, prosody, and many other tasks.
Prosody prediction
This is typically predicted in several stages: placement of events, classification of their types, then realisation.