This is typically predicted in several stages: placement of events, classification of their types, then realisation.
7 minutes 31 seconds
3 minutes 51 seconds
Reading
Jurafsky & Martin (2nd ed) – Section 8.3 – Prosodic Analysis
Beyond getting the phones right, we also need to consider other aspects of speech such as intonation and pausing.
Taylor – Chapter 6 – Prosody prediction from text
Predicting phrasing, prominence, intonation and tune, from text input.