We want a unit sequence that is both linguistically-similar to the target utterance that we are synthesising, and that concatenates well. There is a tradeoff to be made, and we formulate this as the sum of two kinds of cost function: a target cost for every candidate unit selected from the database, and a join cost between every pair of concatenated units.
Independent Feature Formulation (IFF) target cost
Festival uses a target cost function that is a weighted sum of linguistic feature mismatches. Only symbolic information from the front-end is required, and no explicit predictions of acoustic values (e.g., F0 and duration) are needed.
Target cost for diphones
Linguistic features from the front end are associated with phones, but the acoustic units of concatenation are actually diphones, so we need to define how the target cost is calculated for diphones.
The join cost
Concatenations (i.e., joins) might be perceived by the listener, so we need to minimise the potential for perceptual discontinuity. The join cost quantifies this, by measuring acoustic discontinuity.
How every local cost influences the entire unit sequence
We have deliberated formulated the target and join cost to be computed locally, so that we can use Dynamic Programming. But every local decision (of which candidate to use for a particular target) has a potential effect on all the other decisions, via the join costs on the left and right edges of that candidate.