Forum Replies Created
-
AuthorPosts
-
The reasons for making joins in the middle of phones, rather than at phone boundaries, was covered in Speech Processing. The main reason is that this is a more acoustically-stable position, and further away from the effects of co-articulation. Think of diphones as “units of co-articulation” which go from one stable mid-phone position to the next.
Your other point relates to what Taylor says on page 483: “These high-level features are also likely to influence voice quality and spectral effects and if these are left out of the set of specification features then their influence cannot be used in the synthesiser. This can lead to a situation in which the high-level features are included in addition to the F0 contour.”
I’ll touch on this in the lecture.
In the Cereproc system, the pieces are called “spurts” and are either sentences or parts of sentences. We do enough text processing to predict where pauses will be (it might be as simple as at every period and comma). We then assume that these pauses are long enough to prevent any co-articulation spreading across them.
Very few systems (if any) deal with units larger than sentences in any meaningful way.
1) theoretically there is no problem at all doing that, but this is not implemented in Festival; if you wanted to evaluate this kind of thing, you might manually edit the synthetic speech to insert those effects – that would be perfectly acceptable experimental technique
2) that’s part of the PhD topic of Rasmus Dall
I’ll touch on this in the lecture, but this is the point where we will depart from unit selection (with an ASF target cost) and move on to statistical parametric synthesis in a couple of weeks from now.
All good questions, but we’re going to talk about NNs for synthesis a bit later in the course, so make sure to ask them again at that point.
At this stage, we can state that a NN is just a non-linear regression model, and so replacing a regression tree with a NN is not a big conceptual leap. That should be much clearer after we have covered HMM-based speech synthesis and the way that it uses regression trees.
We’ll cover this in the lecture.
As you say, we can place any labels we like on the database (whether automatically or manually), and then include appropriate sub-costs in the target cost to prefer units with matching values for these new linguistic features.
The hardest part is usually predicting these from the text input, at synthesis time. But, if we allow markup on that text, this information could be supplied by the user, or whatever system is generating the text.
It’s important to note that every new sub-cost added to the target cost effectively increases the sparsity of the linguistic feature space. We may need to record a (much) larger database. We would also have to carefully tune the weight on the new sub-cost to make sure that choosing candidate units that match the new feature doesn’t result in choosing candidates that are worse matches in the other (possibly more important) features.
This is coming up in the following lecture, so we’ll answer it then.
We’ll cover both points in the lecture.
Your list of design choices is pretty comprehensive, I think. We’ll recap that in the lecture.
Yes, unit selection really is “synthesis” because it can create an appropriate waveform for any given input text (via the linguistic specification). The data storage and search are how it is implemented but the clever part is deciding what to do when the database doesn’t contain the exact units that we need (and that will happen almost all the time).
We’ll look at this in the lecture.
We’ll look at this in the lecture.
By “arbitrary point” Taylor means that the acoustic features take continuous values (e.g., F0 in Hz) rather than discrete values (e.g., “stressed?”)
Those acoustic values have been predicted, given the linguistic features. We actually already know how to build a model to make such predictions – see Speech Processing…
I think you’ve just missed one simple point: it will not be possible, in general, to find any candidates in the database that have exactly the same linguistic specification as the target.
In your example, where you are using phone-sized units an an ASF target cost, your target specification is “phoneme /n/ with an F0 of 121Hz and a duration of 60ms”. It is very unlikely that we will find a candidate with exactly those values. Imagine that we find these candidates:
- phoneme /n/ with an F0 of 101Hz and a duration of 63ms
- phoneme /n/ with an F0 of 120Hz and a duration of 93ms
- phoneme /n/ with an F0 of 114Hz and a duration of 56ms
None of these will have zero target cost.
1) A simple way to do that would be to add the fillers (e.g., “Hmm”) as words in the dictionary. You can then make sure there are some example recordings of that word in your database. Try it and see if it works…
2) We’ll discuss intelligibility etc in the lecture on evaluation, so please ask this question again at that point.
Yes – spot on.
The decision tree is in fact performing a regression or classification task. Given the linguistic features, it is predicting which units in the database would be suitable to use for synthesising the current target position.
If we think of the tree as providing one or more candidate units at each leaf, it is performing classification.
We can also think of it as a regression tree that is predicting an acoustic specification, represented either as a set of exemplar units or (as you say) a probability density. The latter is how HMM-based speech synthesis works.
-
AuthorPosts