Forum Replies Created
-
AuthorPosts
-
Your descriptions of IFF and ASF are correct. You are also right to say that the acoustic features in ASF are predicted from the same linguistic features used in IFF.
The key point to understand is that different combinations of linguistic features all have the same acoustic features. So, sparsity might be less of a problem in the ASF case.
In other words, we don’t really need to find a candidate unit that has the same linguistic features as the target, we just need it to sound like it has the same linguistic features.
However, for an ASF target cost to work well, we need to
- predict the acoustic features accurately from the linguistic features
- measure distances in acoustic space in a way that correlates with perception
Neither of those are trivial. Your phrase “direct mappings” suggests these mappings are easy to learn: they are not.
I think it’s one of those terms that linguists use so frequently, they forget to define it carefully. First we need to know what a phrase is. In the context of speech, we mean the prosodic phrase. This short sentence has a single prosodic phrase when spoken:
“The cat sat on the mat.”
and this one has two:
“The cat sat on the mat, and the dog ran round the tree.”
Phrase-final means the last word, syllable or phone in a prosodic phrase. It’s important because special things happen in phrase-final position: syllables become longer, and F0 often lowers (in statements), for example.
The reasons for making joins in the middle of phones, rather than at phone boundaries, was covered in Speech Processing. The main reason is that this is a more acoustically-stable position, and further away from the effects of co-articulation. Think of diphones as “units of co-articulation” which go from one stable mid-phone position to the next.
Your other point relates to what Taylor says on page 483: “These high-level features are also likely to influence voice quality and spectral effects and if these are left out of the set of specification features then their influence cannot be used in the synthesiser. This can lead to a situation in which the high-level features are included in addition to the F0 contour.”
I’ll touch on this in the lecture.
In the Cereproc system, the pieces are called “spurts” and are either sentences or parts of sentences. We do enough text processing to predict where pauses will be (it might be as simple as at every period and comma). We then assume that these pauses are long enough to prevent any co-articulation spreading across them.
Very few systems (if any) deal with units larger than sentences in any meaningful way.
1) theoretically there is no problem at all doing that, but this is not implemented in Festival; if you wanted to evaluate this kind of thing, you might manually edit the synthetic speech to insert those effects – that would be perfectly acceptable experimental technique
2) that’s part of the PhD topic of Rasmus Dall
I’ll touch on this in the lecture, but this is the point where we will depart from unit selection (with an ASF target cost) and move on to statistical parametric synthesis in a couple of weeks from now.
All good questions, but we’re going to talk about NNs for synthesis a bit later in the course, so make sure to ask them again at that point.
At this stage, we can state that a NN is just a non-linear regression model, and so replacing a regression tree with a NN is not a big conceptual leap. That should be much clearer after we have covered HMM-based speech synthesis and the way that it uses regression trees.
We’ll cover this in the lecture.
As you say, we can place any labels we like on the database (whether automatically or manually), and then include appropriate sub-costs in the target cost to prefer units with matching values for these new linguistic features.
The hardest part is usually predicting these from the text input, at synthesis time. But, if we allow markup on that text, this information could be supplied by the user, or whatever system is generating the text.
It’s important to note that every new sub-cost added to the target cost effectively increases the sparsity of the linguistic feature space. We may need to record a (much) larger database. We would also have to carefully tune the weight on the new sub-cost to make sure that choosing candidate units that match the new feature doesn’t result in choosing candidates that are worse matches in the other (possibly more important) features.
This is coming up in the following lecture, so we’ll answer it then.
We’ll cover both points in the lecture.
Your list of design choices is pretty comprehensive, I think. We’ll recap that in the lecture.
Yes, unit selection really is “synthesis” because it can create an appropriate waveform for any given input text (via the linguistic specification). The data storage and search are how it is implemented but the clever part is deciding what to do when the database doesn’t contain the exact units that we need (and that will happen almost all the time).
We’ll look at this in the lecture.
We’ll look at this in the lecture.
By “arbitrary point” Taylor means that the acoustic features take continuous values (e.g., F0 in Hz) rather than discrete values (e.g., “stressed?”)
Those acoustic values have been predicted, given the linguistic features. We actually already know how to build a model to make such predictions – see Speech Processing…
I think you’ve just missed one simple point: it will not be possible, in general, to find any candidates in the database that have exactly the same linguistic specification as the target.
In your example, where you are using phone-sized units an an ASF target cost, your target specification is “phoneme /n/ with an F0 of 121Hz and a duration of 60ms”. It is very unlikely that we will find a candidate with exactly those values. Imagine that we find these candidates:
- phoneme /n/ with an F0 of 101Hz and a duration of 63ms
- phoneme /n/ with an F0 of 120Hz and a duration of 93ms
- phoneme /n/ with an F0 of 114Hz and a duration of 56ms
None of these will have zero target cost.
-
AuthorPosts