Forum Replies Created
-
AuthorPosts
-
We’ll look at a more detailed example of greedy text selection in the lecture.
Your suggestion to normalise for the length of the sentence is a good idea, otherwise we might just select the longest sentences (because they contain more diphones than shorter sentences).
You make a good point about final total coverage: 100% might be impossible simply because there are no occurrences of certain very rare diphones in our large corpus. The ARCTIC corpus covers around 75-80% of all possible diphones. The initial large corpus contained at least one example of about 90% of all possible diphones (reducing to around 80% when discarding sentences that are not “nice”), so that would be a ceiling on the possible coverage that could ever be obtained.
A training algorithm is used to train a model on some data. Give me more context to your question and I’ll provide a more specific answer.
We’ll do a more detailed example in the lecture.
I agree that this is a somewhat strange design decision in the ARCTIC corpora. In the tech report, the authors’ don’t justify this decision, but I assume it is because questions are too sparse to attempt coverage of them, and because the features used in their text selection algorithm don’t capture the differences between statements and questions.
Your suggestion to remove sentences that are questions from the corpus entirely, rather than keep them without a question mark, seems sensible to me.
Your descriptions of IFF and ASF are correct. You are also right to say that the acoustic features in ASF are predicted from the same linguistic features used in IFF.
The key point to understand is that different combinations of linguistic features all have the same acoustic features. So, sparsity might be less of a problem in the ASF case.
In other words, we don’t really need to find a candidate unit that has the same linguistic features as the target, we just need it to sound like it has the same linguistic features.
However, for an ASF target cost to work well, we need to
- predict the acoustic features accurately from the linguistic features
- measure distances in acoustic space in a way that correlates with perception
Neither of those are trivial. Your phrase “direct mappings” suggests these mappings are easy to learn: they are not.
I think it’s one of those terms that linguists use so frequently, they forget to define it carefully. First we need to know what a phrase is. In the context of speech, we mean the prosodic phrase. This short sentence has a single prosodic phrase when spoken:
“The cat sat on the mat.”
and this one has two:
“The cat sat on the mat, and the dog ran round the tree.”
Phrase-final means the last word, syllable or phone in a prosodic phrase. It’s important because special things happen in phrase-final position: syllables become longer, and F0 often lowers (in statements), for example.
The reasons for making joins in the middle of phones, rather than at phone boundaries, was covered in Speech Processing. The main reason is that this is a more acoustically-stable position, and further away from the effects of co-articulation. Think of diphones as “units of co-articulation” which go from one stable mid-phone position to the next.
Your other point relates to what Taylor says on page 483: “These high-level features are also likely to influence voice quality and spectral effects and if these are left out of the set of specification features then their influence cannot be used in the synthesiser. This can lead to a situation in which the high-level features are included in addition to the F0 contour.”
I’ll touch on this in the lecture.
In the Cereproc system, the pieces are called “spurts” and are either sentences or parts of sentences. We do enough text processing to predict where pauses will be (it might be as simple as at every period and comma). We then assume that these pauses are long enough to prevent any co-articulation spreading across them.
Very few systems (if any) deal with units larger than sentences in any meaningful way.
1) theoretically there is no problem at all doing that, but this is not implemented in Festival; if you wanted to evaluate this kind of thing, you might manually edit the synthetic speech to insert those effects – that would be perfectly acceptable experimental technique
2) that’s part of the PhD topic of Rasmus Dall
I’ll touch on this in the lecture, but this is the point where we will depart from unit selection (with an ASF target cost) and move on to statistical parametric synthesis in a couple of weeks from now.
All good questions, but we’re going to talk about NNs for synthesis a bit later in the course, so make sure to ask them again at that point.
At this stage, we can state that a NN is just a non-linear regression model, and so replacing a regression tree with a NN is not a big conceptual leap. That should be much clearer after we have covered HMM-based speech synthesis and the way that it uses regression trees.
We’ll cover this in the lecture.
As you say, we can place any labels we like on the database (whether automatically or manually), and then include appropriate sub-costs in the target cost to prefer units with matching values for these new linguistic features.
The hardest part is usually predicting these from the text input, at synthesis time. But, if we allow markup on that text, this information could be supplied by the user, or whatever system is generating the text.
It’s important to note that every new sub-cost added to the target cost effectively increases the sparsity of the linguistic feature space. We may need to record a (much) larger database. We would also have to carefully tune the weight on the new sub-cost to make sure that choosing candidate units that match the new feature doesn’t result in choosing candidates that are worse matches in the other (possibly more important) features.
This is coming up in the following lecture, so we’ll answer it then.
We’ll cover both points in the lecture.
-
AuthorPosts