Page 66

Forum Replies Created

Viewing 15 posts - 976 through 990 (of 1,087 total)

← 1 2 3 … 65 66 67 … 71 72 73 →

Author

Posts
January 24, 2016 at 13:41 in reply to: databases for unit selection #2306
Simon
Professor
A training algorithm is used to train a model on some data. Give me more context to your question and I’ll provide a more specific answer.
January 24, 2016 at 13:40 in reply to: Scoring in automatic text selection #2305
Simon
Professor
We’ll do a more detailed example in the lecture.
January 24, 2016 at 13:40 in reply to: CMU punctuation normalisation #2304
Simon
Professor
I agree that this is a somewhat strange design decision in the ARCTIC corpora. In the tech report, the authors’ don’t justify this decision, but I assume it is because questions are too sparse to attempt coverage of them, and because the features used in their text selection algorithm don’t capture the differences between statements and questions.

Your suggestion to remove sentences that are questions from the corpus entirely, rather than keep them without a question mark, seems sensible to me.
January 22, 2016 at 21:04 in reply to: 'Perceptual/Acoustic Space' in ASF #2216
Simon
Professor
Your descriptions of IFF and ASF are correct. You are also right to say that the acoustic features in ASF are predicted from the same linguistic features used in IFF.

The key point to understand is that different combinations of linguistic features all have the same acoustic features. So, sparsity might be less of a problem in the ASF case.

In other words, we don’t really need to find a candidate unit that has the same linguistic features as the target, we just need it to sound like it has the same linguistic features.

However, for an ASF target cost to work well, we need to
1. predict the acoustic features accurately from the linguistic features
2. measure distances in acoustic space in a way that correlates with perception
Neither of those are trivial. Your phrase “direct mappings” suggests these mappings are easy to learn: they are not.
January 22, 2016 at 20:59 in reply to: phrase-final and non-final #2215
Simon
Professor
I think it’s one of those terms that linguists use so frequently, they forget to define it carefully. First we need to know what a phrase is. In the context of speech, we mean the prosodic phrase. This short sentence has a single prosodic phrase when spoken:

“The cat sat on the mat.”

and this one has two:

“The cat sat on the mat, and the dog ran round the tree.”

Phrase-final means the last word, syllable or phone in a prosodic phrase. It’s important because special things happen in phrase-final position: syllables become longer, and F0 often lowers (in statements), for example.
January 17, 2016 at 11:38 in reply to: Diphones and high level features #2171
Simon
Professor
The reasons for making joins in the middle of phones, rather than at phone boundaries, was covered in Speech Processing. The main reason is that this is a more acoustically-stable position, and further away from the effects of co-articulation. Think of diphones as “units of co-articulation” which go from one stable mid-phone position to the next.

Your other point relates to what Taylor says on page 483: “These high-level features are also likely to influence voice quality and spectral effects and if these are left out of the set of specification features then their influence cannot be used in the synthesiser. This can lead to a situation in which the high-level features are included in addition to the F0 contour.”

I’ll touch on this in the lecture.
January 17, 2016 at 11:32 in reply to: Breaking Text up (Streaming Synthesis) #2170
Simon
Professor
In the Cereproc system, the pieces are called “spurts” and are either sentences or parts of sentences. We do enough text processing to predict where pauses will be (it might be as simple as at every period and comma). We then assume that these pauses are long enough to prevent any co-articulation spreading across them.

Very few systems (if any) deal with units larger than sentences in any meaningful way.
January 17, 2016 at 11:29 in reply to: Disfluencies, and intelligibility #2169
Simon
Professor
1) theoretically there is no problem at all doing that, but this is not implemented in Festival; if you wanted to evaluate this kind of thing, you might manually edit the synthetic speech to insert those effects – that would be perfectly acceptable experimental technique

2) that’s part of the PhD topic of Rasmus Dall
January 17, 2016 at 11:26 in reply to: Setting the target weights using maximum likelihood #2168
Simon
Professor
I’ll touch on this in the lecture, but this is the point where we will depart from unit selection (with an ASF target cost) and move on to statistical parametric synthesis in a couple of weeks from now.
January 17, 2016 at 11:01 in reply to: Neural Networks #2166
Simon
Professor
All good questions, but we’re going to talk about NNs for synthesis a bit later in the course, so make sure to ask them again at that point.

At this stage, we can state that a NN is just a non-linear regression model, and so replacing a regression tree with a NN is not a big conceptual leap. That should be much clearer after we have covered HMM-based speech synthesis and the way that it uses regression trees.
January 17, 2016 at 10:57 in reply to: Pruning in unit selection search procedure #2165
Simon
Professor
We’ll cover this in the lecture.
January 17, 2016 at 10:56 in reply to: Use of paralinguistic information #2164
Simon
Professor
As you say, we can place any labels we like on the database (whether automatically or manually), and then include appropriate sub-costs in the target cost to prefer units with matching values for these new linguistic features.

The hardest part is usually predicting these from the text input, at synthesis time. But, if we allow markup on that text, this information could be supplied by the user, or whatever system is generating the text.

It’s important to note that every new sub-cost added to the target cost effectively increases the sparsity of the linguistic feature space. We may need to record a (much) larger database. We would also have to carefully tune the weight on the new sub-cost to make sure that choosing candidate units that match the new feature doesn’t result in choosing candidates that are worse matches in the other (possibly more important) features.
January 17, 2016 at 10:52 in reply to: Automatic Labeling of Recordings #2163
Simon
Professor
This is coming up in the following lecture, so we’ll answer it then.
January 17, 2016 at 10:51 in reply to: Target function and features #2162
Simon
Professor
We’ll cover both points in the lecture.
January 17, 2016 at 10:44 in reply to: How to make a "better" system with unit selection? #2161
Simon
Professor
Your list of design choices is pretty comprehensive, I think. We’ll recap that in the lecture.

Yes, unit selection really is “synthesis” because it can create an appropriate waveform for any given input text (via the linguistic specification). The data storage and search are how it is implemented but the clever part is deciding what to do when the database doesn’t contain the exact units that we need (and that will happen almost all the time).
Author

Posts

Viewing 15 posts - 976 through 990 (of 1,087 total)

← 1 2 3 … 65 66 67 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis