Page 61

Forum Replies Created

Viewing 15 posts - 901 through 915 (of 1,084 total)

← 1 2 3 … 60 61 62 … 71 72 73 →

Author

Posts
February 26, 2016 at 08:00 in reply to: Join Costs #2652
Simon
Professor
Yes, I think that would work. Changing the normalisation of the join cost coefficients (not just MFCCs – also F0 and energy) effectively changes the relative weight between join cost and target cost.

Try making the join cost coeffs very small – you should get more joins (fewer contiguous sequences of candidate), and therefore presumably more bad joins.

Try making them rather large, and you should get more contiguous sequences of candidates, but which match the target context less well.
February 25, 2016 at 18:39 in reply to: Join Costs #2649
Simon
Professor
Festival doesn’t do anything to bias against joins in [r] etc – but commercial systems certainly do.

The join cost for naturally-contiguous units is simply defined to be zero and isn’t even calculated.

Festival computes join cost entirely locally, just from the frames either side of the join.
February 25, 2016 at 14:54 in reply to: diphone/word/sentence position #2648
Simon
Professor
I’ve updated my previous response….
February 25, 2016 at 14:47 in reply to: Target and join cost #2646
Simon
Professor
How to do detective work on the target cost?

Well, it will be forensic detective work, I think. You will need to look at the linguistic context of the target and the linguistic context of all available candidates in the database (including the one that was chosen), and then count the mismatches for each: basically, compute the target cost yourself.

I don’t recommend doing this. Looking at a single target in isolation will not tell the whole story: the candidates chosen for all the other targets have an influence on this choice, via the sequence of join costs.
February 25, 2016 at 14:44 in reply to: Target and join cost #2645
Simon
Professor
The fact that units with a relatively high target cost have been chosen simply means that they are part of the lowest-overall-cost sequence. One possible reason for that is that there is only one available candidate for a given target diphone type, and so it will be always used, no matter how high the cost (e.g., even if it has “bad F0”).

The same applies for “bad duration”.

You might think that a candidate can only be an outlier if there are several other diphones of the same type. But we look at the two halves of the diphone separately. So, “outlier” is with respect to the monophone duration distribution.
February 25, 2016 at 14:39 in reply to: F0 Target #2644
Simon
Professor
The source code says:

Specifically, if the targ/cand segment type is expected to be voiced, then an f0 of zero is bad (results from poor pitch tracking).

That is, all voiced sounds should have a value for F0, as determined by the pitch tracker.

Festival’s multisyn unit selection engine uses a pure “IFF” target cost function (using Taylor’s terminology). It makes no explicit predictions of any acoustic properties.

The ToBI predictions made by the front end are not used in the target cost.
February 25, 2016 at 13:58 in reply to: Target and join cost #2641
Simon
Professor
OK – I see. The basic target cost (the weighted sum of feature mismatches) is normalised to the 0-1 range. After that, penalties may be added for things like “bad F0” or “bad duration” and those penalties can have values such as 25 or 10.

So a target cost of, say, 10.375 is likely to be a basic cost of 0.375 plus a penalty of 10.
February 25, 2016 at 13:04 in reply to: diphone/word/sentence position #2639
Simon
Professor
There is no need to store word position: it can be deduced easily on the fly by querying the utterance structure (words have a syllable as parent, which in turn has a word as parent).

Multisyn does actually use “phrase position” in the target cost (this was omitted in the lecture slides – apologies). Here are the actual costs currently used in Multisyn:
```
  (10 tc_stress )
  (5 tc_syl_pos )
  (5 tc_word_pos)
  (6 tc_partofspeech)
  (7 tc_phrase_pos) 
  (4 tc_left_context)
  (3 tc_right_context)
  (25 tc_bad_f0)
  (10 tc_bad_duration)
```
where “tc_phrase_pos” looks as match/mismatch in the phrase break feature of the word that the current segment belongs to.
February 25, 2016 at 13:02 in reply to: Target and join cost #2638
Simon
Professor
Can you post an example of negative target costs – e.g., the output of (utt.relation.print yourutt ‘Unit)

There’s no intuition of ‘high’ or ‘low’ costs – it is their value relative to the costs of alternate unit sequences that matters.
February 25, 2016 at 12:52 in reply to: Backoff (from diphones to half phones) • Diphone and half-phone systems are very #2637
Simon
Professor
See section 3.7.1 in this paper

Robert A. J. Clark, Korin Richmond, and Simon King. Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication, 49(4):317-330, 2007. Publisher’s version, or open access version.
February 21, 2016 at 12:39 in reply to: pitch tracking and pitch marking #2630
Simon
Professor
Let’s go over the difference between pitch tracking and pitch marking in the lecture.

For speech, we often conflate the terms “pitch” and “fundamental frequency” even though we really should not. See also this topic.
February 21, 2016 at 12:37 in reply to: Accounting for low F0 bias in autocorrelation #2629
Simon
Professor
Most algorithms would use cross-correlation (also known as modified auto-correlation), even if it does need a little bit more computation. In speech synthesis, F0 estimation is typically a one-time process that happens during voice building and so we don’t care too much about a little extra computational cost, if that gives a better result.

I think when you say “low F0 bias” you mean a bias towards picking peaks at smaller lags. That would be a bias towards picking higher values of F0. For example, we might accidentally pick a peak that corresponds to F1 in some cases.

The YIN pitch tracker (or open access version) performs a transformation (look for “Cumulative mean normalized difference function” in the paper) of the cross-correlation function, to avoid picking F1.
February 21, 2016 at 12:20 in reply to: Dudley's pitch detector #2628
Simon
Professor
This is a really ‘old school’ type of signal processing, from the days when the implementation would be in analogue hardware and would have to be causal (i.e., cannot look ahead at the rest of the signal) and real-time, by definition.

You are correct that detection can only happen in the exponential decaying part. The blanking time is there to prevent any detections in the short period of time after the previous detection. It is a threshold on the minimum fundamental period that can be detected (i.e., it determines the maximum F0 that can be detected). The blanking time is a parameter of the method and will have to be set by the designer.

We cannot be certain that the first peak to cross the threshold will correspond to the F0 component. So, we do not expect this method to be very robust. I’m sure we could carefully tune the blanking time and the slope of the exponential decay to make it work in some cases, but it would probably be hard to find values for those two parameters that work for a wide variety of voices.
February 21, 2016 at 12:05 in reply to: Pitch VS F0 #2627
Simon
Professor
In speech, there is essentially a one-to-one relationship between perceived pitch and the physical property of F0. That’s why we so often conflate these two terms (e.g., a “pitch tracker” is really tracking F0).

One exception to this is that listeners can perceive a fundamental frequency that is actually missing, perhaps due to transmission over an old-fashioned telephone line, or reproduction through small or low-quality loudspeakers.

It is possible to construct sounds that have a complicated relationship between the physical and perceived properties. Although these are not really relevant to speech, they are still interesting. My favourite is the “Shepard–Risset glissando” because it drives musicians crazy.
February 21, 2016 at 11:43 in reply to: Reconstructing F0 from harmonics #2626
Simon
Professor
An excellent idea, and one that has indeed been proposed in the literature, specifically for the case where the fundamental is absent (e.g., speech transmitted down old-fashioned telephone lines).

What you propose is to find the largest common divisor of a set of candidate values for F0. See http://dx.doi.org/10.1121/1.1910902 (the full text is behind a paywall: you’ll need to enter the JASA website via the University library to gain access).

This could be combined with any way of finding candidate values for F0 (e.g., autocorrelation) and we would also expect that some post processing (e.g., dynamic programming) would further improve results.
Author

Posts

Viewing 15 posts - 901 through 915 (of 1,084 total)

← 1 2 3 … 60 61 62 … 71 72 73 →

Simon

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis