Forum Replies Created
-
AuthorPosts
-
I’ve updated my previous response….
How to do detective work on the target cost?
Well, it will be forensic detective work, I think. You will need to look at the linguistic context of the target and the linguistic context of all available candidates in the database (including the one that was chosen), and then count the mismatches for each: basically, compute the target cost yourself.
I don’t recommend doing this. Looking at a single target in isolation will not tell the whole story: the candidates chosen for all the other targets have an influence on this choice, via the sequence of join costs.
The fact that units with a relatively high target cost have been chosen simply means that they are part of the lowest-overall-cost sequence. One possible reason for that is that there is only one available candidate for a given target diphone type, and so it will be always used, no matter how high the cost (e.g., even if it has “bad F0”).
The same applies for “bad duration”.
You might think that a candidate can only be an outlier if there are several other diphones of the same type. But we look at the two halves of the diphone separately. So, “outlier” is with respect to the monophone duration distribution.
The source code says:
Specifically, if the targ/cand segment type is expected to be voiced, then an f0 of zero is bad (results from poor pitch tracking).
That is, all voiced sounds should have a value for F0, as determined by the pitch tracker.
Festival’s multisyn unit selection engine uses a pure “IFF” target cost function (using Taylor’s terminology). It makes no explicit predictions of any acoustic properties.
The ToBI predictions made by the front end are not used in the target cost.
OK – I see. The basic target cost (the weighted sum of feature mismatches) is normalised to the 0-1 range. After that, penalties may be added for things like “bad F0” or “bad duration” and those penalties can have values such as 25 or 10.
So a target cost of, say, 10.375 is likely to be a basic cost of 0.375 plus a penalty of 10.
There is no need to store word position: it can be deduced easily on the fly by querying the utterance structure (words have a syllable as parent, which in turn has a word as parent).
Multisyn does actually use “phrase position” in the target cost (this was omitted in the lecture slides – apologies). Here are the actual costs currently used in Multisyn:
(10 tc_stress ) (5 tc_syl_pos ) (5 tc_word_pos) (6 tc_partofspeech) (7 tc_phrase_pos) (4 tc_left_context) (3 tc_right_context) (25 tc_bad_f0) (10 tc_bad_duration)
where “tc_phrase_pos” looks as match/mismatch in the phrase break feature of the word that the current segment belongs to.
Can you post an example of negative target costs – e.g., the output of (utt.relation.print yourutt ‘Unit)
There’s no intuition of ‘high’ or ‘low’ costs – it is their value relative to the costs of alternate unit sequences that matters.
February 25, 2016 at 12:52 in reply to: Backoff (from diphones to half phones) • Diphone and half-phone systems are very #2637See section 3.7.1 in this paper
Robert A. J. Clark, Korin Richmond, and Simon King. Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication, 49(4):317-330, 2007. Publisher’s version, or open access version.
Let’s go over the difference between pitch tracking and pitch marking in the lecture.
For speech, we often conflate the terms “pitch” and “fundamental frequency” even though we really should not. See also this topic.
Most algorithms would use cross-correlation (also known as modified auto-correlation), even if it does need a little bit more computation. In speech synthesis, F0 estimation is typically a one-time process that happens during voice building and so we don’t care too much about a little extra computational cost, if that gives a better result.
I think when you say “low F0 bias” you mean a bias towards picking peaks at smaller lags. That would be a bias towards picking higher values of F0. For example, we might accidentally pick a peak that corresponds to F1 in some cases.
The YIN pitch tracker (or open access version) performs a transformation (look for “Cumulative mean normalized difference function” in the paper) of the cross-correlation function, to avoid picking F1.
This is a really ‘old school’ type of signal processing, from the days when the implementation would be in analogue hardware and would have to be causal (i.e., cannot look ahead at the rest of the signal) and real-time, by definition.
You are correct that detection can only happen in the exponential decaying part. The blanking time is there to prevent any detections in the short period of time after the previous detection. It is a threshold on the minimum fundamental period that can be detected (i.e., it determines the maximum F0 that can be detected). The blanking time is a parameter of the method and will have to be set by the designer.
We cannot be certain that the first peak to cross the threshold will correspond to the F0 component. So, we do not expect this method to be very robust. I’m sure we could carefully tune the blanking time and the slope of the exponential decay to make it work in some cases, but it would probably be hard to find values for those two parameters that work for a wide variety of voices.
In speech, there is essentially a one-to-one relationship between perceived pitch and the physical property of F0. That’s why we so often conflate these two terms (e.g., a “pitch tracker” is really tracking F0).
One exception to this is that listeners can perceive a fundamental frequency that is actually missing, perhaps due to transmission over an old-fashioned telephone line, or reproduction through small or low-quality loudspeakers.
It is possible to construct sounds that have a complicated relationship between the physical and perceived properties. Although these are not really relevant to speech, they are still interesting. My favourite is the “Shepard–Risset glissando” because it drives musicians crazy.
An excellent idea, and one that has indeed been proposed in the literature, specifically for the case where the fundamental is absent (e.g., speech transmitted down old-fashioned telephone lines).
What you propose is to find the largest common divisor of a set of candidate values for F0. See http://dx.doi.org/10.1121/1.1910902 (the full text is behind a paywall: you’ll need to enter the JASA website via the University library to gain access).
This could be combined with any way of finding candidate values for F0 (e.g., autocorrelation) and we would also expect that some post processing (e.g., dynamic programming) would further improve results.
OK, I get this idea. You are proposing to ‘loop’ the contents of the window, to effectively create a longer single.
I think the problem with this will be that when we cycle around, we will create a a discontinuity in the waveform (because we don’t loop in perfect multiples of the pitch period: the window is not generally aligned to the pitch period).
The POS will also be nil for words where the pronunciation does not depend on POS. That will actually be the case for most words. Try looking up a spelling that has two possible pronunciations, differentiated by POS, such as “present”. Use the lex.lookup_all function to retrieve all entries.
-
AuthorPosts