Forum Replies Created
-
AuthorPosts
-
Most algorithms would use cross-correlation (also known as modified auto-correlation), even if it does need a little bit more computation. In speech synthesis, F0 estimation is typically a one-time process that happens during voice building and so we don’t care too much about a little extra computational cost, if that gives a better result.
I think when you say “low F0 bias” you mean a bias towards picking peaks at smaller lags. That would be a bias towards picking higher values of F0. For example, we might accidentally pick a peak that corresponds to F1 in some cases.
The YIN pitch tracker (or open access version) performs a transformation (look for “Cumulative mean normalized difference function” in the paper) of the cross-correlation function, to avoid picking F1.
This is a really ‘old school’ type of signal processing, from the days when the implementation would be in analogue hardware and would have to be causal (i.e., cannot look ahead at the rest of the signal) and real-time, by definition.
You are correct that detection can only happen in the exponential decaying part. The blanking time is there to prevent any detections in the short period of time after the previous detection. It is a threshold on the minimum fundamental period that can be detected (i.e., it determines the maximum F0 that can be detected). The blanking time is a parameter of the method and will have to be set by the designer.
We cannot be certain that the first peak to cross the threshold will correspond to the F0 component. So, we do not expect this method to be very robust. I’m sure we could carefully tune the blanking time and the slope of the exponential decay to make it work in some cases, but it would probably be hard to find values for those two parameters that work for a wide variety of voices.
In speech, there is essentially a one-to-one relationship between perceived pitch and the physical property of F0. That’s why we so often conflate these two terms (e.g., a “pitch tracker” is really tracking F0).
One exception to this is that listeners can perceive a fundamental frequency that is actually missing, perhaps due to transmission over an old-fashioned telephone line, or reproduction through small or low-quality loudspeakers.
It is possible to construct sounds that have a complicated relationship between the physical and perceived properties. Although these are not really relevant to speech, they are still interesting. My favourite is the “Shepard–Risset glissando” because it drives musicians crazy.
An excellent idea, and one that has indeed been proposed in the literature, specifically for the case where the fundamental is absent (e.g., speech transmitted down old-fashioned telephone lines).
What you propose is to find the largest common divisor of a set of candidate values for F0. See http://dx.doi.org/10.1121/1.1910902 (the full text is behind a paywall: you’ll need to enter the JASA website via the University library to gain access).
This could be combined with any way of finding candidate values for F0 (e.g., autocorrelation) and we would also expect that some post processing (e.g., dynamic programming) would further improve results.
OK, I get this idea. You are proposing to ‘loop’ the contents of the window, to effectively create a longer single.
I think the problem with this will be that when we cycle around, we will create a a discontinuity in the waveform (because we don’t loop in perfect multiples of the pitch period: the window is not generally aligned to the pitch period).
The POS will also be nil for words where the pronunciation does not depend on POS. That will actually be the case for most words. Try looking up a spelling that has two possible pronunciations, differentiated by POS, such as “present”. Use the lex.lookup_all function to retrieve all entries.
That is the expected behaviour for any words that are not in the lexicon: it is not easy to predict the POS of a single unknown word taken out of context, so Festival does not attempt to do that.
Here’s another example of a proprietary markup language: Neospeech’s VTML.
Yes, you are correct that “the y-axis is the output of correlation function for a given value of lag”.
The units on the horizontal axis are simply time (measured in whatever units you would like to measure lag in, which is most commonly samples).
The units on the vertical axis are the square of the units on a waveform. However, we tend not to write units on the vertical axis of a waveform, because the scale is usually not calibrated. We simply label the axis with, say, amplitude. If we do put units, then they will be relative, such as the bits used to store each sample, or decibels relative to the maximum possible amplitude.
Why do we need to find just one peak? Well, you are right to say the each peak corresponds to a similarity in the waveform with the shifted version of itself. But, perhaps your misunderstanding is that you are thinking of these as “moments” (in time). That’s not what they are. The horizontal axis in the autocorrelation plot is lag and not time (although the units of lag are the same as the units of time).
So, multiple peaks correspond to multiple lags, and therefore to multiple candidate values for the fundamental period (= 1/F0). Since F0 really does only have exactly one value (*) we do just want to find the one correct peak.
(*) although that’s not going to be true for creaky voice or any other type of irregular vocal fold vibration
OK, here’s what should be a much simpler way. Create versions of localdir_multisyn-gam.scm for each of your voices. Start Festival, and manually load each such voice definition
festival> (load ".../mylocaldirvoice1_multisyn-gam.scm") nil festival> (load ".../mylocaldirvoice2_multisyn-gam.scm") nil
where you should replace “…” with the absolute path to where you have placed those files. Now you can use those voices
festival> (voice_mylocaldirvoice1_multisyn-gam) Please wait: Initialising multisyn voice. ...etc. festival> (SayText "This is the first voice") ...etc. festival> (voice_mylocaldirvoice2_multisyn-gam) Please wait: Initialising multisyn voice. ...etc. festival> (SayText "This is the second voice")
And to do the whole thing from the beginning and then synthesise from some text with SABLE markup, start Festival and do:
festival> (load ".../mylocaldirvoice1_multisyn-gam.scm") nil festival> (load ".../mylocaldirvoice2_multisyn-gam.scm") nil festival> (tts "somefile.sable" nil)
For unit selection voices that use the multisyn engine, You need to set
voice-path-multisyn
and notvoice-path
.Try this command
festival> (set! voice-path-multisyn "/Path/To/Your/Voice/Directory/")
and Festival will look in the location you specify, instead of the system location /Volumes/Network/courses/ss/festival/lib.incomplete/voices-multisyn/
So, whatever directory (in your own filespace) you set
voice-path-multisyn
to, it should have the same structure as the system directory above, which is:If that works, you could create a file called
.festivalrc
in your home directory (or editing the existing file, if you have one) that contains the command, so it is executed every time you start Festival:(set! voice-path-multisyn "/Path/To/Your/Voice/Directory/")
Hint: the Finder on a Mac will hide files whose name starts with a period, by default. Use the Terminal to see all files, where you need to use
ls -a
rather than justls
It was superseded in the sense that it never made it as far as a formalised standard (e.g., via the W3C) and instead we have various vendor-specific approaches (e.g., Microsoft’s SAPI5).
SAPI 5 synthesis markup format is similar to the format published by the SABLE Consortium. However, this format and SABLE version 1.0 are not interoperable. At this time, it’s not determined if they will become partially interoperable in the future. (SAPI 5.3 documentation, Microsoft)
2. a little fiddly, but possible
A voice in Festival is defined by a set of files, including some Scheme that defines the locations of the various files needed (LPCs, utts, etc). On the system here in Edinburgh, the definition of voice_localdir_multisyn-gam is here:
/Volumes/Network/courses/ss/festival/lib.incomplete/voices-multisyn/english/localdir_multisyn-gam
This is a little different to normal voices, in that it looks in the current (local) directory for the voice files, and not in Festival’s own library.
You will need to make your own a copy of that directory and its contents, and modify file
festvox/localdir_multisyn-gam.scm
to change the name of the voice (you can’t have two voices with the same name), and the paths used to find the voice data.See how far you get, then come back for more help if you need it.
-
AuthorPosts