Forum Replies Created
-
AuthorPosts
-
That is the expected behaviour for any words that are not in the lexicon: it is not easy to predict the POS of a single unknown word taken out of context, so Festival does not attempt to do that.
Here’s another example of a proprietary markup language: Neospeech’s VTML.
Yes, you are correct that “the y-axis is the output of correlation function for a given value of lag”.
The units on the horizontal axis are simply time (measured in whatever units you would like to measure lag in, which is most commonly samples).
The units on the vertical axis are the square of the units on a waveform. However, we tend not to write units on the vertical axis of a waveform, because the scale is usually not calibrated. We simply label the axis with, say, amplitude. If we do put units, then they will be relative, such as the bits used to store each sample, or decibels relative to the maximum possible amplitude.
Why do we need to find just one peak? Well, you are right to say the each peak corresponds to a similarity in the waveform with the shifted version of itself. But, perhaps your misunderstanding is that you are thinking of these as “moments” (in time). That’s not what they are. The horizontal axis in the autocorrelation plot is lag and not time (although the units of lag are the same as the units of time).
So, multiple peaks correspond to multiple lags, and therefore to multiple candidate values for the fundamental period (= 1/F0). Since F0 really does only have exactly one value (*) we do just want to find the one correct peak.
(*) although that’s not going to be true for creaky voice or any other type of irregular vocal fold vibration
OK, here’s what should be a much simpler way. Create versions of localdir_multisyn-gam.scm for each of your voices. Start Festival, and manually load each such voice definition
festival> (load ".../mylocaldirvoice1_multisyn-gam.scm") nil festival> (load ".../mylocaldirvoice2_multisyn-gam.scm") nil
where you should replace “…” with the absolute path to where you have placed those files. Now you can use those voices
festival> (voice_mylocaldirvoice1_multisyn-gam) Please wait: Initialising multisyn voice. ...etc. festival> (SayText "This is the first voice") ...etc. festival> (voice_mylocaldirvoice2_multisyn-gam) Please wait: Initialising multisyn voice. ...etc. festival> (SayText "This is the second voice")
And to do the whole thing from the beginning and then synthesise from some text with SABLE markup, start Festival and do:
festival> (load ".../mylocaldirvoice1_multisyn-gam.scm") nil festival> (load ".../mylocaldirvoice2_multisyn-gam.scm") nil festival> (tts "somefile.sable" nil)
For unit selection voices that use the multisyn engine, You need to set
voice-path-multisyn
and notvoice-path
.Try this command
festival> (set! voice-path-multisyn "/Path/To/Your/Voice/Directory/")
and Festival will look in the location you specify, instead of the system location /Volumes/Network/courses/ss/festival/lib.incomplete/voices-multisyn/
So, whatever directory (in your own filespace) you set
voice-path-multisyn
to, it should have the same structure as the system directory above, which is:If that works, you could create a file called
.festivalrc
in your home directory (or editing the existing file, if you have one) that contains the command, so it is executed every time you start Festival:(set! voice-path-multisyn "/Path/To/Your/Voice/Directory/")
Hint: the Finder on a Mac will hide files whose name starts with a period, by default. Use the Terminal to see all files, where you need to use
ls -a
rather than justls
It was superseded in the sense that it never made it as far as a formalised standard (e.g., via the W3C) and instead we have various vendor-specific approaches (e.g., Microsoft’s SAPI5).
SAPI 5 synthesis markup format is similar to the format published by the SABLE Consortium. However, this format and SABLE version 1.0 are not interoperable. At this time, it’s not determined if they will become partially interoperable in the future. (SAPI 5.3 documentation, Microsoft)
2. a little fiddly, but possible
A voice in Festival is defined by a set of files, including some Scheme that defines the locations of the various files needed (LPCs, utts, etc). On the system here in Edinburgh, the definition of voice_localdir_multisyn-gam is here:
/Volumes/Network/courses/ss/festival/lib.incomplete/voices-multisyn/english/localdir_multisyn-gam
This is a little different to normal voices, in that it looks in the current (local) directory for the voice files, and not in Festival’s own library.
You will need to make your own a copy of that directory and its contents, and modify file
festvox/localdir_multisyn-gam.scm
to change the name of the voice (you can’t have two voices with the same name), and the paths used to find the voice data.See how far you get, then come back for more help if you need it.
1. some tags need to be supported by the voice
why will unit selection voices generally not support tags that modify pitch, duration, emphasis, etc ?
This might be because you did a cut-and-paste from this webpage, which picked up HTML versions of some characters?
Yes, this will change between voices. The format of the name of a voice is the same that you would use within Festival, minus the “voice_” prefix. Try creating a file called
test.sable
(make sure the suffix is .sable and that your editor doesn’t add another suffix) with these contents:Changes of speaker may appear in the text. Using one speaker Eventually returning to the original default speaker.and run it through Festival like this
bash$ festival --tts test.sable
Note that SABLE was a putative standard developed a long time ago by us in Edinburgh with a few companies. It has been superseded. See also the earlier standard SSML and the related standard for interactive systems, VoiceXML.
There was an error a path inside the text2wave script. Try again and report back. Remember to source the setup.sh first – this sets your PATH.
(Also, post the complete error message including the full command line you are running so I can replicate the error.)
The usual form of surprising results is that listeners didn’t hear an improvement that the designers thought they had made, or that some other aspect of the synthetic speech masked the possible improvement (e.g., the speech did sound more prosodically natural, but the waveform quality was lower, and so listeners preferred the baseline).
I’m struggling to think of any genuine positive surprises, but will keep thinking…
Yes, that would seem a reasonable conclusion. Your hypothetical MDS test has found that listeners only use prosodic naturalness to distinguish between stimuli. Either they do not hear segmental problems, or there are none (it doesn’t matter which).
-
AuthorPosts