Forum Replies Created
-
AuthorPosts
-
i can get SABLE to switch between voices that are not in the same directory, so clearly it can be done. i just need to know how to “declare the new voice to Festival”, or add its directory to the voice-path….
something to do with this, presumably:
The variable voice-path conatins a list of directories where voices will be automatically searched for. If this is not set it is set automatically by appending `/voices/’ to all paths in festival load-path. You may add new directories explicitly to this variable in your `sitevars.scm’ file or your own `.festivalrc’ as you wish.i tried this:
(voice-location NAME DIR DOCSTRING)
Record the location of a voice. Called for each voice found on voice-path. Can be called in site-init or .festivalrc for additional voices which exist elsewhere.but that didn’t work. clearly, i need to tell festival where my voice(s) are. the manual implies that this is a simple process, but i can’t get it work.
Ok, getting a bit stuck. I’ve located and copied the localdir_multisyn-gam.scm, opened it and edited it to change the data paths, and changed its name to voice1_multisyn-gam.scm. Now, the simple thing to do would be to just put it back in the festvox directory that it came from, but of course i don’t have permissions access to do that. Now, you say “make your own a copy of that directory”, ok fine, I can do that, but then how will I tell festival to find that new directory? Doesn’t festival know where to find everything it needs based off of this script: festival>(voice_localdir_multisyn-gam)? So wouldn’t I need to edit THAT script to point it to the new directory I would make? That was my reasoning, which sent me looking for the voice_ script…and I can’t find it anywhere on the network.
Help?OK, I actually got it to work by using the example straight out of the manual (not sure exactly what is different about that from your script). A couple things:
1. Not all of the tags seem to work.
2. The voice switching DOES work – hooray!! Now, on to the next problem: If I’m using ‘localdir’ to identify my voice, how could I switch that to one of my other voices, which is in another directory? (I’ve put each voice in its own directory – seems like the only way to go, especially since they don’t share the same wav files). Suggestions?
I tried your example from above, and here’s what the terminal returned:
ppls-atlab-017:ss s1567647$ festival –tts test.sable
Error: Expected name, but got <space> after <
in unnamed entity at line 1 char 2 of file:/var/folders/0h/9r06nc9x49q8b8rlczy9nhk401jk1t/T//est_01054_00000
festival: text modes, caught error and tidying upSuggestions?
So..if VoiceXML is the standard for interactive systems, is there another standard for purely TTS systems? You say SABLE has been ‘superseded’. By what? And will that standard work on Festival?
In general, this seems like a HIGHLY relevant area to our course of study. Will this be covered at all during the SS course? If not, I’d like to put in a request for either one of your amazing ‘extra’ videos, or one of your amazing ‘extra’ lectures!!
Can you post a link to papers that showed the kind of surprising results you describe here? Audio examples would be even better. I would like to understand the experiment design that was used. For instance, in the second example you site, at least based on the information you stated, I would argue that they should not have been surprised: asking anyone, even a phonetician or musician, to ‘only focus on the prosody’, and ignore the low quality waveform (especially if the baseline actually had a higher quality waveform!!), is, in my opinion, a flawed experimental approach. In any experiment where the listeners DIDN’T hear an expected improvement, my first question is, why not? Was it the experiment design (as mentioned previously), or was it that the designers were so caught up with their technical achievement that they didn’t notice that in reality it was actually quite subtle from an audio perspective, or even un-noticeable to the ‘average’ listener? Has this exact situation ever happened to you, where you were working on improving a voice, and you thought you had made some improvement but your listeners couldn’t hear it? And if so, what is your explanation for why that happened?
Do the audio examples from these 2 papers still exist somewhere? Can I listen to them?
Would it be possible to use Natural Language Generation to compose sentences to make up a corpus with 100% coverage? Or, if we knew what the ‘missing’ 20% was, could we generate sentences to fill in those missing diphones?
Failing all that, could we simply hand-write additional sentences to increase our coverage?Having struggled with this myself in the studio, I don’t see how removing the punctuation would make any difference. A sentence that seems to be worded like a question is difficult to speak without question-like prosody, whether or not there is a period at the end of the text. Why do you say this would be an improvement?
OK, let’s leave aside 2) for now (I have an entire catalog of evaluation questions, so I’ll wait until later in the course to ask those).
Focusing on disfluencies, and what I am calling ‘human-ness’. Yes, I see how it would be easy enough to add specific word-like fillers to your recordings, then add them to your dictionary, and then add them to the text you want to synthesize,just like any other word. Hmm, umm, uh, err, ahh all come to mind.
1) How about something that wouldn’t have a phonetic spelling, and isn’t composed of diphones, like a sigh, or a deep intake of breath? Could we use a symbol in the text, like #, and tell festival to consider this a word. And then…map this symbol directly to a particular wav file?
2) What if I wanted un-expectedness (in an attempt to increase ‘human-ness’)? Let’s say I don’t want to write ‘hmm’ and ‘uh’ in my text to be synthesized. I want the system to insert filled pauses for me, not at random, but based on some concept of where a human might actually insert them in natural speech. Could this be considered a kind of post-lexical rule, but with some probability function added to it? For example, a rule that says ‘when the word “well” is followed by the word “I”, insert in between them one of the filler words “hmm”, “uh”, and “err”, or nothing, but with different likelihoods, such that about 71% of the time nothing is inserted, about 17% of the time “hmm” is inserted, about 9% of the time “uh” is inserted, and about 3% of the time “err” is inserted.(link fixed)
That’s excellent. I wish you would extend it and show what happens when the token loops back to the beginning! But I think it’s just clicked in for me. Any token that leaves the model, kills off any other tokens that left at the same time with lower log prob, and loops back to the beginning will simply be in competition with whatever token it meets there, and by Viterbi, the winner will win and the loser will die off. This will happen many times (N-3 I suppose for a 3-emitting state HMM). If at any time t that the token that came around from the end and looped back beats the token it meets in state one (which just went around its own self transition loop), it will now be the current winner, at least in that one state. If it were to be the ultimate winner at the Nth turn of the handle, then whatever ‘loops back to the beginning’ it made will be recorded as the word sequence. The tech paper calls these the ‘Word Link Record’. Each time a token leaves a model, ready to loop back to the beginning, it adds that model’s name to its WLR. Any token that ends up the ‘winner’ of the whole sequence will have a WLR that contains every model, in order, that it passed through. Also important to note that each time the token loops back to the beginning, it clones itself so it can go out to all possible models, given the topology. (Which is constrained by the language model).
Hows that???Well, despite the fact that I wrote 2. down and you have deemed it correct, I still struggle with a mental model of this ‘compiled network’. Is the following statement also true:
Bearing in mind there are always 11 models running in parallel (10 digits plus junk) every time a new HMM start state is entered, after, say, 103 turns of the handle, (which is only 1.03 seconds in duration), some 1100 HMMs are all churning away, generating observations. After 1003 turns of the handle, there are 11,000 HMMs in action.
That seems like a lot! Is this where pruning comes in? Or have I now gone off track?
Following on to this topic: I have trouble understanding why the ‘single most likely state sequence’ (from Viterbi), is not in fact the best state sequence for our model. Why does considering, and weighting, every possible state sequence (from Baum-Welch) generate a better model? Yes, I can prove that this is in fact true empirically, but I still don’t really understand why. I’m assuming the key is what you stated above: “Baum-Welch algorithm computes exact state occupancies whereas the Viterbi algorithm only computes an approximation”. The video on this on LEARN is also quite good and clear…but there’s still this gap in my understanding. Why isn’t ‘the best’ (the most likely) the best?
Thank you. By looking ahead at the slides of the upcoming lectures, and pairing that with last year’s video lectures, I was able to expand my understanding of the process.
-
AuthorPosts