Forum Replies Created
-
AuthorPosts
-
Emailed the Blizzard challenge support and found the synthetic speech of previous participants here: http://www.cstr.ed.ac.uk/projects/blizzard/data.html
Hi Simon,
Hope you are well!
I am reading The Blizzard Challenge 2019 papers (http://www.festvox.org/blizzard/blizzard2019.html) and would like to know if there’s a way that we could listen to the synthetic voices submitted by different teams?Also, please kindly advice if you prefer alumni to post questions in other venues (e.g. Linkedin or email).
Thank you!
Thank Simon!
Have a follow up question on speech intelligibility measures.
I was reviewing the Evaluation videos and noticed the objective measures (e.g. MCD and RMSE) seem to be related to naturalness rather intelligibility.
Is there a way to measure intelligibility “objectively”? Thank!
Did Blizzard listening test today and have a couple of questions on WER.
How to calculate WER for words that are originally from other languages for mandarin? In Mandarin, many different characters sound exactly the same.
For example, characters for Victoria Falls could be one of the following sentences (and many other more combinations, at least for the translation for “Victoria”):
1. 维多利亚瀑布,
2. 维多莉亚瀑布,
3. 维多莉雅瀑布 ,
4. 維多利亞瀑布.Specifically, can we use WER to measure intelligibility of all (or most) languages?
thanks!
In the “Front end” video in the Deep Learning for Text-to-Speech Synthesis, using the Merlin toolkit, Mr. Watts mentioned we need to search proper TTS front end for languages that are not supported by Festival. For example, Thai, Korean, Japanese, Cantonese and Mandarin.
Is there a recommended/efficient way/place/platform to conduct the search other than googling it?
Thanks!
Thanks Simon!
A couple of follow-up questions on the overall WER for a system:
1. When testing intelligibility of a system, surly sentences with various length might be used. Should we use a weighted average (take sentence length into consideration) to calculate the overall WER for a system?2. When reporting WER for different sentences within a system, is it a good idea to include the reference sentence in the result?
3. I am calculating WER manually and wonder if this is why many listening tests recruit less than 20 listeners?
Thanks!
Hi,
I am curious about how does WER calculated in the Blizzard challenge? Is it done by human marker? Or is it done by sclite alone with human marker? Thanks!
Thanks Simon!
The synthesised utterances are rather short, ranging from 3-7 seconds.
I want to add short silence at the beginning of sentences for the intelligibility tests, so that there are perhaps 1 second (or less) of “buffering” time for participants to get ready.
I am not sure if this is overthinking, but that’s why I want to test out.Is there a way to add short silence at the beginning and and the end of a synthesised speech? I tried to add a colon or a full stop at the beginning of the sentence, but it doesn’t work.
((R:Token.parent.punc in (“?” “.” “:”))
((BB))
((R:Token.parent.punc in (“‘” “\”” “,” “;”))
((B))Thanks!
Hi,
I know we can save a synthesized waveform with below commands, but is there a way to save multiple waveforms in a batch (i.e. 20- 30 sentences for one system) ?
festival> (set! myutt (SayText “Hello world.”))
festival> (utt.save.wave myutt “myutt.wav” ‘riff)Thanks!
J&M 2Edition, 9.3.4 (9.11)
Could you please elaborate a bit more on the Hamming window? How did we derive 0.54 and 0.46 from? Are they fixed numbers? or could vary depending on design/preference? Thanks!J&M 2Edition, 9.3.4 (9.14) mentions the mel frequency m can be computed from the raw acoustic frequency as follows:
mel(f) = 11271n(1+ f/700)Could you please explain what does the f and n stand for respectively?
Also, how did we derive 11271 and 700 from? Thanks!J&M 2 Edition, in 9.3.3, page 299 mentions the fast Fourier transform or FFT is very efficient but only works for values of N that are powers of 2.
Could you please explain why it only works for values of N that are powers of 2?
In Speech Synthesis and Recognition, page 124 on continuous speech recognition, figure 8.9, the figure title explains three template sequences are been consider- T1-T3-T1- T3, T1-T3-T3-T1 and T1-T1-T1-T1. However, judging from the illustration, I read three sequences as T1-T3-T1- T2, T1-T3-T3-T2 and T1-T3-T3-T3. Could you please explain how to read this trace-back chart correctly? thanks!
Attachments:
You must be logged in to view attached files. -
AuthorPosts