› Forums › Speech Synthesis › Evaluation › Tools for running a web-based listening test
- This topic has 25 replies, 4 voices, and was last updated 4 years, 3 months ago by Elijah G.
-
AuthorPosts
-
-
February 5, 2016 at 11:36 #2431
If you don’t fancy writing your own CGI script, there are plenty of other ways to run an online listening test. Some of them will encode your audio files as mp3, or require that you do that before uploading the files. Whilst this is not ideal, if you use a high bitrate (128kbps or higher) then it will not be a problem in most cases. Listen to the raw waveforms and the encoded versions carefully using good headphones, to see if you hear a significant difference.
-
February 5, 2016 at 11:37 #2432
Windows-only software suggested by someone in CSTR
http://www.wondershare.com/pro/quizcreator.html
which creates a test in flash. Not the best choice, but it works fine for people with no knowledge of web pages.
-
February 5, 2016 at 11:37 #2433
-
February 5, 2016 at 11:47 #2434
A simple option is Google Forms: http://www.google.co.uk/forms/about/
but in a bit of a roundabout way: http://screencast-o-matic.com/watch/c2QbINnWz3
-
February 7, 2016 at 14:36 #2556
If you want to try an objective measure (perhaps to see if it correlates with your listeners’ judgements), here’s a Python implementation of Mel Cepstral Distortion (MCD) by Matt Shannon.
This requires skills in compiling code (if you do this, please post information here), is entirely optional and certainly not required for this assignment.
-
March 22, 2016 at 12:17 #2842
-
March 24, 2019 at 23:51 #9735
Hi,
I know we can save a synthesized waveform with below commands, but is there a way to save multiple waveforms in a batch (i.e. 20- 30 sentences for one system) ?
festival> (set! myutt (SayText “Hello world.”))
festival> (utt.save.wave myutt “myutt.wav” ‘riff)Thanks!
-
March 25, 2019 at 10:37 #9736
One easy way is to write a very simple program (Python or shell script) to generate a file that contains something like
(voice_localdir_multisyn-rpx) (set! myutt (SayText "Hello world.")) (utt.save.wave myutt "sentence001.wav" 'riff) (set! myutt (SayText "Here is the next sentence.")) (utt.save.wave myutt "sentence002.wav" 'riff)
Your program should save this into a file, perhaps called
generate_test_sentences.scm
and then you can execute that in Festival simply by passing it on the command line like this$ festival generate_test_sentences.scm
-
March 28, 2019 at 14:30 #9751
Is there a way to add short silence at the beginning and and the end of a synthesised speech? I tried to add a colon or a full stop at the beginning of the sentence, but it doesn’t work.
((R:Token.parent.punc in (“?” “.” “:”))
((BB))
((R:Token.parent.punc in (“‘” “\”” “,” “;”))
((B))Thanks!
-
March 28, 2019 at 14:46 #9752
Why do you need to do this?
One option would be just to add some silence using sox.
-
March 28, 2019 at 15:27 #9753
Thanks Simon!
The synthesised utterances are rather short, ranging from 3-7 seconds.
I want to add short silence at the beginning of sentences for the intelligibility tests, so that there are perhaps 1 second (or less) of “buffering” time for participants to get ready.
I am not sure if this is overthinking, but that’s why I want to test out. -
March 28, 2019 at 18:08 #9756
You are probably overthinking. If the audio plays correctly in whatever tool you use to implement the listening test, then there is no reason to pad with extra silence.
-
April 1, 2019 at 23:49 #9766
Made some attempt using BeaqleJS. If you know something about Linux server management/Docker, it will be straightforward. However, there are some potential bugs. One I found is I can’t enable random AB choice preference testing (e.g. give probability 0.5 to set AB = BA). I would like to share following instructions/issues that might related.
Following link is my test, it will be available until April 15th 2019.
http://148.70.111.22/
Deployment:
1. Firstly, you need a server. I suggest you use Vultr for you can create/delete any server at any time, i.e. you don’t need to spend money for whole month. A minimum hardware config is sufficient.
2. Beaqlejs is mainly js with HTML, so any HTTP server is sufficient. If you don’t want to use Docker:
2.1 Install HTTP server Apache/Nginx
2.2 Enable HTTP service on your machine(I assume you use linux)
$ sudo systemctl enable httpd && sudo systemctl start httpd
Attention: above line is different according to which linux
distribution using
2.3 Config your server to: allow http port explore outside, give correct
user group, and site configuration, etc. You can find them online.
2.4 Copy the whole beaqlejs to /var/www/html (I assume this is your
default site directory)
$ git clone https://github.com/HSU-ANT/beaqlejs.git .
$ sudo cp beaqlejs/* /var/www/html
2.5 create an audio directory under your site directory, remember to
give correct access to audio file. I simply gives full access(not
safe!)
$ cd /var/www/html && sudo mkdir audio && chmod 777 audio
2.6 copy your audio file to audio directory, scp/rsync will do the job.
$ scp -r audio/ username@server:/var/www/html/
2.7 Create a config file for your test case. you can simply copy one
from beaqlejs/config and modify the choices. The github page gives
detailed instruction.
https://github.com/HSU-ANT/beaqlejs2.8 Edit index.html to choose which test you want to use
$ vim index.html
line 17:
<script src=”config/example_config_mushra.js”
type=”text/javascript”></script>
choose your config file to replace above exampleline 35: testHandle = new MushraTest(TestConfig);
Three variations provided: Mushra, ABX, Preference(AB Test)
If necessary define your own test class according to their github.
2.9 Beaqlejs supports online submission. you need firstly enable PHP
support of your HTTP server, then set EnableOnlineSubmission and
BeaqleServiceURL in your config file. The submission server is
defined in beaqlejs/web_service/beaqleJS_Service.php. You also need
to give read and execute access to the php file. The test results
will be stored under web_service/result/ directory as txt files.
More detail please refer to their github page.3. If you are Docker User:
3.1 they provided a sample docker-compose file at beaqlejs/tools/Docker/
3.2 if you prefer to use raw docker command, you can simply use:
$ docker run -d -p 80:80 -v /path/to/BeaqleJS:/var/www/html –name
beaqlejs-server php:7.0-apache
$ docker start/stop beaqlejs-server
3.3 you can modify the php:7.0-apache image according to your need.
3.4 other config are same to 2.4-2.9, do remember to enable access to
your audio and php directory at host machine.Known Issues:
1. You can do the test on mobile. However, for some chinese android mobiles, they using customized web browser which uses an older version of Chrome, which might have issues with the test.
2. It happens that at a bad net status, beaqlejs will fail to load audio file and generate error “not able to access audio file”, sometimes the error message are just confusing
3. For preference test(AB Test), it is not able to randomize the AB choice.
A solution: modify line 1375 of js/beaqle.js file.
comment this line:
// if (this.TestConfig.RandomizeFileOrder && RandFileNumber>0.5) {
replace with:
if (RandFileNumber>0.5) {
It uses a never defined variable RandomizeFileOrder that will always prevent the test choice to be randomized. I have opened an issue but don’t know when the developer to fix. You can also modify it according to your need. -
April 7, 2019 at 23:13 #9775
Hi,
I am curious about how does WER calculated in the Blizzard challenge? Is it done by human marker? Or is it done by sclite alone with human marker? Thanks!
-
April 8, 2019 at 09:45 #9776
The Blizzard Challenge uses a standard dynamic programming approach to align the reference with what the listener transcribed – very much like HResults from HTK or sclite. WER is then calculated in the usual way, summing up insertions, deletions and substitutions and dividing by the total number of words in the reference.
The procedure is slightly enhanced for Blizzard to allow for listeners’ typos (which are defined in a manually-created lookup table, updated for each new test set used once we see the typical mistakes they make for those particular sentences).
For your listening tests, I recommend manually correcting any typos, then either computing WER manually, or using HResults – that’s just a matter of getting things in the right file format. Your reference would be in an MLF and your listener transcriptions would be in .rec files.
Whilst we are on this topic, this is a good time to remember that in general you cannot compute WER per sentence, then average over all sentences. This is only valid if all sentences have the same number of words (in the reference).
-
April 8, 2019 at 12:18 #9777
Thanks Simon!
A couple of follow-up questions on the overall WER for a system:
1. When testing intelligibility of a system, surly sentences with various length might be used. Should we use a weighted average (take sentence length into consideration) to calculate the overall WER for a system?2. When reporting WER for different sentences within a system, is it a good idea to include the reference sentence in the result?
3. I am calculating WER manually and wonder if this is why many listening tests recruit less than 20 listeners?
Thanks!
-
April 8, 2019 at 17:02 #9778
1. A weighted average WER is correct, but the simplest way to calculate this is just to sum up insertions, deletions and substitutions across the entire test set, then divide by the total number of words in the reference.
2. I’m not sure you would often want to report WER for individual sentences – this will be highly variable and likely to be based on very few samples. You would need to have a specific reason to report (and analyse) per-sentence WER.
3. No, that’s not the reason! It’s easy to automate WER calculation. Published work using too few listeners just indicates lazy experimenters!
-
June 28, 2019 at 21:25 #9786
Did Blizzard listening test today and have a couple of questions on WER.
How to calculate WER for words that are originally from other languages for mandarin? In Mandarin, many different characters sound exactly the same.
For example, characters for Victoria Falls could be one of the following sentences (and many other more combinations, at least for the translation for “Victoria”):
1. 维多利亚瀑布,
2. 维多莉亚瀑布,
3. 维多莉雅瀑布 ,
4. 維多利亞瀑布.Specifically, can we use WER to measure intelligibility of all (or most) languages?
thanks!
-
June 30, 2019 at 11:32 #9787
The Blizzard Challenges in 2008, 2009, and 2010 included tasks on Mandarin Chinese and the summary papers for these years (available from http://festvox.org/blizzard/index.html) tell you about the two measures used: pinyin error rate with or without tone.
The general answer is: no, Word Error Rate is not the most useful measure of intelligibility for all languages.
-
July 9, 2019 at 00:08 #9788
Thank Simon!
Have a follow up question on speech intelligibility measures.
I was reviewing the Evaluation videos and noticed the objective measures (e.g. MCD and RMSE) seem to be related to naturalness rather intelligibility.
Is there a way to measure intelligibility “objectively”? Thank!
-
July 9, 2019 at 09:01 #9789
People have tried using automatic speech recognition to evaluate the intelligibility of synthetic speech, but with only limited success. So the simple answer is that there is no objective measure of intelligibility.
-
April 20, 2020 at 11:18 #11196
Hi,
I’m trying to analyse my intelligibility results by calculating WER with HResults, but I’m not quite sure how to go about it. I know that I need a reference.mlf file, and some rec files, but I don’t know what HMM list to use. Currently, my HResults command looks like:
HResults -p \
-I ./reference.mlf \
rec/intel*My rec files are called intel[1-10].lab, and my MLF looks like:
#!MLF!#
“/*intel1.lab”
transcription 1
.
“/*intel2.lab”
transcription 2
.
etc.When I run this, it says “No transcriptions found.”
Thanks.
-
April 20, 2020 at 14:50 #11198
Format of the reference MLF should be:
#!MLF!# "*/intel1.lab" word1 word2 word3 . “*/intel2.lab” word1 word2 .
with a final newline at the end of the file. The format of each rec file should also be one word per line (and you might need dummy start/end time?) – look at your rec files from the Speech Processing digit recogniser assignment.
-
April 20, 2020 at 16:18 #11199
Thanks for your reply.
I’ve now changed the format of my MLF, and I’ve looked at some dummy times to put in my rec files. They currently have the format:
0 5100000 word1 -3050.923340
0 5100000 word2 -3050.923340I don’t really understand the purpose of the dummy times, but I still get the “No transcriptions found” error. If possible, is there a resource I could use to learn how to calculate WER manually?
Thanks again.
-
April 21, 2020 at 12:56 #11200
HResults needs another file, listing the valid labels, so you need to do:
$ HResults -p -I ./reference.mlf wordlist rec/intel*.rec
where the file wordlist contains a list of all possible words that could be found in the transcriptions or rec files
I’ve also checked, and the dummy timestamps are not necessary: the rec files can just have one word per line.
Here’s one way to make the wordlist file, assuming rec files with one word per line and no timestamps:
$ cat reference.mlf rec/intel*.rec | egrep -v '#|"|\.' | sort -u > wordlist
-
April 21, 2020 at 16:07 #11201
Thanks a lot for your help, it works now!
-
-
AuthorPosts
- You must be logged in to reply to this topic.