Forum Replies Created
-
AuthorPosts
-
February 7, 2016 at 12:13 in reply to: Output quality: common evaluation framework & materials (SUS) #2547
The reason for using SUS is, of course, to avoid a ceiling effect in intelligibility. But you are not the first person(*) to suggest that SUS are highly unnatural and to wonder how much a SUS test actually tells us about real-world intelligibility.
A slightly more ecologically valid example would be evaluating the intelligibility of synthetic speech in noise, where SUS would be too difficult and ‘more normal’ sentences could be used instead. But such tests are still generally done in the lab, with artificially-added noise. They could hardly be called ecologically valid.
You ask whether “there [are] intelligibility tests using situations that mimic the desired applications?” This would certainly be desirable, and commercial companies might do this as part of usability testing. Unfortunately, mimicking the end application is a lot of work, and so makes the test slow and expensive. Once we start evaluating the synthetic speech as part of a final application, it will get harder to separate out the underlying causes for users’ responses. At this point, we reach the limit of my expertise, and would be better asking an expert, such as Maria Wolters.
* Paul Taylor always told me he was very sceptical of SUS intelligibility testing. He asserted that all commercial systems were already at ceiling intelligibility in real-world conditions, so there was no point measuring it; researchers should focus on naturalness instead. I agree with him as far as listening in quiet conditions is concerned, but synthetic speech is certainly not at ceiling intelligibility when heard in noise.
February 7, 2016 at 12:00 in reply to: Output quality: common evaluation framework & materials (SUS) #2546In the field of speech coding there are standardised tests and methodologies for evaluating codecs. This standardisation is driven by commercial concerns, both those who invent new codecs, and those who use them (e.g., telecoms or broadcasting).
But in speech synthesis there appears to be no commercial demand for equivalent standardised tests. Commercial producers of speech synthesisers never reveal the evaluation results for their products (the same is true of automatic speech recognition).
There are, however, conventions and accepted methods for evaluation that are widely used in research and development. The SUS method is one such method and is fairly widely used (although Word Error Rate is usually reported and not Sentence Error Rate).
The Blizzard Challenge is the only substantial effort to make fair comparisons across multiple systems. The listening test design in the Blizzard Challenge is straightforward (it includes a section of SUS) and is widely used by others. The materials (speech databases + text of the test sentences) are publicly available and are also quite widely used. This is a kind of de facto standardisation.
There are some examples for the Speech Communication paper.
This is potentially confusing – and we don’t want to get hung up on terminology. I’ve added some clarification.
The reasons for avoiding very long sentences in the prompts for recording a unit selection database are
- they are hard to read out without the speaker making a mistake
- the proportion of phrase-initial and phrase-final diphones is low
Short sentences might be avoided because they have unusual prosody, and so units from short phrases (e.g., “Hi!”) may not be very suitable for synthesising ‘ordinary’ sentences.
This is a good point, and one we do indeed have to confront in a practical test. In a SUS test, we compare words (not their pronunciations). It is therefore appropriate to allow listeners to type in homophones, or indeed to mis-spell words.
There is usually either some pre-processing of the typed-in responses, before we compute the Word Error Rate (WER), or we allow for these mismatches when performing the dynamic programming alignment as part of the WER computation. This might be achieved by creating lists of acceptable matches for each word in the correct transcription, such as
correct word: your
allowable responses: your, you’re, yore youreSuch lists need updating for each listening test (after gathering the listeners’ responses) because listeners seem to be very good at finding new ways to mis-spell or mis-type words!
I’ve attached an example list of acceptable variants for a set of Semantically Unpredictable Sentences, taken from the tools used to run the Blizzard Challenge.
Attachments:
You must be logged in to view attached files.Packaging a speech synthesis voice
Some operating systems provide a way to plug in your own voice, and make it available to all applications on that computer. In Windows, this is called SAPI.
There is no freely-available SAPI5 wrapper for Festival at the current time.
Making applications based on Festival
Festival has a very liberal license that allows you to do almost anything you like (except remove the headers in the source code that say who wrote it). The only practical problem would be speed and memory usage.
There is a faster and simpler system related to Festival, called flite. To make a voice for flite, you need to use the festvox voice building process, but you can start from the same data that you might have collected when building a Festival voice using the multisyn unit selection method.
Making applications based on HTK
You need to be careful about the license conditions, which forbid you from redistributing HTK. I think it is fine to distribute models trained with HTK though. There is an API for building real-time applications around HTK, called ATK (and aimed at spoken dialogue systems).
Several questions there, so let’s deal with them one-by-one.
Running HTK and Festival on Windows
HTK is straightforward to compile on various operating systems (it’s written in plain C), so should be usable on Windows. You might want to install Cygwin, to get a ‘unix-like’ environment.
Festival is trickier – not impossible, but painful and I do not recommend wasting time on this because you can simply run a virtual Linux machine on your Windows computer.
These seem to be a pretty clear set of instructions. After installing Virtual Box (this is the ‘host’ software), you can download an image (basically a snapshot of a hard drive) of a Linux machine here:
https://virtualboximages.com/VirtualBox+Scientific+Linux+Images
https://virtualboximages.com/Scientific+Linux+7+x86_64+Desktop+VirtualBox+Virtual+Computer
and just load it in to Virtual Box. You will then need to install software on that Linux machine
A simple option is Google Forms: http://www.google.co.uk/forms/about/
but in a bit of a roundabout way: http://screencast-o-matic.com/watch/c2QbINnWz3
Windows-only software suggested by someone in CSTR
http://www.wondershare.com/pro/quizcreator.html
which creates a test in flash. Not the best choice, but it works fine for people with no knowledge of web pages.
If you have an Informatics DICE account then you already have webspace that you can use to run a listening test, using CGI scripting. Using my username of ‘simonk’ as an example, this is the location on the filesystem (from any DICE machine)
/public/homepages/simonk
In there, the directory ‘web’ is served up by the web server at the URL http://homepages.inf.ed.ac.uk/simonk/ and the ‘cgi’ directory is somewhere you can put scripts. So this file
/public/homepages/simonk/web/mypage.html
would have the URL
Sebastian Andersson, Kallirroi Georgila, David Traum, Matthew Aylett, and Robert Clark. Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Proc. Speech Prosody, Chicago, USA, May 2010. PDF
Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. DOI: 10.1016/j.specom.2011.08.001
-
AuthorPosts