Forum Replies Created
-
AuthorPosts
-
Multiple rounds of evaluation would be normal when developing a large system over a period of time. For the “Build your own unit selection voice” exercise, that would probably take too much time though.
In general, it’s difficult to evaluate a single system in isolation: this is because most types of evaluation provide a relative judgement compared to one or more other systems or references. Even in the case of intelligibility testing, where evaluating a single system sounds reasonable, we still need to interpret the result: for example, is a Word Error Rate of 15% good or bad? One way to know would be to measure the intelligibility of natural speech under the same conditions.
When comparing multiple systems, we would normally use the same speaker and in fact the exact same database (unless we were investigating the effect of database size or content). Trying to compare two systems built from different speakers’ data would not enable us to separate the effects of speaker from those of the system.
February 7, 2016 at 12:13 in reply to: Output quality: common evaluation framework & materials (SUS) #2547The reason for using SUS is, of course, to avoid a ceiling effect in intelligibility. But you are not the first person(*) to suggest that SUS are highly unnatural and to wonder how much a SUS test actually tells us about real-world intelligibility.
A slightly more ecologically valid example would be evaluating the intelligibility of synthetic speech in noise, where SUS would be too difficult and ‘more normal’ sentences could be used instead. But such tests are still generally done in the lab, with artificially-added noise. They could hardly be called ecologically valid.
You ask whether “there [are] intelligibility tests using situations that mimic the desired applications?” This would certainly be desirable, and commercial companies might do this as part of usability testing. Unfortunately, mimicking the end application is a lot of work, and so makes the test slow and expensive. Once we start evaluating the synthetic speech as part of a final application, it will get harder to separate out the underlying causes for users’ responses. At this point, we reach the limit of my expertise, and would be better asking an expert, such as Maria Wolters.
* Paul Taylor always told me he was very sceptical of SUS intelligibility testing. He asserted that all commercial systems were already at ceiling intelligibility in real-world conditions, so there was no point measuring it; researchers should focus on naturalness instead. I agree with him as far as listening in quiet conditions is concerned, but synthetic speech is certainly not at ceiling intelligibility when heard in noise.
February 7, 2016 at 12:00 in reply to: Output quality: common evaluation framework & materials (SUS) #2546In the field of speech coding there are standardised tests and methodologies for evaluating codecs. This standardisation is driven by commercial concerns, both those who invent new codecs, and those who use them (e.g., telecoms or broadcasting).
But in speech synthesis there appears to be no commercial demand for equivalent standardised tests. Commercial producers of speech synthesisers never reveal the evaluation results for their products (the same is true of automatic speech recognition).
There are, however, conventions and accepted methods for evaluation that are widely used in research and development. The SUS method is one such method and is fairly widely used (although Word Error Rate is usually reported and not Sentence Error Rate).
The Blizzard Challenge is the only substantial effort to make fair comparisons across multiple systems. The listening test design in the Blizzard Challenge is straightforward (it includes a section of SUS) and is widely used by others. The materials (speech databases + text of the test sentences) are publicly available and are also quite widely used. This is a kind of de facto standardisation.
There are some examples for the Speech Communication paper.
This is potentially confusing – and we don’t want to get hung up on terminology. I’ve added some clarification.
The reasons for avoiding very long sentences in the prompts for recording a unit selection database are
- they are hard to read out without the speaker making a mistake
- the proportion of phrase-initial and phrase-final diphones is low
Short sentences might be avoided because they have unusual prosody, and so units from short phrases (e.g., “Hi!”) may not be very suitable for synthesising ‘ordinary’ sentences.
This is a good point, and one we do indeed have to confront in a practical test. In a SUS test, we compare words (not their pronunciations). It is therefore appropriate to allow listeners to type in homophones, or indeed to mis-spell words.
There is usually either some pre-processing of the typed-in responses, before we compute the Word Error Rate (WER), or we allow for these mismatches when performing the dynamic programming alignment as part of the WER computation. This might be achieved by creating lists of acceptable matches for each word in the correct transcription, such as
correct word: your
allowable responses: your, you’re, yore youreSuch lists need updating for each listening test (after gathering the listeners’ responses) because listeners seem to be very good at finding new ways to mis-spell or mis-type words!
I’ve attached an example list of acceptable variants for a set of Semantically Unpredictable Sentences, taken from the tools used to run the Blizzard Challenge.
Attachments:
You must be logged in to view attached files.Packaging a speech synthesis voice
Some operating systems provide a way to plug in your own voice, and make it available to all applications on that computer. In Windows, this is called SAPI.
There is no freely-available SAPI5 wrapper for Festival at the current time.
Making applications based on Festival
Festival has a very liberal license that allows you to do almost anything you like (except remove the headers in the source code that say who wrote it). The only practical problem would be speed and memory usage.
There is a faster and simpler system related to Festival, called flite. To make a voice for flite, you need to use the festvox voice building process, but you can start from the same data that you might have collected when building a Festival voice using the multisyn unit selection method.
Making applications based on HTK
You need to be careful about the license conditions, which forbid you from redistributing HTK. I think it is fine to distribute models trained with HTK though. There is an API for building real-time applications around HTK, called ATK (and aimed at spoken dialogue systems).
Several questions there, so let’s deal with them one-by-one.
Running HTK and Festival on Windows
HTK is straightforward to compile on various operating systems (it’s written in plain C), so should be usable on Windows. You might want to install Cygwin, to get a ‘unix-like’ environment.
Festival is trickier – not impossible, but painful and I do not recommend wasting time on this because you can simply run a virtual Linux machine on your Windows computer.
These seem to be a pretty clear set of instructions. After installing Virtual Box (this is the ‘host’ software), you can download an image (basically a snapshot of a hard drive) of a Linux machine here:
https://virtualboximages.com/VirtualBox+Scientific+Linux+Images
https://virtualboximages.com/Scientific+Linux+7+x86_64+Desktop+VirtualBox+Virtual+Computer
and just load it in to Virtual Box. You will then need to install software on that Linux machine
A simple option is Google Forms: http://www.google.co.uk/forms/about/
but in a bit of a roundabout way: http://screencast-o-matic.com/watch/c2QbINnWz3
Windows-only software suggested by someone in CSTR
http://www.wondershare.com/pro/quizcreator.html
which creates a test in flash. Not the best choice, but it works fine for people with no knowledge of web pages.
If you have an Informatics DICE account then you already have webspace that you can use to run a listening test, using CGI scripting. Using my username of ‘simonk’ as an example, this is the location on the filesystem (from any DICE machine)
/public/homepages/simonk
In there, the directory ‘web’ is served up by the web server at the URL http://homepages.inf.ed.ac.uk/simonk/ and the ‘cgi’ directory is somewhere you can put scripts. So this file
/public/homepages/simonk/web/mypage.html
would have the URL
-
AuthorPosts