Forum Replies Created
-
AuthorPosts
-
This is potentially confusing – and we don’t want to get hung up on terminology. I’ve added some clarification.
The reasons for avoiding very long sentences in the prompts for recording a unit selection database are
- they are hard to read out without the speaker making a mistake
- the proportion of phrase-initial and phrase-final diphones is low
Short sentences might be avoided because they have unusual prosody, and so units from short phrases (e.g., “Hi!”) may not be very suitable for synthesising ‘ordinary’ sentences.
This is a good point, and one we do indeed have to confront in a practical test. In a SUS test, we compare words (not their pronunciations). It is therefore appropriate to allow listeners to type in homophones, or indeed to mis-spell words.
There is usually either some pre-processing of the typed-in responses, before we compute the Word Error Rate (WER), or we allow for these mismatches when performing the dynamic programming alignment as part of the WER computation. This might be achieved by creating lists of acceptable matches for each word in the correct transcription, such as
correct word: your
allowable responses: your, you’re, yore youreSuch lists need updating for each listening test (after gathering the listeners’ responses) because listeners seem to be very good at finding new ways to mis-spell or mis-type words!
I’ve attached an example list of acceptable variants for a set of Semantically Unpredictable Sentences, taken from the tools used to run the Blizzard Challenge.
Attachments:
You must be logged in to view attached files.Packaging a speech synthesis voice
Some operating systems provide a way to plug in your own voice, and make it available to all applications on that computer. In Windows, this is called SAPI.
There is no freely-available SAPI5 wrapper for Festival at the current time.
Making applications based on Festival
Festival has a very liberal license that allows you to do almost anything you like (except remove the headers in the source code that say who wrote it). The only practical problem would be speed and memory usage.
There is a faster and simpler system related to Festival, called flite. To make a voice for flite, you need to use the festvox voice building process, but you can start from the same data that you might have collected when building a Festival voice using the multisyn unit selection method.
Making applications based on HTK
You need to be careful about the license conditions, which forbid you from redistributing HTK. I think it is fine to distribute models trained with HTK though. There is an API for building real-time applications around HTK, called ATK (and aimed at spoken dialogue systems).
Several questions there, so let’s deal with them one-by-one.
Running HTK and Festival on Windows
HTK is straightforward to compile on various operating systems (it’s written in plain C), so should be usable on Windows. You might want to install Cygwin, to get a ‘unix-like’ environment.
Festival is trickier – not impossible, but painful and I do not recommend wasting time on this because you can simply run a virtual Linux machine on your Windows computer.
These seem to be a pretty clear set of instructions. After installing Virtual Box (this is the ‘host’ software), you can download an image (basically a snapshot of a hard drive) of a Linux machine here:
https://virtualboximages.com/VirtualBox+Scientific+Linux+Images
https://virtualboximages.com/Scientific+Linux+7+x86_64+Desktop+VirtualBox+Virtual+Computer
and just load it in to Virtual Box. You will then need to install software on that Linux machine
A simple option is Google Forms: http://www.google.co.uk/forms/about/
but in a bit of a roundabout way: http://screencast-o-matic.com/watch/c2QbINnWz3
Windows-only software suggested by someone in CSTR
http://www.wondershare.com/pro/quizcreator.html
which creates a test in flash. Not the best choice, but it works fine for people with no knowledge of web pages.
If you have an Informatics DICE account then you already have webspace that you can use to run a listening test, using CGI scripting. Using my username of ‘simonk’ as an example, this is the location on the filesystem (from any DICE machine)
/public/homepages/simonk
In there, the directory ‘web’ is served up by the web server at the URL http://homepages.inf.ed.ac.uk/simonk/ and the ‘cgi’ directory is somewhere you can put scripts. So this file
/public/homepages/simonk/web/mypage.html
would have the URL
Sebastian Andersson, Kallirroi Georgila, David Traum, Matthew Aylett, and Robert Clark. Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Proc. Speech Prosody, Chicago, USA, May 2010. PDF
Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. DOI: 10.1016/j.specom.2011.08.001
Using spontaneous speech as the basis for a speech synthesiser is an attractive idea, but is rather hard in practice, for several reasons. Here are some of them:
Word-level transcription: spontaneous speech is harder to transcribe even at the word level than read speech, because it is not entirely made of words (as found in a lexicon); ASR could be tried, as could hard-transcription, but both would have difficulty with this – remember that commercial ASR is designed for careful planned speech such as dictation and will not work very well for unplanned speech
Phonetic transcription: even harder than word-level transcription, because the pronunciations deviate considerably from those found in the lexicon (due to co-articulation, assimilation, deletion,…)
Phonetic alignment: the idea that speech is a linear string of phones (“beads on a string”) was never quite true even for read speech, but is even more problematic for spontaneous speech.
Here’s an experiment to try:
- record a spontaneous utterance
- transcribe the words
- record a read-text version of that
- compare the spontaneous and read-text versions side by side
- listen
- examine waveforms and spectrograms
- try to hand-label word and phone boundaries
That function is calculating the midpoints, yes. The code you’re showing is used for stripping the join cost coefficients during voice building, but it’s performing the same calculation that is done during synthesis.
In lectures, we did indeed gloss over a couple of special cases:
Diphthongs: the 50% point is a poor choice, since the spectrum may be changing rapidly there, so we make the join 75% of the way through the segment where the spectrum is generally a little more stable.
Stops: the end of the closure (stored in cl_end) will have been found during forced alignment (how?) and so we use that as the join point; picking the 50% point in a stop (=closure+burst) might sometimes be before the burst, and other times in the middle of the burst, so would be a bad place to make a join (e.g., we might end up with two bursts in the synthetic speech).
Diphone boundaries are generally just the midpoint between phone boundaries. So, there is no need to store this information in the .utt files because it’s very fast to compute on the fly (e.g., as the file is loaded).
Likewise, it’s easy to construct an index of all available diphones on the fly, as the .utt files are loaded, and store it in memory.
-
AuthorPosts