Forum Replies Created
-
AuthorPosts
-
This is a good point, and one we do indeed have to confront in a practical test. In a SUS test, we compare words (not their pronunciations). It is therefore appropriate to allow listeners to type in homophones, or indeed to mis-spell words.
There is usually either some pre-processing of the typed-in responses, before we compute the Word Error Rate (WER), or we allow for these mismatches when performing the dynamic programming alignment as part of the WER computation. This might be achieved by creating lists of acceptable matches for each word in the correct transcription, such as
correct word: your
allowable responses: your, you’re, yore youreSuch lists need updating for each listening test (after gathering the listeners’ responses) because listeners seem to be very good at finding new ways to mis-spell or mis-type words!
I’ve attached an example list of acceptable variants for a set of Semantically Unpredictable Sentences, taken from the tools used to run the Blizzard Challenge.
Attachments:
You must be logged in to view attached files.Packaging a speech synthesis voice
Some operating systems provide a way to plug in your own voice, and make it available to all applications on that computer. In Windows, this is called SAPI.
There is no freely-available SAPI5 wrapper for Festival at the current time.
Making applications based on Festival
Festival has a very liberal license that allows you to do almost anything you like (except remove the headers in the source code that say who wrote it). The only practical problem would be speed and memory usage.
There is a faster and simpler system related to Festival, called flite. To make a voice for flite, you need to use the festvox voice building process, but you can start from the same data that you might have collected when building a Festival voice using the multisyn unit selection method.
Making applications based on HTK
You need to be careful about the license conditions, which forbid you from redistributing HTK. I think it is fine to distribute models trained with HTK though. There is an API for building real-time applications around HTK, called ATK (and aimed at spoken dialogue systems).
Several questions there, so let’s deal with them one-by-one.
Running HTK and Festival on Windows
HTK is straightforward to compile on various operating systems (it’s written in plain C), so should be usable on Windows. You might want to install Cygwin, to get a ‘unix-like’ environment.
Festival is trickier – not impossible, but painful and I do not recommend wasting time on this because you can simply run a virtual Linux machine on your Windows computer.
These seem to be a pretty clear set of instructions. After installing Virtual Box (this is the ‘host’ software), you can download an image (basically a snapshot of a hard drive) of a Linux machine here:
https://virtualboximages.com/VirtualBox+Scientific+Linux+Images
https://virtualboximages.com/Scientific+Linux+7+x86_64+Desktop+VirtualBox+Virtual+Computer
and just load it in to Virtual Box. You will then need to install software on that Linux machine
A simple option is Google Forms: http://www.google.co.uk/forms/about/
but in a bit of a roundabout way: http://screencast-o-matic.com/watch/c2QbINnWz3
Windows-only software suggested by someone in CSTR
http://www.wondershare.com/pro/quizcreator.html
which creates a test in flash. Not the best choice, but it works fine for people with no knowledge of web pages.
If you have an Informatics DICE account then you already have webspace that you can use to run a listening test, using CGI scripting. Using my username of ‘simonk’ as an example, this is the location on the filesystem (from any DICE machine)
/public/homepages/simonk
In there, the directory ‘web’ is served up by the web server at the URL http://homepages.inf.ed.ac.uk/simonk/ and the ‘cgi’ directory is somewhere you can put scripts. So this file
/public/homepages/simonk/web/mypage.html
would have the URL
Sebastian Andersson, Kallirroi Georgila, David Traum, Matthew Aylett, and Robert Clark. Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Proc. Speech Prosody, Chicago, USA, May 2010. PDF
Sebastian Andersson, Junichi Yamagishi, and Robert A.J. Clark. Synthesis and evaluation of conversational characteristics in HMM-based speech synthesis. Speech Communication, 54(2):175-188, 2012. DOI: 10.1016/j.specom.2011.08.001
Using spontaneous speech as the basis for a speech synthesiser is an attractive idea, but is rather hard in practice, for several reasons. Here are some of them:
Word-level transcription: spontaneous speech is harder to transcribe even at the word level than read speech, because it is not entirely made of words (as found in a lexicon); ASR could be tried, as could hard-transcription, but both would have difficulty with this – remember that commercial ASR is designed for careful planned speech such as dictation and will not work very well for unplanned speech
Phonetic transcription: even harder than word-level transcription, because the pronunciations deviate considerably from those found in the lexicon (due to co-articulation, assimilation, deletion,…)
Phonetic alignment: the idea that speech is a linear string of phones (“beads on a string”) was never quite true even for read speech, but is even more problematic for spontaneous speech.
Here’s an experiment to try:
- record a spontaneous utterance
- transcribe the words
- record a read-text version of that
- compare the spontaneous and read-text versions side by side
- listen
- examine waveforms and spectrograms
- try to hand-label word and phone boundaries
That function is calculating the midpoints, yes. The code you’re showing is used for stripping the join cost coefficients during voice building, but it’s performing the same calculation that is done during synthesis.
In lectures, we did indeed gloss over a couple of special cases:
Diphthongs: the 50% point is a poor choice, since the spectrum may be changing rapidly there, so we make the join 75% of the way through the segment where the spectrum is generally a little more stable.
Stops: the end of the closure (stored in cl_end) will have been found during forced alignment (how?) and so we use that as the join point; picking the 50% point in a stop (=closure+burst) might sometimes be before the burst, and other times in the middle of the burst, so would be a bad place to make a join (e.g., we might end up with two bursts in the synthetic speech).
Diphone boundaries are generally just the midpoint between phone boundaries. So, there is no need to store this information in the .utt files because it’s very fast to compute on the fly (e.g., as the file is loaded).
Likewise, it’s easy to construct an index of all available diphones on the fly, as the .utt files are loaded, and store it in memory.
January 24, 2016 at 17:30 in reply to: Labelling the diphones (not the features, just the phonemes) #2325We’ll look at this in detail in the lecture.
Yes, this would be pretty straightforward to do. As you say, you could treat it as a special kind of word (presumably pronounced as a special new phone).
In fact, you might find that – at least in unit selection – you will get these in-breaths ‘for free’ because phrase-initial silence diphones will be chosen to synthesise silence in phrase-initial positions.
-
AuthorPosts