Forum Replies Created
-
AuthorPosts
-
Qualtrics might convert them to mp3 silently (certainly some platforms do) – check in a browser by taking your completed test as a subject.
The main problems with using wav files on the web are
- They are larger than mp3 – not a problem here: we care about quality not size
- Some browsers, notably Safari, will not play wav files
The listener is potentially a significant factor that could affect the results. Language background is one important aspect of this, and for that reason, the work reported in many scientific papers only uses native listeners.
Within speech synthesis evaluation, there is not a lot of work exploring this. One paper worth skimming is
Wester, Valentini-Botinhao and Henter “Are we using enough listeners? no! — an empirically-supported critique of interspeech 2014 TTS evaluations“, In INTERSPEECH-2015, 3476-3480.
although this is focussed on number of listeners more than their properties.
From the Blizzard Challenge we know that non-natives have systematically higher average (and higher standard deviation) WER than native speakers when transcribing speech. We sometimes also find that people with high exposure to TTS (“Speech Experts”) do not rank systems in exactly the same order as the general population of listeners.
If we think that listener properties could affect results, then there are several possible approaches, including:
1. use a listener pool that is as homogenous as possible, typically “normal-hearing native speakers” – this is what we do in Edinburgh most of the time
2. use a large listener pool and collect information about, for example, language background or previous exposure to TTS, so that the results can be analysed – this is what Blizzard does
Neither is perfect. 1 – limits the available pool of subjects. 2 – results in unbalanced sub-groups which complicates statistical analysis.
For the assignment, I do not recommend attempting to restrict your listeners to only native speakers of English, or to native speakers of your own first language (where that’s not English) – just get as many listeners as possible. So, take approach 2.
Approach 2 involves collecting information about each individual listener. Be very careful to collect only what is essential for testing your hypothesis (e.g., language background) and not to ask for intrusive personal information that you don’t need (e.g., gender, age, ethnicity).
But, for the assignment, investigation of listener factors is optional and it would be fine to omit this and to analyse system properties instead. You choose!
Festival is showing its age: it doesn’t support UTF-8. It only supports ASCII.
Terminology in this reply:
- sentence = the text being synthesised
- utterance = the synthetic speech for a sentence
You’re right to worry about listener boredom or fatigue.
But you also need to think about the effect of the sentence, which can be large (or at least unknown): some sentences are just harder to synthesise than others. In general, we therefore prioritise using the same sentences across all systems we are comparing.
If you place utterances side-by-side for direct comparison (e.g., ranking or MUSHRA) then you would always use the same sentence. Listeners would indeed have to listen to the same sentence uttered multiple times.
If you present utterances one at a time (e.g., MOS) then you can (pseudo)randomise the order so that listeners do not get the same sentence several times in a row, although they will still hear multiple utterances saying the same sentence across the test (or that section) as a whole.
Here’s a way to
ssh
via a gateway machine (outside the University firewall) to a machine inside the firewall, in a single line. Does not require the VPN:$ ssh -t s1234567@student.ssh.inf.ed.ac.uk -t ssh s1234567@ppls-atl-0020.ppls.ed.ac.uk Password: s1234567@ppls-atl-0020.ppls.ed.ac.uk's password:
the first password request is for
student.ssh.inf.ed.ac.uk
, the second forppls-atl-0020.ppls.ed.ac.uk
.Setting up ssh keys appropriately should allow you to do this without passwords, except Informatics don’t allow ssh keys, so you need to use Kerberos – see their support pages.
This part does work though, to avoid needing a password for the lab computer: generate keys on
student.ssh.inf.ed.ac.uk
and copy toppls-atl-0020.ppls.ed.ac.uk
usingssh-copy-id
.The error “ssh_exchange_identification: read: Connection reset by peer” usually means you had a few failed login attempts in a short period of time. Wait and try later.
Please include the complete command line you are running, and the full error message, so I can help you.
Semantically Unpredictable Sentences (SUS) follow a simple template format, given in the paper along with links to word lists. From these, a simple script can be written to randomly-generate SUS.
Remember that SUS may not be necessary if you don’t have a ceiling effect on intelligibility – you will want to informally find that out before proceeding with SUS. Using SUS with a very low-intelligibility voice might lead to a floor effect!
Harvard sentences are semantically plausible and (supposedly) phonetically-balanced when used in groups of 10. They are still widely used for intelligibility testing when there is no risk of ceiling effect, such as in noise (or, in the case of this assignment, when the synthetic voice is far from perfect!).
I would expect students to continue studying even when there are no classes. Therefore, yes, I would expect you to have worked through all materials according the originally-planned class schedule.
What we actually cover in each remaining class may be adjusted to make best use of the available class time (but without scheduling additional hours to replace cancelled classes).
It’s too early for me to make an announcement about what the effect on the exam might be. I have not yet written the exam.
February 19, 2020 at 10:02 in reply to: Festival inserting extra lines when running bulk processing script #10675There might be some non-ASCII (and non-printing – therefore hard to detect) characters in a few sentences. Here’s one way to remove all non-ASCII characters
cat input.txt | iconv -c -t ASCII > output.txt
Or you could simply manually remove those sentences that get split across two lines by Festival.
F0 is real-valued. Taylor argues that this means there is a very natural way to measure the distance between two F0 values. For example, we could take their difference. I would make this argument on the basis of perception: it is clear that a larger difference in F0 values will generally produce a larger perceived difference in two speech sounds. The relationship is not linear, but at least it is monotonic.
This is in contrast to using multiple high-level features such as stress, accentuation, phrasing and phonetic identity. It is not at all clear what distance metric we should use here, for reasons including:
- they are not real-valued
- we don’t know their relative importance
- we don’t know if/how they are correlated with one another
- the relationship with perception is not so obvious as for F0
make_mfcc_list uses utts.data as its source of filenames, so perhaps you have modified that?
Weijia W: Adding a word to the script like that is a very good technique – this is exactly the right line of thinking to explore what is happening in each step of voice building.
Bingzi Y: you are right – “prprprfg” wasn’t a good choice of “word” because this would have been classified by Festival as a NSW and expanded into something else (perhaps treated as a LSEQ?). Your “Moschops” is a better choice because this is clearly a possible word in English (in fact, it happens to be a real word in this case).
You need to carefully distinguish two very different ways in which we save computation in both DTW and the Viterbi algorithm for HMMs.
Dynamic Programming: this algorithm efficiently evaluates all possible paths (= state sequences for HMMs). All paths are evaluated, and none are disregarded. This algorithm is exact and introduces no errors compared to a naive exhaustive search of the paths one at a time.
Pruning: this involves not exploring some paths (state sequences) at all. In DTW, this means that we will not visit every single point in the grid. In the Viterbi algorithm for HMMs implemented as token passing, it means that not all states will have tokens at all time steps. Pruning introduces errors whenever an unexplored part of the grid would have been on the globally most likely path.
In Dynamic Programming, we talk about “throwing away” all but the locally best path when two or more paths meet. The paths that are “thrown away” have already been evaluated up to that point. Extending those paths further would involve exactly the same computations as extending the best path. So we are effectively still evaluating all paths. We save computation without introducing any error: that’s the magic of Dynamic Programming.
This is not the same as pruning, in which we stop exploring some of the locally best paths, because there is another path (into another point on the DTW grid, or arriving at a different state in the HMM) that is much better.
Your first explanation of search space is correct.
Token passing is an algorithm, not a model.
Tokens generate the given observations and – in doing so – we compute the probability of that observation being generated from the state’s pdf. Yes, we just “look up” that probability.
-
AuthorPosts