Page 3

Forum Replies Created

Viewing 15 posts - 31 through 45 (of 1,104 total)

← 1 2 3 4 … 72 73 74 →

Author

Posts
November 25, 2024 at 17:30 in reply to: AI Use for Coding #18165
Simon King
Professor
Please read “Please briefly describe if you use any external Artificial Intelligence (AI) related tools in doing this assignment…” which applies to all use of such tools.

Whilst professional programmers certainly use AI “co-pilots” to write code faster, I strongly recommend against this for beginners: for one thing, you do not yet have the skill to judge whether the resulting code is correct!

You will learn a lot more, and build more confidence, doing it yourself from scratch.
November 20, 2024 at 08:53 in reply to: Access to Atli’s Notion Notes #18125
Simon King
Professor
These notes are not currently available – there is everything you need here on speech.zone and the forums.
November 1, 2024 at 08:08 in reply to: Using Tables or Figures from Cited Work #18087
Simon King
Professor
Reproducing a figure from another source may not be the best way to show your understanding.

If you really want to do this, then you need to cite the source in the caption of the figure, in the same way that you would cite the source of a direct text quote.

It’s up to you whether to include the original figure, or redraw it.
October 31, 2024 at 12:12 in reply to: In-Text Citation #18067
Simon King
Professor
No, they are not. I’ve clarified that in the Formatting instructions.
October 29, 2024 at 08:47 in reply to: Access to labs #18048
Simon King
Professor
You can check when AT 4.02 is available (i.e., whenever there is no class scheduled) by visiting timetables then searching for 4.02 and selecting the one in Appleton Tower. This link should take you directly there (although you may need to authenticate with EASE first).
May 13, 2024 at 13:09 in reply to: Axes of a plotted filter #17738
Simon King
Professor
Key points

You are correct that the x-axis (horizontal) will be frequency, and will be labelled in units of Hertz (Hz).

The vertical axis is magnitude, which is most commonly plotted on a logarithmic scale and is therefore labelled in decibels (dB).

Additional detail

Magnitude is a ratio (in this case, of filter output to its input), and therefore has no units: formally, we say it is dimensionless. So dB are not actually a unit, but a scale.
April 12, 2024 at 08:03 in reply to: emulabel #17730
Simon King
Professor
emulabel is an outdated program mentioned in some old documentation on festvox.org

In Pitchmark the speech, the command make_pmlab_pm creates label files from the pitchmarks, and places them in the pm_lab directory. These can be viewed in the same way as any other label file (such as the aligned phone labels), e.g., using wavesurfer.
April 10, 2024 at 13:36 in reply to: Upload Audio Files to Qualtrics #17724
Simon King
Professor
You can use Qualtrics to build the survey, but host your audio files somewhere else, then enter their URLs into Qualtrics.

You can host the audio files anywhere that is able to provide a URL for the file. For example, a free github page, which might give you URLs like this:
```
https://jonojace.github.io/IS19-robustness-audio-samples/figure3/g_100clean.wav
```
April 10, 2024 at 11:03 in reply to: About abstract and introduction #17720
Simon King
Professor
Yes, you need both an abstract an an introduction.
April 8, 2024 at 16:00 in reply to: Autocorrelation and Pitch Prediction in FastPitch Vs. UnitSelec #17712
Simon King
Professor
You need to more clearly separate two independent design choices:

1. How to estimate F0 for recorded speech (which will become the database for a unit selection system, or the training data for a FastPitch model).

The method for estimating F0 (whether autocorrelation based like RAPT, or something else) is independent of the method used for synthesis. The synthesis methods just need values for F0, they don’t care where they come from.

2. Using F0 during synthesis (which will be either the unit selection algorithm, or FastPitch inference).

In a unit selection system that doesn’t employ any signal modification, you are correct in stating that the system can only synthesise speech with F0 values found in the database. FastPitch can, in theory, generate any F0 value.

But both methods use the data to learn how to predict F0, so they are both constrained by what is present in the database. The ‘model’ of F0 prediction in unit selection is implicit: the combination of target and join cost function. The model of F0 prediction in FastPitch is explicit.

So, in practice, as you suggest, FastPitch is very constrained by what is present in the training data. In that regard, it’s not so very different to unit selection.
April 6, 2024 at 18:24 in reply to: SIOD ERROR: not a number #17707
Simon King
Professor
There is probably either a formatting error or a non-ASCII character in your utts.data.

If you can’t easily locate it, try using binary search to find the offending line (here I’ll assume utts.data has 600 lines):

0. make a backup of utts.data

1. make a file containing only first half of utts.data, for example with

head -300 utts.data > first.data

2. try check_script on first.data

3a. if you get an error then take the first half again

head -150 utts.data > first.data

3b.if you don’t get an error, make a file containing the first three-quarters of utts.data

head -450 utts.data > first.data

and iterate, varying the number of lines in a binary search pattern, until you home in on the error.
April 6, 2024 at 15:10 in reply to: Synthesis with SoundStream #17705
Simon King
Professor
The SoundStream codes are an alternative to the mel spectrogram.

To do Text-to-Speech, we would train a model to generate SoundStream codes, instead of generating a mel spectrogram.

Before training the system, we would pass all our training data waveforms through the SoundStream encoder, thus converting each waveform into a sequence of codes.

(In the case of a mel spectrogram, we would pass each waveform through a mel-scale filterbank to convert it to a mel spectrogram.)

Then we train a speech synthesis model to predict a code sequence given a phone (or text) input.

To do speech synthesis, we perform inference with the model to generate a sequence of codes, given a phone (or text) input. We then pass that sequence of codes through the decoder of SoundStream which outputs a waveform.

(In the case of a mel spectrogram, we would pass the mel spectrogram to a neural vocoder which would output a waveform)
April 5, 2024 at 17:41 in reply to: save output of festival command #17703
Simon King
Professor
If the output is small, and you’re running Festival in interactive mode, just copy-paste from the terminal into any plain text editor.

If you want to capture everything from an interactive session, this will capture stdout in the file out.txt but still issue it to the terminal so you can use festival interactively:

$ festival | tee out.txt

If you are running Festival in batch (non-interactive mode), you can redirect stdout to a file using > like this:

$ festival -b some_batch_script.scm > out.txt
April 4, 2024 at 19:16 in reply to: About target cost #17695
Simon King
Professor
You can’t make a causal link from a lower target cost to “sounding better”, at least for any individual diphone or even an individual utterance. As you say, other factors are at play – notably the join cost.

Remember that the costs are only ever used relative to other costs: the search minimises the total cost.

If you want to inspect the target cost for a synthesised utterance, it is available in the Utterance object.

To inspect the differences between selected units (e.g., the diphones from different source utterances that you mention), you can look at the utterances they were taken from. For example, you could look at the original left and right phonetic context of the diphone in the source utterance, and compare that to the context in which it is being used in the target utterance. The more different these are, the worse we expect that unit to sound. This difference is exactly what the target cost measures.
April 4, 2024 at 08:15 in reply to: About target cost #17686
Simon King
Professor
The unit selection search algorithm only guarantees that the selected sequence has the lowest sum of join and target costs.

It does not necessarily select an individual candidate unit that has the lowest target cost for its target position. So be careful when talking about “achieving a lower target cost”. The search will of course tend to achieve that, but only for the whole sequence.

When you say “the choice of one [candidate unit is] better than the other”, I think you simply mean “sounds better”. So that is what you would report to illustrate this; remember that expert listening (i.e., by yourself) is a valid method, provided you specify that in the experiment.
Author

Posts

Viewing 15 posts - 31 through 45 (of 1,104 total)

← 1 2 3 4 … 72 73 74 →

Simon King

Forum Replies Created

Search the forums

Note

Latest Activity

Search the forums

Speech Synthesis