Forum Replies Created
-
AuthorPosts
-
You are confusing the time domain and frequency domain. A signal can have energy at some frequency F without there being an impulse occurring every 1/F seconds.
Develop your intuitions with this tool (tip: you can use your mouse to draw any waveform you like; also, use the ‘Mag/Phase View’ rather than sines and cosines) and let me know if that helps.
The co-efficients in the filter equations are weights applied to speech samples. This representation of the filter is actually very hard to interpret. The relationship between the coefficients and the frequency response is complicated and not something we are going to cover in Speech Processing.
For this course, the way we will understand the filter’s behaviour is empirical rather than theoretical. We will excite the filter with an impulse and inspect the output impulse response. We can take the DFT of that to obtain the filter’s frequency response.
A key point is that there are multiple domains in which we can define and characterise the filter. One is the filter coefficients in the difference equation. Another is the filter’s impulse response – a pitch period of speech waveform. Yet another is its frequency response.
So, where do the filter co-efficients come from? Conceptually, this is straightforward: take a pitch period of speech and find a set coefficients that gives that output, when the input is an impulse. You might imagine doing that by trial and error (and can do exactly that in the notebooks).
The formal method for finding a set of filter coefficients that give the desired output speech signal involves solving a set of simultaneous equations; there are various methods available for this. They are out-of-scope for this course: we just want you to understand at a conceptual level.
Yes, Module 2 tutorials are this week (w/c 2020-10-05). Module 3 will be released in the first part of this week, giving you at least a week to work on it before the tutorials next week.
The earlier parts of the course have undergone the most changes (we hope they are improvements) this year, hence the gradual release of modules.
The later parts of the course will actually be released further ahead than this, giving more time for you to work on it.
Remember that the tutorial for a module is not the end point for that material. It’s just one of the multiple learning modes to use. You’ll always need to go back over material again, and synthesise what you have learned from videos, tutorials, readings, and discussion.
Let’s clear up the terms ‘lossless’ and ‘codec’ first. In the Speech Processing course, we are only ever talking about raw waveforms. These are ‘lossless’ and there is no ‘codec’ as such: the values of the samples are stored directly. ‘WAV’ is just a file format for storing raw waveforms preceded by a header containing useful information such as sample rate, duration, number of channels, etc.
A lossy codec, such as mp3 or AAC, does not store the samples, but encodes them in a way that loses some unimportant information (determined using a model of human hearing, for general-purpose codecs) . We don’t need to understand these codecs for the Speech Processing course. Speech-specific codecs, such as that used on your mobile phone, typically use the source-filter model, rather than a model of hearing.
Now on to the value of different sampling rates and bit depths. For consumer audio, there is little or no benefit of using a sampling rate higher than 44.1 kHz or a bit depth greater than 16. We more often see 48 kHz in professional audio, simply because it divides by 2 or 3 more sensibly.
In professional audio, such a music recording studio, we may well use a higher sampling rate and greater bit depth. This is because the signal will undergo all sorts of processing as part of the production process (e.g. time and pitch modification). This processing will introduce artefacts, and having a very high Nyquist frequency will place those artefacts up beyond the range of human hearing. A greater bit depth simply means storing each sample with greater precision, again giving more robustness against some sorts of processing such as changing the level (e.g., when mixing tracks together). Just before publishing the music, the audio is downsampled to 44.1 kHz and the bit depth reduced to 16.
Some people claim to be able to hear the difference between 48 kHz and 96 kHz. You would need a very well-produced example audio file, a good ear, and expensive equipment to try this for yourself.
Here is an example of reducing bit depth, so you can hear the effect.
A very good question, which is about how the theory of sampling meets practical application.
Let’s start with the easy case of a sampling frequency much higher than than any component frequency of the signal being sampled. We will get many samples per cycle and therefore good reconstruction when we ‘join the dots’ (the samples). All good so far.
Now consider the limiting case: a sampling rate that is only twice as high as any component of the signal being sampled. You are right that we might be ‘lucky’ or ‘unlucky’ in where the sampling points fall with respect to the signal being sampled. This is explored in one of the notebooks.
The take-home message is that a digital representation of a signal is going to be reliable for frequencies well below the Nyquist frequency, but that we should not entirely trust the representation of frequencies close to the Nyquist frequency.
The Nyquist frequency is a theoretical limit – an absolute maximum frequency that we can capture by sampling. In practical applications, we would not want to operate too close to this limit. For example, if you thought that speech synthesis requires perfect reconstruction of frequencies up to 10 kHz, then you would not choose to sample at 20 kHz but something a little higher.
The digital signal is an approximation of the original analogue signal that was sampled. We have made compromises, which are unavoidable in all real engineering applications. They are not a problem, so long as we understand the consequences.
The quotes are there to handle files with spaces in their name.
ffmpeg -i in.aac out.wav
October 2, 2020 at 12:34 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #12169The two forms of energy exchanged are the same as in the swing.
A swing has maximum potential energy at its highest point, when it is momentarily stationary (= no kinetic energy). It has maximum kinetic energy at its lowest point, when it is moving fastest.
In air resonating in a tube, the same two forms of energy are exchanged. Potential energy is air under increased pressure, and kinetic energy is air moving at maximum velocity.
Look at these air molecules – they alternate between being “all bunched together and stationary” (maximum potential energy) and being “evenly spaced and moving quickly” (maximum kinetic energy).
To understand why pressure is potential energy, imagine a cylinder storing gas: it contains high pressure gas. Open the valve and this is converted into kinetic energy as the gas rushes out at high speed. Potential energy has been converted to (= exchanged for) kinetic energy. The total amount of energy is conserved.
The form of filter you give only has terms involving x[.] on the right hand side (“RHS” in maths jargon). This is a Finite Impulse Response (FIR) filter, and you can explore that in one of the Module 2 Jupyter notebooks.
The equation operates in the time domain, and the co-efficients are simply weights applied to input (x) samples: the output is nothing more than a weighted sum of input samples.
Importantly, there is no term involving the output (y) on the RHS. This means there is no “feedback”. For any given input, the filter’s output will only continue for a finite time (the “F” in FIR) after the input stops.
In contrast, if we start putting some y[.] terms on the RHS, then there will be some feedback, and the filter can potentially produce output for an infinite duration after the input has ceased. This form of filter can exhibit resonance, and so is the form we will use to model the vocal tract filter.
You can explore Infinite Impulse Response (IIR) filters in one of the Module 2 Jupyter notebooks.
Try the notebooks, then post follow-up questions.
An analogue signal – such as a sound wave propagating through air – may contain frequencies over a very wide range, with no upper limit.
When we need a digital representation of such a signal, we need to choose a sampling rate (which then determines the Nyquist frequency). Our choice of sampling rate will be influenced by:
- What information in the sound we think is important – we might say that only frequencies up to 8 kHz are useful for Automatic Speech Recognition, for example.
- Practical considerations such as the amount of storage the digital waveform will require (higher sampling rate = larger files) or whether we need to transmit it (higher sampling rate = larger bandwidth required).
We will generally choose the lowest possible sampling rate that satisfies the first requirement, related to the application we are building.
We must remove any components of the analogue signal that are above the Nyquist frequency, before sampling it. This is done in the analogue domain using a low-pass filter (an ‘anti-aliasing filter’). There is such a filter in your computer’s audio input, for example.
You’re unlikely to ever need to build an analogue-to-digital convertor, so you might be wondering why we care about this…
The same thing applies when reducing the sampling rate of an existing digital signal – a process known as downsampling. For example, to halve the sampling rate, we cannot simply take every second sample. We must first pass the digital signal through a low-pass filter (an ‘anti-aliasing filter’ in the digital domain) to remove everything above the new, lower, Nyquist frequency.
Downsampling is quite common when preparing existing speech recordings for use in speech technology. They may have been recorded at a higher sampling rate than we wish to use.
If you get an error message, include it here. If there is no error message, then something else is wrong. Here’s what it looks like when SayText runs correctly:
festival> (SayText "Hello world.") #<Utterance 0x7f17db4840f0> festival>
and here’s what some errors might look like:
festival> SayText "Hello world." #<CLOSURE (text) (begin "(SayText TEXT) TEXT, a string, is rendered as speech." (utt.play (utt.synth (eval (list (quote Utterance) (quote Text) text)))))> "Hello world." festival> (SayText "Hello world." > festival> (SayText Hello world.) SIOD ERROR: unbound variable : Hello festival> SayText("Hello world.") #<CLOSURE (text) (begin "(SayText TEXT) TEXT, a string, is rendered as speech." (utt.play (utt.synth (eval (list (quote Utterance) (quote Text) text)))))> SIOD ERROR: bad function : "Hello world."
Yes, that’s right. In that case, the waveforms will look different, but (in general) we will not hear any difference.
I like your recipe analogy – let’s try using it: If we construct a recipe using the wrong phase, we’ll use the correct ingredients (i.e., sinusoids with the correct magnitudes), but in the wrong relationship to each other.
On the left of the attached picture (you may need to be logged in to see it) is a cake constructed with the correct phases of all the ingredients. On the right, the same ingredients with the wrong phases. Close your eyes and they will taste the same, but they look very different.
Note that the sinusoid basis functions in Fourier analysis can never cancel each other out though – because they are orthogonal.
Attachments:
You must be logged in to view attached files.The video Frequency domain will help you understand why phase is less important than magnitude, for human perception, and for speech technology.
The terms ‘pitch period’ and ‘fundamental period’ are used interchangeably in the field. You’re right that this is technically incorrect.
‘Register’ here just means ‘in a different frequency range’.
Don’t worry if you think you are only analysing these sounds in very simple terms: you are – that’s the point here: just get get our hands on some audio samples and inspect them. Do the readings as well as the exercises – they will help.
-
AuthorPosts