Forum Replies Created
-
AuthorPosts
-
It’s really not easy to look at a waveform and guess how much energy it will have at each harmonic (except in some special cases like the above). That’s why we prefer to inspect signals in the frequency domain.
Sounds like you are spending too much time looking at waveforms and not enough time with the spectrum?
Periodic signals
All periodic signals have energy only at multiples of the fundamental frequency (which are called the harmonics). We can see the periodicity in the waveform, but not much else.
How much energy at each harmonic is what differentiates one signal from another.
Special cases (where inspecting the waveform makes sense)
The sine wave is the simplest case: it has energy at the fundamental frequency only and no energy at all the other multiples.
The impulse train has an equal amount of energy at every multiple of the fundamental.
A square wave has energy at all the odd multiples of the fundamental and no energy at the even multiples.
The general case (where inspecting the waveform is of limited use)
Voiced speech has energy at all multiples of the fundamental (in common with the impulse train) but the amount of energy varies with frequency (why?) and so voiced speech does not sound like an impulse train.
Yes, the figure is different in the 2nd edition (the attachment in #7813 is from one version of that edition but this has serious errors in it), in which there are 4 sub-plots all with the same spectral envelope but different F0.
In the top sub-figure, the waveform makes 6 cycles within the first pitch period and that’s what the “6 peaks” is referring to. This is the resonance of the vocal tract – the “ringing” of the filter in response to an input impulse. It corresponds to the peak in the spectral envelope at 600 Hz.
All the waveforms in the other sub-figures have the same ‘ringing’ behaviour, it’s just that the input impulses are spaced at different fundamental periods.
Attached is a correct Figure 4.13 from my hardcopy 2nd edition.
Attachments:
You must be logged in to view attached files.The course deliberately uses multiple modes of learning so you see material from many points of view. You’re right that the SIGNALS notebooks were a bit tough – that’s because we wanted you to see the mathematical way to do signal processing, because eventually you will understand some of this and see that it can be the most simple and direct route to understanding. For now, just getting some intuitive understanding is all you need.
The gap between the videos (pretty pictures but glossing over details) and the notebooks (all the gory details, possibly too much) can be filled by some of the readings and by asking questions in tutorials or here on the forums. Try posting about what you think you understand and ask for confirmation, as well as posting about what you are struggling with.
In Module 3, there is a consolidation tutorial for the SIGNALS material – use that to ask questions about how much you are expected to understand at this point in the course (and how much more by the end of the course).
I hope you are writing your own notes, as you would if we had lectures. Writing your own ‘textbook’ which draws together all the material is my top tip for learning on this course.
Persevere for a week or so, then tell us how you are doing.
You are confusing the time domain and frequency domain. A signal can have energy at some frequency F without there being an impulse occurring every 1/F seconds.
Develop your intuitions with this tool (tip: you can use your mouse to draw any waveform you like; also, use the ‘Mag/Phase View’ rather than sines and cosines) and let me know if that helps.
The co-efficients in the filter equations are weights applied to speech samples. This representation of the filter is actually very hard to interpret. The relationship between the coefficients and the frequency response is complicated and not something we are going to cover in Speech Processing.
For this course, the way we will understand the filter’s behaviour is empirical rather than theoretical. We will excite the filter with an impulse and inspect the output impulse response. We can take the DFT of that to obtain the filter’s frequency response.
A key point is that there are multiple domains in which we can define and characterise the filter. One is the filter coefficients in the difference equation. Another is the filter’s impulse response – a pitch period of speech waveform. Yet another is its frequency response.
So, where do the filter co-efficients come from? Conceptually, this is straightforward: take a pitch period of speech and find a set coefficients that gives that output, when the input is an impulse. You might imagine doing that by trial and error (and can do exactly that in the notebooks).
The formal method for finding a set of filter coefficients that give the desired output speech signal involves solving a set of simultaneous equations; there are various methods available for this. They are out-of-scope for this course: we just want you to understand at a conceptual level.
Yes, Module 2 tutorials are this week (w/c 2020-10-05). Module 3 will be released in the first part of this week, giving you at least a week to work on it before the tutorials next week.
The earlier parts of the course have undergone the most changes (we hope they are improvements) this year, hence the gradual release of modules.
The later parts of the course will actually be released further ahead than this, giving more time for you to work on it.
Remember that the tutorial for a module is not the end point for that material. It’s just one of the multiple learning modes to use. You’ll always need to go back over material again, and synthesise what you have learned from videos, tutorials, readings, and discussion.
Let’s clear up the terms ‘lossless’ and ‘codec’ first. In the Speech Processing course, we are only ever talking about raw waveforms. These are ‘lossless’ and there is no ‘codec’ as such: the values of the samples are stored directly. ‘WAV’ is just a file format for storing raw waveforms preceded by a header containing useful information such as sample rate, duration, number of channels, etc.
A lossy codec, such as mp3 or AAC, does not store the samples, but encodes them in a way that loses some unimportant information (determined using a model of human hearing, for general-purpose codecs) . We don’t need to understand these codecs for the Speech Processing course. Speech-specific codecs, such as that used on your mobile phone, typically use the source-filter model, rather than a model of hearing.
Now on to the value of different sampling rates and bit depths. For consumer audio, there is little or no benefit of using a sampling rate higher than 44.1 kHz or a bit depth greater than 16. We more often see 48 kHz in professional audio, simply because it divides by 2 or 3 more sensibly.
In professional audio, such a music recording studio, we may well use a higher sampling rate and greater bit depth. This is because the signal will undergo all sorts of processing as part of the production process (e.g. time and pitch modification). This processing will introduce artefacts, and having a very high Nyquist frequency will place those artefacts up beyond the range of human hearing. A greater bit depth simply means storing each sample with greater precision, again giving more robustness against some sorts of processing such as changing the level (e.g., when mixing tracks together). Just before publishing the music, the audio is downsampled to 44.1 kHz and the bit depth reduced to 16.
Some people claim to be able to hear the difference between 48 kHz and 96 kHz. You would need a very well-produced example audio file, a good ear, and expensive equipment to try this for yourself.
Here is an example of reducing bit depth, so you can hear the effect.
A very good question, which is about how the theory of sampling meets practical application.
Let’s start with the easy case of a sampling frequency much higher than than any component frequency of the signal being sampled. We will get many samples per cycle and therefore good reconstruction when we ‘join the dots’ (the samples). All good so far.
Now consider the limiting case: a sampling rate that is only twice as high as any component of the signal being sampled. You are right that we might be ‘lucky’ or ‘unlucky’ in where the sampling points fall with respect to the signal being sampled. This is explored in one of the notebooks.
The take-home message is that a digital representation of a signal is going to be reliable for frequencies well below the Nyquist frequency, but that we should not entirely trust the representation of frequencies close to the Nyquist frequency.
The Nyquist frequency is a theoretical limit – an absolute maximum frequency that we can capture by sampling. In practical applications, we would not want to operate too close to this limit. For example, if you thought that speech synthesis requires perfect reconstruction of frequencies up to 10 kHz, then you would not choose to sample at 20 kHz but something a little higher.
The digital signal is an approximation of the original analogue signal that was sampled. We have made compromises, which are unavoidable in all real engineering applications. They are not a problem, so long as we understand the consequences.
The quotes are there to handle files with spaces in their name.
ffmpeg -i in.aac out.wav
October 2, 2020 at 12:34 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #12169The two forms of energy exchanged are the same as in the swing.
A swing has maximum potential energy at its highest point, when it is momentarily stationary (= no kinetic energy). It has maximum kinetic energy at its lowest point, when it is moving fastest.
In air resonating in a tube, the same two forms of energy are exchanged. Potential energy is air under increased pressure, and kinetic energy is air moving at maximum velocity.
Look at these air molecules – they alternate between being “all bunched together and stationary” (maximum potential energy) and being “evenly spaced and moving quickly” (maximum kinetic energy).
To understand why pressure is potential energy, imagine a cylinder storing gas: it contains high pressure gas. Open the valve and this is converted into kinetic energy as the gas rushes out at high speed. Potential energy has been converted to (= exchanged for) kinetic energy. The total amount of energy is conserved.
The form of filter you give only has terms involving x[.] on the right hand side (“RHS” in maths jargon). This is a Finite Impulse Response (FIR) filter, and you can explore that in one of the Module 2 Jupyter notebooks.
The equation operates in the time domain, and the co-efficients are simply weights applied to input (x) samples: the output is nothing more than a weighted sum of input samples.
Importantly, there is no term involving the output (y) on the RHS. This means there is no “feedback”. For any given input, the filter’s output will only continue for a finite time (the “F” in FIR) after the input stops.
In contrast, if we start putting some y[.] terms on the RHS, then there will be some feedback, and the filter can potentially produce output for an infinite duration after the input has ceased. This form of filter can exhibit resonance, and so is the form we will use to model the vocal tract filter.
You can explore Infinite Impulse Response (IIR) filters in one of the Module 2 Jupyter notebooks.
Try the notebooks, then post follow-up questions.
An analogue signal – such as a sound wave propagating through air – may contain frequencies over a very wide range, with no upper limit.
When we need a digital representation of such a signal, we need to choose a sampling rate (which then determines the Nyquist frequency). Our choice of sampling rate will be influenced by:
- What information in the sound we think is important – we might say that only frequencies up to 8 kHz are useful for Automatic Speech Recognition, for example.
- Practical considerations such as the amount of storage the digital waveform will require (higher sampling rate = larger files) or whether we need to transmit it (higher sampling rate = larger bandwidth required).
We will generally choose the lowest possible sampling rate that satisfies the first requirement, related to the application we are building.
We must remove any components of the analogue signal that are above the Nyquist frequency, before sampling it. This is done in the analogue domain using a low-pass filter (an ‘anti-aliasing filter’). There is such a filter in your computer’s audio input, for example.
You’re unlikely to ever need to build an analogue-to-digital convertor, so you might be wondering why we care about this…
The same thing applies when reducing the sampling rate of an existing digital signal – a process known as downsampling. For example, to halve the sampling rate, we cannot simply take every second sample. We must first pass the digital signal through a low-pass filter (an ‘anti-aliasing filter’ in the digital domain) to remove everything above the new, lower, Nyquist frequency.
Downsampling is quite common when preparing existing speech recordings for use in speech technology. They may have been recorded at a higher sampling rate than we wish to use.
If you get an error message, include it here. If there is no error message, then something else is wrong. Here’s what it looks like when SayText runs correctly:
festival> (SayText "Hello world.") #<Utterance 0x7f17db4840f0> festival>
and here’s what some errors might look like:
festival> SayText "Hello world." #<CLOSURE (text) (begin "(SayText TEXT) TEXT, a string, is rendered as speech." (utt.play (utt.synth (eval (list (quote Utterance) (quote Text) text)))))> "Hello world." festival> (SayText "Hello world." > festival> (SayText Hello world.) SIOD ERROR: unbound variable : Hello festival> SayText("Hello world.") #<CLOSURE (text) (begin "(SayText TEXT) TEXT, a string, is rendered as speech." (utt.play (utt.synth (eval (list (quote Utterance) (quote Text) text)))))> SIOD ERROR: bad function : "Hello world."
Yes, that’s right. In that case, the waveforms will look different, but (in general) we will not hear any difference.
-
AuthorPosts