Forum Replies Created
-
AuthorPosts
-
Please try the machine
scp1.ppls.ed.ac.ukand report back (instructions now updated).No, that’s not correct. The last N/2 values are identical to the first N/2 – they are just copies of the same values (their ordering is mirrored around the Nyquist frequency).
There are only N/2 magnitudes. No more.
The N numbers (samples) in the time domain (waveform) have been transformed into N/2 magnitudes and N/2 phases in the frequency domain. So, in the frequency domain (= magnitude spectrum & phase spectrum) there are also exactly N numbers.
In other words, the transform to the frequency domain has preserved all of the information in the waveform. That means the inverse transform will perfectly reconstruct the waveform.
Yes, that’s all correct.
You obtained the notebooks from a git repository and that created a folder called
uoe_speech_processing_coursesomewhere on your machine.Navigate to that folder in the terminal before running
jupyter notebookand then your browser will open in the right place.Otherwise, just navigate in your browser to wherever you have
uoe_speech_processing_course(but noting that Jupyter won’t allow you to navigate upwards from the folder it was started in – this might be your problem?).It’s really not easy to look at a waveform and guess how much energy it will have at each harmonic (except in some special cases like the above). That’s why we prefer to inspect signals in the frequency domain.
Sounds like you are spending too much time looking at waveforms and not enough time with the spectrum?
Periodic signals
All periodic signals have energy only at multiples of the fundamental frequency (which are called the harmonics). We can see the periodicity in the waveform, but not much else.
How much energy at each harmonic is what differentiates one signal from another.
Special cases (where inspecting the waveform makes sense)
The sine wave is the simplest case: it has energy at the fundamental frequency only and no energy at all the other multiples.
The impulse train has an equal amount of energy at every multiple of the fundamental.
A square wave has energy at all the odd multiples of the fundamental and no energy at the even multiples.
The general case (where inspecting the waveform is of limited use)
Voiced speech has energy at all multiples of the fundamental (in common with the impulse train) but the amount of energy varies with frequency (why?) and so voiced speech does not sound like an impulse train.
Yes, the figure is different in the 2nd edition (the attachment in #7813 is from one version of that edition but this has serious errors in it), in which there are 4 sub-plots all with the same spectral envelope but different F0.
In the top sub-figure, the waveform makes 6 cycles within the first pitch period and that’s what the “6 peaks” is referring to. This is the resonance of the vocal tract – the “ringing” of the filter in response to an input impulse. It corresponds to the peak in the spectral envelope at 600 Hz.
All the waveforms in the other sub-figures have the same ‘ringing’ behaviour, it’s just that the input impulses are spaced at different fundamental periods.
Attached is a correct Figure 4.13 from my hardcopy 2nd edition.
Attachments:
You must be logged in to view attached files.The course deliberately uses multiple modes of learning so you see material from many points of view. You’re right that the SIGNALS notebooks were a bit tough – that’s because we wanted you to see the mathematical way to do signal processing, because eventually you will understand some of this and see that it can be the most simple and direct route to understanding. For now, just getting some intuitive understanding is all you need.
The gap between the videos (pretty pictures but glossing over details) and the notebooks (all the gory details, possibly too much) can be filled by some of the readings and by asking questions in tutorials or here on the forums. Try posting about what you think you understand and ask for confirmation, as well as posting about what you are struggling with.
In Module 3, there is a consolidation tutorial for the SIGNALS material – use that to ask questions about how much you are expected to understand at this point in the course (and how much more by the end of the course).
I hope you are writing your own notes, as you would if we had lectures. Writing your own ‘textbook’ which draws together all the material is my top tip for learning on this course.
Persevere for a week or so, then tell us how you are doing.
You are confusing the time domain and frequency domain. A signal can have energy at some frequency F without there being an impulse occurring every 1/F seconds.
Develop your intuitions with this tool (tip: you can use your mouse to draw any waveform you like; also, use the ‘Mag/Phase View’ rather than sines and cosines) and let me know if that helps.
The co-efficients in the filter equations are weights applied to speech samples. This representation of the filter is actually very hard to interpret. The relationship between the coefficients and the frequency response is complicated and not something we are going to cover in Speech Processing.
For this course, the way we will understand the filter’s behaviour is empirical rather than theoretical. We will excite the filter with an impulse and inspect the output impulse response. We can take the DFT of that to obtain the filter’s frequency response.
A key point is that there are multiple domains in which we can define and characterise the filter. One is the filter coefficients in the difference equation. Another is the filter’s impulse response – a pitch period of speech waveform. Yet another is its frequency response.
So, where do the filter co-efficients come from? Conceptually, this is straightforward: take a pitch period of speech and find a set coefficients that gives that output, when the input is an impulse. You might imagine doing that by trial and error (and can do exactly that in the notebooks).
The formal method for finding a set of filter coefficients that give the desired output speech signal involves solving a set of simultaneous equations; there are various methods available for this. They are out-of-scope for this course: we just want you to understand at a conceptual level.
Yes, Module 2 tutorials are this week (w/c 2020-10-05). Module 3 will be released in the first part of this week, giving you at least a week to work on it before the tutorials next week.
The earlier parts of the course have undergone the most changes (we hope they are improvements) this year, hence the gradual release of modules.
The later parts of the course will actually be released further ahead than this, giving more time for you to work on it.
Remember that the tutorial for a module is not the end point for that material. It’s just one of the multiple learning modes to use. You’ll always need to go back over material again, and synthesise what you have learned from videos, tutorials, readings, and discussion.
Let’s clear up the terms ‘lossless’ and ‘codec’ first. In the Speech Processing course, we are only ever talking about raw waveforms. These are ‘lossless’ and there is no ‘codec’ as such: the values of the samples are stored directly. ‘WAV’ is just a file format for storing raw waveforms preceded by a header containing useful information such as sample rate, duration, number of channels, etc.
A lossy codec, such as mp3 or AAC, does not store the samples, but encodes them in a way that loses some unimportant information (determined using a model of human hearing, for general-purpose codecs) . We don’t need to understand these codecs for the Speech Processing course. Speech-specific codecs, such as that used on your mobile phone, typically use the source-filter model, rather than a model of hearing.
Now on to the value of different sampling rates and bit depths. For consumer audio, there is little or no benefit of using a sampling rate higher than 44.1 kHz or a bit depth greater than 16. We more often see 48 kHz in professional audio, simply because it divides by 2 or 3 more sensibly.
In professional audio, such a music recording studio, we may well use a higher sampling rate and greater bit depth. This is because the signal will undergo all sorts of processing as part of the production process (e.g. time and pitch modification). This processing will introduce artefacts, and having a very high Nyquist frequency will place those artefacts up beyond the range of human hearing. A greater bit depth simply means storing each sample with greater precision, again giving more robustness against some sorts of processing such as changing the level (e.g., when mixing tracks together). Just before publishing the music, the audio is downsampled to 44.1 kHz and the bit depth reduced to 16.
Some people claim to be able to hear the difference between 48 kHz and 96 kHz. You would need a very well-produced example audio file, a good ear, and expensive equipment to try this for yourself.
Here is an example of reducing bit depth, so you can hear the effect.
A very good question, which is about how the theory of sampling meets practical application.
Let’s start with the easy case of a sampling frequency much higher than than any component frequency of the signal being sampled. We will get many samples per cycle and therefore good reconstruction when we ‘join the dots’ (the samples). All good so far.
Now consider the limiting case: a sampling rate that is only twice as high as any component of the signal being sampled. You are right that we might be ‘lucky’ or ‘unlucky’ in where the sampling points fall with respect to the signal being sampled. This is explored in one of the notebooks.
The take-home message is that a digital representation of a signal is going to be reliable for frequencies well below the Nyquist frequency, but that we should not entirely trust the representation of frequencies close to the Nyquist frequency.
The Nyquist frequency is a theoretical limit – an absolute maximum frequency that we can capture by sampling. In practical applications, we would not want to operate too close to this limit. For example, if you thought that speech synthesis requires perfect reconstruction of frequencies up to 10 kHz, then you would not choose to sample at 20 kHz but something a little higher.
The digital signal is an approximation of the original analogue signal that was sampled. We have made compromises, which are unavoidable in all real engineering applications. They are not a problem, so long as we understand the consequences.
The quotes are there to handle files with spaces in their name.
ffmpeg -i in.aac out.wav
October 2, 2020 at 12:34 in reply to: Handbook of Phonetic Sciences – Chapter 20 – Intro to Signal Processing #12169The two forms of energy exchanged are the same as in the swing.
A swing has maximum potential energy at its highest point, when it is momentarily stationary (= no kinetic energy). It has maximum kinetic energy at its lowest point, when it is moving fastest.
In air resonating in a tube, the same two forms of energy are exchanged. Potential energy is air under increased pressure, and kinetic energy is air moving at maximum velocity.
Look at these air molecules – they alternate between being “all bunched together and stationary” (maximum potential energy) and being “evenly spaced and moving quickly” (maximum kinetic energy).
To understand why pressure is potential energy, imagine a cylinder storing gas: it contains high pressure gas. Open the valve and this is converted into kinetic energy as the gas rushes out at high speed. Potential energy has been converted to (= exchanged for) kinetic energy. The total amount of energy is conserved.
-
AuthorPosts
This is the new version. Still under construction.