› Forums › Speech Processing – Live Q&A Sessions › Module 3 – Digital Speech Signals › DFT inputs and outputs
- This topic has 1 reply, 2 voices, and was last updated 2 years, 3 months ago by Catherine Lai.
-
AuthorPosts
-
-
October 6, 2022 at 15:56 #16065
Today I asked a question about the linear relationship between the input samples and the output frequencies of the DFT. I would like to make sure that I understood the answer.
If you look at my attached image you can see that the top left corner illustrates the effect of window width; wave A only completes half a cycle in window 1 whereas it completes one full cycle in window 2. The lowest frequency we can capture given the width of window 1 would be the frequency of wave B whereas wave A is the lowest frequency wave we can possibly capture given the width of window 2.
Below this is the effect of sampling rate. If we only sample sparsely we would get wave alpha, which is very different from a possible underlying pattern of wave beta. This is also why we have the Nyquist frequency effect which is that the sampling must be such that two points are included per wave period, otherwise we wouldn’t get the max and min points and therefore not a fair representation of the actual wave. The distance between each sample on the time axis (dt) is the sample period. The sample period determines window width because dt x N(samples) = window width.
The window width and the sampling rate can be set by the engineers when we build the model but we cannot think of them as independent things (even though we could technically set these parameters independently of one another). This relationship between sampling rate and window width is what I wanted to ask about.Let me give the scenario:
1) You must first turn the analogue signal into a digital signal through sampling. Your sampled representation of the entire original wave is the input to your speech system. The number of input sample points equals the number of frequency outputs. This number times the sample period also determines the width of the entirety of the sampled sequence.
2) So if we imagine that the dotted line in my screenshot is the total duration of the sampled sequence then what we are doing in processing (or at least how I imagine it) is going over this digitized signal with a window of a width and a shape that we can set ourselves, independently of sampling rate. We are processing slices of the total wave, one window frame at a time. If we imagine that the square I have drawn over the dotted line is our window which we slide across the signal, then its width also determines what frequencies you can detect. Its width will determine what range of frequencies you could see BUT if you have set the sampling rate independently of the frame width then you will sometimes end up with a mismatch between which frequencies could be detected given window width vs. which frequencies are detected given the sample rate of the digitized signal. This results in the leakage effect in the frequency domain where we see a mismatch between where the formants appear in the plot vs. where they are supposed to appear given the harmonic structure that can be detected by our window.
– Is this correct or is it the case that the window width and the sampling rate are always set in tandem? Seeing as dt x N(samples) = window width, we have a close relationship between the two parameters which means that whatever sample rate we choose determines the width of our window. We don’t set them independently? If this is the case then maybe my confusion above stems from the fact that I think of the process in several steps: first you digitize the analogue signal by sampling and then you “slide” a window across this signal to process it. Maybe this is totally wrong as the input to our system is instantaneous so you cannot separate the steps – segmenting and sampling take place simultaneously, as the analogue signal arrives in the system in real-time. ???
– If the latter is true, then leakage wouldn’t be the result of a mismatch between sampling rate and window width but between the language model and the analogue wave (ie we have set the sampling rate and window width in such a way that the particular frequencies of the analogue signal cannot be detected easily)? Which one is it?Attachments:
You must be logged in to view attached files. -
October 11, 2022 at 18:52 #16077
Hi Rebecka,
Sorry about the delay in responding!
Your description of the relationship between the number of inputs and the DFT analysis frequencies is good. Sorry, I misunderstood your question in class and went on a bit of a tangent!
For your scenario: I would first just clarify that
“The number of input sample points in the analysis window equals the number of frequency outputs of applying the DFT to that window” which I think is what you are saying in any case.In point 2:
We are processing slices of the total wave, one window frame at a time. If we imagine that the square I have drawn over the dotted line is our window which we slide across the signal, then its width also determines what frequencies you can detect.
Yes!
Its width will determine what range of frequencies you could see BUT if you have set the sampling rate independently of the frame width then you will sometimes end up with a mismatch between which frequencies could be detected given window width vs. which frequencies are detected given the sample rate of the digitized signal. This results in the leakage effect in the frequency domain where we see a mismatch between where the formants appear in the plot vs. where they are supposed to appear given the harmonic structure that can be detected by our window.
I think you’ve got it, but it’s worth unpacking this a bit here!
1. If you have have a fixed number of samples per window, changing the sampling rate will change the length of the window (in seconds). Conversely, If you have a fixed sampling rate, changing the length of the input window (in seconds) will change the number of samples you can take in that window. So, if you always keep to the same number of samples in a window but change the sampling rate, DFT outputs will map to basis sinusoids of different frequencies.
2. Depending on what frequencies are actually in your input sample, you may get leakage with one sampling rate (due to change in window length) as opposed to the another. For example, assume your window is 10 samples long and your sampling rate is 1000 Hz. Then the DFT should be able to pick up multiples of 100 Hz faithfully. In contrast, if your sampling rate is 800 Hz (still 10 samples), the DFT would pick up multiples of 80 Hz. This means that a 100 Hz sine wave would appear as a single spike on the former, but cause leakage in the latter (spilling over into 80 Hz and 160 Hz, and potentially other frequencies in the spectrum).
– Is this correct or is it the case that the window width and the sampling rate are always set in tandem? Seeing as dt x N(samples) = window width, we have a close relationship between the two parameters which means that whatever sample rate we choose determines the width of our window. We don’t set them independently?
Yes, you’re right. We can consider window width in time and sampling rate independently, but this would then change the number of samples we can analyse in a window. In reality, the sampling rate is usually set first and then the window width is usually determined based on the application we have in mind (e.g. do we want a narrowband or wideband view of the spectrum? It turns out the seeing all the fine detail of the harmonic structure is not that use for word recognition, for example).
If this is the case then maybe my confusion above stems from the fact that I think of the process in several steps: first you digitize the analogue signal by sampling and then you “slide” a window across this signal to process it. Maybe this is totally wrong as the input to our system is instantaneous so you cannot separate the steps – segmenting and sampling take place simultaneously, as the analogue signal arrives in the system in real-time. ???
The input isn’t really instantaneous. The waveform needs to be sampled and quantized (i.e. discretized) so we can even get it onto the computer. We then do windowing and apply the DFT (and potentially other things) in the digital realm. So the leakage is really due to the fact that we can’t always know what exact frequencies will be in our (analogue) input (this is very much the case of speech as everyone’s voice is a bit different and so characterised by different frequency components!). Separately, we have constraints on memory for storing digital recordings (a higher sampling rate means you have to store more samples) and also for making sure we have a high enough sampling rate to capture the frequencies that are important for the task. For example, humans can hear up to around 22kHz but still understand people through telephones which only have 8kHz sampling rate (though we notice the loss of quality). Separate again are constraints based on what sort of frequency analysis we want to do (how much detail we actually want to extract).
-
-
AuthorPosts
- You must be logged in to reply to this topic.