› Forums › Speech Processing – Live Q&A Sessions › Module 3 – Digital Speech Signals › Module 3 questions (Digital speech signals)
- This topic has 2 replies, 2 voices, and was last updated 2 years, 10 months ago by Catherine Lai.
-
AuthorPosts
-
-
October 7, 2021 at 11:22 #14866
Post your questions for the Q&A session here (or on Teams)!
-
October 8, 2021 at 13:26 #14879
Hello Catherine! I am quite confused about window‘s function and how to calculate the window size and also the relationship between window size and frequency resolution and spectral resolution.
I was reading Chapter 7 in Digital Signal Processing and it talks about the window on page 159. It says that a short stretch of the signal is windowed to yield a short-time spectrum. The size of the window is taken to be the duration of one period, from which the amplitude and phase of the first harmonic (of the window size) are computed. The computation is then repeated for two, three, four, etc. periods, giving amplitude and phase information for the respective harmonics.
And it is exemplified as follows: for a signal sampled at 10 kHz, a 10 ms window yields a spectral resolution of 100 Hz (1,000 samples per 100 msec/10 ms), a 50 ms window gives one of 20 Hz (1,000/50), etc.I have looked up some information and explanation online but I’m still confused by this. Could you please help me with it? Thanks in advance!
-
October 15, 2021 at 12:01 #14923
This question gets at one of the most important parts of understanding the DFT: The number of input samples determines which frequencies you can accurately detect with the DFT.
For a recording, we can assume that we have a fixed sampling rate, f_s, so the time between each sample, the sampling period, is 1/f_s. For example, if the sampling rate is f_s=1000Hz, the sampling period will be 1/1000 = 0.001 seconds.
So, 10 samples will capture a window size of 10*0.001=0.01 seconds.
Taking these 10 samples as input, we can all our input size N=10.The range of frequencies we can analyse with the DFT is determined by the sampling rate, but the number of frequencies we can actually analyse is determined by the size of the input. The frequencies we can analyse are exactly N points spread evenly from 0 to the sampling rate.
The lowest analysis frequency (associated with DFT[1]) will be the sampling rate/input size, f_s/N, and the other analysis frequencies will be multiples of that up to the sampling rate. So in the example above, the lowest analysis frequency will be f_s/N = 1000/10 = 100. So the DFT output will represent the 10 frequencies 100 Hz, 200 Hz, 300 Hz,…,1000 Hz.
This is essentially what the textbook is getting at: the ‘first harmonic’ mentioned in the text is the lowest analysis frequency. If we work from the DFT equations, we can (eventually) see that the lowest analysis frequency is the same as the frequency of a sine wave that has the same period as the window size. So if the window size is 0.01 seconds, the lowest analysis frequency will be 1/period = 1/0.01 = 100 Hz.
Now, for sampling rate 1000 Hz and input size N=10, we won’t be able to accurately tell if the input has a frequency component of 30 Hz, because this falls between the analysis frequencies (which are all multiples of 100 Hz).
If instead we analyze a longer window, e.g. 0.1 seconds, we would need N=100 samples (100*0.001=0.1 seconds). In this case the lowest analysis frequency will be f_s/N = 1000/100 = 10 Hz (equivalently, period=window size=0.1, frequency=1/period = 10 Hz). So the DFT output will represent N=100 frequencies: 10, 20, 30,…,1000 Hz. This time, with input size N=100 we will be able to capture an input frequency of 30 Hz, because it matches one of our DFT analysis frequencies in this case.
In general, given a fixed sampling rate, the longer the time window we want to analyze, the bigger the input size N will be, and the more frequencies we will be able to accurately detect in the input. This means higher frequency resolution. But if we take a longer window as input, we have less time resolution. For example, if we were to take the DFT over a whole diphthong like [ai], we won’t be able to “see” the changes in formants from the beginning of the vowel to the end as you just get one spectrum out for the whole window. To increase the time resolution, we need shorter windows but this means there are less frequencies we can detect with the DFT as the input size is decreased.
The length of window you want to use depends on what type of analysis you want to do. If we want to look at how the overall spectral envelope changes with detailed time resolution (e.g. to track changes in formants), and don’t care too much about the fine spectral details (like the harmonics due to the voice source) a shorter window would be preferred. If we wanted to capture more of the fine detail of harmonic structure and know that the sound is stable (like a sustained vowel) then we would prefer a longer window. In practice, since phones are very different in their spectral and temporal characteristics, we usually keep to a default window size and step (e.g. frame size=25 ms, frame step=10ms) and then do some other analysis to obtain the features we want. We’ll come back to this later in the course when we talk about feature inputs for ASR!
Here’s a link to the recording of discussion of this in the in the Q&A session (12/10/2021): Link to recording (Teams)
-
-
AuthorPosts
- You must be logged in to reply to this topic.