Reducing a FFT spectrum range - audio

I am currently running Python's Numpy fft on 44100Hz audio samples which gives me a working frequency range of 0Hz - 22050Hz (thanks Nyquist). Once I use fft on those time domain values, I have 128 points in my fft spectrum giving me 172Hz for each frequency bin size.
I would like to tighten the frequency bin to 86Hz and still keep to only 128 fft points, instead of increasing my fft count to 256 through an adjustment on how I'm creating my samples.
The question I have is whether or not this is theoretically possible. My thought would be to run fft on any Hz values between 0Hz to 11025Hz only. I don't care about anything above that anyway. This would cut my working spectrum in half and put my frequency bins at 86Hz and while keeping to my 128 spectrum bins. Perhaps this can be accomplished via a window function in the time domain?
Currently the code I'm using to create my samples and then convert to fft is:
import numpy as np
sample_rate = 44100
chunk = 128
record_seconds = 2
stream = self.audio.open(format=pyaudio.paInt16, channels=1,
rate=sample_rate, input=True, frames_per_buffer=6300)
sample_list = []
for i in range(0, int(sample_rate / chunk * record_seconds)):
data = stream.read(chunk)
sample_list.append(np.fromstring(data, dtype=np.int16))
### then later ###:
for samp in sample_list:
samp_fft = np.fft.fft(samp) ...
I hope I worded this clearly enough. Let me know if I need to adjust my explanation or terminology.

What you are asking for is not possible. As you mentioned in a comment you require a short time window. I assume this is because you're trying to detect when a signal arrives at a certain frequency (as I've answered your earlier question on the subject) and you want the detection to be time sensitive. However, it seems your bin size is too large for your requirements.
There are only two ways to decrease the bin size. 1) Increase the length of the FFT. Unfortunately this also means that it will take longer to acquire the data. 2) lower the sample rate (either by sample rate conversion or at the hardware level) but since the samples arrive slower it will also take longer to acquire the data.
I'm going to suggest to you a 3rd option (from what I've gleaned from this and your other questions is possibly a better solution) which is: Perform the frequency detection in the time domain. What this would require is a time-domain bandpass filter followed by an RMS meter. Implementation wise this would be one or more biquad filters that you could implement in python for the filter - there are probably implementations already available. The tricky part would be designing the filter but I'd be happy to help you in chat. The RMS meter is basically taking the square root of the sum of the squares of the output samples from the filter.

Doubling the size of the FFT would be the obvious thing to do, but if there is a good reason why you can't do this then consider 2x downsampling prior to the FFT to get the effective sample rate down to 22050 Hz:
- Apply low pass filter with cut off at 11 kHz
- Discard every other sample from filtered output
- Apply FFT to down-sampled data

If you are not trying to resolve between close adjacent frequency peaks or noise, then, to half the frequency bin spacing, you can zero-pad your data to double the FFT length, without having to wait for more data. Then, if you only want the lower half of the frequency range 0..Fs/2, just throw away the middle half of the FFT result vector (which is usually far more efficient than trying to compute the lower half of the frequency range via non-FFT means).
Note that zero-padding gives the same result as high-quality interpolation (as in smoothing a plot of the original FFT result points). It does not increase peak separation resolution, but might make it easier to pick out more precise peak locations in the plot if the noise level is low enough.

Related

FFT - When to window?

I've seen the various FFT questions on here but I'm confused on part of the implementation. Instead of performing the FFT in real time, I want to do it offline. Lets say I have the raw data in float[] audio. The sampling rate is 44100 and so audio[0] to audio[44099] will contain 1 seconds worth of audio. If my FFT function handles the windowing (e.g. Hanning), do I simply put the entire audio buffer into the function in one go? Or, do I have to cut the audio into chunks of 4096 (my window size) and then input that into the FFT which will then perform the windowing function on top?
You may need to copy your input data to a separate buffer and get it in the correct format, e.g. if your FFT is in-place, or if it requires interleaved complex data (real/imaginary). However if your FFT routine can take a purely real input and is not in-place (i.e. non-destructive) then you may just be able to pass a pointer to the original sample data, along with an appropriate size parameter.
Typically for 1s of audio, e.g. speech or music, you would pick an FFT size which corresponds to a reasonably stationary chunk of audio, e.g. 10 ms or 20 ms. So at 44.1 kHz your FFT size might be say 512 or 1024. You would then generate successive spectra by advancing through your buffer and doing a new FFT at each starting point. Note that it's common practice to overlap these successive buffers, typically by 50%. So if N = 1024 your first FFT would be for samples 0..1023, your second would be for samples 512..1535, then 1024..2047, etc.
The choice of whether to calculate one FFT over the entire data set (in the OP's case, 44100 samples representing 1-second of data), or whether to do a series of FFT's over smaller subsets of the full data set, depends on the data, and on the intended purpose of the FFT.
If the data is relatively static spectrally over the full data set, then one FFT over the entire data set is probably all that's needed.
However, if the data is spectrally dynamic over the data set, then multiple sliding FFT's over small subsets of the data would create a more accurate time-frequency representation of the data.
The plot below shows the power spectrum of an acoustic guitar playing an A4 note. The audio signal was sampled at 44.1 KHz and the data set contains 131072 samples, almost 3 seconds of data. This data set was pre-multiplied with a Hann window function.
The plot below shows the power spectrum of a subset of 16384 samples (0 to 16383) taken from the full data set of the acoustic guitar A4 note. This subset was also pre-multiplied with a Hann window function.
Notice how the spectral energy distribution of the subset is significantly different from the spectral energy distribution of the full data set.
If we were to extract subsets from the full data set, using a sliding 16384 sample frame, and calculate the power spectrum of each frame, we would create an accurate time-frequency picture of the full data set.
References:
Real audio signal data, Hann window function, plots, FFT, and spectral analysis were done here:
Fast Fourier Transform, spectral analysis, Hann window function, audio data
The chunk size or window length you pick controls the frequency resolution and the time resolution of the FFT result. You have to determine which you want or what trade-off to make.
Longer windows give you better frequency resolution, but worse time resolution. Shorter windows, vice versa. Each FFT result bin will contain a frequency bandwidth of roughly 1 to 2 times the sample rate divided by the FFT length, depending on the window shape (rectangular, von Hann, etc.), not just one single frequency. If your entire data chunk is stationary (frequency content doesn't change), then you may not need any time resolution, and can go for 1 to 2 Hz frequency "resolution" in your 1 second of data. Averaging multiple short FFT windows might also help reduce the variance of your spectral estimations.

Real-Time FFT with High Resolution while Keeping Latency Low

I have read all the wikipedia articles and stackoverflow articles on fft and resolution. However, nothing has helped in learning how to get high resolution frequency without having a huge latency issues.
If I understand signal processing correctly:
I have a sampling rate of 44,100, and I take 256 block. Then the frequency resolution would be 44,100/2/256 = 86.1 Hz per frequency bin with FFT.
Constantly I see examples like http://www.tunelab-world.com/, and http://www.spectraplus.com/ that are able to determine the frequency down to .01 Hz.
If I did that with my above method I would need 4410,000 bins to get that kind of resolution. At 44,100 sampling rate it would take 100 seconds to fill in the data from the input.
I know I am missing something, but I can't figure what.
How can I get a signal, and then draw a graph or display the frequency of a peak with that kind of accuracy without taking a gazillion bins or waiting forever?
Thanks in advance for your help!
If you want a high frequency resolution FFT output, you have to perform the FFT over many samples: there is simply no way round that.
What you are probably seeing in other applications is overlapping: they may do a 4096 pt FFT on the first set of data, then move along 256 samples and do another 4096 pt FFT (on 3840 of the samples they have already used, plus a new 256 samples).
This allows you to show regular (different) updates with a fine frequency resolution. It will be no good for capturing transient signals, but looks good on an active display.
The reason you can get better accuracy is that the frequency estimation problem lends itself to being solved with higher accuracy than many other estimation problems.
The Cramer-Rao Lower Bound (CRLB) on the accuracy is given by:
which means that the variance of the frequency estimate (a measure of the expected error) goes down as the cube of T, the duration of the measurements. "Normal" estimation problems tend to have this measure go down as the square of T.
Using the FFT maximizer (the bin with the largest peak) will only get you the square of T.
As Adrian Taylor says, the examples you give are probably starting with a higher number of samples and then updating by a shorter duration.
For kicks, there are some frequency estimation algorithms here that might be of interest. They are quicker than the FFT, and more accurate.
SpectraPlus says "High Resolution FFT Analysis up to 1,048,576 pts"; that won't get you to 0.01 Hz resolution at 44.1 kHz.
TuneLab seems to go down to 0.01 cents, but the "spectrum display" appears to have a resolution of around 2.5 Hz at 440 Hz. The "phase display" is nothing special.
What are you trying to do? If you merely want to implement a guitar tuner, you don't need (and probably don't want) an FFT. Not knowing any better, I'd go for a PLL.

"Winamp style" spectrum analyzer

I have a program that plots the spectrum analysis (Amp/Freq) of a signal, which is preety much the DFT converted to polar. However, this is not exactly the sort of graph that, say, winamp (right at the top-left corner), or effectively any other audio software plots. I am not really sure what is this sort of graph called (if it has a distinct name at all), so I am not sure what to look for.
I am preety positive about the frequency axis being base two exponential, the amplitude axis puzzles me though.
Any pointers?
Actually an interesting question. I know what you are saying; the frequency axis is certainly logarithmic. But what about the amplitude? In response to another poster, the amplitude can't simply be in units of dB alone, because dB has no concept of zero. This introduces the idea of quantization error, SNR, and dynamic range.
Assume that the received digitized (i.e., discrete time and discrete amplitude) time-domain signal, x[n], is equal to s[n] + e[n], where s[n] is the transmitted discrete-time signal (i.e., continuous amplitude) and e[n] is the quantization error. Suppose x[n] is represented with b bits, and for simplicity, takes values in [0,1). Then the maximum peak-to-peak amplitude of e[n] is one quantization level, i.e., 2^{-b}.
The dynamic range is the defined to be, in decibels, 20 log10 (max peak-to-peak |s[n]|)/(max peak-to-peak |e[n]|) = 20 log10 1/(2^{-b}) = 20b log10 2 = 6.02b dB. For 16-bit audio, the dynamic range is 96 dB. For 8-bit audio, the dynamic range is 48 dB.
So how might Winamp plot amplitude? My guesses:
The minimum amplitude is assumed to be -6.02b dB, and the maximum amplitude is 0 dB. Visually, Winamp draws the window with these thresholds in mind.
Another nonlinear map, such as log(1+X), is used. This function is always nonnegative, and when X is large, it approximates log(X).
Any other experts out there who know? Let me know what you think. I'm interested, too, exactly how this is implemented.
To generate a power spectrum you need to do the following steps:
apply window function to time domain data (e.g. Hanning window)
compute FFT
calculate log of FFT bin magnitudes for N/2 points of FFT (typically 10 * log10(re * re + im * im))
This gives log magnitude (i.e. dB) versus linear frequency.
If you also want a log frequency scale then you will need to accumulate the magnitude from appropriate ranges of bins (and you will need a fairly large FFT to start with).
Well I'm not 100% sure what you mean but surely its just bucketing the data from an FFT?
If you want to get the data such that you have (for a 44Khz file) frequency points at 22Khz, 11Khz 5.5Khz etc then you could use a wavelet decomposition, i guess ...
This thread may help ya a bit ...
Converting an FFT to a spectogram
Same sort of information as a spectrogram I'd guess ...
What you need is power spectrum graph. You have to compute DFT of your signal's current window. Then square each value.

Measure audio noise level

I'm trying to get a qualitative handle on the amount of static or noise present in a audio stream. The normal content of the stream is voice or music.
I've been experiementing with taking the stddev of the samples, and that does give me some handle on the presence of voice vs. empty channel noise (ie. a high stddev usually indicates voice or music)
Was wondering if anyone else had some pointers on this.
Doesn't the peak value give you the answer? If you're looking at a signal from a good ADC, the ambient level should be in the 1's or 10's of counts, while voice or music will get up into the thousands of counts. Is there some kind of automatic gain control that makes this strategy not work?
If you need something more complex, the peak to RMS ratio might be a bit more reliable than simply RMS level (RMS = stddev). Pure noise will have a ratio of around 3-5, while sinusoids, for instance, have a peak to RMS ratio of 1.4. However, you can get more discrimination by looking at the spectrum of the signal. Static is usually spectrally smooth or even flat, while voice and music are spectrally structured. So a Fourier transform might be what you're looking for. Assuming a signal x that contains, say 0.5 seconds worth of data, here's some Matlab code:
Sx = fft(x .* hann(length(x), 'periodic'))
The HANN function applies a Hann window to reduce spectral leakage, while the FFT function quickly calculates the Fourier transform. Now you have a couple of choices. If you want to determine whether the signal x consists of static or voice/music, take the peak to RMS ratio of the spectrum:
pk2rms = max(abs(Sx))/sqrt(sum(abs(Sx).^2)/length(Sx))
I'd expect pure static to have a peak to RMS ratio around 3-5 (again), while voice/music would be at least an order of magnitude higher. This takes advantage of the fact that pure white noise has the same "structure" in time and frequency domains.
If you want to get a numerical estimate of the noise level, you can calculate the power in Sx over time, using an average:
Gxx = ((k-1)*Gxx + Sx.*conj(Sx))/k
Over time, the peaks in Gxx should come and go, but you should see a constant minimum value corresponding to the noise floor. In general, audio spectra are easier to look at on a dB (log vertical) scale.
Some notes:
1. I picked 0.5 seconds for the length of x, but I'm not sure what an optimal value here is. If you pick a value that's too short, x will not have much structure. In that case, the DC component of the signal will have a lot of energy. I expect you can still use the peak to RMS discriminator, though, if you first toss out the bin in Sx corresponding to DC.
2. I'm not sure what a good value for k is, but that equation corresponds to exponential averaging. You can probably experiment with k to figure out an optimal value. This might work best with a short x.
There are different kinds of noise. White, pink, brown. Noise can come from many places. Is a 60hertz hum noise or signal?
For white noise, I'd look at the fft and find the lowest value to see what your noise floor is.

How do you analyse the fundamental frequency of a PCM or WAV sample? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a sample held in a buffer from DirectX. It's a sample of a note played and captured from an instrument. How do I analyse the frequency of the sample (like a guitar tuner does)? I believe FFTs are involved, but I have no pointers to HOWTOs.
The FFT can help you figure out where the frequency is, but it can't tell you exactly what the frequency is. Each point in the FFT is a "bin" of frequencies, so if there's a peak in your FFT, all you know is that the frequency you want is somewhere within that bin, or range of frequencies.
If you want it really accurate, you need a long FFT with a high resolution and lots of bins (= lots of memory and lots of computation). You can also guess the true peak from a low-resolution FFT using quadratic interpolation on the log-scaled spectrum, which works surprisingly well.
If computational cost is most important, you can try to get the signal into a form in which you can count zero crossings, and then the more you count, the more accurate your measurement.
None of these will work if the fundamental is missing, though. :)
I've outlined a few different algorithms here, and the interpolated FFT is usually the most accurate (though this only works when the fundamental is the strongest harmonic - otherwise you need to be smarter about finding it), with zero-crossings a close second (though this only works for waveforms with one crossing per cycle). Neither of these conditions is typical.
Keep in mind that the partials above the fundamental frequency are not perfect harmonics in many instruments, like piano or guitar. Each partial is actually a little bit out of tune, or inharmonic. So the higher-frequency peaks in the FFT will not be exactly on the integer multiples of the fundamental, and the wave shape will change slightly from one cycle to the next, which throws off autocorrelation.
To get a really accurate frequency reading, I'd say to use the autocorrelation to guess the fundamental, then find the true peak using quadratic interpolation. (You can do the autocorrelation in the frequency domain to save CPU cycles.) There are a lot of gotchas, and the right method to use really depends on your application.
There are also other algorithms that are time-based, not frequency based.
Autocorrelation is a relatively simple algorithm for pitch detection.
Reference: http://cnx.org/content/m11714/latest/
I have written c# implementations of autocorrelation and other algorithms that are readable. Check out http://code.google.com/p/yaalp/.
http://code.google.com/p/yaalp/source/browse/#svn/trunk/csaudio/WaveAudio/WaveAudio
Lists the files, and PitchDetection.cs is the one you want.
(The project is GPL; so understand the terms if you use the code).
Guitar tuners don't use FFT's or DFT's. Usually they just count zero crossings. You might not get the fundamental frequency because some waveforms have more zero crossings than others but you can usually get a multiple of the fundamental frequency that way. That's enough to get the note although you might be one or more octaves off.
Low pass filtering before counting zero crossings can usually get rid of the excess zero crossings. Tuning the low pass filter requires some knowlegde of the range of frequency you want to detect though
FFTs (Fast-Fourier Transforms) would indeed be involved. FFTs allow you to approximate any analog signal with a sum of simple sine waves of fixed frequencies and varying amplitudes. What you'll essentially be doing is taking a sample and decomposing it into amplitude->frequency pairs, and then taking the frequency that corresponds to the highest amplitude.
Hopefully another SO reader can fill the gaps I'm leaving between the theory and the code!
A little more specifically:
If you start with the raw PCM in an input array, what you basically have is a graph of wave amplitude vs time.Doing a FFT will transform that to a frequency histogram for frequencies from 0 to 1/2 the input sampling rate. The value of each entry in the result array will be the 'strength' of the corresponding sub-frequency.
So to find the root frequency given an input array of size N sampled at S samples/second:
FFT(N, input, output);
max = max_i = 0;
for(i=0;i<N;i++)
if (output[i]>max) max_i = i;
root = S/2.0 * max_i/N ;
Retrieval of fundamental frequencies in a PCM audio signal is a difficult task, and there would be a lot to talk about it...
Anyway, usually time-based method are not suitable for polyphonic signals, because a complex wave given by the sum of different harmonic components due to multiple fundamental frequencies has a zero-crossing rate which depends only from the lowest frequency component...
Also in the frequency domain the FFT is not the most suitable method, since frequency spacing between notes follow an exponential scale, not linear. This means that a constant frequency resolution, used in the FFT method, may be insufficient to resolve lower frequency notes if the size of the analysis window in the time domain is not large enough.
A more suitable method would be a constant-Q transform, which is DFT applied after a process of low-pass filtering and decimation by 2 (i.e. halving each step the sampling frequency) of the signal, in order to obtain different subbands with different frequency resolution. In this way the calculation of DFT is optimized. The trouble is that also time resolution is variable, and increases for the lower subbands...
Finally, if we are trying to estimate the fundamental frequency of a single note, FFT/DFT methods are ok. Things change for a polyphonic context, in which partials of different sounds overlap and sum/cancel their amplitude depending from their phase difference, and so a single spectral peak could belong to different harmonic contents (belonging to different notes). Correlation in this case don't give good results...
Apply a DFT and then derive the fundamental frequency from the results. Googling around for DFT information will give you the information you need -- I'd link you to some, but they differ greatly in expectations of math knowledge.
Good luck.

Resources