Compute MFCC using Librosa

Compute MFCC using Librosa - python-3.x

I am trying to use the librosa library to compute the MFCC of my time series. The time series is directly from data collected from a device at a sampling rate of 50 Hz.
Could someone help clarify on what values I could use for n_fft, hop_length, win_length and window? And their meaning?
Thanks in advance

MFCC is based on short-time Fourier transform (STFT), n_fft, hop_length, win_length and window are the parameters for STFT.
STFT divide a longer time signal into shorter segments of equal length and then compute Fourier transform separately on each shorter segment. Fourier transform transforms the signal from time domain to frequency domain. The figure below demonstrates the steps to calculate STFT.
n_fft is the number of bins of Fourier transform. Its value depends on the type of the signal and relates sampling rate, typically are the power of two. In your case, it's hard to say what is the appropriate value, since I don't know what the signal is. hop_length is the overlap of two consecutive segments, typically chosen to be 1/2 or 1/4 of n_fft. We typically apply a window on the segment. If you are not familiar with signal processing, you can leave this value to its default.

Related

Reducing a FFT spectrum range

I am currently running Python's Numpy fft on 44100Hz audio samples which gives me a working frequency range of 0Hz - 22050Hz (thanks Nyquist). Once I use fft on those time domain values, I have 128 points in my fft spectrum giving me 172Hz for each frequency bin size.
I would like to tighten the frequency bin to 86Hz and still keep to only 128 fft points, instead of increasing my fft count to 256 through an adjustment on how I'm creating my samples.
The question I have is whether or not this is theoretically possible. My thought would be to run fft on any Hz values between 0Hz to 11025Hz only. I don't care about anything above that anyway. This would cut my working spectrum in half and put my frequency bins at 86Hz and while keeping to my 128 spectrum bins. Perhaps this can be accomplished via a window function in the time domain?
Currently the code I'm using to create my samples and then convert to fft is:
import numpy as np
sample_rate = 44100
chunk = 128
record_seconds = 2
stream = self.audio.open(format=pyaudio.paInt16, channels=1,
rate=sample_rate, input=True, frames_per_buffer=6300)
sample_list = []
for i in range(0, int(sample_rate / chunk * record_seconds)):
data = stream.read(chunk)
sample_list.append(np.fromstring(data, dtype=np.int16))
### then later ###:
for samp in sample_list:
samp_fft = np.fft.fft(samp) ...
I hope I worded this clearly enough. Let me know if I need to adjust my explanation or terminology.

What you are asking for is not possible. As you mentioned in a comment you require a short time window. I assume this is because you're trying to detect when a signal arrives at a certain frequency (as I've answered your earlier question on the subject) and you want the detection to be time sensitive. However, it seems your bin size is too large for your requirements.
There are only two ways to decrease the bin size. 1) Increase the length of the FFT. Unfortunately this also means that it will take longer to acquire the data. 2) lower the sample rate (either by sample rate conversion or at the hardware level) but since the samples arrive slower it will also take longer to acquire the data.
I'm going to suggest to you a 3rd option (from what I've gleaned from this and your other questions is possibly a better solution) which is: Perform the frequency detection in the time domain. What this would require is a time-domain bandpass filter followed by an RMS meter. Implementation wise this would be one or more biquad filters that you could implement in python for the filter - there are probably implementations already available. The tricky part would be designing the filter but I'd be happy to help you in chat. The RMS meter is basically taking the square root of the sum of the squares of the output samples from the filter.

Doubling the size of the FFT would be the obvious thing to do, but if there is a good reason why you can't do this then consider 2x downsampling prior to the FFT to get the effective sample rate down to 22050 Hz:
- Apply low pass filter with cut off at 11 kHz
- Discard every other sample from filtered output
- Apply FFT to down-sampled data

If you are not trying to resolve between close adjacent frequency peaks or noise, then, to half the frequency bin spacing, you can zero-pad your data to double the FFT length, without having to wait for more data. Then, if you only want the lower half of the frequency range 0..Fs/2, just throw away the middle half of the FFT result vector (which is usually far more efficient than trying to compute the lower half of the frequency range via non-FFT means).
Note that zero-padding gives the same result as high-quality interpolation (as in smoothing a plot of the original FFT result points). It does not increase peak separation resolution, but might make it easier to pick out more precise peak locations in the plot if the noise level is low enough.

How to resolve frequency from PCM samples

I'd like to build a an audio visualizer display using led strips to be used at parties. Building the display and programming the rendering engine is fairly straightforward, but I don't have any experience in signal processing, aside from rendering PCM samples.
The primary feature I'd like to implement would be animation driven by audible frequency. To keep things super simple and get the hang of it, I'd like to start by simply rendering a color according to audible frequency of the input signal (e.g. the highest audible frequency would be rendered as white).
I understand that reading input samples as PCM gives me the amplitude of air pressure (intensity) with respect to time and that using a Fourier transform outputs the signal as intensity with respect to frequency. But from there I'm lost as to how to resolve the actual frequency.
Would the numeric frequency need to be resolved as the inverse transform of the of the Fourier transform (e.g. the intensity is the argument and the frequency is the result)?
I understand there are different types of Fourier transforms that are suitable for different purposes. Which is useful for such an application?

You can transform the samples from time domain to frequency domain using DFT or FFT. It outputs frequencies and their intensities. Actually you get a set of frequencies not just one. Based on that LED strips can be lit. See DFT spectrum tracer

"The frequency", as in a single numeric audio frequency spectrum value, does not exist for almost all sounds. That's why an FFT gives you all N/2 frequency bins of the full audio spectrum, up to half the sample rate, with a resolution determined by the length of the FFT.

FFT - When to window?

I've seen the various FFT questions on here but I'm confused on part of the implementation. Instead of performing the FFT in real time, I want to do it offline. Lets say I have the raw data in float[] audio. The sampling rate is 44100 and so audio[0] to audio[44099] will contain 1 seconds worth of audio. If my FFT function handles the windowing (e.g. Hanning), do I simply put the entire audio buffer into the function in one go? Or, do I have to cut the audio into chunks of 4096 (my window size) and then input that into the FFT which will then perform the windowing function on top?

You may need to copy your input data to a separate buffer and get it in the correct format, e.g. if your FFT is in-place, or if it requires interleaved complex data (real/imaginary). However if your FFT routine can take a purely real input and is not in-place (i.e. non-destructive) then you may just be able to pass a pointer to the original sample data, along with an appropriate size parameter.
Typically for 1s of audio, e.g. speech or music, you would pick an FFT size which corresponds to a reasonably stationary chunk of audio, e.g. 10 ms or 20 ms. So at 44.1 kHz your FFT size might be say 512 or 1024. You would then generate successive spectra by advancing through your buffer and doing a new FFT at each starting point. Note that it's common practice to overlap these successive buffers, typically by 50%. So if N = 1024 your first FFT would be for samples 0..1023, your second would be for samples 512..1535, then 1024..2047, etc.

The choice of whether to calculate one FFT over the entire data set (in the OP's case, 44100 samples representing 1-second of data), or whether to do a series of FFT's over smaller subsets of the full data set, depends on the data, and on the intended purpose of the FFT.
If the data is relatively static spectrally over the full data set, then one FFT over the entire data set is probably all that's needed.
However, if the data is spectrally dynamic over the data set, then multiple sliding FFT's over small subsets of the data would create a more accurate time-frequency representation of the data.
The plot below shows the power spectrum of an acoustic guitar playing an A4 note. The audio signal was sampled at 44.1 KHz and the data set contains 131072 samples, almost 3 seconds of data. This data set was pre-multiplied with a Hann window function.
The plot below shows the power spectrum of a subset of 16384 samples (0 to 16383) taken from the full data set of the acoustic guitar A4 note. This subset was also pre-multiplied with a Hann window function.
Notice how the spectral energy distribution of the subset is significantly different from the spectral energy distribution of the full data set.
If we were to extract subsets from the full data set, using a sliding 16384 sample frame, and calculate the power spectrum of each frame, we would create an accurate time-frequency picture of the full data set.
References:
Real audio signal data, Hann window function, plots, FFT, and spectral analysis were done here:
Fast Fourier Transform, spectral analysis, Hann window function, audio data

The chunk size or window length you pick controls the frequency resolution and the time resolution of the FFT result. You have to determine which you want or what trade-off to make.
Longer windows give you better frequency resolution, but worse time resolution. Shorter windows, vice versa. Each FFT result bin will contain a frequency bandwidth of roughly 1 to 2 times the sample rate divided by the FFT length, depending on the window shape (rectangular, von Hann, etc.), not just one single frequency. If your entire data chunk is stationary (frequency content doesn't change), then you may not need any time resolution, and can go for 1 to 2 Hz frequency "resolution" in your 1 second of data. Averaging multiple short FFT windows might also help reduce the variance of your spectral estimations.

Why do I need to apply a window function to samples when building a power spectrum of an audio signal?

I have found for several times the following guidelines for getting the power spectrum of an audio signal:
collect N samples, where N is a power of 2
apply a suitable window function to the samples, e.g. Hanning
pass the windowed samples to an FFT routine - ideally you want a real-to-complex FFT but if all you have a is complex-to-complex FFT then pass 0 for all the imaginary input parts
calculate the squared magnitude of your FFT output bins (re * re + im * im)
(optional) calculate 10 * log10 of each magnitude squared output bin to get a magnitude value in dB
Now that you have your power spectrum you just need to identify the peak(s), which should be pretty straightforward if you have a reasonable S/N ratio. Note that frequency resolution improves with larger N. For the above example of 44.1 kHz sample rate and N = 32768 the frequency resolution of each bin is 44100 / 32768 = 1.35 Hz.
But... why do I need to apply a window function to the samples? What does that really means?
What about the power spectrum, is it the power of each frequency in the range of sample rate? (example: windows media player visualizer of sound?)

Most real world audio signals are non-periodic, meaning that real audio signals do not generally repeat exactly, over any given time span.
However, the math of the Fourier transform assumes that the signal being Fourier transformed is periodic over the time span in question.
This mismatch between the Fourier assumption of periodicity, and the real world fact that audio signals are generally non-periodic, leads to errors in the transform.
These errors are called "spectral leakage", and generally manifest as a wrongful distribution of energy across the power spectrum of the signal.
The plot below shows a closeup of the power spectrum of an acoustic guitar playing the A4 note. The spectrum was calculated with the FFT (Fast Fourier Transform), but the signal was not windowed prior to the FFT.
Notice the distribution of energy above the -60 dB line, and the three distinct peaks at roughly 440 Hz, 880 Hz, and 1320 Hz. This particular distribution of energy contains "spectral leakage" errors.
To somewhat mitigate the "spectral leakage" errors, you can pre-multiply the signal by a window function designed specifically for that purpose, like for example the Hann window function.
The plot below shows the Hann window function in the time-domain. Notice how the tails of the function go smoothly to zero, while the center portion of the function tends smoothly towards the value 1.
Now let's apply the Hann window to the guitar's audio data, and then FFT the resulting signal.
The plot below shows a closeup of the power spectrum of the same signal (an acoustic guitar playing the A4 note), but this time the signal was pre-multiplied by the Hann window function prior to the FFT.
Notice how the distribution of energy above the -60 dB line has changed significantly, and how the three distinct peaks have changed shape and height. This particular distribution of spectral energy contains fewer "spectral leakage" errors.
The acoustic guitar's A4 note used for this analysis was sampled at 44.1 KHz with a high quality microphone under studio conditions, it contains essentially zero background noise, no other instruments or voices, and no post processing.
References:
Real audio signal data, Hann window function, plots, FFT, and spectral analysis were done here:
Fast Fourier Transform, spectral analysis, Hann window function, audio data

As #cyco130 says, your samples are already windowed by a rectangular function. Since a Fourier Transform assumes periodicity, any discontinuity between the last sample and the repeated first sample will cause artefacts in the spectrum (e.g. "smearing" of the peaks). This is known as spectral leakage. To reduce the effect of this we apply a tapered window function such as a Hann window which smooths out any such discontinuity and thereby reduces artefacts in the spectrum.

Note that a non-rectangular window has both benefits and costs. The result of a window in the time-domain is equivalent to a convolution of the window's transform with the signal's spectrum. A typical window, such as a von Hann window, will reduce the "leakage" from any non-periodic spectral content, which will result in a less noisy looking spectrum; but, in return, the convolution will "blur" any exactly or close to periodic spectral peaks across a few adjacent bins. e.g. all the spectral peaks will become rounder looking which may reduce frequency estimation accuracy. If you know, apriori, that there is no non-periodic content (e.g. data from some rotationally synchronous sampling system), a non-rectangular window could actually make the FFT look worse.
A non-rectangular window is also an informationally lossy process. A significant amount of spectral information near the edges of the window will be thrown away, assuming finite precision arithmetic. So non-rectangular windows are best used with overlapping window processing, and/or when one can assume that the spectrum of interest is either stationary across the entire window width, or centered in the window.

If you're not applying any windowing function, you're actually aplying a rectangular windowing function. Different windowing functions have different characteristics, it depends on what you want exactly.

"Winamp style" spectrum analyzer

I have a program that plots the spectrum analysis (Amp/Freq) of a signal, which is preety much the DFT converted to polar. However, this is not exactly the sort of graph that, say, winamp (right at the top-left corner), or effectively any other audio software plots. I am not really sure what is this sort of graph called (if it has a distinct name at all), so I am not sure what to look for.
I am preety positive about the frequency axis being base two exponential, the amplitude axis puzzles me though.
Any pointers?

Actually an interesting question. I know what you are saying; the frequency axis is certainly logarithmic. But what about the amplitude? In response to another poster, the amplitude can't simply be in units of dB alone, because dB has no concept of zero. This introduces the idea of quantization error, SNR, and dynamic range.
Assume that the received digitized (i.e., discrete time and discrete amplitude) time-domain signal, x[n], is equal to s[n] + e[n], where s[n] is the transmitted discrete-time signal (i.e., continuous amplitude) and e[n] is the quantization error. Suppose x[n] is represented with b bits, and for simplicity, takes values in [0,1). Then the maximum peak-to-peak amplitude of e[n] is one quantization level, i.e., 2^{-b}.
The dynamic range is the defined to be, in decibels, 20 log10 (max peak-to-peak |s[n]|)/(max peak-to-peak |e[n]|) = 20 log10 1/(2^{-b}) = 20b log10 2 = 6.02b dB. For 16-bit audio, the dynamic range is 96 dB. For 8-bit audio, the dynamic range is 48 dB.
So how might Winamp plot amplitude? My guesses:
The minimum amplitude is assumed to be -6.02b dB, and the maximum amplitude is 0 dB. Visually, Winamp draws the window with these thresholds in mind.
Another nonlinear map, such as log(1+X), is used. This function is always nonnegative, and when X is large, it approximates log(X).
Any other experts out there who know? Let me know what you think. I'm interested, too, exactly how this is implemented.

To generate a power spectrum you need to do the following steps:
apply window function to time domain data (e.g. Hanning window)
compute FFT
calculate log of FFT bin magnitudes for N/2 points of FFT (typically 10 * log10(re * re + im * im))
This gives log magnitude (i.e. dB) versus linear frequency.
If you also want a log frequency scale then you will need to accumulate the magnitude from appropriate ranges of bins (and you will need a fairly large FFT to start with).

Well I'm not 100% sure what you mean but surely its just bucketing the data from an FFT?
If you want to get the data such that you have (for a 44Khz file) frequency points at 22Khz, 11Khz 5.5Khz etc then you could use a wavelet decomposition, i guess ...
This thread may help ya a bit ...
Converting an FFT to a spectogram
Same sort of information as a spectrogram I'd guess ...

What you need is power spectrum graph. You have to compute DFT of your signal's current window. Then square each value.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string