Python adaptfilt 2.0 FloatingPointError: invalid value encountered in multiply - python-3.x

I am trying to use the "echo cancel" example in python 3.4 from the library adaptfilt 2.0 which looks like this:
import numpy as np
import adaptfilt as adf
# Get u(n) - this is available on github or pypi in the examples folder
u = np.load('speech.npy')
# Generate received signal d(n) using randomly chosen coefficients
coeffs = np.concatenate(([0.8], np.zeros(8), [-0.7], np.zeros(9),
[0.5], np.zeros(11), [-0.3], np.zeros(3),
[0.1], np.zeros(20), [-0.05]))
d = np.convolve(u, coeffs)
# Add background noise
v = np.random.randn(len(d)) * np.sqrt(5000)
d += v
# Apply adaptive filter
M = 100 # Number of filter taps in adaptive filter
step = 0.1 # Step size
y, e, w = adf.nlms(u, d, M, step, returnCoeffs=True)
# Calculate mean square weight error
mswe = adf.mswe(w, coeffs)
It works as expected. But then I wanted to to do the same thing with some real data from music file and I get an error:
Traceback (most recent call last):
File "C:/Python34/Lib/site-packages/adaptfilt/echocancel.py", line 86, in <module>
y, e, w = adf.nlms(u, d, M, step, returnCoeffs=True)
File "C:\Python34\Lib\site-packages\adaptfilt\nlms.py", line 149, in nlms
w = leakstep * w + step * normFactor * x * e[n]
FloatingPointError: invalid value encountered in multiply
The code I used is this:
import numpy as np
import adaptfilt as adf
import pyaudio
import wave
np.seterr(all='raise')
p = pyaudio.PyAudio()
stream = p.open(format = p.get_format_from_width(2),
channels = 1,
rate = 44100,
input = True,
output = True,
#stream_callback = self.callback
)
wf = wave.open("XXX.wav", 'rb')
while u != " ":
data = wf.readframes(1024)
u = np.fromstring(data, np.int16)
# Generate received signal d(n) using randomly chosen coefficients
coeffs = np.concatenate(([0.8], np.zeros(8), [-0.7], np.zeros(9),
[0.5], np.zeros(11), [-0.3], np.zeros(3),
[0.1], np.zeros(20), [-0.05]))
coeffs.dtype = np.int16
d = np.convolve(u, coeffs)
# Add background noise
v = np.random.randn(len(d)) * np.sqrt(5000)
d += v
# Apply adaptive filter
M = 100 # Number of filter taps in adaptive filter
step = 0.1 # Step size
y, e, w = adf.nlms(u, d, M, step, returnCoeffs=True)
# Calculate mean square weight error
mswe = adf.mswe(w, coeffs)
stream.write(y.astype(np.int16).tostring())
The only difference I see is that the array from "speech.npy" is type of float64 and my array from the wav file is type of int16.

I was also able to get 'adaptfilt 2.0' to work on music data (a mono .wav file with 44.1kHz sampling rate) by casting 'data' to float64.
The adaptive filter took longer to converge, but it worked OK. Below is the Mean squared weight error plot showing the filter converge.
I would also add that to apply this to music, you'll likely need a longer filter tap (M). The 'speech.npy' array used in the original script is just an array with no sample rate information, but we can assume the sample rate in a speech file is lower than 44.1kHz.
I played the 'speech.npy' back at varying sample rates, and just by listening to what sounds natural, I'm guessing it's in the 10-12kHz range. This means that the "impulse response" stored in 'coeffs', is ~5.7 ms (assuming 10kHz sampling rate). At 44.1kHz, the impulse response is only ~1.3ms. This is very short and unlikely to model the impulse response for real signals.

This error arises due to numerical issues when using the int16 type for computations somewhere inside the module. Using your example and casting the input data to a floating point type such as
u = np.fromstring(data, np.int16)
u = np.float64(u)
resolves the issue for me.

Related

Python, Extract spline coefficient

I am using python3, Scipy
I have a 3d points (x,y,z]
From them I make s apline using scipy.interpolate.splprep
x_points = np.linspace(0, 2*np.pi, 10)
y_points = np.sin(x_points)
z_points = np.cos(x_points)
path = np.vstack([x_points, y_points, z_points])
tck, u = sc.splprep(path, k=3, s=0)
I wish to get the coefficients of the spline[i]:
For example the latest splins:
sp9 = a9 + b9(x-x4) + c9(x-x4)^2 + d9(x-x4)^3
I know that the tck is (t,c,k) a tuple containing the vector of knots, the B-spline coefficients, and the degree of the spline.
But I don't see how I can get this spline function and plot only it
I tried using this method:
import numpy as np
import scipy.interpolate as sc
x_points = np.linspace(0, 2*np.pi, 10)
y_points = np.sin(x_points)
z_points = np.cos(x_points)
path = np.vstack([x_points, y_points, z_points])
tck, u = sc.splprep(path, k=3, s=0)
p = sc.PPoly.from_spline(tck)
but I'm getting this error on the last line:
p = sc.PPoly.from_spline(tck) File
"C:\Users...\Python38\lib\site-packages\scipy\interpolate\interpolate.py",
line 1314, in from_spline cvals = np.empty((k + 1, len(t)-1),
dtype=c.dtype)
AttributeError: 'list' object has no attribute 'dtype'
The coefficients in the tck tuple are in the b-spline basis. If you want to convert them to the power basis, you can do PPoly.from_spline(tck) .
An obligatory note however: converting between bases incurs numerical errors.
EDIT. First, as it's splprep, you'll need to convert the list-of-arrays c into a proper numpy array and transpose (it's a known wart of splPrep). Then, as it turns out, PPoly.from_spline does not handle multidimensional c (this might be a nice pull request to the scipy repository), so you'll need to e.g. loop over the dimensions. Something along the lines of (continuing from your OP)
t, c, k = tck
cc = np.asarray(c) # cc.shape is (3, 10) now
spl0 = sc.PPoly.from_spline((t, cc.T[0], k))
print(spl0.c) # here are your coefficients for the component 0

FFT plot of raw PCM comes wrong for higher frequency in python

Here I am using fft function of numpy to plot the fft of PCM wave generated from a 10000Hz sine wave. But the amplitude of the plot I am getting is wrong.
The frequency is coming correct using fftfreq function which I am printing in the console itself. My python code is here.
import numpy as np
import matplotlib.pyplot as plt
frate = 44100
filename = 'Sine_10000Hz.bin' #signed16 bit PCM of a 10000Hz sine wave
f = open('Sine_10000Hz.bin','rb')
y = np.fromfile(f,dtype='int16') #Extract the signed 16 bit PCM value of 10000Hz Sine wave
f.close()
####### Spectral Analysis #########
fft_value = np.fft.fft(y)
freqs = np.fft.fftfreq(len(fft_value)) # frequencies associated with the coefficients:
print("freqs.min(), freqs.max()")
idx = np.argmax(np.abs(fft_value)) # Find the peak in the coefficients
freq = freqs[idx]
freq_in_hertz = abs(freq * frate)
print("\n\n\n\n\n\nfreq_in_hertz")
print(freq_in_hertz)
for i in range(2):
print("Value at index {}:\t{}".format(i, fft_value[i + 1]), "\nValue at index {}:\t{}".format(fft_value.size -1 - i, fft_value[-1 - i]))
#####
n_sa = 8 * int(freq_in_hertz)
t_fft = np.linspace(0, 1, n_sa)
T = t_fft[1] - t_fft[0] # sampling interval
N = n_sa #Here it is n_sample
print("\nN value=")
print(N)
# 1/T = frequency
f = np.linspace(0, 1 / T, N)
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
plt.xlim(0,15000)
# 2 / N is a normalization factor Here second half of the sequence gives us no new information that the half of the FFT sequence is the output we need.
plt.bar(f[:N // 2], np.abs(fft_value)[:N // 2] * 2 / N, width=15,color="red")
Output comes in the console (Only minimal prints I am pasting here)
freqs.min(), freqs.max()
-0.5 0.49997732426303854
freq_in_hertz
10000.0
Value at index 0: (19.949569768991054-17.456031216294324j)
Value at index 44099: (19.949569768991157+17.45603121629439j)
Value at index 1: (9.216783424692835-13.477631008179145j)
Value at index 44098: (9.216783424692792+13.477631008179262j)
N value=
80000
The frequency extraction is coming correctly but in the plot something I am doing is incorrect which I don't know.
Updating the work:
When I am change the multiplication factor 10 in the line n_sa = 10 * int(freq_in_hertz) to 5 gives me correct plot. Whether its correct or not I am not able to understand
In the line plt.xlim(0,15000) if I increase max value to 20000 again is not plotting. Till 15000 it is plotting correctly.
I generated this Sine_10000Hz.bin using Audacity tool where I generate a sine wave of freq 10000Hz of 1sec duration and a sampling rate of 44100. Then I exported this audio to signed 16bit with headerless (means raw PCM). I could able to regenerate this sine wave using this script. Also I want to calculate the FFT of this. So I expect a peak at 10000Hz with amplitude 32767. You can see i changed the multiplication factor 8 instead of 10 in the line n_sa = 8 * int(freq_in_hertz). Hence it worked. But the amplitude is showing incorrect. I will attach my new figure here
I'm not sure exactly what you are trying to do, but my suspicion is that the Sine_10000Hz.bin file isn't what you think it is.
Is it possible it contains more than one channel (left & right)?
Is it realy signed 16 bit integers?
It's not hard to create a 10kHz sine wave in 16 bit integers in numpy.
import numpy as np
import matplotlib.pyplot as plt
n_samples = 2000
f_signal = 10000 # (Hz) Signal Frequency
f_sample = 44100 # (Hz) Sample Rate
amplitude = 2**3 # Arbitrary. Must be > 1. Should be > 2. Larger makes FFT results better
time = np.arange(n_samples) / f_sample # sample times
# The signal
y = (np.sin(time * f_signal * 2 * np.pi) * amplitude).astype('int16')
If you plot 30 points of the signal you can see there are about 5 points per cycle.
plt.plot(time[:30], y[:30], marker='o')
plt.xlabel('Time (s)')
plt.yticks([]); # Amplitude value is artificial. hide it
If you plot 30 samples of the data from Sine_10000Hz.bin does it have about 5 points per cycle?
This is my attempt to recreate the FFT work as I understand it.
fft_value = np.fft.fft(y) # compute the FFT
freqs = np.fft.fftfreq(len(fft_value)) * f_sample # frequencies for each FFT bin
N = len(y)
plt.plot(freqs[:N//2], np.abs(fft_value[:N//2]))
plt.yscale('log')
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
I get the following plot
The y-axis of this plot is on a log scale. Notice that the amplitude of the peak is in the thousands. The amplitude of most of the rest of the data points are around 100.
idx_max = np.argmax(np.abs(fft_value)) # Find the peak in the coefficients
idx_min = np.argmin(np.abs(fft_value)) # Find the peak in the coefficients
print(f'idx_max = {idx_max}, idx_min = {idx_min}')
print(f'f_max = {freqs[idx_max]}, f_min = {freqs[idx_min]}')
print(f'fft_value[idx_max] {fft_value[idx_max]}')
print(f'fft_value[idx_min] {fft_value[idx_min]}')
produces:
idx_max = 1546, idx_min = 1738
f_max = -10010.7, f_min = -5777.1
fft_value[idx_max] (-4733.232076236707+219.11718299533203j)
fft_value[idx_min] (-0.17017443966211232+0.9557200531465061j)
I'm adding a link to a script I've build that outputs the FFT with ACTUAL amplitude (for real signals - e.g. your signal). Have a go and see if it works:
dt=1/frate in your constellation....
https://stackoverflow.com/a/53925342/4879610
After a long home work I could able to find my issue. As I mentioned in the Updating the work: the reason was with the number of samples which I took was wrong.
I changed the two lines in the code
n_sa = 8 * int(freq_in_hertz)
t_fft = np.linspace(0, 1, n_sa)
to
n_sa = y.size //number of samples directly taken from the raw 16bits
t_fft = np.arange(n_sa)/frate //Here we need to divide each samples by the sampling rate
This solved my issue.
My spectral output is
Special thanks to #meta4 and #YoniChechik for giving me some suggestions.

How to implement Pytorch 1D crosscorrelation for long signals in fourier domain?

I have a series of signals length n = 36,000 which I need to perform crosscorrelation on. Currently, my cpu implementation in numpy is a little slow. I've heard Pytorch can greatly speed up tensor operations, and provides a way to perform computations in parallel on the GPU. I'd like to explore this option, but I'm not quite sure how to accomplish this using the framework.
Because of the length of these signals, I'd prefer to perform the crosscorrelation operation in the frequency domain.
Normally using numpy I'd perform the operation like so:
import numpy as np
signal_length=36000
# make the signals
signal_1 = np.random.uniform(-1,1, signal_length)
signal_2 = np.random.uniform(-1,1, signal_length)
# output target length of crosscorrelation
x_cor_sig_length = signal_length*2 - 1
# get optimized array length for fft computation
fast_length = np.fftpack.next_fast_len(x_cor_sig_length)
# move data into the frequency domain. axis=-1 to perform
# along last dimension
fft_1 = np.fft.rfft(src_data, fast_length, axis=-1)
fft_2 = np.fft.rfft(src_data, fast_length, axis=-1)
# take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions
fft_1 = np.conj(fft_1)
fft_multiplied = fft_1 * fft_2
# back to time domain.
prelim_correlation = np.fft.irfft(result, x_corr_sig_length, axis=-1)
# shift the signal to make it look like a proper crosscorrelation,
# and transform the output to be purely real
final_result = np.real(np.fft.fftshift(prelim_correlation),axes=-1)).astype(np.float64)
Looking at the Pytorch documentation, there doesn't seem to be an equivalent for numpy.conj(). I'm also not sure if/how I need to implement a fftshift after the irfft operation.
So how would you go about writing a 1D crosscorrelation in Pytorch using the fourier method?
A few things to be considered.
Python interpreter is very slow, what those vectorization libraries do is to move the workload to a native implementation. In order to make any difference you need to be able to give perform many operations in one python instruction. Evaluating things on GPU follows the same principle, while GPU has more compute power it is slower to copy data to/from GPU.
I adapted your example to process multiple signals simulataneously.
import numpy as np
def numpy_xcorr(BATCH=1, signal_length=36000):
# make the signals
signal_1 = np.random.uniform(-1,1, (BATCH, signal_length))
signal_2 = np.random.uniform(-1,1, (BATCH, signal_length))
# output target length of crosscorrelation
x_cor_sig_length = signal_length*2 - 1
# get optimized array length for fft computation
fast_length = next_fast_len(x_cor_sig_length)
# move data into the frequency domain. axis=-1 to perform
# along last dimension
fft_1 = np.fft.rfft(signal_1, fast_length, axis=-1)
fft_2 = np.fft.rfft(signal_2 + 0.1 * signal_1, fast_length, axis=-1)
# take the complex conjugate of one of the spectrums.
fft_1 = np.conj(fft_1)
fft_multiplied = fft_1 * fft_2
# back to time domain.
prelim_correlation = np.fft.irfft(fft_multiplied, fast_length, axis=-1)
# shift the signal to make it look like a proper crosscorrelation,
# and transform the output to be purely real
final_result = np.fft.fftshift(np.real(prelim_correlation), axes=-1)
return final_result, np.sum(final_result)
Since torch 1.7 we have the torch.fft module that provides an interface similar to numpy.fft, the fftshift is missing but the same result can be obtained with torch.roll. Another point is that numpy uses by default 64-bit precision and torch will use 32-bit precision.
The fast length consists in choosing smooth numbers (those having that are factorized in to small prime numbers, and I suppose you are familiar with this subject).
def next_fast_len(n, factors=[2, 3, 5, 7]):
'''
Returns the minimum integer not smaller than n that can
be written as a product (possibly with repettitions) of
the given factors.
'''
best = float('inf')
stack = [1]
while len(stack):
a = stack.pop()
if a >= n:
if a < best:
best = a;
if best == n:
break; # no reason to keep searching
else:
for p in factors:
b = a * p;
if b < best:
stack.append(b)
return best;
Then the torch implementation goes
import torch;
import torch.fft
def torch_xcorr(BATCH=1, signal_length=36000, device='cpu', factors=[2,3,5], dtype=torch.float):
signal_length=36000
# torch.rand is random in the range (0, 1)
signal_1 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)
signal_2 = 1 - 2*torch.rand((BATCH, signal_length), device=device, dtype=dtype)
# just make the cross correlation more interesting
signal_2 += 0.1 * signal_1;
# output target length of crosscorrelation
x_cor_sig_length = signal_length*2 - 1
# get optimized array length for fft computation
fast_length = next_fast_len(x_cor_sig_length, [2, 3])
# the last signal_ndim axes (1,2 or 3) will be transformed
fft_1 = torch.fft.rfft(signal_1, fast_length, dim=-1)
fft_2 = torch.fft.rfft(signal_2, fast_length, dim=-1)
# take the complex conjugate of one of the spectrums. Which one you choose depends on domain specific conventions
fft_multiplied = torch.conj(fft_1) * fft_2
# back to time domain.
prelim_correlation = torch.fft.irfft(fft_multiplied, dim=-1)
# shift the signal to make it look like a proper crosscorrelation,
# and transform the output to be purely real
final_result = torch.roll(prelim_correlation, (fast_length//2,))
return final_result, torch.sum(final_result);
And here a code to test the results
import time
funcs = {'numpy-f64': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float64),
'numpy-f32': lambda b: numpy_xcorr(b, factors=[2,3,5], dtype=np.float32),
'torch-cpu-f64': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float64),
'torch-cpu': lambda b: torch_xcorr(b, device='cpu', factors=[2,3,5], dtype=torch.float32),
'torch-gpu-f64': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float64),
'torch-gpu': lambda b: torch_xcorr(b, device='cuda', factors=[2,3,5], dtype=torch.float32),
}
times ={}
for batch in [1, 10, 100]:
times[batch] = {}
for l, f in funcs.items():
t0 = time.time()
t1, t2 = f(batch)
tf = time.time()
del t1
del t2
times[batch][l] = 1000 * (tf - t0) / batch;
I obtained the following results
And what surprised myself is the result when the numbers are not so smooth e.g. using 17-smooth length the torch implementation is so much better that I used logarithmic scale here (with batch size 100 the torch gpu was 10000 times faster than numpy with batch size 1).
Remember that these functions are generating the data at the GPU in general we want to copy the final results to the CPU, if we consider the time spent copying the final result to CPU I observed times up to 10x higher than the cross correlation computation (random data generation + three FFTs).

Converting a wav file to amplitude and frequency values for textual, time-series analysis

I'm processing wav files for amplitude and frequency analysis with FFT, but I am having trouble getting the data out to csv in a time series format.
Using #Beginner's answer heavily from this post: How to convert a .wav file to a spectrogram in python3, I'm able to get the spectrogram output in an image. I'm trying to simplify that somewhat to get to a text output in csv format, but I'm not seeing how to do so. The outcome I'm hoping to achieve would look something like the following:
time_in_ms, amplitude_in_dB, freq_in_kHz
.001, -115, 1
.002, -110, 2
.003, 20, 200
...
19000, 20, 200
For my testing, I have been using http://soundbible.com/2123-40-Smith-Wesson-8x.html, (Notes: I simplified the wav down to a single channel and removed metadata w/ Audacity to get it to work.)
Heavy props to #Beginner for 99.9% of the following, anything nonsensical is surely mine.
import numpy as np
from matplotlib import pyplot as plt
import scipy.io.wavfile as wav
from numpy.lib import stride_tricks
filepath = "40sw3.wav"
""" short time fourier transform of audio signal """
def stft(sig, frameSize, overlapFac=0.5, window=np.hanning):
win = window(frameSize)
hopSize = int(frameSize - np.floor(overlapFac * frameSize))
# zeros at beginning (thus center of 1st window should be for sample nr. 0)
samples = np.append(np.zeros(int(np.floor(frameSize/2.0))), sig)
# cols for windowing
cols = np.ceil( (len(samples) - frameSize) / float(hopSize)) + 1
# zeros at end (thus samples can be fully covered by frames)
samples = np.append(samples, np.zeros(frameSize))
frames = stride_tricks.as_strided(samples, shape=(int(cols), frameSize), strides=(samples.strides[0]*hopSize, samples.strides[0])).copy()
frames *= win
return np.fft.rfft(frames)
""" scale frequency axis logarithmically """
def logscale_spec(spec, sr=44100, factor=20.):
timebins, freqbins = np.shape(spec)
scale = np.linspace(0, 1, freqbins) ** factor
scale *= (freqbins-1)/max(scale)
scale = np.unique(np.round(scale))
# create spectrogram with new freq bins
newspec = np.complex128(np.zeros([timebins, len(scale)]))
for i in range(0, len(scale)):
if i == len(scale)-1:
newspec[:,i] = np.sum(spec[:,int(scale[i]):], axis=1)
else:
newspec[:,i] = np.sum(spec[:,int(scale[i]):int(scale[i+1])], axis=1)
# list center freq of bins
allfreqs = np.abs(np.fft.fftfreq(freqbins*2, 1./sr)[:freqbins+1])
freqs = []
for i in range(0, len(scale)):
if i == len(scale)-1:
freqs += [np.mean(allfreqs[int(scale[i]):])]
else:
freqs += [np.mean(allfreqs[int(scale[i]):int(scale[i+1])])]
return newspec, freqs
""" compute spectrogram """
def compute_stft(audiopath, binsize=2**10):
samplerate, samples = wav.read(audiopath)
s = stft(samples, binsize)
sshow, freq = logscale_spec(s, factor=1.0, sr=samplerate)
ims = 20.*np.log10(np.abs(sshow)/10e-6) # amplitude to decibel
return ims, samples, samplerate, freq
""" plot spectrogram """
def plot_stft(ims, samples, samplerate, freq, binsize=2**10, plotpath=None, colormap="jet"):
timebins, freqbins = np.shape(ims)
plt.figure(figsize=(15, 7.5))
plt.imshow(np.transpose(ims), origin="lower", aspect="auto", cmap=colormap, interpolation="none")
plt.colorbar()
plt.xlabel("time (s)")
plt.ylabel("frequency (hz)")
plt.xlim([0, timebins-1])
plt.ylim([0, freqbins])
xlocs = np.float32(np.linspace(0, timebins-1, 5))
plt.xticks(xlocs, ["%.02f" % l for l in ((xlocs*len(samples)/timebins)+(0.5*binsize))/samplerate])
ylocs = np.int16(np.round(np.linspace(0, freqbins-1, 10)))
plt.yticks(ylocs, ["%.02f" % freq[i] for i in ylocs])
if plotpath:
plt.savefig(plotpath, bbox_inches="tight")
else:
plt.show()
plt.clf()
"" HERE IS WHERE I'm ATTEMPTING TO GET IT OUT TO TXT """
ims, samples, samplerate, freq = compute_stft(filepath)
""" Print lengths """
print('ims len:', len(ims))
print('samples len:', len(samples))
print('samplerate:', samplerate)
print('freq len:', len(freq))
""" Write values to files """
np.savetxt(filepath + '-ims.txt', ims, delimiter=', ', newline='\n', header='ims')
np.savetxt(filepath + '-samples.txt', samples, delimiter=', ', newline='\n', header='samples')
np.savetxt(filepath + '-frequencies.txt', freq, delimiter=', ', newline='\n', header='frequencies')
In terms of values out, the file I'm analyzing is approx 19.1 seconds long and the sample rate is 44100, so I’d expect to have about 842k values for any given variable. But I'm not seeing what I expected. Instead here is what I see:
freqs comes out with just a handful of values, 512 and while they appear to be correct range for expected frequency, they are ordered least to greatest, not in time series like I expected. The 512 values, I assume, is the "fast" in FFT, basically down-sampled...
ims, appears to be amplitude, but values seem too high, although sample size is correct. Should be seeing -50 up to ~240dB.
samples . . . not sure.
In short, can someone advise on how I'd get the FFT out to a text file with time, amp, and freq values for the entire sample set? Is savetxt the correct route, or is there a better way? This code can certainly be used to make a great spectrogram, but how can I just get out the data?
Your output format is too limiting, as the audio spectrum at any interval in time usually contains a range of frequencies. e.g the FFT of a 1024 samples will contain 512 frequency bins for one window of time or time step, each with an amplitude. If you want a time step of one millisecond, then you will have to offset the window of samples you feed each STFT to center the window at that point in your sample vector. Although with an FFT about 23 milliseconds long, that will involve a high overlap of windows. You could use shorter windows, but the time-frequency trade-off will result in proportionately less frequency resolution.

Python Librosa : What is the default frame size used to compute the MFCC features?

Using Librosa library, I generated the MFCC features of audio file 1319 seconds into a matrix 20 X 56829. The 20 here represents the no of MFCC features (Which I can manually adjust it). But I don't know how it segmented the audio length into 56829. What is the frame size it takes process the audio?
import numpy as np
import matplotlib.pyplot as plt
import librosa
def getPathToGroundtruth(episode):
"""Return path to groundtruth file for episode"""
pathToGroundtruth = "../../../season01/Audio/" \
+ "Season01.Episode%02d.en.wav" % episode
return pathToGroundtruth
def getduration(episode):
pathToAudioFile = getPathToGroundtruth(episode)
y, sr = librosa.load(pathToAudioFile)
duration = librosa.get_duration(y=y, sr=sr)
return duration
def getMFCC(episode):
filename = getPathToGroundtruth(episode)
y, sr = librosa.load(filename) # Y gives
data = librosa.feature.mfcc(y=y, sr=sr)
return data
data = getMFCC(1)
Short Answer
You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)
data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20)
Long Answer
Librosa's librosa.feature.mfcc() function really just acts as a wrapper to librosa's librosa.feature.melspectrogram() function (which is a wrapper to librosa.core.stft and librosa.filters.mel functions).
All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc() function.
All extra **kwargs parameters are fed to librosa.feature.melspectrogram() and subsequently to librosa.filters.mel()
By Default, the Mel-scaled power spectrogram window and hop length are the following:
n_fft=2048
hop_length=512
So assuming you used the default sample rate (sr=22050), the output of your mfcc function makes sense:
output length = (seconds) * (sample rate) / (hop_length)
(1319) * (22050) / (512) = 56804 samples
The parameters that you are able to tune, are the following:
Melspectrogram Parameters
-------------------------
y : np.ndarray [shape=(n,)] or None
audio time-series
sr : number > 0 [scalar]
sampling rate of `y`
S : np.ndarray [shape=(d, t)]
power spectrogram
n_fft : int > 0 [scalar]
length of the FFT window
hop_length : int > 0 [scalar]
number of samples between successive frames.
See `librosa.core.stft`
kwargs : additional keyword arguments
Mel filter bank parameters.
See `librosa.filters.mel` for details.
If you want to further specify characteristics of the mel filterbank used to define the Mel-scaled power spectrogram, you can tune the following
Mel Frequency Parameters
------------------------
sr : number > 0 [scalar]
sampling rate of the incoming signal
n_fft : int > 0 [scalar]
number of FFT components
n_mels : int > 0 [scalar]
number of Mel bands to generate
fmin : float >= 0 [scalar]
lowest frequency (in Hz)
fmax : float >= 0 [scalar]
highest frequency (in Hz).
If `None`, use `fmax = sr / 2.0`
htk : bool [scalar]
use HTK formula instead of Slaney
Documentation for Librosa:
librosa.feature.melspectrogram
librosa.filters.mel
librosa.core.stft

Resources