converting mel spectograms to log mel energies - audio

I want to convert mel spectogram to log mel energies what I used is
y, sr = librosa.load(filename, sr=16000)
mel_spectrogram = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=128, n_fft=1024, hop_length=512, power=2)
log_mel_spectrogram = librosa.power_to_db(mel_spectrogram)
I thought this converts to mel energies but I found this line of code
log_mel_spectrogram = 20.0 / power * np.log10(np.maximum(mel_spectrogram, sys.float_info.epsilon))
My question is what is the difference between log-mel spectrograms and log mel energies, which line of code to use

Related

FFT plot of raw PCM comes wrong for higher frequency in python

Here I am using fft function of numpy to plot the fft of PCM wave generated from a 10000Hz sine wave. But the amplitude of the plot I am getting is wrong.
The frequency is coming correct using fftfreq function which I am printing in the console itself. My python code is here.
import numpy as np
import matplotlib.pyplot as plt
frate = 44100
filename = 'Sine_10000Hz.bin' #signed16 bit PCM of a 10000Hz sine wave
f = open('Sine_10000Hz.bin','rb')
y = np.fromfile(f,dtype='int16') #Extract the signed 16 bit PCM value of 10000Hz Sine wave
f.close()
####### Spectral Analysis #########
fft_value = np.fft.fft(y)
freqs = np.fft.fftfreq(len(fft_value)) # frequencies associated with the coefficients:
print("freqs.min(), freqs.max()")
idx = np.argmax(np.abs(fft_value)) # Find the peak in the coefficients
freq = freqs[idx]
freq_in_hertz = abs(freq * frate)
print("\n\n\n\n\n\nfreq_in_hertz")
print(freq_in_hertz)
for i in range(2):
print("Value at index {}:\t{}".format(i, fft_value[i + 1]), "\nValue at index {}:\t{}".format(fft_value.size -1 - i, fft_value[-1 - i]))
#####
n_sa = 8 * int(freq_in_hertz)
t_fft = np.linspace(0, 1, n_sa)
T = t_fft[1] - t_fft[0] # sampling interval
N = n_sa #Here it is n_sample
print("\nN value=")
print(N)
# 1/T = frequency
f = np.linspace(0, 1 / T, N)
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
plt.xlim(0,15000)
# 2 / N is a normalization factor Here second half of the sequence gives us no new information that the half of the FFT sequence is the output we need.
plt.bar(f[:N // 2], np.abs(fft_value)[:N // 2] * 2 / N, width=15,color="red")
Output comes in the console (Only minimal prints I am pasting here)
freqs.min(), freqs.max()
-0.5 0.49997732426303854
freq_in_hertz
10000.0
Value at index 0: (19.949569768991054-17.456031216294324j)
Value at index 44099: (19.949569768991157+17.45603121629439j)
Value at index 1: (9.216783424692835-13.477631008179145j)
Value at index 44098: (9.216783424692792+13.477631008179262j)
N value=
80000
The frequency extraction is coming correctly but in the plot something I am doing is incorrect which I don't know.
Updating the work:
When I am change the multiplication factor 10 in the line n_sa = 10 * int(freq_in_hertz) to 5 gives me correct plot. Whether its correct or not I am not able to understand
In the line plt.xlim(0,15000) if I increase max value to 20000 again is not plotting. Till 15000 it is plotting correctly.
I generated this Sine_10000Hz.bin using Audacity tool where I generate a sine wave of freq 10000Hz of 1sec duration and a sampling rate of 44100. Then I exported this audio to signed 16bit with headerless (means raw PCM). I could able to regenerate this sine wave using this script. Also I want to calculate the FFT of this. So I expect a peak at 10000Hz with amplitude 32767. You can see i changed the multiplication factor 8 instead of 10 in the line n_sa = 8 * int(freq_in_hertz). Hence it worked. But the amplitude is showing incorrect. I will attach my new figure here
I'm not sure exactly what you are trying to do, but my suspicion is that the Sine_10000Hz.bin file isn't what you think it is.
Is it possible it contains more than one channel (left & right)?
Is it realy signed 16 bit integers?
It's not hard to create a 10kHz sine wave in 16 bit integers in numpy.
import numpy as np
import matplotlib.pyplot as plt
n_samples = 2000
f_signal = 10000 # (Hz) Signal Frequency
f_sample = 44100 # (Hz) Sample Rate
amplitude = 2**3 # Arbitrary. Must be > 1. Should be > 2. Larger makes FFT results better
time = np.arange(n_samples) / f_sample # sample times
# The signal
y = (np.sin(time * f_signal * 2 * np.pi) * amplitude).astype('int16')
If you plot 30 points of the signal you can see there are about 5 points per cycle.
plt.plot(time[:30], y[:30], marker='o')
plt.xlabel('Time (s)')
plt.yticks([]); # Amplitude value is artificial. hide it
If you plot 30 samples of the data from Sine_10000Hz.bin does it have about 5 points per cycle?
This is my attempt to recreate the FFT work as I understand it.
fft_value = np.fft.fft(y) # compute the FFT
freqs = np.fft.fftfreq(len(fft_value)) * f_sample # frequencies for each FFT bin
N = len(y)
plt.plot(freqs[:N//2], np.abs(fft_value[:N//2]))
plt.yscale('log')
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
I get the following plot
The y-axis of this plot is on a log scale. Notice that the amplitude of the peak is in the thousands. The amplitude of most of the rest of the data points are around 100.
idx_max = np.argmax(np.abs(fft_value)) # Find the peak in the coefficients
idx_min = np.argmin(np.abs(fft_value)) # Find the peak in the coefficients
print(f'idx_max = {idx_max}, idx_min = {idx_min}')
print(f'f_max = {freqs[idx_max]}, f_min = {freqs[idx_min]}')
print(f'fft_value[idx_max] {fft_value[idx_max]}')
print(f'fft_value[idx_min] {fft_value[idx_min]}')
produces:
idx_max = 1546, idx_min = 1738
f_max = -10010.7, f_min = -5777.1
fft_value[idx_max] (-4733.232076236707+219.11718299533203j)
fft_value[idx_min] (-0.17017443966211232+0.9557200531465061j)
I'm adding a link to a script I've build that outputs the FFT with ACTUAL amplitude (for real signals - e.g. your signal). Have a go and see if it works:
dt=1/frate in your constellation....
https://stackoverflow.com/a/53925342/4879610
After a long home work I could able to find my issue. As I mentioned in the Updating the work: the reason was with the number of samples which I took was wrong.
I changed the two lines in the code
n_sa = 8 * int(freq_in_hertz)
t_fft = np.linspace(0, 1, n_sa)
to
n_sa = y.size //number of samples directly taken from the raw 16bits
t_fft = np.arange(n_sa)/frate //Here we need to divide each samples by the sampling rate
This solved my issue.
My spectral output is
Special thanks to #meta4 and #YoniChechik for giving me some suggestions.

librosa.util.exceptions.ParameterError: Invalid shape for monophonic audio: ndim=2, shape=(172972, 2)

Please somebody help me to solve this
I was following this tutorial:
https://data-flair.training/blogs/python-mini-project-speech-emotion-recognition/
And used their dataset which they took from the RAVDESS Dataset and lowered the sample rate of them. I can train using this data easily. But when I use Original data from here:
https://zenodo.org/record/1188976
Just "Audio_Speech_Actors_01-24.zip" and try to train model it gives me below error:
Traceback (most recent call last):
File "C:/Users/raj.pandey/Desktop/speech-emotion-recognition/main.py", line 64, in <module>
x_train, x_test, y_train, y_test = load_data(test_size=0.20)
File "C:/Users/raj.pandey/Desktop/speech-emotion-recognition/main.py", line 57, in load_data
feature = extract_feature(file, mfcc=True, chroma=True, mel=True)
File "C:/Users/raj.pandey/Desktop/speech-emotion-recognition/main.py", line 32, in extract_feature
stft = np.abs(librosa.stft(X))
File "C:\Users\raj.pandey\Desktop\speech-emotion-recognition\lib\site-packages\librosa\core\spectrum.py", line 215, in stft
util.valid_audio(y)
File "C:\Users\raj.pandey\Desktop\speech-emotion-recognition\lib\site-packages\librosa\util\utils.py", line 268, in valid_audio
'ndim={:d}, shape={}'.format(y.ndim, y.shape))
librosa.util.exceptions.ParameterError: Invalid shape for monophonic audio: ndim=2, shape=(172972, 2)
The tutorial provided by the trains at the same dataset but just that they have lowered the sample rate. Why isn't it running on the original one?
Does it have to do anything with this in the code:
X = sound_file.read(dtype="float32")
I have was also just out of curiosity tried to predict from a .mp3 file and it resulted in an error. Then I converted that .mp3 file in wav and tried but still gives error in the title.
How to solve this error and make it train on the original data? If it starts training on original then I think it can predict on the .mp3 to wav converted file.
Below is the code that I am using:
import librosa
import soundfile
import os
import glob
import pickle
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
# DataFlair - Emotions in the RAVDESS dataset
emotions = {
'01': 'neutral',
'02': 'calm',
'03': 'happy',
'04': 'sad',
'05': 'angry',
'06': 'fearful',
'07': 'disgust',
'08': 'surprised'
}
# DataFlair - Emotions to observe
observed_emotions = ['calm', 'happy', 'fearful', 'disgust']
# DataFlair - Extract features (mfcc, chroma, mel) from a sound file
def extract_feature(file_name, mfcc, chroma, mel):
with soundfile.SoundFile(file_name) as sound_file:
X = sound_file.read(dtype="float32")
sample_rate = sound_file.samplerate
if chroma:
stft = np.abs(librosa.stft(X))
result = np.array([])
if mfcc:
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
result = np.hstack((result, mfccs))
if chroma:
chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T, axis=0)
result = np.hstack((result, chroma))
if mel:
mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T, axis=0)
result = np.hstack((result, mel))
return result
# DataFlair - Load the data and extract features for each sound file
def load_data(test_size=0.2):
x, y = [], []
for file in glob.glob("C:\\Users\\raj.pandey\\Desktop\\speech-emotion-recognition\\Dataset\\Actor_*\\*.wav"):
# for file in glob.glob("C:\\Users\\raj.pandey\\Desktop\\speech-emotion-recognition\\Dataset\\newactor\\*.wav"):
file_name = os.path.basename(file)
emotion = emotions[file_name.split("-")[2]]
if emotion not in observed_emotions:
continue
feature = extract_feature(file, mfcc=True, chroma=True, mel=True)
x.append(feature)
y.append(emotion)
return train_test_split(np.array(x), y, test_size=test_size, random_state=9)
# DataFlair - Split the dataset
x_train, x_test, y_train, y_test = load_data(test_size=0.20)
# DataFlair - Get the shape of the training and testing datasets
# print((x_train.shape[0], x_test.shape[0]))
# DataFlair - Get the number of features extracted
# print(f'Features extracted: {x_train.shape[1]}')
# DataFlair - Initialize the Multi Layer Perceptron Classifier
model = MLPClassifier(alpha=0.01, batch_size=256, epsilon=1e-08, hidden_layer_sizes=(300,), learning_rate='adaptive',
max_iter=500)
# DataFlair - Train the model
model.fit(x_train, y_train)
# print(model.fit(x_train, y_train))
# DataFlair - Predict for the test set
y_pred = model.predict(x_test)
# print("This is y_pred: ", y_pred)
# DataFlair - Calculate the accuracy of our model
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
# DataFlair - Print the accuracy
# print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predicting random files
tar_file = "C:\\Users\\raj.pandey\\Desktop\\speech-emotion-recognition\\Dataset\\newactor\\pls-hold-while-try.wav"
new_feature = extract_feature(tar_file, mfcc=True, chroma=True, mel=True)
data = []
data.append(new_feature)
data = np.array(data)
z_pred = model.predict(data)
print("This is output: ", z_pred)
The dataset provided by the tutorial to train was this: https://drive.google.com/file/d/1wWsrN2Ep7x6lWqOXfr4rpKGYrJhWc8z7/view
The original dataset you can get from here(which isn't working with the program):https://zenodo.org/record/1188976 (Audio_speech_actor one)
In predicting random files if you put any .wav files with a speech in it, it results in an error. And if you try text to speech converter and get the .wav and pass it here it will always say "fearfull". I have tried converting a .mp3 to .wav to get it to work nicely but nope still an error.
Anyone checked yet how can I get it working?
I've just ran into the same problem. For anyone reading this that prefers not to delete the stereo files, it is possible to convert them to mono using the command line tool ffmpeg:
ffmpeg -i stereo_file_name.wav -ac 1 mono_file_name.wav
Link to ffmpeg
Related Stack Overflow Post
from pydub import AudioSegment
file_name=os.path.basename(file)
#converting stereo audio to mono
sound = AudioSegment.from_wav(file)
sound = sound.set_channels(1)
sound.export(file, format="wav")
emotion=emotions[file_name.split("-")[2]]
if emotion not in observed_emotions:
continue
feature=extract_feature(file, mfcc=True, chroma=True, mel=True)
x.append(feature)
y.append(emotion)
return train_test_split(np.array(x), y, test_size=test_size, random_state=9)
Cause the audio file has 2 audio chanle, it must be only one audio chanle.
I'm working on the same dataset and got the same error as well. What I did was convert the audio to mono using an online converter https://convertio.co/ by following this website's directions https://videoconvert.minitool.com/video-converter/stereo-to-mono.html (point 1, convertio)
array(['fearful'], dtype='<U7')
The above line is my output too, it predicts it as fearful, might be because of the accuracy (mine is 73.96%, but it varies)
The tutorial provided by the trains at the same dataset but just that they have lowered the sample rate. Why isn't it running on the original one?
Even tho people already gave an answer to this question, The author or the authors of that tutorial didn't specify the fact that the dataset posted on their Google Drive have all audio tracks with mono channels while in the original one there are some audio tracks that are in stereo channels.
As already Arryan Sinha showed, just use the package pydub and the job is done.
Other than this, I would suggest to not give so much attention to that tutorial because the results from the classifier are, most of the time, about 50% of accuracy, which is not great. To verify effectively if the classifier is good try to print a confusion matrix. That surely helps to see if a classifier is good or not.
Some of the audio files are stereo,(and code needs mono) those are causing break. Removing those files from dataset eliminates this error. In load_data(), add a line, print(file). It will tell you which files are breaking the code, then just remove them.
def load_data(test_size=0.2):
x,y=[],[]
for file in glob.glob("\Actor_*\\*.wav"):
file_name=os.path.basename(file)
print(file)
emotion=emotions[file_name.split("-")[2]]
...
I found 4 files that are causing this :
Actor_01/03-01-02-01-01-02-01.wav
Actor_05/03-01-02-01-02-02-05.wav
Actor_20/03-01-03-01-02-01-20.wav
Actor_20/03-01-06-01-01-02-20.wav

Converting a wav file to amplitude and frequency values for textual, time-series analysis

I'm processing wav files for amplitude and frequency analysis with FFT, but I am having trouble getting the data out to csv in a time series format.
Using #Beginner's answer heavily from this post: How to convert a .wav file to a spectrogram in python3, I'm able to get the spectrogram output in an image. I'm trying to simplify that somewhat to get to a text output in csv format, but I'm not seeing how to do so. The outcome I'm hoping to achieve would look something like the following:
time_in_ms, amplitude_in_dB, freq_in_kHz
.001, -115, 1
.002, -110, 2
.003, 20, 200
...
19000, 20, 200
For my testing, I have been using http://soundbible.com/2123-40-Smith-Wesson-8x.html, (Notes: I simplified the wav down to a single channel and removed metadata w/ Audacity to get it to work.)
Heavy props to #Beginner for 99.9% of the following, anything nonsensical is surely mine.
import numpy as np
from matplotlib import pyplot as plt
import scipy.io.wavfile as wav
from numpy.lib import stride_tricks
filepath = "40sw3.wav"
""" short time fourier transform of audio signal """
def stft(sig, frameSize, overlapFac=0.5, window=np.hanning):
win = window(frameSize)
hopSize = int(frameSize - np.floor(overlapFac * frameSize))
# zeros at beginning (thus center of 1st window should be for sample nr. 0)
samples = np.append(np.zeros(int(np.floor(frameSize/2.0))), sig)
# cols for windowing
cols = np.ceil( (len(samples) - frameSize) / float(hopSize)) + 1
# zeros at end (thus samples can be fully covered by frames)
samples = np.append(samples, np.zeros(frameSize))
frames = stride_tricks.as_strided(samples, shape=(int(cols), frameSize), strides=(samples.strides[0]*hopSize, samples.strides[0])).copy()
frames *= win
return np.fft.rfft(frames)
""" scale frequency axis logarithmically """
def logscale_spec(spec, sr=44100, factor=20.):
timebins, freqbins = np.shape(spec)
scale = np.linspace(0, 1, freqbins) ** factor
scale *= (freqbins-1)/max(scale)
scale = np.unique(np.round(scale))
# create spectrogram with new freq bins
newspec = np.complex128(np.zeros([timebins, len(scale)]))
for i in range(0, len(scale)):
if i == len(scale)-1:
newspec[:,i] = np.sum(spec[:,int(scale[i]):], axis=1)
else:
newspec[:,i] = np.sum(spec[:,int(scale[i]):int(scale[i+1])], axis=1)
# list center freq of bins
allfreqs = np.abs(np.fft.fftfreq(freqbins*2, 1./sr)[:freqbins+1])
freqs = []
for i in range(0, len(scale)):
if i == len(scale)-1:
freqs += [np.mean(allfreqs[int(scale[i]):])]
else:
freqs += [np.mean(allfreqs[int(scale[i]):int(scale[i+1])])]
return newspec, freqs
""" compute spectrogram """
def compute_stft(audiopath, binsize=2**10):
samplerate, samples = wav.read(audiopath)
s = stft(samples, binsize)
sshow, freq = logscale_spec(s, factor=1.0, sr=samplerate)
ims = 20.*np.log10(np.abs(sshow)/10e-6) # amplitude to decibel
return ims, samples, samplerate, freq
""" plot spectrogram """
def plot_stft(ims, samples, samplerate, freq, binsize=2**10, plotpath=None, colormap="jet"):
timebins, freqbins = np.shape(ims)
plt.figure(figsize=(15, 7.5))
plt.imshow(np.transpose(ims), origin="lower", aspect="auto", cmap=colormap, interpolation="none")
plt.colorbar()
plt.xlabel("time (s)")
plt.ylabel("frequency (hz)")
plt.xlim([0, timebins-1])
plt.ylim([0, freqbins])
xlocs = np.float32(np.linspace(0, timebins-1, 5))
plt.xticks(xlocs, ["%.02f" % l for l in ((xlocs*len(samples)/timebins)+(0.5*binsize))/samplerate])
ylocs = np.int16(np.round(np.linspace(0, freqbins-1, 10)))
plt.yticks(ylocs, ["%.02f" % freq[i] for i in ylocs])
if plotpath:
plt.savefig(plotpath, bbox_inches="tight")
else:
plt.show()
plt.clf()
"" HERE IS WHERE I'm ATTEMPTING TO GET IT OUT TO TXT """
ims, samples, samplerate, freq = compute_stft(filepath)
""" Print lengths """
print('ims len:', len(ims))
print('samples len:', len(samples))
print('samplerate:', samplerate)
print('freq len:', len(freq))
""" Write values to files """
np.savetxt(filepath + '-ims.txt', ims, delimiter=', ', newline='\n', header='ims')
np.savetxt(filepath + '-samples.txt', samples, delimiter=', ', newline='\n', header='samples')
np.savetxt(filepath + '-frequencies.txt', freq, delimiter=', ', newline='\n', header='frequencies')
In terms of values out, the file I'm analyzing is approx 19.1 seconds long and the sample rate is 44100, so I’d expect to have about 842k values for any given variable. But I'm not seeing what I expected. Instead here is what I see:
freqs comes out with just a handful of values, 512 and while they appear to be correct range for expected frequency, they are ordered least to greatest, not in time series like I expected. The 512 values, I assume, is the "fast" in FFT, basically down-sampled...
ims, appears to be amplitude, but values seem too high, although sample size is correct. Should be seeing -50 up to ~240dB.
samples . . . not sure.
In short, can someone advise on how I'd get the FFT out to a text file with time, amp, and freq values for the entire sample set? Is savetxt the correct route, or is there a better way? This code can certainly be used to make a great spectrogram, but how can I just get out the data?
Your output format is too limiting, as the audio spectrum at any interval in time usually contains a range of frequencies. e.g the FFT of a 1024 samples will contain 512 frequency bins for one window of time or time step, each with an amplitude. If you want a time step of one millisecond, then you will have to offset the window of samples you feed each STFT to center the window at that point in your sample vector. Although with an FFT about 23 milliseconds long, that will involve a high overlap of windows. You could use shorter windows, but the time-frequency trade-off will result in proportionately less frequency resolution.

Python Librosa : What is the default frame size used to compute the MFCC features?

Using Librosa library, I generated the MFCC features of audio file 1319 seconds into a matrix 20 X 56829. The 20 here represents the no of MFCC features (Which I can manually adjust it). But I don't know how it segmented the audio length into 56829. What is the frame size it takes process the audio?
import numpy as np
import matplotlib.pyplot as plt
import librosa
def getPathToGroundtruth(episode):
"""Return path to groundtruth file for episode"""
pathToGroundtruth = "../../../season01/Audio/" \
+ "Season01.Episode%02d.en.wav" % episode
return pathToGroundtruth
def getduration(episode):
pathToAudioFile = getPathToGroundtruth(episode)
y, sr = librosa.load(pathToAudioFile)
duration = librosa.get_duration(y=y, sr=sr)
return duration
def getMFCC(episode):
filename = getPathToGroundtruth(episode)
y, sr = librosa.load(filename) # Y gives
data = librosa.feature.mfcc(y=y, sr=sr)
return data
data = getMFCC(1)
Short Answer
You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)
data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20)
Long Answer
Librosa's librosa.feature.mfcc() function really just acts as a wrapper to librosa's librosa.feature.melspectrogram() function (which is a wrapper to librosa.core.stft and librosa.filters.mel functions).
All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc() function.
All extra **kwargs parameters are fed to librosa.feature.melspectrogram() and subsequently to librosa.filters.mel()
By Default, the Mel-scaled power spectrogram window and hop length are the following:
n_fft=2048
hop_length=512
So assuming you used the default sample rate (sr=22050), the output of your mfcc function makes sense:
output length = (seconds) * (sample rate) / (hop_length)
(1319) * (22050) / (512) = 56804 samples
The parameters that you are able to tune, are the following:
Melspectrogram Parameters
-------------------------
y : np.ndarray [shape=(n,)] or None
audio time-series
sr : number > 0 [scalar]
sampling rate of `y`
S : np.ndarray [shape=(d, t)]
power spectrogram
n_fft : int > 0 [scalar]
length of the FFT window
hop_length : int > 0 [scalar]
number of samples between successive frames.
See `librosa.core.stft`
kwargs : additional keyword arguments
Mel filter bank parameters.
See `librosa.filters.mel` for details.
If you want to further specify characteristics of the mel filterbank used to define the Mel-scaled power spectrogram, you can tune the following
Mel Frequency Parameters
------------------------
sr : number > 0 [scalar]
sampling rate of the incoming signal
n_fft : int > 0 [scalar]
number of FFT components
n_mels : int > 0 [scalar]
number of Mel bands to generate
fmin : float >= 0 [scalar]
lowest frequency (in Hz)
fmax : float >= 0 [scalar]
highest frequency (in Hz).
If `None`, use `fmax = sr / 2.0`
htk : bool [scalar]
use HTK formula instead of Slaney
Documentation for Librosa:
librosa.feature.melspectrogram
librosa.filters.mel
librosa.core.stft

Python adaptfilt 2.0 FloatingPointError: invalid value encountered in multiply

I am trying to use the "echo cancel" example in python 3.4 from the library adaptfilt 2.0 which looks like this:
import numpy as np
import adaptfilt as adf
# Get u(n) - this is available on github or pypi in the examples folder
u = np.load('speech.npy')
# Generate received signal d(n) using randomly chosen coefficients
coeffs = np.concatenate(([0.8], np.zeros(8), [-0.7], np.zeros(9),
[0.5], np.zeros(11), [-0.3], np.zeros(3),
[0.1], np.zeros(20), [-0.05]))
d = np.convolve(u, coeffs)
# Add background noise
v = np.random.randn(len(d)) * np.sqrt(5000)
d += v
# Apply adaptive filter
M = 100 # Number of filter taps in adaptive filter
step = 0.1 # Step size
y, e, w = adf.nlms(u, d, M, step, returnCoeffs=True)
# Calculate mean square weight error
mswe = adf.mswe(w, coeffs)
It works as expected. But then I wanted to to do the same thing with some real data from music file and I get an error:
Traceback (most recent call last):
File "C:/Python34/Lib/site-packages/adaptfilt/echocancel.py", line 86, in <module>
y, e, w = adf.nlms(u, d, M, step, returnCoeffs=True)
File "C:\Python34\Lib\site-packages\adaptfilt\nlms.py", line 149, in nlms
w = leakstep * w + step * normFactor * x * e[n]
FloatingPointError: invalid value encountered in multiply
The code I used is this:
import numpy as np
import adaptfilt as adf
import pyaudio
import wave
np.seterr(all='raise')
p = pyaudio.PyAudio()
stream = p.open(format = p.get_format_from_width(2),
channels = 1,
rate = 44100,
input = True,
output = True,
#stream_callback = self.callback
)
wf = wave.open("XXX.wav", 'rb')
while u != " ":
data = wf.readframes(1024)
u = np.fromstring(data, np.int16)
# Generate received signal d(n) using randomly chosen coefficients
coeffs = np.concatenate(([0.8], np.zeros(8), [-0.7], np.zeros(9),
[0.5], np.zeros(11), [-0.3], np.zeros(3),
[0.1], np.zeros(20), [-0.05]))
coeffs.dtype = np.int16
d = np.convolve(u, coeffs)
# Add background noise
v = np.random.randn(len(d)) * np.sqrt(5000)
d += v
# Apply adaptive filter
M = 100 # Number of filter taps in adaptive filter
step = 0.1 # Step size
y, e, w = adf.nlms(u, d, M, step, returnCoeffs=True)
# Calculate mean square weight error
mswe = adf.mswe(w, coeffs)
stream.write(y.astype(np.int16).tostring())
The only difference I see is that the array from "speech.npy" is type of float64 and my array from the wav file is type of int16.
I was also able to get 'adaptfilt 2.0' to work on music data (a mono .wav file with 44.1kHz sampling rate) by casting 'data' to float64.
The adaptive filter took longer to converge, but it worked OK. Below is the Mean squared weight error plot showing the filter converge.
I would also add that to apply this to music, you'll likely need a longer filter tap (M). The 'speech.npy' array used in the original script is just an array with no sample rate information, but we can assume the sample rate in a speech file is lower than 44.1kHz.
I played the 'speech.npy' back at varying sample rates, and just by listening to what sounds natural, I'm guessing it's in the 10-12kHz range. This means that the "impulse response" stored in 'coeffs', is ~5.7 ms (assuming 10kHz sampling rate). At 44.1kHz, the impulse response is only ~1.3ms. This is very short and unlikely to model the impulse response for real signals.
This error arises due to numerical issues when using the int16 type for computations somewhere inside the module. Using your example and casting the input data to a floating point type such as
u = np.fromstring(data, np.int16)
u = np.float64(u)
resolves the issue for me.

Resources