How to judge the similarity between two pieces of audio? - audio

I want to achieve a similar singing scoring function to determine the similarity of two audio, but I do not know how a simple implementation, look at a lot of github projects, more mention is simhash, but I feel similar to the audio may not be very good, so here to ask for advice.

one approach would be to find the frequencies that are present in a audio segment by using auto correlation.
there are many implementations of this. e.g. librosa, scipy, numpy and so on.
a very loose and sloppy implementation to give you an understanding of the algorithm without libs:
import matplotlib.pyplot as plt
import math
'''
create a test signal
'''
sr = 44100 #hz
freq = 200 #hz
duration = 0.1 #sec
signal = [math.sin(x * (math.pi * 2 * freq) * (1 / sr)) for x in range(0, int(sr * duration))]
'''
compute the auto correlation at a given frequency
'''
def auto_correlation(signal, freq, sr):
sample_delay = int (sr / freq)
score = 0
for i in range(0, len(signal) - sample_delay):
score += signal[i] * signal[i + sample_delay]
return score / len(signal)
'''
iterate over frequency spectrum and test the autocorrelation
'''
start_freq = 150
end_freq = 1000
scores = []
for freq in range(start_freq, end_freq):
scores.append(auto_correlation(signal, freq, sr))
max_index = scores.index(max(scores))
print("estimated frequency : {} hz".format(max_index + start_freq))
plt.ylabel("correlation")
plt.xlabel("frequency Hz")
plt.plot([x + start_freq for x in range(len(scores))], scores)
you could then iterate thru the audio files, test segments for the dominating frequency and compare the scores.
another possibility is to do this by computing FFT for the audio files and compare those. librosa is a great lib if youre in python territory.

Related

Generating sawtooth wave with python math module

I know I can do this with scipy or numpy, but I want to do it with just built-in modules in this case. So far, I have come up with this code to generate samples of a sawtooth wave of a specific frequency, at a specific sampling rate (and plot it):
import math
import matplotlib.pyplot as plt
def sawtooth_sample(amplitude, freq, samplerate, i):
value = math.atan(math.tan(2.0 * math.pi * float(freq) * (float(i) / float(samplerate))))
return amplitude * value
def plot_samples(num_samples, frequency):
# Generate the samples
samples = []
for i in range(num_samples):
samples.append(sawtooth_sample(1.0, frequency, 44100, i))
# Plot the samples
plt.plot(samples)
plt.show()
So, then I do this to test it:
# Generate 1000 samples of a 100Hz sawtooth wave, sampled at 44.1KHz
plot_samples(1000, 100)
And, I get this:
Great. Looks very sawtooth-y.
But, when I try with a higher frequency, like this:
# Generate 20 samples of a 10KHz sawtooth wave, sampled at 44.1KHz
plot_samples(20, 10000)
Then I get this:
Not very sawtooth-y any more. Looks more like a triangle wave, with a low harmonic that is sawtooth-shaped. What am I doing wrong and/or missing here?

Skewed random sample from Numpy random generator sample (numpy.random.Generator.choice)

I have made a piece of Python to generate mixture of normal distributions and I would want to sample from it. As the result is my probability density function I would want the sample to be representative of the original distribution.
So I have developped the function to create the pdf:
def gaussian_pdf(amplitude, mean, std, sample_int):
coeff = (amplitude / std) / np.sqrt(2 * np.pi)
if len(amplitude > 1):
# create mixture distribution
# get distribution support
absciss_array = np.linspace(np.min(mean) - 4 * std[np.argmin(mean)],
np.max(mean) + 4 * std[np.argmax(mean)],
sample_int)
normal_array = np.zeros(len(absciss_array))
for index in range(0, len(amplitude)):
normal_array += coeff[index] * np.exp(-((absciss_array - mean[index]) / std[index]) ** 2)
else:
# create simple gaussian distribution
absciss_array = np.linspace(mean - 4*std, mean + 4*std, sample_int)
normal_array = coeff * np.exp(-((absciss_array - mean) / 2*std) ** 2)
return np.ascontiguousarray(normal_array / np.sum(normal_array))
An I have tested a sampling with the main part of the script :
def main():
amplitude = np.asarray([1, 2, 1])
mean = np.asarray([0.5, 1, 2.5])
std = np.asarray([0.1, 0.2, 0.3])
no_sample = 10000
# create mixture gaussian array
gaussian_array = gaussian_pdf(amplitude, mean, std, no_sample)
# pot data
fig, ax = plt.subplots()
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
ax.plot(absciss, gaussian_array)
# create random generator to sample from distribution
rng = np.random.default_rng(424242)
# sample from distribution
sample = rng.choice(a=gaussian_array, size=100, replace=True, p=gaussian_array)
# plot results
ax.plot(sample, np.full_like(sample, -0.00001), '|k', markeredgewidth=1)
plt.show()
return None
I then have the result :
You can see with the dark lines the samples that have been extracted from the distribution. The problem is that, even if I specify to use the probability array in the numpy function, the sampling is skewed towards the end of the distribution. I have tried several times with other seeds but the result does not change...
I expect to have more samples in the area where the probability density is greater...
Would someone please help me ? Am I missing something here ?
Thanks in advance.
Well actually the answer was to use an uniform distribution for sampling. Thanks to #amzon-ex for pointing it out.
The code is then :
absciss = np.linspace(np.min(gaussian_array), np.max(gaussian_array), no_sample)
sample_other = rng.choice(a=absciss, size=100, replace=True, p=gaussian_array)

FFT plot of raw PCM comes wrong for higher frequency in python

Here I am using fft function of numpy to plot the fft of PCM wave generated from a 10000Hz sine wave. But the amplitude of the plot I am getting is wrong.
The frequency is coming correct using fftfreq function which I am printing in the console itself. My python code is here.
import numpy as np
import matplotlib.pyplot as plt
frate = 44100
filename = 'Sine_10000Hz.bin' #signed16 bit PCM of a 10000Hz sine wave
f = open('Sine_10000Hz.bin','rb')
y = np.fromfile(f,dtype='int16') #Extract the signed 16 bit PCM value of 10000Hz Sine wave
f.close()
####### Spectral Analysis #########
fft_value = np.fft.fft(y)
freqs = np.fft.fftfreq(len(fft_value)) # frequencies associated with the coefficients:
print("freqs.min(), freqs.max()")
idx = np.argmax(np.abs(fft_value)) # Find the peak in the coefficients
freq = freqs[idx]
freq_in_hertz = abs(freq * frate)
print("\n\n\n\n\n\nfreq_in_hertz")
print(freq_in_hertz)
for i in range(2):
print("Value at index {}:\t{}".format(i, fft_value[i + 1]), "\nValue at index {}:\t{}".format(fft_value.size -1 - i, fft_value[-1 - i]))
#####
n_sa = 8 * int(freq_in_hertz)
t_fft = np.linspace(0, 1, n_sa)
T = t_fft[1] - t_fft[0] # sampling interval
N = n_sa #Here it is n_sample
print("\nN value=")
print(N)
# 1/T = frequency
f = np.linspace(0, 1 / T, N)
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
plt.xlim(0,15000)
# 2 / N is a normalization factor Here second half of the sequence gives us no new information that the half of the FFT sequence is the output we need.
plt.bar(f[:N // 2], np.abs(fft_value)[:N // 2] * 2 / N, width=15,color="red")
Output comes in the console (Only minimal prints I am pasting here)
freqs.min(), freqs.max()
-0.5 0.49997732426303854
freq_in_hertz
10000.0
Value at index 0: (19.949569768991054-17.456031216294324j)
Value at index 44099: (19.949569768991157+17.45603121629439j)
Value at index 1: (9.216783424692835-13.477631008179145j)
Value at index 44098: (9.216783424692792+13.477631008179262j)
N value=
80000
The frequency extraction is coming correctly but in the plot something I am doing is incorrect which I don't know.
Updating the work:
When I am change the multiplication factor 10 in the line n_sa = 10 * int(freq_in_hertz) to 5 gives me correct plot. Whether its correct or not I am not able to understand
In the line plt.xlim(0,15000) if I increase max value to 20000 again is not plotting. Till 15000 it is plotting correctly.
I generated this Sine_10000Hz.bin using Audacity tool where I generate a sine wave of freq 10000Hz of 1sec duration and a sampling rate of 44100. Then I exported this audio to signed 16bit with headerless (means raw PCM). I could able to regenerate this sine wave using this script. Also I want to calculate the FFT of this. So I expect a peak at 10000Hz with amplitude 32767. You can see i changed the multiplication factor 8 instead of 10 in the line n_sa = 8 * int(freq_in_hertz). Hence it worked. But the amplitude is showing incorrect. I will attach my new figure here
I'm not sure exactly what you are trying to do, but my suspicion is that the Sine_10000Hz.bin file isn't what you think it is.
Is it possible it contains more than one channel (left & right)?
Is it realy signed 16 bit integers?
It's not hard to create a 10kHz sine wave in 16 bit integers in numpy.
import numpy as np
import matplotlib.pyplot as plt
n_samples = 2000
f_signal = 10000 # (Hz) Signal Frequency
f_sample = 44100 # (Hz) Sample Rate
amplitude = 2**3 # Arbitrary. Must be > 1. Should be > 2. Larger makes FFT results better
time = np.arange(n_samples) / f_sample # sample times
# The signal
y = (np.sin(time * f_signal * 2 * np.pi) * amplitude).astype('int16')
If you plot 30 points of the signal you can see there are about 5 points per cycle.
plt.plot(time[:30], y[:30], marker='o')
plt.xlabel('Time (s)')
plt.yticks([]); # Amplitude value is artificial. hide it
If you plot 30 samples of the data from Sine_10000Hz.bin does it have about 5 points per cycle?
This is my attempt to recreate the FFT work as I understand it.
fft_value = np.fft.fft(y) # compute the FFT
freqs = np.fft.fftfreq(len(fft_value)) * f_sample # frequencies for each FFT bin
N = len(y)
plt.plot(freqs[:N//2], np.abs(fft_value[:N//2]))
plt.yscale('log')
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
I get the following plot
The y-axis of this plot is on a log scale. Notice that the amplitude of the peak is in the thousands. The amplitude of most of the rest of the data points are around 100.
idx_max = np.argmax(np.abs(fft_value)) # Find the peak in the coefficients
idx_min = np.argmin(np.abs(fft_value)) # Find the peak in the coefficients
print(f'idx_max = {idx_max}, idx_min = {idx_min}')
print(f'f_max = {freqs[idx_max]}, f_min = {freqs[idx_min]}')
print(f'fft_value[idx_max] {fft_value[idx_max]}')
print(f'fft_value[idx_min] {fft_value[idx_min]}')
produces:
idx_max = 1546, idx_min = 1738
f_max = -10010.7, f_min = -5777.1
fft_value[idx_max] (-4733.232076236707+219.11718299533203j)
fft_value[idx_min] (-0.17017443966211232+0.9557200531465061j)
I'm adding a link to a script I've build that outputs the FFT with ACTUAL amplitude (for real signals - e.g. your signal). Have a go and see if it works:
dt=1/frate in your constellation....
https://stackoverflow.com/a/53925342/4879610
After a long home work I could able to find my issue. As I mentioned in the Updating the work: the reason was with the number of samples which I took was wrong.
I changed the two lines in the code
n_sa = 8 * int(freq_in_hertz)
t_fft = np.linspace(0, 1, n_sa)
to
n_sa = y.size //number of samples directly taken from the raw 16bits
t_fft = np.arange(n_sa)/frate //Here we need to divide each samples by the sampling rate
This solved my issue.
My spectral output is
Special thanks to #meta4 and #YoniChechik for giving me some suggestions.

Converting a wav file to amplitude and frequency values for textual, time-series analysis

I'm processing wav files for amplitude and frequency analysis with FFT, but I am having trouble getting the data out to csv in a time series format.
Using #Beginner's answer heavily from this post: How to convert a .wav file to a spectrogram in python3, I'm able to get the spectrogram output in an image. I'm trying to simplify that somewhat to get to a text output in csv format, but I'm not seeing how to do so. The outcome I'm hoping to achieve would look something like the following:
time_in_ms, amplitude_in_dB, freq_in_kHz
.001, -115, 1
.002, -110, 2
.003, 20, 200
...
19000, 20, 200
For my testing, I have been using http://soundbible.com/2123-40-Smith-Wesson-8x.html, (Notes: I simplified the wav down to a single channel and removed metadata w/ Audacity to get it to work.)
Heavy props to #Beginner for 99.9% of the following, anything nonsensical is surely mine.
import numpy as np
from matplotlib import pyplot as plt
import scipy.io.wavfile as wav
from numpy.lib import stride_tricks
filepath = "40sw3.wav"
""" short time fourier transform of audio signal """
def stft(sig, frameSize, overlapFac=0.5, window=np.hanning):
win = window(frameSize)
hopSize = int(frameSize - np.floor(overlapFac * frameSize))
# zeros at beginning (thus center of 1st window should be for sample nr. 0)
samples = np.append(np.zeros(int(np.floor(frameSize/2.0))), sig)
# cols for windowing
cols = np.ceil( (len(samples) - frameSize) / float(hopSize)) + 1
# zeros at end (thus samples can be fully covered by frames)
samples = np.append(samples, np.zeros(frameSize))
frames = stride_tricks.as_strided(samples, shape=(int(cols), frameSize), strides=(samples.strides[0]*hopSize, samples.strides[0])).copy()
frames *= win
return np.fft.rfft(frames)
""" scale frequency axis logarithmically """
def logscale_spec(spec, sr=44100, factor=20.):
timebins, freqbins = np.shape(spec)
scale = np.linspace(0, 1, freqbins) ** factor
scale *= (freqbins-1)/max(scale)
scale = np.unique(np.round(scale))
# create spectrogram with new freq bins
newspec = np.complex128(np.zeros([timebins, len(scale)]))
for i in range(0, len(scale)):
if i == len(scale)-1:
newspec[:,i] = np.sum(spec[:,int(scale[i]):], axis=1)
else:
newspec[:,i] = np.sum(spec[:,int(scale[i]):int(scale[i+1])], axis=1)
# list center freq of bins
allfreqs = np.abs(np.fft.fftfreq(freqbins*2, 1./sr)[:freqbins+1])
freqs = []
for i in range(0, len(scale)):
if i == len(scale)-1:
freqs += [np.mean(allfreqs[int(scale[i]):])]
else:
freqs += [np.mean(allfreqs[int(scale[i]):int(scale[i+1])])]
return newspec, freqs
""" compute spectrogram """
def compute_stft(audiopath, binsize=2**10):
samplerate, samples = wav.read(audiopath)
s = stft(samples, binsize)
sshow, freq = logscale_spec(s, factor=1.0, sr=samplerate)
ims = 20.*np.log10(np.abs(sshow)/10e-6) # amplitude to decibel
return ims, samples, samplerate, freq
""" plot spectrogram """
def plot_stft(ims, samples, samplerate, freq, binsize=2**10, plotpath=None, colormap="jet"):
timebins, freqbins = np.shape(ims)
plt.figure(figsize=(15, 7.5))
plt.imshow(np.transpose(ims), origin="lower", aspect="auto", cmap=colormap, interpolation="none")
plt.colorbar()
plt.xlabel("time (s)")
plt.ylabel("frequency (hz)")
plt.xlim([0, timebins-1])
plt.ylim([0, freqbins])
xlocs = np.float32(np.linspace(0, timebins-1, 5))
plt.xticks(xlocs, ["%.02f" % l for l in ((xlocs*len(samples)/timebins)+(0.5*binsize))/samplerate])
ylocs = np.int16(np.round(np.linspace(0, freqbins-1, 10)))
plt.yticks(ylocs, ["%.02f" % freq[i] for i in ylocs])
if plotpath:
plt.savefig(plotpath, bbox_inches="tight")
else:
plt.show()
plt.clf()
"" HERE IS WHERE I'm ATTEMPTING TO GET IT OUT TO TXT """
ims, samples, samplerate, freq = compute_stft(filepath)
""" Print lengths """
print('ims len:', len(ims))
print('samples len:', len(samples))
print('samplerate:', samplerate)
print('freq len:', len(freq))
""" Write values to files """
np.savetxt(filepath + '-ims.txt', ims, delimiter=', ', newline='\n', header='ims')
np.savetxt(filepath + '-samples.txt', samples, delimiter=', ', newline='\n', header='samples')
np.savetxt(filepath + '-frequencies.txt', freq, delimiter=', ', newline='\n', header='frequencies')
In terms of values out, the file I'm analyzing is approx 19.1 seconds long and the sample rate is 44100, so I’d expect to have about 842k values for any given variable. But I'm not seeing what I expected. Instead here is what I see:
freqs comes out with just a handful of values, 512 and while they appear to be correct range for expected frequency, they are ordered least to greatest, not in time series like I expected. The 512 values, I assume, is the "fast" in FFT, basically down-sampled...
ims, appears to be amplitude, but values seem too high, although sample size is correct. Should be seeing -50 up to ~240dB.
samples . . . not sure.
In short, can someone advise on how I'd get the FFT out to a text file with time, amp, and freq values for the entire sample set? Is savetxt the correct route, or is there a better way? This code can certainly be used to make a great spectrogram, but how can I just get out the data?
Your output format is too limiting, as the audio spectrum at any interval in time usually contains a range of frequencies. e.g the FFT of a 1024 samples will contain 512 frequency bins for one window of time or time step, each with an amplitude. If you want a time step of one millisecond, then you will have to offset the window of samples you feed each STFT to center the window at that point in your sample vector. Although with an FFT about 23 milliseconds long, that will involve a high overlap of windows. You could use shorter windows, but the time-frequency trade-off will result in proportionately less frequency resolution.

Python Librosa : What is the default frame size used to compute the MFCC features?

Using Librosa library, I generated the MFCC features of audio file 1319 seconds into a matrix 20 X 56829. The 20 here represents the no of MFCC features (Which I can manually adjust it). But I don't know how it segmented the audio length into 56829. What is the frame size it takes process the audio?
import numpy as np
import matplotlib.pyplot as plt
import librosa
def getPathToGroundtruth(episode):
"""Return path to groundtruth file for episode"""
pathToGroundtruth = "../../../season01/Audio/" \
+ "Season01.Episode%02d.en.wav" % episode
return pathToGroundtruth
def getduration(episode):
pathToAudioFile = getPathToGroundtruth(episode)
y, sr = librosa.load(pathToAudioFile)
duration = librosa.get_duration(y=y, sr=sr)
return duration
def getMFCC(episode):
filename = getPathToGroundtruth(episode)
y, sr = librosa.load(filename) # Y gives
data = librosa.feature.mfcc(y=y, sr=sr)
return data
data = getMFCC(1)
Short Answer
You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)
data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20)
Long Answer
Librosa's librosa.feature.mfcc() function really just acts as a wrapper to librosa's librosa.feature.melspectrogram() function (which is a wrapper to librosa.core.stft and librosa.filters.mel functions).
All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc() function.
All extra **kwargs parameters are fed to librosa.feature.melspectrogram() and subsequently to librosa.filters.mel()
By Default, the Mel-scaled power spectrogram window and hop length are the following:
n_fft=2048
hop_length=512
So assuming you used the default sample rate (sr=22050), the output of your mfcc function makes sense:
output length = (seconds) * (sample rate) / (hop_length)
(1319) * (22050) / (512) = 56804 samples
The parameters that you are able to tune, are the following:
Melspectrogram Parameters
-------------------------
y : np.ndarray [shape=(n,)] or None
audio time-series
sr : number > 0 [scalar]
sampling rate of `y`
S : np.ndarray [shape=(d, t)]
power spectrogram
n_fft : int > 0 [scalar]
length of the FFT window
hop_length : int > 0 [scalar]
number of samples between successive frames.
See `librosa.core.stft`
kwargs : additional keyword arguments
Mel filter bank parameters.
See `librosa.filters.mel` for details.
If you want to further specify characteristics of the mel filterbank used to define the Mel-scaled power spectrogram, you can tune the following
Mel Frequency Parameters
------------------------
sr : number > 0 [scalar]
sampling rate of the incoming signal
n_fft : int > 0 [scalar]
number of FFT components
n_mels : int > 0 [scalar]
number of Mel bands to generate
fmin : float >= 0 [scalar]
lowest frequency (in Hz)
fmax : float >= 0 [scalar]
highest frequency (in Hz).
If `None`, use `fmax = sr / 2.0`
htk : bool [scalar]
use HTK formula instead of Slaney
Documentation for Librosa:
librosa.feature.melspectrogram
librosa.filters.mel
librosa.core.stft

Resources