Applying CNN to Short-time Fourier transform? - python-3.x

So I have a code which returns a Short-Time Fourier Transform spectrum of a .wav file. I want to be able to take, say a millisecond of the spectrum, and train a CNN on it.
I'm not quite sure how I would implement that. I know how to format the image data to feed into the CNN, and how to train the network, but I'm lost on how to take the FFT-data and divide it into small time-frames.
The FFT Code(Sorry for ultra long code):
rate, audio = wavfile.read('scale_a_lydian.wav')
audio = np.mean(audio, axis=1)
N = audio.shape[0]
L = N / rate
M = 1024
# Audio is 44.1 Khz, or ~44100 samples / second
# window function takes 1024 samples or 0.02 seconds of audio (1024 / 44100 = ~0.02 seconds)
# and shifts the window 100 over each time
# so there would end up being (total_samplesize - 1024)/(100) total steps done (or slices)
slices = util.view_as_windows(audio, window_shape=(M,), step=100) #slices overlap
win = np.hanning(M + 1)[:-1]
slices = slices * win #each slice is 1024 samples (0.02 seconds of audio)
slices = slices.T #transpose matrix -> make each column 1024 samples (ie. make each column one slice)
spectrum = np.fft.fft(slices, axis=0)[:M // 2 + 1:-1] #perform fft on each slice and then take the first half of each slice, and reverse
spectrum = np.abs(spectrum) #take absolute value of slices
# take SampleSize * Slices
# transpose into slices * samplesize
# Take the first row -> slice * samplesize
# transpose back to samplesize * slice (essentially get 0.01s of spectrum)
spectrum2 = spectrum.T
spectrum2 = spectrum2[:1]
spectrum2 = spectrum2.T
The following outputs an FFT spectrum:
N = spectrum2.shape[0]
L = N / rate
f, ax = plt.subplots(figsize=(4.8, 2.4))
S = np.abs(spectrum2)
S = 20 * np.log10(S / np.max(S))
ax.imshow(S, origin='lower', cmap='viridis',
extent=(0, L, 0, rate / 2 / 1000))
ax.axis('tight')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
plt.show()
(Feel free to correct any theoretical errors that I put in the comments)
So from what I understand, I have a numpy array (spectrum) with each column being a slice with 510 samples (Cut in half, because half of each FFT slice is redundant (useless?)), with each sample having the list of frequencies?
EDIT: So the above code theoretically outputs 0.01s of audio as a spectrum, which is exactly what I need. Is this true, or am I not thinking right?

Related

How to fix the issue of plotting a 2D sine wave in python

I want to generate 2D travelling sine wave. To do this, I've set the parameters for the plane wave and generate wave for any time instants like as follows:
import numpy as np
import random
import matplotlib.pyplot as plt
f = 10 # frequency
fs = 100 # sample frequency
Ts = 1/fs # sample period
t = np.arange(0,0.5, Ts) # time index
c = 50 # speed of wave
w = 2*np.pi *f # angular frequency
k = w/c # wave number
resolution = 0.02
x = np.arange(-5, 5, resolution)
y = np.arange(-5, 5, resolution)
dx = np.array(x); M = len(dx)
dy = np.array(y); N = len(dy)
[xx, yy] = np.meshgrid(x, y);
theta = np.pi / 4 # direction of propagation
kx = k* np.cos(theta)
ky = k * np.sin(theta)
So, the plane wave would be
plane_wave = np.sin(kx * xx + ky * yy - w * t[1])
plt.figure();
plt.imshow(plane_wave,cmap='seismic',origin='lower', aspect='auto')
that gives a smooth plane wave as shown in . Also, the sine wave variation with plt.figure(); plt.plot(plane_wave[2,:]) time is given in .
However, when I want to append plane waves at different time instants then there is some discontinuity arises in figure 03 & 04 , and I want to get rid of from this problem.
I'm new in python and any help will be highly appreciated. Thanks in advance.
arr = []
for count in range(len(t)):
p = np.sin(kx * xx + ky * yy - w * t[count]); # plane wave
arr.append(p)
arr = np.array(arr)
print(arr.shape)
pp,q,r = arr.shape
sig = np.reshape(arr, (-1, r))
print('The signal shape is :', sig.shape)
plt.figure(); plt.imshow(sig.transpose(),cmap='seismic',origin='lower', aspect='auto')
plt.xlabel('X'); plt.ylabel('Y')
plt.figure(); plt.plot(sig[2,:])
This is not that much a problem of programming. It has to do more with the fact that you are using the physical quantities in a somewhat unusual way. Your plots are absolutely fine and correct.
What you seem to have misunderstood is the fact that you are talking about a 2D problem with a third dimension added for time. This is by no means wrong but if you try to append the snapshot of the 2D wave side-by-side you are using (again) the x spatial dimension to represent temporal variations. This leads to an inconsistency of the use of that coordinate axis. Now, to make this more intuitive, consider the two time instances separately. Does it not coincide with your intuition that all points on the 2D plane must have different amplitudes (unless of course the time has progressed by a multiple of the period of the wave)? This is the case indeed. Thus, when you try to append the two snapshots, a discontinuity is exhibited. In order to avoid that you have to either use a time step equal to one period, which I believe is of no practical use, or a constant time step that will make the phase of the wave on the left border of the image in the current time equal to the phase of the wave on the right border of the image in the previous time step. Yet, this will always be a constant time step, alternating the phase (on the edges of the image) between the two said values.
The same applies to the 1D case because you use the two coordinate axes to represent the wave (x is the x spatial dimension and y is used to represent the amplitude). This is what can be seen in your last plot.
Now, what would be the solution you may ask. The solution is provided by simple inspection of the mathematical formula of the wave function. In 2D, it is a scalar function of three variables (that is, takes as input three values and outputs one) and so you need at least four dimensions to represent it. Alas, we can't perceive a fourth spatial dimension, but this is not a problem in your case as the output of the function is represented with colors. Then there are three dimensions that could be used to represent the temporal evolution of your function. All you have to do is to create a 3D array where the third dimension represents time and all 2D snapshots will be stored in the first two dimensions.
When it comes to visual representation of the results you could either use some kind of waterfall plots where the z-axis will represent time or utilize the fourth dimension we can perceive, time that is, to create an animation of the evolution of the wave.
I am not very familiar with Python, so I will only provide a generic naive implementation. I am sure a lot of people here could provide some simplification and/or optimisation of the following snippet. I assume that everything in your first two blocks of code is available so changes have to be done only in the last block you present
arr = np.zeros((len(xx), len(yy), len(t))) # Initialise the array to hold the temporal evolution of the snapshots
for i in range(len(t)):
arr[:, :, i] = np.sin(kx * xx + ky * yy - w * t[i])
# Below you can plot the figures with any function you prefer or make an animation out of it

Normalization - Signal with different sampling rates

I am trying to solve a signal processing problem. I have a signal like this
My job is to use FFT to plot the frequency vs. signal. This is what I have coded so far:
def Extract_Data(filepath, pattern):
data = []
with open(filepath) as file:
for line in file:
m = re.match(pattern, line)
if m:
data.append(list(map(float, m.groups())))
#print(data)
data = np.asarray(data)
#Convert lists to arrays
variable_array = data[:,1]
time_array = data[:,0]
return variable_array, time_array
def analysis_FFT(filepath, pattern):
signal, time = Extract_Data(filepath, pattern)
signal_FFT = np.fft.fft(signal)
N = len(signal_FFT)
T = time[-1]
#Frequencies
signal_freq = np.fft.fftfreq(N, d = T/N)
#Shift the frequencies
signal_freq_shift = np.fft.fftshift(signal_freq)
#Real and imagniary part of the signal
signal_real = signal_FFT.real
signal_imag = signal_FFT.imag
signal_abs = pow(signal_real, 2) + pow(signal_imag, 2)
#Shift the signal
signal_shift = np.fft.fftshift(signal_FFT)
#signal_shift = np.fft.fftshift(signal_FFT)
#Spectrum
signal_spectrum = np.abs(signal_shift)
What I really concern about is the sampling rate. As you look at the plot, it looks like the sampling rate of the first ~0.002s is not the same as the rest of the signal. So I'm thinking maybe I need to normalize the signal
However, when I use np.fft.fftfreq(N, d =T/N), it seems like np.fft.ffreq assumes the signal has the same sampling rate throughout the domain. So I'm not sure how I could normalize the signal with np.fft. Any suggestions?
Cheers.
This is what I got when I plotted shifted frequency [Hz] with shifted signal
I generated a synthetic signal similar to yours and plotted, like you did the spectrum over the whole time. Your plot was good as it pertains to the whole spectrum, just appears to not give the absolute value.
import numpy as np
import matplotlib.pyplot as p
%matplotlib inline
T=0.05 # 1/20 sec
n=5000 # 5000 Sa, so 100kSa/sec sampling frequency
sf=n/T
d=T/n
t=np.linspace(0,T,n)
fr=260 # Hz
y1= - np.cos(2*np.pi*fr*t) * np.exp(- 20* t)
y2= 3*np.sin(2*np.pi*10*fr*t+0.5) *np.exp(-2e6*(t-0.001)**2)
y=(y1+y2)/30
f=np.fft.fftshift(np.fft.fft(y))
freq=np.fft.fftshift(np.fft.fftfreq(n,d))
p.figure(figsize=(12,8))
p.subplot(311)
p.plot(t,y ,color='green', lw=1 )
p.xlabel('time (sec)')
p.ylabel('Velocity (m/s)')
p.subplot(312)
p.plot(freq,np.abs(f)/n)
p.xlabel('freq (Hz)')
p.ylabel('Velocity (m/s)');
p.subplot(313)
s=slice(n//2-500,n//2+500,1)
p.plot(freq[s],np.abs(f)[s]/n)
p.xlabel('freq (Hz)')
p.ylabel('Velocity (m/s)');
On the bottom, I zoomed in a bit to show the two main frequency components. Note that we are showing the positive and negative frequencies (only the positive ones, times 2x are physical). The Gaussians at 2600 Hz indicate the frequency spectrum of the burst (FT of Gaussian is Gaussian). The straight lines at 260 Hz indicate the slow base frequency (FT of sine is a delta).
That, however hides the timing of the two separate frequency components, the short (in my case Gaussian) burst at the start at about 2.6 kHz and the decaying low tone at about 260 Hz. The spectrogram plots spectra of short pieces (nperseg) of your signal in vertical as stripes where color indicates intensity. You can set some overlap between the time frames,which should be some fraction of the segment length. By stacking these stripes over time, you get a plot of the spectral change over time.
from scipy.signal import spectrogram
f, t, Sxx = spectrogram(y,sf,nperseg=256,noverlap=64)
p.pcolormesh(t, f[:20], Sxx[:20,:])
#p.pcolormesh(t, f, Sxx)
p.ylabel('Frequency [Hz]')
p.xlabel('Time [sec]')
p.show()
It is instructive to try and generate the spectrogram yourself with the help of just the FFT. Otherwise the settings of the spectrogram function might not be very intuitive at first.

GStreamer - Generate audio waveform from MP4 file

I have a two-part question,
1) I have an MP4 file and want to generate it's audio waveform.
2) I have another MP4 file which has audio at channel [0] and channel [1] and a video track too, I want to generate waveforms for both channels as separate images.
How can I achieve both of the above by using GSteamer?
Is the raw audio in 16-bit or 32-bit format? What is the sample rate (44100 hz) and what is time duration?
Anyways assuming 44.1khz at 10 second duration... since you can't draw 44 thousand samples as pixel width so choose a final display size (eg: width = 800px and height = 600px) and do math :
//# is (samplerate / duration) / width...
(44100 / 10) / 800 = 551;
After reading first 2 values, you will jump ahead by 551 bytes and repeat until total of (.
So in your raw data starting from pos = 0;...
1) Check this and next sample, then multiply their values together (sample[pos] x sample[pos+1]).
2) Take that result and divide by 65335 (maximum value of 16-bits or 2 bytes). That's the final value of your first sample or point.
3) Draw a line to fit according to image height (eg: 600px) so if sample = 0.83 then:
line_height = (600 x 0.83); //# gives 498 as line height
line_count += 1; //# add plus 1 to line count (stop when it reaches 800)
4) Skip ahead from position [pos] by +551 bytes and repeat step (1) until line_count == 800;

AR model with MATLAB

I'm using the following code taken from MATLAB documentation to estimate the parameters of an ARMA model:
y = sin([1:300]') + 0.5 * randn(300, 1);
y = iddata(y);
mb = ar(y, 4, 'burg');
At this point, if if I type mb what I get is this:
Discrete-time IDPOLY model:
A(q)y(t) = e(t)
A(q) = 1 - 0.2764 q^-1 + 0.2069 q^-2 + 0.4804 q^-3 + 0.1424 q^-4
Estimated using AR ('burg'/'now') from data set y
Loss function 0.314965 and FPE 0.323364
Sampling interval: 1
How can I use the variable mb I obtained to generate samples with those coefficients?
mb doesn't look like a vector.
In particular, how can I handle missing data?
Use: sim(mb,input)
More info about sim and also here:
Simulate linear models.
Syntax
y = sim(m,ue)
[y, ysd] = sim(m,ue,init)
Description
m is an arbitrary idmodel object.
ue is an iddata object, containing inputs only. The number of input
channels in ue must either be equal to the number of inputs of the
model m, or equal to the sum of the number of inputs and noise sources
(= number of outputs). In the latter case the last inputs in ue are
regarded as noise sources and a noise-corrupted simulation is
obtained. The noise is scaled according to the property
m.NoiseVariance in m, so in order to obtain the right noise level
according to the model, the noise inputs should be white noise with
zero mean and unit covariance matrix. If no noise sources are
contained in ue, a noise-free simulation is obtained.

How to find the fundamental frequency of a guitar string sound?

I want to build a guitar tuner app for Iphone. My goal is to find the fundamental frequency of sound generated by a guitar string. I have used bits of code from aurioTouch sample provided by Apple to calculate frequency spectrum and I find the frequency with the highest amplitude . It works fine for pure sounds (the ones that have only one frequency) but for sounds from a guitar string it produces wrong results. I have read that this is because of the overtones generate by the guitar string that might have higher amplitudes than the fundamental one. How can I find the fundamental frequency so it works for guitar strings? Is there an open-source library in C/C++/Obj-C for sound analyzing (or signal processing)?
You can use the signal's autocorrelation, which is the inverse transform of the magnitude squared of the DFT. If you're sampling at 44100 samples/s, then a 82.4 Hz fundamental is about 535 samples, whereas 1479.98 Hz is about 30 samples. Look for the peak positive lag in that range (e.g. from 28 to 560). Make sure your window is at least two periods of the longest fundamental, which would be 1070 samples here. To the next power of two that's a 2048-sample buffer. For better frequency resolution and a less biased estimate, use a longer buffer, but not so long that the signal is no longer approximately stationary. Here's an example in Python:
from pylab import *
import wave
fs = 44100.0 # sample rate
K = 3 # number of windows
L = 8192 # 1st pass window overlap, 50%
M = 16384 # 1st pass window length
N = 32768 # 1st pass DFT lenth: acyclic correlation
# load a sample of guitar playing an open string 6
# with a fundamental frequency of 82.4 Hz (in theory),
# but this sample is actually at about 81.97 Hz
g = fromstring(wave.open('dist_gtr_6.wav').readframes(-1),
dtype='int16')
g = g / float64(max(abs(g))) # normalize to +/- 1.0
mi = len(g) / 4 # start index
def welch(x, w, L, N):
# Welch's method
M = len(w)
K = (len(x) - L) / (M - L)
Xsq = zeros(N/2+1) # len(N-point rfft) = N/2+1
for k in range(K):
m = k * ( M - L)
xt = w * x[m:m+M]
# use rfft for efficiency (assumes x is real-valued)
Xsq = Xsq + abs(rfft(xt, N)) ** 2
Xsq = Xsq / K
Wsq = abs(rfft(w, N)) ** 2
bias = irfft(Wsq) # for unbiasing Rxx and Sxx
p = dot(x,x) / len(x) # avg power, used as a check
return Xsq, bias, p
# first pass: acyclic autocorrelation
x = g[mi:mi + K*M - (K-1)*L] # len(x) = 32768
w = hamming(M) # hamming[m] = 0.54 - 0.46*cos(2*pi*m/M)
# reduces the side lobes in DFT
Xsq, bias, p = welch(x, w, L, N)
Rxx = irfft(Xsq) # acyclic autocorrelation
Rxx = Rxx / bias # unbias (bias is tapered)
mp = argmax(Rxx[28:561]) + 28 # index of 1st peak in 28 to 560
# 2nd pass: cyclic autocorrelation
N = M = L - (L % mp) # window an integer number of periods
# shortened to ~8192 for stationarity
x = g[mi:mi+K*M] # data for K windows
w = ones(M); L = 0 # rectangular, non-overlaping
Xsq, bias, p = welch(x, w, L, N)
Rxx = irfft(Xsq) # cyclic autocorrelation
Rxx = Rxx / bias # unbias (bias is constant)
mp = argmax(Rxx[28:561]) + 28 # index of 1st peak in 28 to 560
Sxx = Xsq / bias[0]
Sxx[1:-1] = 2 * Sxx[1:-1] # fold the freq axis
Sxx = Sxx / N # normalize S for avg power
n0 = N / mp
np = argmax(Sxx[n0-2:n0+3]) + n0-2 # bin of the nearest peak power
# check
print "\nAverage Power"
print " p:", p
print "Rxx:", Rxx[0] # should equal dot product, p
print "Sxx:", sum(Sxx), '\n' # should equal Rxx[0]
figure().subplots_adjust(hspace=0.5)
subplot2grid((2,1), (0,0))
title('Autocorrelation, R$_{xx}$'); xlabel('Lags')
mr = r_[:3 * mp]
plot(Rxx[mr]); plot(mp, Rxx[mp], 'ro')
xticks(mp/2 * r_[1:6])
grid(); axis('tight'); ylim(1.25*min(Rxx), 1.25*max(Rxx))
subplot2grid((2,1), (1,0))
title('Power Spectral Density, S$_{xx}$'); xlabel('Frequency (Hz)')
fr = r_[:5 * np]; f = fs * fr / N;
vlines(f, 0, Sxx[fr], colors='b', linewidth=2)
xticks((fs * np/N * r_[1:5]).round(3))
grid(); axis('tight'); ylim(0,1.25*max(Sxx[fr]))
show()
Output:
Average Power
p: 0.0410611012542
Rxx: 0.0410611012542
Sxx: 0.0410611012542
The peak lag is 538, which is 44100/538 = 81.97 Hz. The first-pass acyclic DFT shows the fundamental at bin 61, which is 82.10 +/- 0.67 Hz. The 2nd pass uses a window length of 538*15 = 8070, so the DFT frequencies include the fundamental period and harmonics of the string. This enables an ubiased cyclic autocorrelation for an improved PSD estimate with less harmonic spreading (i.e. the correlation can wrap around the window periodically).
Edit: Updated to use Welch's method to estimate the autocorrelation. Overlapping the windows compensates for the Hamming window. I also calculate the tapered bias of the hamming window to unbias the autocorrelation.
Edit: Added a 2nd pass with cyclic correlation to clean up the power spectral density. This pass uses 3 non-overlapping, rectangular windows length 538*15 = 8070 (short enough to be nearly stationary). The bias for cyclic correlation is a constant, instead of the Hamming window's tapered bias.
Finding the musical pitches in a chord is far more difficult than estimating the pitch of one single string or note played at a time. The overtones for the multiple notes in a chord might all be overlapping and interleaving. And all the notes in common chords may themselves be at overtone frequencies for one or more non-existent lower pitched notes.
For single notes, autocorrelation is a common technique used by some guitar tuners. But with autocorrelation, you have to be aware of some potential octave uncertainty, as guitars may produce inharmonic and decaying overtones which thus don't exactly match from pitch period to pitch period. Cepstrum and Harmonic Product Spectrum are two other pitch estimation methods which may or may not have different problems, depending on the guitar and the note.
RAPT appears to be one published algorithm for more robust pitch estimation. YIN is another.
Also Objective C is a superset of ANSI C. So you can use any C DSP routines you find for pitch estimation within an Objective C app.
Use libaubio (link) and be happy . It was one the biggest time lose for me to try to implement a fundemental frequency estimator. If you want to do it yourself I advise you follow to YINFFT method (link)

Resources