Proper usage of tensorflows STFT function

Proper usage of tensorflows STFT function - audio

I am trying to construct a plot spectrum of an audio sample similar to the one that is created using Audacity. From Audacity's wiki page, the plot spectrum (attached example) performs:
Plot Spectrum take the audio in blocks of 'Size' samples, does the
FFT, and averages all the blocks together.
I was thinking I would use the STFT functionality recently provided by Tensorflow.
I am using audio blocks of size 512, and my code is as follows:
audio_binary = tf.read_file(audio_file)
waveform = tf.contrib.ffmpeg.decode_audio(
audio_binary,
file_format="wav",
samples_per_second=4000,
channel_count=1
)
stft = tf.contrib.signal.stft(
waveform,
512, # frame_length
512, # frame_step
fft_length=512,
window_fn=functools.partial(tf.contrib.signal.hann_window, periodic=True), # matches audacity
pad_end=True,
name="STFT"
)
But the results of stft are is just an empty array when I expect the FFT results for each frame (of 512 samples)
What is wrong with the way that I am making this call?
I have verified that waveform audio data is being correctly read with just the regular tf.fft function.

audio_file = tf.placeholder(tf.string)
audio_binary = tf.read_file(audio_file)
waveform = tf.contrib.ffmpeg.decode_audio(
audio_binary,
file_format="wav",
samples_per_second=sample_rate, # Get Info on .wav files (sample rate)
channel_count=1 # Get Info on .wav files (audio channels)
)
stft = tf.contrib.signal.stft(
tf.transpose(waveform),
frame_length, # frame_lenght, hmmm
frame_step, # frame_step, more hmms
fft_length=fft_length,
window_fn=functools.partial(tf.contrib.signal.hann_window,
periodic=False), # matches audacity
pad_end=False,
name="STFT"
)

Related

torchaudio load audio with specific sampling rate

From documentation, https://pytorch.org/audio/stable/backend.html#torchaudio.backend.sox_io_backend.load it seems there is no parameter for loading audio with a fixed sampling rate which is important for training models.
How to load a pytorch audio tensor with a fixed sampling rate with torchaudio?

Resample can be used from transforms.
waveform, sample_rate = torchaudio.load('test.wav', normalize=True)
transform = transforms.Resample(sample_rate, sample_rate/10)
waveform = transform(waveform)

You can resample with torchaudio.functional.resample
arr, org_sr = torchaudio.load('path')
arr = torchaudio.functional.resample(arr, orig_freq=org_sr, new_freq=new_sr)

How to detect input audio existence and do action whenever it exists?

I checked pyaudio but it offers the ability to record the input and manipulate it , i just want to do action when audio input exists.

You can implement a simple input audio detection by using PyAudio. You just need to decide what you mean with audio existence.
In the following example code I have used a simple root mean square calculation with a threshold. An other option is a peak test, just comparing the amplitude of each audio sample with a peak amplitude threshold. What is most useful for you depends on the application.
You can play around with the threshold value (i.e. the minimum amplitude or loudness of audio) and the chunk size (i.e. the latency of the audio detection) to get the behaviour you want.
import pyaudio
import math
RATE = 44100
CHUNK = 1024
AUDIO_EXISTENCE_THRESHOLD = 1000
def detect_input_audio(data, threshold):
if not data:
return False
rms = math.sqrt(sum([x**2 for x in data]) / len(data))
if rms > threshold:
return True
return False
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, input=True,
rate=RATE, frames_per_buffer=CHUNK)
data = []
while detect_input_audio(data, AUDIO_EXISTENCE_THRESHOLD):
data = stream.read(CHUNK)
# Do something when input audio exists
# ...
stream.stop_stream()
stream.close()
audio.terminate()

Python FFT for feature extraction

I am looking to perform feature extraction for human accelerometer data to use for activity recognition. The sampling rate of my data is 100Hz.
From the various sources I have researched an FFT is a favourable method to use. I have the data in a sliding windows format, the length of each window is 256. I am using Python to do this with the NumPy library. The code I have used to apply the FFt is:
import numpy as np
def fft_transform (window_data):
fft_data = []
fft_freq = []
power_spec = []
for window in window_data:
fft_window = np.fft.fft(window)
fft_data.append(fft_window)
freq = np.fft.fftfreq(np.array(window).shape[-1], d=0.01)
fft_freq.append(freq )
fft_ps = np.abs(fft_window)**2
power_spec.append(fft_ps)
return fft_data, fft_freq, power_spec
This give output which looks like this:
fft_data
array([ 2.92394828e+01 +0.00000000e+00j,
-6.00104665e-01 -7.57915977e+00j,
-1.02677676e+01 -1.55806119e+00j,
-7.17273995e-01 -6.64043705e+00j,
3.45758079e+01 +3.60869421e+01j,
etc..
freq_data
array([ 0. , 0.390625, 0.78125 , 1.171875, 1.5625 , etc...
power_spectrum
array([ 8.54947354e+02, 5.78037884e+01, 1.07854606e+02,
4.46098863e+01, 2.49775388e+03, etc...
I have also plotted the results using this code - where fst_ps is the first array/window of power_spectrum and the fst_freq is the first window/array of the fft_freq data.
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(width, height))
fig1= fig.add_subplot(221)
fig2= fig.add_subplot(222)
fig1.plot(fst_freq, fst_ps)
fig2.plot(fst_freq, np.log10(fst_ps))
plt.show()
I am looking for some advice on what my next step is for extracting features. Thanks

So, as you decomposed signal into spectrum, next step you could try to understand which frequencies is relevant for your application. But it's quite bit difficult to get it from single spectrum picture. Remember, that one frequency bin in the spectrum - it's the same basic signal bounded by narrow frequency range. Some frequencies could not be important for your task.
Better way, if you could try STFT method to understand your signal features in the frequency-time domain. For example, you may read this article about STFT approach on Python. Usually this method applied for searching some kind of time-frequency patterns, which can be recognized as features. For example, in human voice pattern (as in the article) you may see sustainable floating frequencies with duration and frequency bound features. You need to get STFT for your signal to find some patterns on the sonogram to extract features for your task.

exporting scipy array for speech data to ascii text readable in adobe audition

I read a 48khz, 16bit precision PCM speech data using wav read functionality
of scipy.signal.
Next, I perform these steps in order :
decimation -> normalisation
Decimation and normalisation is done using the following steps :
yiir = scipy.signal.decimate(topRightChn, 3)
timeSerDownSmpldSig = N.array(yiir)
factor = 2**16
normtimeSerDownSmpldSig = normalise(timeSerDownSmpldSig, factor)
My decimated(or downsampled) signal is supposed to be 16khz( Hence, the downsampling factor of 3 as above). Now, I want to view the normalised downsampled numpy array normtimeSerDownSmpldSig in Adobe Audition.
What steps in Python and/or Adobe audition do I need to perform? How can I use the savetxt function of scipy to view the above array in Adobe Audition ?
My yiir signal values look like following :
Downsampled signal yiir First 10 values: [ -6.95990948e-05 -2.71091920e-02 -3.
76441923e-01 -5.65301893e-01
1.59163252e-01 -2.44745081e+00 -4.11047340e+00 -2.81722036e+00
-1.89322873e+00 -2.51526839e+00]
Downsampled signal yiir: Last ten values: [-1.73357094 -3.41704894 -2.77903517
0.87867336 -2.00060527 -2.63675154
-5.93578443 -5.70939184 -3.68355598 -4.29757849]
Array signal obtained from iir decimate of python:
shape: (6400000,)
Type: <class 'numpy.dtype'>
dtype: float64
Rows : 6400000
min, max: -875.162306537 874.341374084
Information for usage on Adobe audition ### at this link (page45) -
http://www.newhopechurch.ca/docs/tech/AUDITION.pdf
gives out the following :
ASCII Text Data (.txt)
Audio data can be read to or written from files in a standard text format, >with each sample separated by a carriage return,
and channels separated by a tab character. An optional header can be placed >before the data. If there is no header text,
then the data is assumed to be 16-bit signed decimal integers.
The header is formatted as KEYWORD: value with the
keywords being: SAMPLES, BITSPERSAMPLE, CHANNELS, SAMPLERATE, and NORMALIZED. >The values for
NORMALIZED are either TRUE or FALSE. For example,
SAMPLES: 1582
BITSPERSAMPLE: 16
CHANNELS: 2
SAMPLERATE: 22050
NORMALIZED: FALSE
164 -1372
492 -876
etc...
Options
Choose any of the following:
•Include Format Header places a header before the data.
•Normalized Data normalizes the data between -1.0 and 1.0.

numpy.savetxt does not create WAV files. You can use scipy.io.wavfile.write.
For example, the following creates a WAV file containing a single channel (monophonic). The signal is 3 seconds of a 440 Hz sine wave sampled at 44100 samples per second.
In [18]: from scipy.io import wavfile
In [19]: fs = 44100
In [20]: T = 3.0
In [21]: t = np.linspace(0, 3, T*fs, endpoint=False)
In [22]: y = np.sin(2*pi*440*t)
In [23]: wavfile.write("sine440.wav", fs, y)
Another alternative is wavio.

Kivy/Audiostream microphone input data format

I am playing around with some basics of the Audiostream package for Kivy.
I would like to make a simple online input-filter-output system, for example, take in microphone data, impose a band-pass filter, send to speakers.
However, I can't seem to figure out what data format the microphone input is in or how to manipulate it. In code below, buf is type string, but how can I get the data out of it to manipulate it in such a way [i.e. function(buf)] to do something like a band-pass filter?
The code currently functions to just send the microphone input directly to the speakers.
Thanks.
from time import sleep
from audiostream import get_input
from audiostream import get_output, AudioSample
#get speakers, create sample and bind to speakers
stream = get_output(channels=2, rate=22050, buffersize=1024)
sample = AudioSample()
stream.add_sample(sample)
#define what happens on mic input with arg as buffer
def mic_callback(buf):
print 'got', len(buf)
#HERE: How do I manipulate buf?
#modified_buf = function(buf)
#sample.write(modified_buf)
sample.write(buf)
# get the default audio input (mic on most cases)
mic = get_input(callback=mic_callback)
mic.start()
sample.play()
sleep(3) #record for 3 seconds
mic.stop()
sample.stop()

The buffer is composed of bytes that need to be interpreted as signed short. You can use struct or array module to get value. In your example, you have 2 channels (L/R). Let's say you wanna to have the right channel volume down by 20% (aka 80% of the original sound only for right channel)
from array import array
def mic_callback(buf):
# convert our byte buffer into signed short array
values = array("h", buf)
# get right values only
r_values = values[1::2]
# reduce by 20%
r_values = map(lambda x: x * 0.8, r_values)
# you can assign only array for slice, not list
# so we need to convert back list to array
values[1::2] = array("h", r_values)
# convert back the array to a byte buffer for speaker
sample.write(values.tostring())

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Proper usage of tensorflows STFT function - audio

Related

torchaudio load audio with specific sampling rate

How to detect input audio existence and do action whenever it exists?

Python FFT for feature extraction

exporting scipy array for speech data to ascii text readable in adobe audition

Kivy/Audiostream microphone input data format

Categories

Resources