Realtime Audio Processing with FFT - python-3.x

So I'm doing real time Audio processing in Python. The good news is, i found this link, which helps me collect data from my PC mic, and plot all the data in real time which is fantastic.
I also found this code from other links, where i can stream the data from Mic to Speaker for a given time.
self.stream=self.p.open(format=pyaudio.paInt16,channels=self.CHANNELS,rate=self.RATE,input=True,
output=True,frames_per_buffer=self.CHUNK)
def stream_data(self):
for i in range(0, int(self.RATE / self.CHUNK * self.RECORD_SECONDS)):
data = self.stream.read(self.CHUNK)
self.stream.write(data, self.CHUNK)
Where my idea diverges from the above link is, I want to apply an FFT to the Microphone data before i send it to the speaker. if I print the 'data' from the above code, i see that it is a whole lot of hexa gibberish that has to converted to decimal format. From the earlier link, I know how to do that as well
data = np.frombuffer(self.stream.read(self.CHUNK),dtype=np.int16)
I have the data that I need in decimal format. But now that i have this data, how can i convert it back to the hexa format after processing, that 'self.stream.write' can understand & output to the speaker. I'm not sure how that gets done.

i believe I've been able to find an answer. so if this might help someone else as well, here is a paper that helped me.
Real-Time Digital Signal Processing Using pyaudio_helper and the ipywidgets

Related

Meaning of sample values in a wav file

for a school project, I am supposed to analyze a short sound recording in wav format. I am done with the project, I DFT'd it, filtered out unwanted frequencies, and got the correct result. What eludes me, though, is the meaning of the values of the individual samples of my wav file. I have tens of thousands of samples that look like this:
[ 0.06234258 0.16020246 0.14122963 ... -0.01704375 -0.08993937 -0.09293508]
However, no matter how much I multiply these values by a number, the resulting sound sounds the same. If I multiply every sample by 1000, it sounds just as it sounded before. The same goes for dividing. So what do these samples mean, if not volume?
EDIT:
Here is the code I'm using:
import soundfile as sf
import IPython
samples, sampling_freq = sf.read('recording.wav')
IPython.display.display(IPython.display.Audio(samples, rate=sampling_freq )) #This one displays a playable bar.
The samples (basically a long array of floating point numbers) in the file is the Pulse Code Modulated data representing the audio.
Given that audio players use this data to recreate the original audio wave, multiplying every sample by some factor should increase the volume. However, some audio players scale down (re-normalize) the samples to prevent audio clamping - which can be the cause why it sounds the same.
The ideal way to visualize the audio should be using Audacity. It has the capability to show the audio wave in real time. Something like this -
PC: Google

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

Processing: applying a recorded acoustic spaces (impulse response) to an audio file

I need to apply an impulse response of an audio file that is 48kHz to an audio file that is 44.1kHz. It could be someone speaking for example. If I'm using the correct term, I need to convolve two audio file together so it would sound like someone is speaking inside of a cathedral.
What I don't know is how I go about doing this. I looked at minim library since it's the only audio library that I remember using and I found an example that applies an impulse response of a low pass filter to an audio file. Is there a way to convolve two audio files together to output a new sound? Audio processing isn't my forte so please don't mind my ignorance. Thanks, I'm trying to figure this out along the way.
Yes, convolution is what you want, but first your source needs to be the same sample rate. Once your sources are the same sample rate, you have have two options for performing the S/R conversion: 1. you can do it "directly", which is the most straightforward way, but takes M*N time, or 2. you can do it using the Fourier transform, which is much more complex, but faster. You will need to implement the overlap add algorithm as well. Looking at the docs of Minim, it looks to me like they use a standard IIR filter, not convolution by an impulse response, so I don't think that will help. You will have to do a lot of work to do convolution on top of what Minim gives you using the FFT. If you want to go the "direct" route, it will look something like this:
for( i in 0...input.length )
for( j in 0...conv.length )
output[i] += i-j < 0 ? 0 : input[i-j] * conv[j] ;
more details here: http://www-rohan.sdsu.edu/~jiracek/DAGSAW/4.3.html or google "discrete convolution"
Update: Minim does give you convolution: http://code.compartmental.net/minim/javadoc/ddf/minim/effects/Convolver.html

Changing duration of Guitar pluck in PCM data

Folks,
I am struggling with a simple concept related to the duration of play of PCM data. I would appreciate your feedback.
The application I am developing plays guitar notes from a music sheet.
I have implemented Jaffe-Smith Algorithm for guitar plucking.
https://ccrma.stanford.edu/~jos/Mohonk05/Extended_Karplus_Strong_EKS_Algorithm.html.
Let's say I compute samples for note A (440 Hz) for one second.
At the sample rate of 11025, I will be storing 11025 samples that can be send to the computer speakers as PCM audio.
For all the unique notes on the guitar, it takes quite some time to compute samples for all the notes. I am thinking I will pre-compute and save them as binary data and simply load them when the application is run.
So far so good.
Now, let's say I want to play a song (a list of various notes). Let's say the song needs to be played at 100 beats per minute. Let's say I have to play note A for one beat or 0.6 seconds (60/100).
Recalculating samples for 0.6 seconds may take quite some time.
Can I simply play (11025 * 0.6) samples? Will this create any side effect?
Is there a better way to achieve what I am trying to do?
Thank you in advance for your help.
Regards,
Peter
What you're basically trying to do is create a synthesized guitar, yes? I might suggest that you go with the sampler route instead.
By sample, I mean a small clip of audio (not a single sample in the sense of ADC or DAC).
Basically, you can flatten what you need into 4 parts:
Attack
Decay
Sustain
Release
These four parts work in that order, and are generally referred to as an ADSR envelope. The attack of the note is the initial sound. For a guitar, you are going to hear a pluck and the start of a pitch. The decay is going to be the sample of the string as it starts to fade away. The sustain is a sample repeated over and over again until you release the key. The release sample is what is played when you release the key. For a guitar, you might hear a sample of lightly putting fingers back on the string to stop their vibration.
Now, you could generate all of these samples in real-time, but will likely be very CPU intensive.
Regarding your question: "Can I simply play (11025 * 0.6) samples?" Yes, at a sample rate of 11025, that will be 0.6 seconds of audio. Also keep in mind though that you should be sending a continuous stream of data to the sound card, filling any empty spots with 0 (for signed PCM).

Where can I learn how to work with audio data formats?

I'm working on an openGL project that involves a speaking cartoon face. My hope is to play the speech (encoded as mp3s) and animate its mouth using the audio data. I've never really worked with audio before so I'm not sure where to start, but some googling led me to believe my first step would be converting the mp3 to pcm.
I don't really anticipate the need for any Fourier transforms, though that could be nice. The mouth really just needs to move around when there's audio (I was thinking of basing it on volume).
Any tips on to implement something like this or pointers to resources would be much appreciated. Thanks!
-S
Whatever you do, you're going to need to decode the MP3s into PCM data first. There are a number of third-party libraries that can do this for you. Then, you'll need to analyze the PCM data and do some signal processing on it.
Automatically generating realistic lipsync data from audio is a very hard problem, and you're wise to not try to tackle it. I like your idea of simply basing it on the volume. One way you could compute the current volume is to use a rolling window of some size (e.g. 1/16 second), and compute the average power in the sound wave over that window. That is, at frame T, you compute the average power over frames [T-N, T], where N is the number of frames in your window.
Thanks to Parseval's theorem, we can easily compute the power in a wave without having to take the Fourier transform or anything complicated -- the average power is just the sum of the squares of the PCM values in the window, divided by the number of frames in the window. Then, you can convert the power into a decibel rating by dividing it by some base power (which can be 1 for simplicity), taking the logarithm, and multiplying by 10.

Resources