How to detect input audio existence and do action whenever it exists? - python-3.x

I checked pyaudio but it offers the ability to record the input and manipulate it , i just want to do action when audio input exists.

You can implement a simple input audio detection by using PyAudio. You just need to decide what you mean with audio existence.
In the following example code I have used a simple root mean square calculation with a threshold. An other option is a peak test, just comparing the amplitude of each audio sample with a peak amplitude threshold. What is most useful for you depends on the application.
You can play around with the threshold value (i.e. the minimum amplitude or loudness of audio) and the chunk size (i.e. the latency of the audio detection) to get the behaviour you want.
import pyaudio
import math
RATE = 44100
CHUNK = 1024
AUDIO_EXISTENCE_THRESHOLD = 1000
def detect_input_audio(data, threshold):
if not data:
return False
rms = math.sqrt(sum([x**2 for x in data]) / len(data))
if rms > threshold:
return True
return False
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, input=True,
rate=RATE, frames_per_buffer=CHUNK)
data = []
while detect_input_audio(data, AUDIO_EXISTENCE_THRESHOLD):
data = stream.read(CHUNK)
# Do something when input audio exists
# ...
stream.stop_stream()
stream.close()
audio.terminate()

Related

How to set individual image display durations with ffmpeg-python

I am using ffmpeg-python 0.2.0 with Python 3.10.0. Displaying videos in VLC 3.0.17.4.
I am making an animation from a set of images. Each image is displayed for different amount of time.
I have the basics in place with inputting images and concatenating streams, but I can't figure out how to correctly set frame duration.
Consider the following example:
stream1 = ffmpeg.input(image1_file)
stream2 = ffmpeg.input(image2_file)
combined_streams = ffmpeg.concat(stream1, stream2)
output_stream = ffmpeg.output(combined_streams, output_file)
ffmpeg.run(output_stream)
With this I get a video with duration of a split second that barely shows an image before ending. Which is to be expected with two individual frames.
For this example, my goal is to have a video of 5 seconds total duration, showing the image in stream1 for 2 seconds and the image in stream2 for 3 seconds.
Attempt 1: Setting t for inputs
stream1 = ffmpeg.input(image1_file, t=2)
stream2 = ffmpeg.input(image2_file, t=3)
combined_streams = ffmpeg.concat(stream1, stream2)
output_stream = ffmpeg.output(combined_streams, output_file)
ffmpeg.run(output_stream)
With this, I get a video with the duration of a split second and no image displayed.
Attempt 2: Setting frames for inputs
stream1 = ffmpeg.input(image1_file, frames=48)
stream2 = ffmpeg.input(image2_file, frames=72)
combined_streams = ffmpeg.concat(stream1, stream2)
output_stream = ffmpeg.output(combined_streams, output_file, r=24)
ffmpeg.run(output_stream)
In this case, I get the following error from ffmpeg:
Option frames (set the number of frames to output) cannot be applied to input url ########## -- you are trying to apply an input option to an output file or vice versa. Move this option before the file it belongs to.
I can't tell if this is a bug in ffmpeg-python or if I did it wrong.
Attempt 3: Setting framerate for inputs
stream1 = ffmpeg.input(image1_file, framerate=1/2)
stream2 = ffmpeg.input(image2_file, framerate=1/3)
combined_streams = ffmpeg.concat(stream1, stream2)
output_stream = ffmpeg.output(combined_streams, output_file)
ffmpeg.run(output_stream)
With this, I get a video with the duration of a split second and no image displayed. However, when I set both framerate values to 1/2, I get an animation of 4 seconds duration that displays the first image for two seconds and the second image for two seconds. This is the closest I got to a functional solution, but it is not quite there.
I am aware that multiple images can be globbed by input, but that would apply the same duration setting to all images, and my images each have different durations, so I am looking for a different solution.
Any ideas for how to get ffmpeg-python to do the thing is much appreciated.
A simple solution is adding loop=1 and framerate=24 to the "first example":
import ffmpeg
image1_file = 'image1_file.png'
image2_file = 'image2_file.png'
output_file = 'output_file.mp4'
stream1 = ffmpeg.input(image1_file, framerate=24, t=2, loop=1)
stream2 = ffmpeg.input(image2_file, framerate=24, t=3, loop=1)
combined_streams = ffmpeg.concat(stream1, stream2)
output_stream = ffmpeg.output(combined_streams, output_file)
ffmpeg.run(output_stream)
loop=1 - Makes the input image repeating in a loop (the repeated duration is set by t=2 and t=3).
framerate=24 - Images don't have framerate (opposed to video), so they are getting the default framerate (25fps) if framerate is not specified.
Assuming the desired output framerate is 24fps, we may set the input framerate to 24fps.
Selecting framerate=24 sets the input framerate to 24fps (and prevents framerate conversion).
You need to manipulate the timestamp of the source images and use the ts_from_file option of the image2 demuxer:
ts_from_file
If set to 1, will set frame timestamp to modification time of image file. Note that monotonity of timestamps is not provided: images go in the same order as without this option. Default value is 0. If set to 2, will set frame timestamp to the modification time of the image file in nanosecond precision.
You should be able to use os.utime if ok to modify the original file or shutil.copy2 to copy and modify.

torchaudio load audio with specific sampling rate

From documentation, https://pytorch.org/audio/stable/backend.html#torchaudio.backend.sox_io_backend.load it seems there is no parameter for loading audio with a fixed sampling rate which is important for training models.
How to load a pytorch audio tensor with a fixed sampling rate with torchaudio?
Resample can be used from transforms.
waveform, sample_rate = torchaudio.load('test.wav', normalize=True)
transform = transforms.Resample(sample_rate, sample_rate/10)
waveform = transform(waveform)
You can resample with torchaudio.functional.resample
arr, org_sr = torchaudio.load('path')
arr = torchaudio.functional.resample(arr, orig_freq=org_sr, new_freq=new_sr)

Proper usage of tensorflows STFT function

I am trying to construct a plot spectrum of an audio sample similar to the one that is created using Audacity. From Audacity's wiki page, the plot spectrum (attached example) performs:
Plot Spectrum take the audio in blocks of 'Size' samples, does the
FFT, and averages all the blocks together.
I was thinking I would use the STFT functionality recently provided by Tensorflow.
I am using audio blocks of size 512, and my code is as follows:
audio_binary = tf.read_file(audio_file)
waveform = tf.contrib.ffmpeg.decode_audio(
audio_binary,
file_format="wav",
samples_per_second=4000,
channel_count=1
)
stft = tf.contrib.signal.stft(
waveform,
512, # frame_length
512, # frame_step
fft_length=512,
window_fn=functools.partial(tf.contrib.signal.hann_window, periodic=True), # matches audacity
pad_end=True,
name="STFT"
)
But the results of stft are is just an empty array when I expect the FFT results for each frame (of 512 samples)
What is wrong with the way that I am making this call?
I have verified that waveform audio data is being correctly read with just the regular tf.fft function.
audio_file = tf.placeholder(tf.string)
audio_binary = tf.read_file(audio_file)
waveform = tf.contrib.ffmpeg.decode_audio(
audio_binary,
file_format="wav",
samples_per_second=sample_rate, # Get Info on .wav files (sample rate)
channel_count=1 # Get Info on .wav files (audio channels)
)
stft = tf.contrib.signal.stft(
tf.transpose(waveform),
frame_length, # frame_lenght, hmmm
frame_step, # frame_step, more hmms
fft_length=fft_length,
window_fn=functools.partial(tf.contrib.signal.hann_window,
periodic=False), # matches audacity
pad_end=False,
name="STFT"
)

exporting scipy array for speech data to ascii text readable in adobe audition

I read a 48khz, 16bit precision PCM speech data using wav read functionality
of scipy.signal.
Next, I perform these steps in order :
decimation -> normalisation
Decimation and normalisation is done using the following steps :
yiir = scipy.signal.decimate(topRightChn, 3)
timeSerDownSmpldSig = N.array(yiir)
factor = 2**16
normtimeSerDownSmpldSig = normalise(timeSerDownSmpldSig, factor)
My decimated(or downsampled) signal is supposed to be 16khz( Hence, the downsampling factor of 3 as above). Now, I want to view the normalised downsampled numpy array normtimeSerDownSmpldSig in Adobe Audition.
What steps in Python and/or Adobe audition do I need to perform? How can I use the savetxt function of scipy to view the above array in Adobe Audition ?
My yiir signal values look like following :
Downsampled signal yiir First 10 values: [ -6.95990948e-05 -2.71091920e-02 -3.
76441923e-01 -5.65301893e-01
1.59163252e-01 -2.44745081e+00 -4.11047340e+00 -2.81722036e+00
-1.89322873e+00 -2.51526839e+00]
Downsampled signal yiir: Last ten values: [-1.73357094 -3.41704894 -2.77903517
0.87867336 -2.00060527 -2.63675154
-5.93578443 -5.70939184 -3.68355598 -4.29757849]
Array signal obtained from iir decimate of python:
shape: (6400000,)
Type: <class 'numpy.dtype'>
dtype: float64
Rows : 6400000
min, max: -875.162306537 874.341374084
Information for usage on Adobe audition ### at this link (page45) -
http://www.newhopechurch.ca/docs/tech/AUDITION.pdf
gives out the following :
ASCII Text Data (.txt)
Audio data can be read to or written from files in a standard text format, >with each sample separated by a carriage return,
and channels separated by a tab character. An optional header can be placed >before the data. If there is no header text,
then the data is assumed to be 16-bit signed decimal integers.
The header is formatted as KEYWORD: value with the
keywords being: SAMPLES, BITSPERSAMPLE, CHANNELS, SAMPLERATE, and NORMALIZED. >The values for
NORMALIZED are either TRUE or FALSE. For example,
SAMPLES: 1582
BITSPERSAMPLE: 16
CHANNELS: 2
SAMPLERATE: 22050
NORMALIZED: FALSE
164 -1372
492 -876
etc...
Options
Choose any of the following:
•Include Format Header places a header before the data.
•Normalized Data normalizes the data between -1.0 and 1.0.
numpy.savetxt does not create WAV files. You can use scipy.io.wavfile.write.
For example, the following creates a WAV file containing a single channel (monophonic). The signal is 3 seconds of a 440 Hz sine wave sampled at 44100 samples per second.
In [18]: from scipy.io import wavfile
In [19]: fs = 44100
In [20]: T = 3.0
In [21]: t = np.linspace(0, 3, T*fs, endpoint=False)
In [22]: y = np.sin(2*pi*440*t)
In [23]: wavfile.write("sine440.wav", fs, y)
Another alternative is wavio.

Kivy/Audiostream microphone input data format

I am playing around with some basics of the Audiostream package for Kivy.
I would like to make a simple online input-filter-output system, for example, take in microphone data, impose a band-pass filter, send to speakers.
However, I can't seem to figure out what data format the microphone input is in or how to manipulate it. In code below, buf is type string, but how can I get the data out of it to manipulate it in such a way [i.e. function(buf)] to do something like a band-pass filter?
The code currently functions to just send the microphone input directly to the speakers.
Thanks.
from time import sleep
from audiostream import get_input
from audiostream import get_output, AudioSample
#get speakers, create sample and bind to speakers
stream = get_output(channels=2, rate=22050, buffersize=1024)
sample = AudioSample()
stream.add_sample(sample)
#define what happens on mic input with arg as buffer
def mic_callback(buf):
print 'got', len(buf)
#HERE: How do I manipulate buf?
#modified_buf = function(buf)
#sample.write(modified_buf)
sample.write(buf)
# get the default audio input (mic on most cases)
mic = get_input(callback=mic_callback)
mic.start()
sample.play()
sleep(3) #record for 3 seconds
mic.stop()
sample.stop()
The buffer is composed of bytes that need to be interpreted as signed short. You can use struct or array module to get value. In your example, you have 2 channels (L/R). Let's say you wanna to have the right channel volume down by 20% (aka 80% of the original sound only for right channel)
from array import array
def mic_callback(buf):
# convert our byte buffer into signed short array
values = array("h", buf)
# get right values only
r_values = values[1::2]
# reduce by 20%
r_values = map(lambda x: x * 0.8, r_values)
# you can assign only array for slice, not list
# so we need to convert back list to array
values[1::2] = array("h", r_values)
# convert back the array to a byte buffer for speaker
sample.write(values.tostring())

Resources