Voice Activity Detection - python-3.x

I am getting a problem while trying to get the binary result using webrctvad in a wave format audio file. I am using librosa in order to load the audio file in .wav format. Can anyone tell me how to use librosa along with webrtcvad in order to get the binary output of whether the audio contains speech or not?
Webrtcvad module works correctly with the wave module
The above link helped me a lot but still, I am confused as the link contains a good explanation but during implementation lot of errors are coming.

py-webrtcvad, expects the audio data to be 16bit PCM little-endian - as is the most common storage format in WAV files.
librosa and its underlying I/O library pysoundfile however always returns floating point arrays in the range [-1.0, 1.0]. To convertt this to bytes containing 16bit PCM you can use the following float_to_pcm16 function.
And I have tested to use the read_pcm16 function a direct replacement of read_wave in the official py-webrtcvad example. But allowing to open any audio file supported by soundfile (WAV, FLAC, OGG) etc.
def float_to_pcm16(audio):
import numpy
ints = (audio * 32767).astype(numpy.int16)
little_endian = ints.astype('<u2')
buf = little_endian.tostring()
return buf
def read_pcm16(path):
import soundfile
audio, sample_rate = soundfile.read(path)
assert sample_rate in (8000, 16000, 32000, 48000)
pcm_data = float_to_pcm16(audio)
return pcm_data, sample_rate

Related

how to recognize some special sound

how to compare the mic stream data with wave file data to get if it's mixed in the stream data
I'm using pyaudio and fft to get audio frequency data frames, are there any libraries for audio recognition? I do not need speech detection, just detections of some sound saved in files?

Wav audio level is too large

I have a mono wav file for a 'glass breaking' sound. When I graphically display it's levels in python using librosa library, it shows very large range of amplitudes, between +/ 20000 instead of +/- 1. When I open same wav file with Audacity, the levels are between +/- 1.
My question is what generates this difference in displayed amplitude levels and how can I correct it in Python? MinMax scaling will distort the sound and I want to avoid it if possible.
The code is:
from scipy.io import wavfile
fs1, glass_break_data = wavfile.read('test_break_glass_normalized.wav')
%matplotlib inline
import matplotlib.pyplot as plt
import librosa.display
sr=44100
x = glass_break_data.astype('float')
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
These are the images from the notebook and Audacity:
WAV usually uses integer values to represent individual samples, not floats. So what you see in the librosa plot is accurate for a 16 bit/sample audio file.
Programs like VLC show the format, including bit depth per sample in their info dialog, so you can easily check.
Another way to check the format might be using soxi or ffmpeg.
Audacity normalizes everything to floats in the range of -1 to 1—it does not show you the original format.
The same is true for librosa.load()—it also normalizes to [-1,1]. wavfile.read() on the other hand, does not normalize. For more info on ways to read WAV audio, please see for example this answer.
If you use librosa.load instead of wavfile.read it will normalize the range to -1, 1
glass_break_data, fs1 = librosa.load('test_break_glass_normalized.wav')

What data should I write to SDL audio callback buffer?

I am learning how to generate wave audio by using SDL2.0.
When I init the SDL audio, it asks me to provide a SDL_AudioFormat which specifies the audio format, and a callback function which is called when the audio system needs more data.
There are so many audio formats from SDL Doc, but no more information about what actual data I should write to the callback buffer.
I tested these formats:
float with Sine: (-1,1)
S8(signed byte) with square wave: [-128, 127]
U16(unsigned short): [-32768, 32767]
All of them worked.
The question is that I don't know what exactly these audio formats mean.
Can somebody give me some information about it?

How does an audio converter work?

I currently have the idea to code a small audio converter (e.g. FLAC to MP3 or m4a format) application in C# or Python but my problem is I do not know at all how audio conversion works.
After a research, I heard about Analog-to-digital / Digital-to-analog converter but I guess it would be a Digital-to-digital or something like that isn't it ?
If someone could precisely explain how it works, it would be greatly appreciated.
Thanks.
digital audio is called PCM which is the raw audio format fundamental to any audio processing system ... its uncompressed ... just a series of integers representing the height of the audio curve for each sample of the curve (the Y axis where time is the X axis along this curve)
... this PCM audio can be compressed using some codec then bundled inside a container often together with video or meta data channels ... so to convert audio from A to B you would first need to understand the container spec as well as the compressed audio codec so you can decompress audio A into PCM format ... then do the reverse ... compress the PCM into codec of B then bundle it into the container of B
Before venturing further into this I suggest you master the art of WAVE audio files ... beauty of WAVE is that its just a 44 byte header followed by the uncompressed integers of the audio curve ... write some code to read a WAVE file then parse the header (identify bit depth, sample rate, channel count, endianness) to enable you to iterate across each audio sample for each channel ... prove that its working by sending your bytes into an output WAVE file ... diff input WAVE against output WAVE as they should be identical ... once mastered you are ready to venture into your above stated goal ... do not skip over groking notion of interleaving stereo audio as well as spreading out a single audio sample which has a bit depth of 16 bits across two bytes of storage and the reverse namely stitching together multiple bytes into a single integer with a bit depth of 16, 24 or even 32 bits while keeping endianness squared away ... this may sound scary at first however all necessary details are on the net as its how I taught myself this level of detail
modern audio compression algorithms leverage knowledge of how people perceive sound to discard information which is indiscernible ( lossy ) as opposed to lossless algorithms which retain all the informational load of the source ... opus (http://opus-codec.org/) is a current favorite codec untainted by patents and is open source

Create audio file from samples of amplitude

If I have a text file of sample amplitudes (0-26522), how can I create a playable audio file from them?
I have a vague recollection of tinkering with .pcm files and 8-bit samples way back in the nineties.
Is there any software to automatically create an audio file (PCM or other format) from my samples? I found SoX, but I even after looking at the documentation I can't figure out if it can do what I want, and if so how...
GUI audio workstation called Audacity that lets you do this
File -> Import -> Raw Data
Encoding: Signed 16-bit PCM // even though your ints are unsigned it still works
Byte order: little endian
Channels 1 channel mono
then just hit Import
to confirm this works, in a text editor I just did a ( cut N paste followed by select all paste,paste,paste,paste ) of below list of ints about 10 times to generate several thousand ints in a vertical column ... this is my toy input file ... after above Import just save by doing
File -> Export Audio
where you choose which output format ( mp3, aac, PCM, ...) once I did this the output mp3 is playable ... using my toy input file I did hear a sine tone
3
305
20294
11029
585
3
305
20294
11029
585
3
305
20294
11029
585

Resources