Wav audio level is too large - python-3.x

I have a mono wav file for a 'glass breaking' sound. When I graphically display it's levels in python using librosa library, it shows very large range of amplitudes, between +/ 20000 instead of +/- 1. When I open same wav file with Audacity, the levels are between +/- 1.
My question is what generates this difference in displayed amplitude levels and how can I correct it in Python? MinMax scaling will distort the sound and I want to avoid it if possible.
The code is:
from scipy.io import wavfile
fs1, glass_break_data = wavfile.read('test_break_glass_normalized.wav')
%matplotlib inline
import matplotlib.pyplot as plt
import librosa.display
sr=44100
x = glass_break_data.astype('float')
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)
These are the images from the notebook and Audacity:

WAV usually uses integer values to represent individual samples, not floats. So what you see in the librosa plot is accurate for a 16 bit/sample audio file.
Programs like VLC show the format, including bit depth per sample in their info dialog, so you can easily check.
Another way to check the format might be using soxi or ffmpeg.
Audacity normalizes everything to floats in the range of -1 to 1—it does not show you the original format.
The same is true for librosa.load()—it also normalizes to [-1,1]. wavfile.read() on the other hand, does not normalize. For more info on ways to read WAV audio, please see for example this answer.

If you use librosa.load instead of wavfile.read it will normalize the range to -1, 1
glass_break_data, fs1 = librosa.load('test_break_glass_normalized.wav')

Related

Meaning of sample values in a wav file

for a school project, I am supposed to analyze a short sound recording in wav format. I am done with the project, I DFT'd it, filtered out unwanted frequencies, and got the correct result. What eludes me, though, is the meaning of the values of the individual samples of my wav file. I have tens of thousands of samples that look like this:
[ 0.06234258 0.16020246 0.14122963 ... -0.01704375 -0.08993937 -0.09293508]
However, no matter how much I multiply these values by a number, the resulting sound sounds the same. If I multiply every sample by 1000, it sounds just as it sounded before. The same goes for dividing. So what do these samples mean, if not volume?
EDIT:
Here is the code I'm using:
import soundfile as sf
import IPython
samples, sampling_freq = sf.read('recording.wav')
IPython.display.display(IPython.display.Audio(samples, rate=sampling_freq )) #This one displays a playable bar.
The samples (basically a long array of floating point numbers) in the file is the Pulse Code Modulated data representing the audio.
Given that audio players use this data to recreate the original audio wave, multiplying every sample by some factor should increase the volume. However, some audio players scale down (re-normalize) the samples to prevent audio clamping - which can be the cause why it sounds the same.
The ideal way to visualize the audio should be using Audacity. It has the capability to show the audio wave in real time. Something like this -
PC: Google

Voice Activity Detection

I am getting a problem while trying to get the binary result using webrctvad in a wave format audio file. I am using librosa in order to load the audio file in .wav format. Can anyone tell me how to use librosa along with webrtcvad in order to get the binary output of whether the audio contains speech or not?
Webrtcvad module works correctly with the wave module
The above link helped me a lot but still, I am confused as the link contains a good explanation but during implementation lot of errors are coming.
py-webrtcvad, expects the audio data to be 16bit PCM little-endian - as is the most common storage format in WAV files.
librosa and its underlying I/O library pysoundfile however always returns floating point arrays in the range [-1.0, 1.0]. To convertt this to bytes containing 16bit PCM you can use the following float_to_pcm16 function.
And I have tested to use the read_pcm16 function a direct replacement of read_wave in the official py-webrtcvad example. But allowing to open any audio file supported by soundfile (WAV, FLAC, OGG) etc.
def float_to_pcm16(audio):
import numpy
ints = (audio * 32767).astype(numpy.int16)
little_endian = ints.astype('<u2')
buf = little_endian.tostring()
return buf
def read_pcm16(path):
import soundfile
audio, sample_rate = soundfile.read(path)
assert sample_rate in (8000, 16000, 32000, 48000)
pcm_data = float_to_pcm16(audio)
return pcm_data, sample_rate

numpy ifft output has much larger power than original signal

I'm having a weird problem using the numpy fft class. I have the following bit of test code:
import numpy as np
import scipy.io.wavfile
import matplotlib.pyplot as plt
fs, a = scipy.io.wavfile.read('test.wav') # import audio file
spectrum = np.fft.fft(a) # create spectrum
b = np.real(np.fft.ifft(spectrum)) # reconstruct signal
# Print power of original and output signal
print(np.average(a**2))
print(np.average(b**2))
It outputs:
1497.887578558565
4397203.934254291
As expected for these values, the output is much louder than the input. The documentation for numpy.fft.ifft states:
"This function computes the inverse of the one-dimensional n-point discrete Fourier transform computed by fft. In other words, ifft(fft(a)) == a to within numerical accuracy."
Thus the signal should be nearly identical. Yet they are obviously not.
What am I doing wrong here?
Okay I managed to find the solution myself in the end.
The problem arises because the output of wavfile.read is an integer array. For some reason, the fft function handles integers in a different manner than floats. The problem is solved by typecasting a to an np.float64 type.
Why this happens is still not quite clear to me though.

Isolating the head in a grayscale CT image using Python

I am dealing with CT images that contain the head of the patient but also 'shadows' of the metalic cylinder.
These 'shadows' can appear down, left or right. In the image above it appears only on the lower side of the image. In the image below it appears in the left and the right directions. I don't have any prior knowledge of whether there is a shadow of the cylinder in the image. I must somehow detect it and remove it. Then I can proceed to segment out the skull/head.
To create a reproducible example I would like to provide the numpy array (128x128) representing the image but I don't know how to upload it to stackoverflow.
How can I achieve my objective?
I tried segmentation with ndimage and scikit-image but it does not work. I am getting too many segments.
12 Original Images
The 12 Images Binarized
The 12 Images Stripped (with dilation, erosion = 0.1, 0.1)
The images marked with red color can not help create a rectangular mask that will envelop the skull, which is my ultimate objective.
Please note that I will not be able to inspect the images one by one during the application of the algorithm.
You could use a combination of erosion (with an appropriate number of iterations) to remove the thin details, followed by dilation (also with an appropriate number of iterations) to restore the non-thin details to approximately the original size.
In code, this would look like:
import io
import requests
import numpy as np
import scipy as sp
import matplotlib as mpl
import PIL as pil
import scipy.ndimage
import matplotlib.pyplot as plt
# : load the data
url = 'https://i.stack.imgur.com/G4cQO.png'
response = requests.get(url)
img = pil.Image.open(io.BytesIO(response.content)).convert('L')
arr = np.array(img)
mask_arr = arr.astype(bool)
# : strip thin objects
struct = None
n_erosion = 6
n_dilation = 7
strip_arr = sp.ndimage.binary_dilation(
sp.ndimage.binary_erosion(mask_arr, struct, n_erosion),
struct, n_dilation)
plt.imshow(mask_arr, cmap='gray')
plt.imshow(strip_arr, cmap='gray')
plt.imshow(mask_arr ^ strip_arr, cmap='gray')
Starting from this image (mask_arr):
One would get to this image (strip_arr):
The difference being (mask_arr ^ strip_arr):
EDIT
(addressing the issues raised in the comments)
Using a different input image, for example a binarization of the input with a much lower threshold will help having larger and non-thin details of the head that will not disappear during erosion.
Alternatively, you may get more robust results by fitting an ellipse to the head.
Rather than "pure" image processing, like Ander Biguri above, I'd suggest maybe a different approach (actually two).
The concept here is to not rely on purely algorithmic image processing, but leverage the knowledge of the specifics of the situation you have:
1) Given the container is metal (as you stated) another approach that might be a lot easier is just thresholding, based on the specific HU number for the metal frame.
While you show the images as simple greyscale, in reality CT images are 16-bit images that are window levelled when viewed in a 256bit greyscale representation - so the pictures above are not a true representation of the full information available in the image data, which is actually 16 bit.
The metal frame would likely have a HU value that is significantly different to (higher than) anything within the anatomy. If that is the case, then a simple thresholding then subtraction would be a much simpler way to remove it.
2) Another approach would also be based on considering the geometry and properties of the specific situation you have:
In the images above, you could look at a vertical profile upwards in the middle of your image (column-wise) to find the location of the frame - with the location being the point that vertical profile crosses into a HU value that matches the frame.
From that point, you could use a flood fill approach (eg scikit flood_fill) to find all connected points within a certain tolerance.
That also would give you a set of points (mask) matching the frame that you could use to remove it from the original image.
I'm thinking that either of these approaches would be both faster and more robust for the situation you're proposing.

Librosa generated waveplots are flat for certain audio sounds

Certain wave plots generated by librosa’s display module are just flat lines that fill the entire axes.
I used native sampling rates to load some wav files into librosa and my dataset is a mix of stereo and mono files. I know the wave plots are incorrect because it looks nothing like the frequency-time plot of the same files in audacity.
I've tried playing with the figure width, height and DPI, however there are no improvements in the generated waveplots. Below is the waveplot generated by Librosa for one of these audio files and the expected wave plot in audacity.
Librosa Waveplot
Audacity Waveplot
The code used to generate the plot is derived from the librosa documentation:
sound, sr = librosa.load(input_dir, sr=None)
matplotlib.pyplot.figure(figsize=(width, height), dpi=dpi)
librosa.display.waveplot(numpy.array(sound), sr=sr)
matplotlib.pyplot.tight_layout()

Resources