librosa write_wav in mono? - python-3.x

I want to resample a mono recording recordered in 40.000 Hz to 44100 Hz.
The code below works but librosa seems to save in stereo, making the file twice the size which is not needed and I have a lot of samples to process.
So I need to save the result in mono.
Code:
# resampling a .wav file to a specific sample rate
import os
import librosa
import resampy
# this is the sample reate we want
sr_target = 44100
directory_in_str = '/home/hugo/test/'
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".wav"):
file_path = os.path.join(directory_in_str, filename)
print(file_path)
# Load in librosa's example audio file at its native sampling rate
x, sr_orig = librosa.load(file_path, mono=True, sr=None)
print("Original sample rate is : ", sr_orig)
# x is now a 1-d numpy array, with `sr_orig` audio samples per second
# We can resample this to any sampling rate we like, say 16000 Hz
y = resampy.resample(x, sr_orig, sr_target)
file_path_new = os.path.join(directory_in_str+'new/', filename)
# write it back
librosa.output.write_wav(file_path_new, y, sr_target)
continue
else:
continue
Question: I want to save the resampled file in mono, I get stereo and no option to save only mono...

The output is mono or stereo depends on y. If y has the shape of (n,), then the output is mono; If y has the shape of (2,n), then the output is stereo. librosa.output.write_wav won't automatically turn a mono signal to stereo.
From your code, your output audio seems like a stereo audio. The file having twice the size doesn't mean that it's stereo. It may be caused by the different data type of the input and output audio.

Related

how to encode video using H265?

I'm very new to codecs. I have 100 frames and I want to convert them into a video using H265 encoder. previously I tried with h264 and video generated with file size 2.5MB. Now when I tried with hvc1 and the output video file is much bigger(65MB). I also tried with hev1 still getting the same 65MB file. But heard that h265 will generate a lesser file compared to h264. Can anyone help me with this problem?
import cv2
import os
import re
image_folder = 'backend/data'
video_name = 'backend/h265/output.mp4'
images = [img for img in os.listdir(image_folder) if img.endswith(".jpg")]
def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
return [
int(text)
if text.isdigit() else text.lower()
for text in _nsre.split(s)]
sorted_images = sorted(images, key=natural_sort_key)
frame = cv2.imread(os.path.join(image_folder, sorted_images[0]))
height, width, layers = frame.shape
fourcc = cv2.VideoWriter_fourcc(*'hev1') #hvc1 and hev1 are two codec ids of hevc
video = cv2.VideoWriter(video_name, fourcc, 27, (width,height))
for image in sorted_images:
video.write(cv2.imread(os.path.join(image_folder, image)))
cv2.destroyAllWindows()
video.release()
am I using the correct codec ids of hevc for in this code?

WAV file written with AudioSegment.export() sounds half the speed as when rewriting the file with Soundfile.write

I am currently processing some audio data. I have an audio file that I have created from splitting a larger file on silence using pydub.
However, if I take this audio file after exporting it with pydub, and then convert the AudioSegment's array to numpy array, and re-write it using soundfile, I get an audio file written that is about half the speed as it was originally. What could be going wrong?
import soundfile as sf
import numpy as np
from pydub import AudioSegment, effects
from pathlib import Path
# This code takes a large .mp3 file ("original_audio_mp3") with sample rate of 44100 khz
sound = AudioSegment.from_file(original_audio_mp3)
if sound.frame_rate != desired_sample_rate:
sound = sound.set_frame_rate(desired_sample_rate) # convert to 16000 khz sample rate
sound = effects.normalize(sound) # normalize audio file
dBFS = sound.dBFS # get decibels relative to full scale
sound_chunks = split_on_silence(sound,
min_silence_len = 200, # measured in ms
silence_thresh = dBFS -30 # if DBFS goes 30 below the file's dBFS it will be considered "silence"
)
# this "audio_segment_0.wav" file came from the above code.
audio_file_path = Path("audio_segment_0.wav")
raw_audio = AudioSegment.from_file(audio_file_path).set_frame_rate(16000)
# append 200 ms of silence to beginning and end of file
raw_audio = effects.normalize(raw_audio)
silence = AudioSegment.silent(duration = 200, frame_rate = 16000)
raw_audio_w_silence = silence + raw_audio + silence
# export it
raw_audio_w_silence.export("pydub_audio.wav", format = 'wav') # the output from this sounds completely OK.
# read audio, manipulate and write with soundfile
new_audio = AudioSegment.from_file("pydub_audio.wav").set_frame_rate(16000)
new_audio_signal = np.array(new_audio.get_array_of_samples(), dtype = np.float32) / 32768.0 # scale to between [-1.0, 1.0]
# the output from down here using the scaled numpy array sounds about half the speed as the first.
sf.write("soundfile_export.wav", data = new_audio_signal, samplerate = new_audio.frame_rate, format = 'wav')

Converting a wav file to amplitude and frequency values for textual, time-series analysis

I'm processing wav files for amplitude and frequency analysis with FFT, but I am having trouble getting the data out to csv in a time series format.
Using #Beginner's answer heavily from this post: How to convert a .wav file to a spectrogram in python3, I'm able to get the spectrogram output in an image. I'm trying to simplify that somewhat to get to a text output in csv format, but I'm not seeing how to do so. The outcome I'm hoping to achieve would look something like the following:
time_in_ms, amplitude_in_dB, freq_in_kHz
.001, -115, 1
.002, -110, 2
.003, 20, 200
...
19000, 20, 200
For my testing, I have been using http://soundbible.com/2123-40-Smith-Wesson-8x.html, (Notes: I simplified the wav down to a single channel and removed metadata w/ Audacity to get it to work.)
Heavy props to #Beginner for 99.9% of the following, anything nonsensical is surely mine.
import numpy as np
from matplotlib import pyplot as plt
import scipy.io.wavfile as wav
from numpy.lib import stride_tricks
filepath = "40sw3.wav"
""" short time fourier transform of audio signal """
def stft(sig, frameSize, overlapFac=0.5, window=np.hanning):
win = window(frameSize)
hopSize = int(frameSize - np.floor(overlapFac * frameSize))
# zeros at beginning (thus center of 1st window should be for sample nr. 0)
samples = np.append(np.zeros(int(np.floor(frameSize/2.0))), sig)
# cols for windowing
cols = np.ceil( (len(samples) - frameSize) / float(hopSize)) + 1
# zeros at end (thus samples can be fully covered by frames)
samples = np.append(samples, np.zeros(frameSize))
frames = stride_tricks.as_strided(samples, shape=(int(cols), frameSize), strides=(samples.strides[0]*hopSize, samples.strides[0])).copy()
frames *= win
return np.fft.rfft(frames)
""" scale frequency axis logarithmically """
def logscale_spec(spec, sr=44100, factor=20.):
timebins, freqbins = np.shape(spec)
scale = np.linspace(0, 1, freqbins) ** factor
scale *= (freqbins-1)/max(scale)
scale = np.unique(np.round(scale))
# create spectrogram with new freq bins
newspec = np.complex128(np.zeros([timebins, len(scale)]))
for i in range(0, len(scale)):
if i == len(scale)-1:
newspec[:,i] = np.sum(spec[:,int(scale[i]):], axis=1)
else:
newspec[:,i] = np.sum(spec[:,int(scale[i]):int(scale[i+1])], axis=1)
# list center freq of bins
allfreqs = np.abs(np.fft.fftfreq(freqbins*2, 1./sr)[:freqbins+1])
freqs = []
for i in range(0, len(scale)):
if i == len(scale)-1:
freqs += [np.mean(allfreqs[int(scale[i]):])]
else:
freqs += [np.mean(allfreqs[int(scale[i]):int(scale[i+1])])]
return newspec, freqs
""" compute spectrogram """
def compute_stft(audiopath, binsize=2**10):
samplerate, samples = wav.read(audiopath)
s = stft(samples, binsize)
sshow, freq = logscale_spec(s, factor=1.0, sr=samplerate)
ims = 20.*np.log10(np.abs(sshow)/10e-6) # amplitude to decibel
return ims, samples, samplerate, freq
""" plot spectrogram """
def plot_stft(ims, samples, samplerate, freq, binsize=2**10, plotpath=None, colormap="jet"):
timebins, freqbins = np.shape(ims)
plt.figure(figsize=(15, 7.5))
plt.imshow(np.transpose(ims), origin="lower", aspect="auto", cmap=colormap, interpolation="none")
plt.colorbar()
plt.xlabel("time (s)")
plt.ylabel("frequency (hz)")
plt.xlim([0, timebins-1])
plt.ylim([0, freqbins])
xlocs = np.float32(np.linspace(0, timebins-1, 5))
plt.xticks(xlocs, ["%.02f" % l for l in ((xlocs*len(samples)/timebins)+(0.5*binsize))/samplerate])
ylocs = np.int16(np.round(np.linspace(0, freqbins-1, 10)))
plt.yticks(ylocs, ["%.02f" % freq[i] for i in ylocs])
if plotpath:
plt.savefig(plotpath, bbox_inches="tight")
else:
plt.show()
plt.clf()
"" HERE IS WHERE I'm ATTEMPTING TO GET IT OUT TO TXT """
ims, samples, samplerate, freq = compute_stft(filepath)
""" Print lengths """
print('ims len:', len(ims))
print('samples len:', len(samples))
print('samplerate:', samplerate)
print('freq len:', len(freq))
""" Write values to files """
np.savetxt(filepath + '-ims.txt', ims, delimiter=', ', newline='\n', header='ims')
np.savetxt(filepath + '-samples.txt', samples, delimiter=', ', newline='\n', header='samples')
np.savetxt(filepath + '-frequencies.txt', freq, delimiter=', ', newline='\n', header='frequencies')
In terms of values out, the file I'm analyzing is approx 19.1 seconds long and the sample rate is 44100, so I’d expect to have about 842k values for any given variable. But I'm not seeing what I expected. Instead here is what I see:
freqs comes out with just a handful of values, 512 and while they appear to be correct range for expected frequency, they are ordered least to greatest, not in time series like I expected. The 512 values, I assume, is the "fast" in FFT, basically down-sampled...
ims, appears to be amplitude, but values seem too high, although sample size is correct. Should be seeing -50 up to ~240dB.
samples . . . not sure.
In short, can someone advise on how I'd get the FFT out to a text file with time, amp, and freq values for the entire sample set? Is savetxt the correct route, or is there a better way? This code can certainly be used to make a great spectrogram, but how can I just get out the data?
Your output format is too limiting, as the audio spectrum at any interval in time usually contains a range of frequencies. e.g the FFT of a 1024 samples will contain 512 frequency bins for one window of time or time step, each with an amplitude. If you want a time step of one millisecond, then you will have to offset the window of samples you feed each STFT to center the window at that point in your sample vector. Although with an FFT about 23 milliseconds long, that will involve a high overlap of windows. You could use shorter windows, but the time-frequency trade-off will result in proportionately less frequency resolution.

convert numpy array to AudioFileClip in MoviePy

I am trying to convert a numpy array of an audio file sampled at 44100 Hz into an AudioFileClip in MoviePy so I can overdub a videoFileClip. The online documentation is unclear on this topic.
Any advice?
Thanks.
The relevant class is AudioArrayClip in AudioClip.py.
Here are a couple of examples of how to generate 2 seconds of mono and stereo random noise:
import numpy as np
from moviepy.audio.AudioClip import AudioArrayClip
rate = 44100 # Sampling rate in samples per second.
duration = 2 # Duration in seconds
data_mono = np.random.uniform(-1, 1, (int(duration*rate/2), 1))
data_stereo = np.random.uniform(-1, 1, (rate*duration, 2))
audio_mono = AudioArrayClip(data_mono, fps=rate)
audio_stereo = AudioArrayClip(data_stereo, fps=rate)
audio_mono.write_audiofile('mono.mp3')
audio_stereo.write_audiofile('stereo.mp3')
Edit: Update workaround to get correct duration of mono file (python 3.7.5, moviepy 1.0.0)

Python Librosa : What is the default frame size used to compute the MFCC features?

Using Librosa library, I generated the MFCC features of audio file 1319 seconds into a matrix 20 X 56829. The 20 here represents the no of MFCC features (Which I can manually adjust it). But I don't know how it segmented the audio length into 56829. What is the frame size it takes process the audio?
import numpy as np
import matplotlib.pyplot as plt
import librosa
def getPathToGroundtruth(episode):
"""Return path to groundtruth file for episode"""
pathToGroundtruth = "../../../season01/Audio/" \
+ "Season01.Episode%02d.en.wav" % episode
return pathToGroundtruth
def getduration(episode):
pathToAudioFile = getPathToGroundtruth(episode)
y, sr = librosa.load(pathToAudioFile)
duration = librosa.get_duration(y=y, sr=sr)
return duration
def getMFCC(episode):
filename = getPathToGroundtruth(episode)
y, sr = librosa.load(filename) # Y gives
data = librosa.feature.mfcc(y=y, sr=sr)
return data
data = getMFCC(1)
Short Answer
You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)
data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20)
Long Answer
Librosa's librosa.feature.mfcc() function really just acts as a wrapper to librosa's librosa.feature.melspectrogram() function (which is a wrapper to librosa.core.stft and librosa.filters.mel functions).
All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc() function.
All extra **kwargs parameters are fed to librosa.feature.melspectrogram() and subsequently to librosa.filters.mel()
By Default, the Mel-scaled power spectrogram window and hop length are the following:
n_fft=2048
hop_length=512
So assuming you used the default sample rate (sr=22050), the output of your mfcc function makes sense:
output length = (seconds) * (sample rate) / (hop_length)
(1319) * (22050) / (512) = 56804 samples
The parameters that you are able to tune, are the following:
Melspectrogram Parameters
-------------------------
y : np.ndarray [shape=(n,)] or None
audio time-series
sr : number > 0 [scalar]
sampling rate of `y`
S : np.ndarray [shape=(d, t)]
power spectrogram
n_fft : int > 0 [scalar]
length of the FFT window
hop_length : int > 0 [scalar]
number of samples between successive frames.
See `librosa.core.stft`
kwargs : additional keyword arguments
Mel filter bank parameters.
See `librosa.filters.mel` for details.
If you want to further specify characteristics of the mel filterbank used to define the Mel-scaled power spectrogram, you can tune the following
Mel Frequency Parameters
------------------------
sr : number > 0 [scalar]
sampling rate of the incoming signal
n_fft : int > 0 [scalar]
number of FFT components
n_mels : int > 0 [scalar]
number of Mel bands to generate
fmin : float >= 0 [scalar]
lowest frequency (in Hz)
fmax : float >= 0 [scalar]
highest frequency (in Hz).
If `None`, use `fmax = sr / 2.0`
htk : bool [scalar]
use HTK formula instead of Slaney
Documentation for Librosa:
librosa.feature.melspectrogram
librosa.filters.mel
librosa.core.stft

Resources