Librosa.resample() resamples to a lower rate than needed - python-3.x

I am doing some audio pre-processing to train a ML model.
All the audio files of the dataset are:
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz.
I am using the following snippet of code to resample the dataset to 8000 Hz:
samples, sample_rate = librosa.load(filename, sr = 16000)
samples = librosa.resample(samples, sample_rate, 8000)
then I use the following snippet to reshape the new samples:
samples.reshape(1,8000,1)
but for some reason, I keep getting the following error: ValueError: cannot reshape array of size 4000 into shape (1,8000,1) but the size differs from a file to another, but it's always less than 8000 HZ (the desired sample rate).
I doubled checked the original sample rate and it was 16000 Hz, I also tried to load the files with a sample rate of 8000, but I had no luck.

Related

How to maintain normalization when converting normalized WAV files to mp3?

I have a script that uses sox to first normalize a bunch of wav files. It then takes the normalized wav files and converts them to mp3. I use the max amplitude stat to check how 'normalized' the files are. The max amp stats of the normalized files are within the same range. When I look at the max amplitude stats of the mp3 files, they are not maintaining the same close range. How can I maintain normalization when converting from wav to mp3?
The command I use to normalize the files:
sox file.wav --norm=-1 norm.wav
The command I use to convert the files to mp3:
sox norm.wav -c 1 newFile.mp3

How to gain volumes of specific bands of audio files using ffmpeg?

I want increase or decrease volume of specific frequency bands with ffmpeg.
I think bandreject and bandpass filter can do similar thing.
But is there any way to reject 80% of energy of specific bands?
Thanks in advance?
Use the equalizer filter.
Example to attenuate 10 dB at 1000 Hz with a bandwidth of 200 Hz and attenuate 5 dB at 8000 Hz with a bandwidth of 1000 Hz:
ffmpeg -i input.mp3 -af equalizer=frequency=1000:width=200:width_type=h:gain=-10,equalizer=frequency=8000:width=1000:width_type=h:gain=-5 output.wav
Or you can do it in one filter instance using the anequalizer filter.

Understanding audio file spectrogram values

I am currently struggling to understand how the power spectrum is stored in the kaldi framework.
I seem to have successfully created some data files using
$cmd JOB=1:$nj $logdir/spect_${name}.JOB.log \
compute-spectrogram-feats --verbose=2 \
scp,p:$logdir/wav_spect_${name}.JOB.scp ark:- \| \
copy-feats --compress=$compress $write_num_frames_opt ark:- \
ark,scp:$specto_dir/raw_spectogram_$name.JOB.ark,$specto_dir/raw_spectogram_$name.JOB.scp
Which gives me a large file with data point for different audio files, like this.
The problem is that I am not sure on how I should interpret this data set, I know that prior to this an fft is performed, which I guess is a good thing.
The output example given above is from a file which is 1 second long.
all the standard has been used for computing the spectogram, so the sample frequency should be 16 kHz, framelength = 25 ms and overlap = 10 ms.
The number of data points in the first set is 25186.
Given these informations, can I interpret the output in some way?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length. So is this the same case? 16000/25186 = 0.6... Hz/bin?
Or am I interpreting it incorrectly?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length.
So is this the same case? 16000/25186 = 0.6... Hz/bin?
The formula F_s/N is indeed what you would use to compute the frequency bin size. However, as you mention N is the FFT length, not the total number of samples. Based on the approximate 25ms framelength, 10ms hop size and the fact that your generated output data file has 98 lines of 257 values for some presumably real-valued input, it would seem that the FFT length used was 512. This would give you a frequency bin size of 16000/512 = 31.25 Hz/bin.
Based on this scaling, plotting your raw data with the following Matlab script (with the data previously loaded in the Z matrix):
fs = 16000; % 16 kHz sampling rate
hop_size = 0.010; % 10 millisecond
[X,Y]=meshgrid([0:size(Z,1)-1]*hop_size, [0:size(Z,2)-1]*fs/512);
surf(X,Y,transpose(Z),'EdgeColor','None','facecolor','interp');
view(2);
xlabel('Time (seconds)');
ylabel('Frequency (Hz)');
gives this graph (the dark red regions are the areas of highest intensity):

How to calculate total convertion duration before converting with FFMPEG in nodeJS

With FFMPEG in nodeJS,
I would like to convert a video with FFMPEG.
How can I calculate total convertion duration before processing the conversion ?
Example : How long time a 1 Go AVI movie takes to be converted in MKV ?
You can't know in advance the exact amount of time needed for executing the conversion.
If you know the total number of frames of the target file you can use this formula:
T_full_conversion_time = T_elapsed * T_total_frame_count/ T_converted_frames
You can use T_full_conversion_time and T_elapsed and estimate the remaining time.

splitting a flac image into tracks

This is a follow up question to Flac samples calculation.
Do I implement the offset generated by that formula from the beginning of the file or after the metadata where the stream starts (here)?
My goal is to programmatically divide the file myself - largely as a learning exercise. My thought is that I would write down my flac header and metadata blocks based on values learned from the image and then the actual track I get from the master image using my cuesheet.
Currently in my code I can parse each metadata block and end up where the frames start.
Suppose you are trying to decode starting at M:S.F = 3:45.30. There are 75 frames (CDDA sectors) per second, and obviously there are 60 seconds per minute. To convert M:S.F from your cue sheet into a sample offset value, I would first calculate the number of CDDA sectors to the desired starting point: (((60 * 3) + 45) * 75) + 30 = 16,905. Since there are 75 sectors per second, assuming the audio is sampled at 44,100 Hz there are 44,100 / 75 = 588 audio samples per sector. So the desired audio sample offset where you will start decoding is 588 * 16,905 = 9,940,140.
The offset just calculated is an offset into the decompressed PCM samples, not into the compressed FLAC stream (nor in bytes). So for each FLAC frame, calculate the number of samples it contains and keep a running tally of your position. Skip FLAC frames until you find the one containing your starting audio sample. At this point you can start decoding the audio, throwing away any samples in the FLAC frame that you don't need.
FLAC also supports a SEEKTABLE block, the use of which would greatly speed up (and alter) the process I just described. If you haven't already you can look at the implementation of the reference decoder.

Resources