splitting a flac image into tracks - audio

This is a follow up question to Flac samples calculation.
Do I implement the offset generated by that formula from the beginning of the file or after the metadata where the stream starts (here)?
My goal is to programmatically divide the file myself - largely as a learning exercise. My thought is that I would write down my flac header and metadata blocks based on values learned from the image and then the actual track I get from the master image using my cuesheet.
Currently in my code I can parse each metadata block and end up where the frames start.

Suppose you are trying to decode starting at M:S.F = 3:45.30. There are 75 frames (CDDA sectors) per second, and obviously there are 60 seconds per minute. To convert M:S.F from your cue sheet into a sample offset value, I would first calculate the number of CDDA sectors to the desired starting point: (((60 * 3) + 45) * 75) + 30 = 16,905. Since there are 75 sectors per second, assuming the audio is sampled at 44,100 Hz there are 44,100 / 75 = 588 audio samples per sector. So the desired audio sample offset where you will start decoding is 588 * 16,905 = 9,940,140.
The offset just calculated is an offset into the decompressed PCM samples, not into the compressed FLAC stream (nor in bytes). So for each FLAC frame, calculate the number of samples it contains and keep a running tally of your position. Skip FLAC frames until you find the one containing your starting audio sample. At this point you can start decoding the audio, throwing away any samples in the FLAC frame that you don't need.
FLAC also supports a SEEKTABLE block, the use of which would greatly speed up (and alter) the process I just described. If you haven't already you can look at the implementation of the reference decoder.

Related

how many maximum no. of channels in an audio file we can create with FFMPEG amerge filter?

how many maximum no. of channels in an audio file we can create with FFMPEG amerge filter?
We have a requirement to merge multiple single channel audio files into multi channel single audio file.
Each channel represents the speaker in the audio file.
I tried amerge filter and could do it upto 8 files. I am getting blank audio file when I try to do it for 10 audio files, and I think the FFMPEG amerge filter command doesn't produce any error either.
Can I create N no. of multi-channel audio files with N no. of files? Here N may be 100+? Is it possible?
I am new to this audio api etc. so any guidance is appreciated.
how many maximum no. of channels in an audio file we can create with FFMPEG amerge filter? We have a requirement to merge multiple single channel audio files into multi channel single audio file.
Max inputs is 64. According to ffmpeg -h filter=amerge:
inputs <int> ..F.A...... specify the number of inputs (from 1 to 64) (default 2)
Or look at the source code at libavfilter/af_amerge.c and refer to SWR_CH_MAX.
Can I create N no. of multi-channel audio files with N no. of files? Here N may be 100+? Is it possible?
Chain multiple amerge filters with a max of 64 inputs per filter. Or use the amix filter that has a max of 32767.

Librosa.resample() resamples to a lower rate than needed

I am doing some audio pre-processing to train a ML model.
All the audio files of the dataset are:
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz.
I am using the following snippet of code to resample the dataset to 8000 Hz:
samples, sample_rate = librosa.load(filename, sr = 16000)
samples = librosa.resample(samples, sample_rate, 8000)
then I use the following snippet to reshape the new samples:
samples.reshape(1,8000,1)
but for some reason, I keep getting the following error: ValueError: cannot reshape array of size 4000 into shape (1,8000,1) but the size differs from a file to another, but it's always less than 8000 HZ (the desired sample rate).
I doubled checked the original sample rate and it was 16000 Hz, I also tried to load the files with a sample rate of 8000, but I had no luck.

HLS protocol: get absolute elapsed time during a live streaming

I have a very basic question and I didn't get if I googled wrong or if the answer is so simple that I haven't seen it.
I'm implementing a web app using hls.js as Javascript library and I need a way to get the absolute elapsed time of a live streaming e.g. if a user join the live after 10 minutes, I need a way to detect that the user's 1st second is 601st second of the streaming.
Inspecting the streaming fragments I found some information like startPTS and endPTS, but all these information were always related to the retrieved chunks instead of the whole streaming chunks e.g. if a user join the live after 10 minutes and the chunks duration is 2 seconds, the first chunk I'll get will have startPTS = 0 and endPTS = 2, the second chunk I'll get will have startPTS = 2 and endPTS = 4 and so on (rounding the values to the nearest integer).
Is there a way to extract the absolute elapsed time as I need from an HLS live streaming ?
I'm having the exact same need on iOS (AVPlayer) and came with the following solution:
read the m3u8 manifest, for me it looks like this:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-MEDIA-SEQUENCE:410
#EXT-X-TARGETDURATION:8
#EXTINF:8.333,
410.ts
#EXTINF:8.333,
411.ts
#EXTINF:8.334,
412.ts
#EXTINF:8.333,
413.ts
#EXTINF:8.333,
414.ts
#EXTINF:8.334,
415.ts
Observe that the 409 first segments are not part of the manifest
Multiply EXT-X-MEDIA-SEQUENCE by EXT-X-TARGETDURATION and you have an approximation of the clock time for the first available segment.
Let's also notice that each segment is not exactly 8s long, so when I'm using the target duration, I'm actually accumulating an error of about 333ms per segment:
410 * 8 = 3280 seconds = 54.6666 minutes
In this case for me the segments are always 8.333 or 8.334, so by EXTINF instead, I get:
410 * 8.333 = 3416.53 seconds = 56.9421 minutes
These almost 56.9421 minutes is still an approximation (since we don't exactly know how many time we accumulated the new 0.001 error), but it's much much closer to the real clock time.

Understanding audio file spectrogram values

I am currently struggling to understand how the power spectrum is stored in the kaldi framework.
I seem to have successfully created some data files using
$cmd JOB=1:$nj $logdir/spect_${name}.JOB.log \
compute-spectrogram-feats --verbose=2 \
scp,p:$logdir/wav_spect_${name}.JOB.scp ark:- \| \
copy-feats --compress=$compress $write_num_frames_opt ark:- \
ark,scp:$specto_dir/raw_spectogram_$name.JOB.ark,$specto_dir/raw_spectogram_$name.JOB.scp
Which gives me a large file with data point for different audio files, like this.
The problem is that I am not sure on how I should interpret this data set, I know that prior to this an fft is performed, which I guess is a good thing.
The output example given above is from a file which is 1 second long.
all the standard has been used for computing the spectogram, so the sample frequency should be 16 kHz, framelength = 25 ms and overlap = 10 ms.
The number of data points in the first set is 25186.
Given these informations, can I interpret the output in some way?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length. So is this the same case? 16000/25186 = 0.6... Hz/bin?
Or am I interpreting it incorrectly?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length.
So is this the same case? 16000/25186 = 0.6... Hz/bin?
The formula F_s/N is indeed what you would use to compute the frequency bin size. However, as you mention N is the FFT length, not the total number of samples. Based on the approximate 25ms framelength, 10ms hop size and the fact that your generated output data file has 98 lines of 257 values for some presumably real-valued input, it would seem that the FFT length used was 512. This would give you a frequency bin size of 16000/512 = 31.25 Hz/bin.
Based on this scaling, plotting your raw data with the following Matlab script (with the data previously loaded in the Z matrix):
fs = 16000; % 16 kHz sampling rate
hop_size = 0.010; % 10 millisecond
[X,Y]=meshgrid([0:size(Z,1)-1]*hop_size, [0:size(Z,2)-1]*fs/512);
surf(X,Y,transpose(Z),'EdgeColor','None','facecolor','interp');
view(2);
xlabel('Time (seconds)');
ylabel('Frequency (Hz)');
gives this graph (the dark red regions are the areas of highest intensity):

How to calculate total convertion duration before converting with FFMPEG in nodeJS

With FFMPEG in nodeJS,
I would like to convert a video with FFMPEG.
How can I calculate total convertion duration before processing the conversion ?
Example : How long time a 1 Go AVI movie takes to be converted in MKV ?
You can't know in advance the exact amount of time needed for executing the conversion.
If you know the total number of frames of the target file you can use this formula:
T_full_conversion_time = T_elapsed * T_total_frame_count/ T_converted_frames
You can use T_full_conversion_time and T_elapsed and estimate the remaining time.

Resources