I'm working on program that could compare 2 video files and show difference.
I compare audio track of files using SOX and FFMPEG:
invert one of the files (sox)
merge other file and invert version of first (sox)
detect silence (ffmpeg)
But if two file differs only by volume level - all audio track will be detected as non-silent ranges.
How to understand that 2 files have the same audio track, but with different volume level?
I tried to change sound level via sox: sox -v 1.1 input.wav output.wav
And then compare statistical information (-n stat).
It works fine. Result of division parameters audio2/audio1:
Samples read 1.00;
Length (seconds) 1.00;
Scaled by 1.00;
Maximum amplitude 1.10;
Minimum amplitude 1.10;
Midline amplitude 1.10;
Mean norm 1.10;
Mean amplitude 1.00;
RMS amplitude 1.10;
Maximum delta 1.10;
Mean delta 1.10;
RMS delta 1.10;
Rough frequency 1.00;
Volume adjustment 1/1.10;
BUT! When I tried ffmpeg to change volume of video: ffmpeg -i input.mp4 -vcodec copy -af "volume=10dB" output.mp4 (or volume=volume=0.5) and than compared sox audio statistic: I can't find any patterns...
Samples read 1.00
Length (seconds) 1.00
Scaled by 1.00
Maximum amplitude 0.71
Minimum amplitude 0.64
Midline amplitude -2401.73
Mean norm 0.34
Mean amplitude 0.50
RMS amplitude 0.36
Maximum delta 0.37
Mean delta 0.34
RMS delta 0.36
Rough frequency 0.99
Volume adjustment 0.71
I will be grateful for any ideas and help.
Related
I am doing some audio pre-processing to train a ML model.
All the audio files of the dataset are:
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz.
I am using the following snippet of code to resample the dataset to 8000 Hz:
samples, sample_rate = librosa.load(filename, sr = 16000)
samples = librosa.resample(samples, sample_rate, 8000)
then I use the following snippet to reshape the new samples:
samples.reshape(1,8000,1)
but for some reason, I keep getting the following error: ValueError: cannot reshape array of size 4000 into shape (1,8000,1) but the size differs from a file to another, but it's always less than 8000 HZ (the desired sample rate).
I doubled checked the original sample rate and it was 16000 Hz, I also tried to load the files with a sample rate of 8000, but I had no luck.
I want increase or decrease volume of specific frequency bands with ffmpeg.
I think bandreject and bandpass filter can do similar thing.
But is there any way to reject 80% of energy of specific bands?
Thanks in advance?
Use the equalizer filter.
Example to attenuate 10 dB at 1000 Hz with a bandwidth of 200 Hz and attenuate 5 dB at 8000 Hz with a bandwidth of 1000 Hz:
ffmpeg -i input.mp3 -af equalizer=frequency=1000:width=200:width_type=h:gain=-10,equalizer=frequency=8000:width=1000:width_type=h:gain=-5 output.wav
Or you can do it in one filter instance using the anequalizer filter.
I have data that is 157-dimensional with 688 data-points. With the data I would like to perform clustering.
Since K-Means is the simplest algorithm, I have decided to begin with this method.
Here is the Sklearn function call:
KMeans(init='k-means++', n_clusters=4, n_init=10), name="k-means++", data=sales)
Here are some output metrics:
init time inertia homo compl v-meas ARI AMI num_clusters
k-means++ 0.06s 38967 0.262 0.816 0.397 0.297 0.250 4
k-means++ 0.05s 29825 0.321 0.847 0.466 0.338 0.306 6
k-means++ 0.07s 23131 0.411 0.836 0.551 0.430 0.393 8
k-means++ 0.09s 20566 0.636 0.817 0.715 0.788 0.621 10
k-means++ 0.09s 18695 0.534 0.794 0.638 0.568 0.513 12
k-means++ 0.11s 16805 0.773 0.852 0.810 0.916 0.760 14
k-means++ 0.11s 15297 0.822 0.775 0.798 0.811 0.761 16
Can someone, please, help me interpret them?
I know that it is good to have a low inertia and high homogeneity score, but I do not know what a good threshold for these is.
For example, 15297 is the lowest inertia I have received, but that happens when the K-clusters is set to 16. Is this good or bad?
Available abbreviations:
homo = homogeneity score;
compl = completeness score;
v_meas = v-measure score;
ARI = adjusted Rand score;
AMI = adjusted mutual info.
The more centroids you have, the lower inertia you will get.
Having more centroids (num_clusters = centroids) means more ways for inputs to be classified to a center, lowering the magnitude of inertia overall in a multi-dimensional space. However, having more centroids also means that it may be more complicated for a machine to reach convergence for a defined number of max_iter in each n_init (by default, max_iter is set to 300). So, you should understand that for each random initialisation of centroids (each start of n_init), your machine computes KMeans algorithm at maximum 300 times, trying to reach a state, where no reclassification of inputs is possible. Of course, if it reaches convergence earlier, then it proceeds to the next n_init. Equally, if your machine does not find a solution for a defined number of iterations (300 in your case), then it still does a next step with another random placement of centroids. After 10 initialisations, the best output in terms of inertia is taken. You may try to increase both max_iter and num_clusters to see that it takes longer to find a solution.
There are no universal thresholds for homo and inertia due to the fact that there are different datasets. The amount of centroids should be chosen empirically, judging from the structure of data and the amount of clusters these inputs should have.
compl is the completeness metrics that reaches its upper bound (1.0) if all inputs of a given class are assigned to the same cluster. Given that its interval is [0.0, 1.0], you may interpret it as a proportion. homo is the homogeneity metrics which interval is equal to compl. It reaches 1.0 if each cluster contains inputs of a single class. v_meas is simply a harmonic mean of those two metrics.
ARI is actually the adjusted Rand score. You can read more about ARI and AMI.
More general information about completeness score and homogeneity measure is here.
Also, you should consider reducing the dimension size with PCA, because performing KMeans on largely multi-dimensional data may give less satisfying results.
I am currently struggling to understand how the power spectrum is stored in the kaldi framework.
I seem to have successfully created some data files using
$cmd JOB=1:$nj $logdir/spect_${name}.JOB.log \
compute-spectrogram-feats --verbose=2 \
scp,p:$logdir/wav_spect_${name}.JOB.scp ark:- \| \
copy-feats --compress=$compress $write_num_frames_opt ark:- \
ark,scp:$specto_dir/raw_spectogram_$name.JOB.ark,$specto_dir/raw_spectogram_$name.JOB.scp
Which gives me a large file with data point for different audio files, like this.
The problem is that I am not sure on how I should interpret this data set, I know that prior to this an fft is performed, which I guess is a good thing.
The output example given above is from a file which is 1 second long.
all the standard has been used for computing the spectogram, so the sample frequency should be 16 kHz, framelength = 25 ms and overlap = 10 ms.
The number of data points in the first set is 25186.
Given these informations, can I interpret the output in some way?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length. So is this the same case? 16000/25186 = 0.6... Hz/bin?
Or am I interpreting it incorrectly?
Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length.
So is this the same case? 16000/25186 = 0.6... Hz/bin?
The formula F_s/N is indeed what you would use to compute the frequency bin size. However, as you mention N is the FFT length, not the total number of samples. Based on the approximate 25ms framelength, 10ms hop size and the fact that your generated output data file has 98 lines of 257 values for some presumably real-valued input, it would seem that the FFT length used was 512. This would give you a frequency bin size of 16000/512 = 31.25 Hz/bin.
Based on this scaling, plotting your raw data with the following Matlab script (with the data previously loaded in the Z matrix):
fs = 16000; % 16 kHz sampling rate
hop_size = 0.010; % 10 millisecond
[X,Y]=meshgrid([0:size(Z,1)-1]*hop_size, [0:size(Z,2)-1]*fs/512);
surf(X,Y,transpose(Z),'EdgeColor','None','facecolor','interp');
view(2);
xlabel('Time (seconds)');
ylabel('Frequency (Hz)');
gives this graph (the dark red regions are the areas of highest intensity):
This is a follow up question to Flac samples calculation.
Do I implement the offset generated by that formula from the beginning of the file or after the metadata where the stream starts (here)?
My goal is to programmatically divide the file myself - largely as a learning exercise. My thought is that I would write down my flac header and metadata blocks based on values learned from the image and then the actual track I get from the master image using my cuesheet.
Currently in my code I can parse each metadata block and end up where the frames start.
Suppose you are trying to decode starting at M:S.F = 3:45.30. There are 75 frames (CDDA sectors) per second, and obviously there are 60 seconds per minute. To convert M:S.F from your cue sheet into a sample offset value, I would first calculate the number of CDDA sectors to the desired starting point: (((60 * 3) + 45) * 75) + 30 = 16,905. Since there are 75 sectors per second, assuming the audio is sampled at 44,100 Hz there are 44,100 / 75 = 588 audio samples per sector. So the desired audio sample offset where you will start decoding is 588 * 16,905 = 9,940,140.
The offset just calculated is an offset into the decompressed PCM samples, not into the compressed FLAC stream (nor in bytes). So for each FLAC frame, calculate the number of samples it contains and keep a running tally of your position. Skip FLAC frames until you find the one containing your starting audio sample. At this point you can start decoding the audio, throwing away any samples in the FLAC frame that you don't need.
FLAC also supports a SEEKTABLE block, the use of which would greatly speed up (and alter) the process I just described. If you haven't already you can look at the implementation of the reference decoder.