How to detect lound sound in audio files using Sox? - linux

I've got several small audio files and I need to find out which ones contain loud sounds. With the stat command of Sox I get max and min amplitudes which are always around -1 and +1.
For example, this sound is louder:
$ sox out6.wav -n stat
Samples read: 220500
Length (seconds): 5.000000
Scaled by: 2147483647.0
Maximum amplitude: 0.999939
Minimum amplitude: -1.000000
Midline amplitude: -0.000031
Mean norm: 0.079951
Mean amplitude: -0.002050
RMS amplitude: 0.244085
Maximum delta: 0.386505
Minimum delta: 0.000000
Mean delta: 0.007803
RMS delta: 0.024331
Rough frequency: 699
Volume adjustment: 1.000
than this one:
$ sox out5.wav -n stat
Samples read: 220500
Length (seconds): 5.000000
Scaled by: 2147483647.0
Maximum amplitude: 0.999939
Minimum amplitude: -1.000000
Midline amplitude: -0.000031
Mean norm: 0.035560
Mean amplitude: -0.000054
RMS amplitude: 0.121909
Maximum delta: 0.085022
Minimum delta: 0.000000
Mean delta: 0.002599
RMS delta: 0.006305
Rough frequency: 363
Volume adjustment: 1.000
But they both have the same min and max amplitude.
How can I determine which one is the loudest?

Peak amplitude is not a good measure of overall loudness. All this measurement does is find the maximum or minimum sample that occurs over a period. The problem with this is that a clip with all zeros and a single one will measure the same max peak amplitude as the clip with all ones. The RMS (root mean square) amplitude is a better gauge of loudness. Its computed by summing the square of all of the samples and then taking the sqrt of the result. https://en.wikipedia.org/wiki/Amplitude

Related

How find sampleCount knowing length audio file and sampleRate?

I have been looking for a long time how to find sampleCount, but there is no answer. It is possible to say an algorithm or formula for calculation. It is known 850ms , the file weight is 37 KB, the resolution of the wav file , sampleRate is 48000.... I can check , you should get sampleCount equal to 40681 as I have in the file . this is necessary so that I can calculate sampleCount for other audio files.I am waiting for your help
I found and I get 40800 . I multiplied the rate with the time in seconds
Yes, the sample count is equal to the sample rate, multiplied by the duration.
So for an audio file that is exactly 850 milliseconds, at 48 kHz sample rate:
850 * 48000 = 40800 samples
Now, with MP3s you have to be careful. There is some padding at the beginning of the file for cleanly initializing the decoder, and the amount of padding can vary based on the encoder and its configuration. (You can read all about the troubles this has caused on the Wikipedia page for "gapless playback".) Additionally, your MP3 duration will be determined on MP3 frame boundaries, and not arbitrary PCM boundaries... assuming your decoder/player does not support gapless playback.

Explanation of audio stat using sox

I have a bunch of audio files and need to split each files based on silence and using SOX. However, I realize that some files have very noisy background and some don't thus I can't use a single set of parameter to iterate over all files doing the split. I try to figure out how to separate them by noisy background. Here is what I got from sox input1.flac -n stat and sox input2.flac -n stat
Samples read: 18207744
Length (seconds): 568.992000
Scaled by: 2147483647.0
Maximum amplitude: 0.999969
Minimum amplitude: -1.000000
Midline amplitude: -0.000015
Mean norm: 0.031888
Mean amplitude: -0.000361
RMS amplitude: 0.053763
Maximum delta: 0.858917
Minimum delta: 0.000000
Mean delta: 0.018609
RMS delta: 0.039249
Rough frequency: 1859
Volume adjustment: 1.000
and
Samples read: 198976896
Length (seconds): 6218.028000
Scaled by: 2147483647.0
Maximum amplitude: 0.999969
Minimum amplitude: -1.000000
Midline amplitude: -0.000015
Mean norm: 0.156168
Mean amplitude: -0.000010
RMS amplitude: 0.211787
Maximum delta: 1.999969
Minimum delta: 0.000000
Mean delta: 0.091605
RMS delta: 0.123462
Rough frequency: 1484
Volume adjustment: 1.000
The former does not contain noisy background and the latter does. I suspect I can use the Sample Mean of Max delta because of the big gap.
Can anyone explain for me the meaning of those stats, or at least show me where I can get it myself (I tried looking up in official documentation but they don't explain). Many thanks.
I don't know how I've managed to miss stat in the SoX docs all this time, it's right there.
Length
length of the audio file in seconds
Scaled by
what the input is scaled by. By default 2^31-1, to go from 32-bit signed integer to [-1, 1]
Maximum amplitude
maximum sample value
Minimum amplitude
minimum sample value
Midline amplitude
aka mid-range, midpoint between the max and minimum values.
Mean norm
arithmetic mean of samples' absolute values
Mean amplitude
arithmetic mean of samples' values
RMS amplitude
root mean square, root of squared values' mean
Maximum delta
maximum difference between two successive samples
Minimum delta
minimum difference between two successive samples
Mean delta
arithmetic mean of differences between successive samples
RMS delta
root mean square of differences between successive samples
Rough frequency
estimation of the input file's frequency, in hertz. unsure of method used
Volume adjustment
value that should be sent to -v so peak absolute amplitude is 1
Personally I'd rather use the stats function, whose output I find much more practically useful.
As a measure to differentiate between the more or less noisy audio I'd try using the difference between the highest and lowest sound levels. The quietest parts will never be quieter than the background noise alone, so if there is little difference the audio is either noisy, or just loud all the time, like a compressed pop song. You could take the difference between the maximum and minimum RMS values, or between peak and minimum RMS. The RMS window length should be kept fairly short, say between 10 and 200ms, and if the audio has fade-in or fade-out sections, those should be trimmed away, though I didn't include that in the code.
audio="input1.flac"
width=0.01
# Mixes down multi-channel files to mono
stats=$(sox "$audio" -n channels 1 stats -w $width 2>&1 |\
grep "Pk lev dB\|RMS Pk dB\|RMS Tr dB" |\
sed 's/[^0-9.-]*//g')
peak=$(head -n 1 <<< "$stats")
rmsmax=$(head -n 2 <<< "$stats" | tail -n 1)
rmsmin=$(tail -n 1 <<< "$stats")
rmsdif=$(bc <<< "scale=3; $rmsmax - $rmsmin")
pkmindif=$(bc <<< "scale=3; $peak - $rmsmin")
echo "
max RMS: $rmsmax
min RMS: $rmsmin
diff RMS: $rmsdif
peak-min: $pkmindif
"
The documentation is found in sox.pdf in the install directory.
For example, if you install the Windows 32-bit version of SoX 14.4.2, the PDF is found at C:\Program Files (x86)\sox-14-4-2\sox.pdf and the documentation for stat is on pages 35 - 36.
I also found a webpage version here.
I'd use the "mean norm" value as a decider. It works for me, especially if you get pops or clicks on the line. If the line is clean however, then Maximum Amplitude might be a better value to use (I notice your Maximum Amplitude is the same on both, so therefore do not use this in your case).

Retrieve audio duration from kbps and size

I have this data:
Bit speed: 276 kilobytes/seconds
File size: 6.17 MB
Channels: 2
Layer: 3
Frequency: 44100 HZ
How can I retrieve the audio duration in seconds or milliseconds?
You can't. To get the duration you need the sampling rate in samples per second but also the number of channels (mono, stereo, etc.), and the sample length in bytes (1 to 3 usually). And unless it is a raw audio there is also additional data that takes some space. 276kpbs does not help here. If it is a mP3 the file is compressed, you simply can't just by looking at the file size.

discrete fourier transform frequency bound?

for a 8KHz wav sound i took 20ms sample which has 160 samples of data, plotted the FFT spectrum in audacity.
It gave the magnitudes in 3000 and 4000 Hz as well, shouln't it be giving the magnitudes until
the 80Hz,because there is 160 samples of data?
For a sample rate of Fs = 8 khz the FFT will give meaningful results from DC to Nyquist (= Fs / 2), i.e. 0 to 4 kHz. The width of each FFT bin will be 1 / 20 ms = 50 Hz.
actually audacity shows the peaks as 4503Hz which means understands to 1Hz bins. by the way if I take 20ms and repeat it 50 times to make as 1s sample,is the fft going to be for 1Hz bins? and also audacity has the option for the window as far as I know If you use windowing then the components should be multiple times of 2,like 1,2,4,8,etc.. but it shows the exact frequencies,then why it uses the windowing?
The best sampling rate is 2*frequency.
in different frequencys you should to change the sampling rate.

Audio samples per second?

I am wondering on the relationship between a block of samples and its time equivalent. Given my rough idea so far:
Number of samples played per second = total filesize / duration.
So say, I have a 1.02MB file and a duration of 12 sec (avg), I will have about 89,300 samples played per second. Is this right?
Is there other ways on how to compute this? For example, how can I know how much a byte[1024] array is equivalent to in time?
Generally speaking for PCM samples you can divide the total length (in bytes) by the duration (in seconds) to get the number of bytes per second (for WAV files there will be some inaccuracy to account for the header). How these translate into samples depends on
the sample rate
bits used per sample, i.e. commonly
used is 16 bits = 2 bytes
number of channels, i.e. for stereo
this is 2
If you know 2) and 3) you can determine 1)
In your example 89300 bytes/second, assuming stereo and 16 bits per sample would be 89300 / 4 ~= 22Khz sample rate
In addition to #BrokenGlass's very good answer, I'll just add that for uncompressed audio with a fixed sample rate, number of channels and bits per sample, the arithmetic is fairly straightforward. E.g. for "CD quality" audio we have a 44.1 kHz sample rate, 16 bits per sample, 2 channels (stereo), therefore the data rate is:
44100 * 16 * 2
= 1,411,200 bits / sec
= 176,400 bytes / sec
= 10 MB / minute (approx)

Resources