Explanation of audio stat using sox - audio

I have a bunch of audio files and need to split each files based on silence and using SOX. However, I realize that some files have very noisy background and some don't thus I can't use a single set of parameter to iterate over all files doing the split. I try to figure out how to separate them by noisy background. Here is what I got from sox input1.flac -n stat and sox input2.flac -n stat
Samples read: 18207744
Length (seconds): 568.992000
Scaled by: 2147483647.0
Maximum amplitude: 0.999969
Minimum amplitude: -1.000000
Midline amplitude: -0.000015
Mean norm: 0.031888
Mean amplitude: -0.000361
RMS amplitude: 0.053763
Maximum delta: 0.858917
Minimum delta: 0.000000
Mean delta: 0.018609
RMS delta: 0.039249
Rough frequency: 1859
Volume adjustment: 1.000
and
Samples read: 198976896
Length (seconds): 6218.028000
Scaled by: 2147483647.0
Maximum amplitude: 0.999969
Minimum amplitude: -1.000000
Midline amplitude: -0.000015
Mean norm: 0.156168
Mean amplitude: -0.000010
RMS amplitude: 0.211787
Maximum delta: 1.999969
Minimum delta: 0.000000
Mean delta: 0.091605
RMS delta: 0.123462
Rough frequency: 1484
Volume adjustment: 1.000
The former does not contain noisy background and the latter does. I suspect I can use the Sample Mean of Max delta because of the big gap.
Can anyone explain for me the meaning of those stats, or at least show me where I can get it myself (I tried looking up in official documentation but they don't explain). Many thanks.

I don't know how I've managed to miss stat in the SoX docs all this time, it's right there.
Length
length of the audio file in seconds
Scaled by
what the input is scaled by. By default 2^31-1, to go from 32-bit signed integer to [-1, 1]
Maximum amplitude
maximum sample value
Minimum amplitude
minimum sample value
Midline amplitude
aka mid-range, midpoint between the max and minimum values.
Mean norm
arithmetic mean of samples' absolute values
Mean amplitude
arithmetic mean of samples' values
RMS amplitude
root mean square, root of squared values' mean
Maximum delta
maximum difference between two successive samples
Minimum delta
minimum difference between two successive samples
Mean delta
arithmetic mean of differences between successive samples
RMS delta
root mean square of differences between successive samples
Rough frequency
estimation of the input file's frequency, in hertz. unsure of method used
Volume adjustment
value that should be sent to -v so peak absolute amplitude is 1
Personally I'd rather use the stats function, whose output I find much more practically useful.
As a measure to differentiate between the more or less noisy audio I'd try using the difference between the highest and lowest sound levels. The quietest parts will never be quieter than the background noise alone, so if there is little difference the audio is either noisy, or just loud all the time, like a compressed pop song. You could take the difference between the maximum and minimum RMS values, or between peak and minimum RMS. The RMS window length should be kept fairly short, say between 10 and 200ms, and if the audio has fade-in or fade-out sections, those should be trimmed away, though I didn't include that in the code.
audio="input1.flac"
width=0.01
# Mixes down multi-channel files to mono
stats=$(sox "$audio" -n channels 1 stats -w $width 2>&1 |\
grep "Pk lev dB\|RMS Pk dB\|RMS Tr dB" |\
sed 's/[^0-9.-]*//g')
peak=$(head -n 1 <<< "$stats")
rmsmax=$(head -n 2 <<< "$stats" | tail -n 1)
rmsmin=$(tail -n 1 <<< "$stats")
rmsdif=$(bc <<< "scale=3; $rmsmax - $rmsmin")
pkmindif=$(bc <<< "scale=3; $peak - $rmsmin")
echo "
max RMS: $rmsmax
min RMS: $rmsmin
diff RMS: $rmsdif
peak-min: $pkmindif
"

The documentation is found in sox.pdf in the install directory.
For example, if you install the Windows 32-bit version of SoX 14.4.2, the PDF is found at C:\Program Files (x86)\sox-14-4-2\sox.pdf and the documentation for stat is on pages 35 - 36.
I also found a webpage version here.

I'd use the "mean norm" value as a decider. It works for me, especially if you get pops or clicks on the line. If the line is clean however, then Maximum Amplitude might be a better value to use (I notice your Maximum Amplitude is the same on both, so therefore do not use this in your case).

Related

How to detect lound sound in audio files using Sox?

I've got several small audio files and I need to find out which ones contain loud sounds. With the stat command of Sox I get max and min amplitudes which are always around -1 and +1.
For example, this sound is louder:
$ sox out6.wav -n stat
Samples read: 220500
Length (seconds): 5.000000
Scaled by: 2147483647.0
Maximum amplitude: 0.999939
Minimum amplitude: -1.000000
Midline amplitude: -0.000031
Mean norm: 0.079951
Mean amplitude: -0.002050
RMS amplitude: 0.244085
Maximum delta: 0.386505
Minimum delta: 0.000000
Mean delta: 0.007803
RMS delta: 0.024331
Rough frequency: 699
Volume adjustment: 1.000
than this one:
$ sox out5.wav -n stat
Samples read: 220500
Length (seconds): 5.000000
Scaled by: 2147483647.0
Maximum amplitude: 0.999939
Minimum amplitude: -1.000000
Midline amplitude: -0.000031
Mean norm: 0.035560
Mean amplitude: -0.000054
RMS amplitude: 0.121909
Maximum delta: 0.085022
Minimum delta: 0.000000
Mean delta: 0.002599
RMS delta: 0.006305
Rough frequency: 363
Volume adjustment: 1.000
But they both have the same min and max amplitude.
How can I determine which one is the loudest?
Peak amplitude is not a good measure of overall loudness. All this measurement does is find the maximum or minimum sample that occurs over a period. The problem with this is that a clip with all zeros and a single one will measure the same max peak amplitude as the clip with all ones. The RMS (root mean square) amplitude is a better gauge of loudness. Its computed by summing the square of all of the samples and then taking the sqrt of the result. https://en.wikipedia.org/wiki/Amplitude

Creating audio level meter - signal normalization

I have program which tracks audio signal in real time. Every processed sample I am able to read value of it in range between <-1, 1>.
I would like to create(and later display) audio level meter. From what I understand - to do it I need to keep converting my audio signal in real time, on each channel to dB and then display dB values on each channel in some graphical form of bars.
I am a bit lost how to do it and it should be simple matter. Would just normalization from <-1, 1> to <0, 1> (like... [n-sample +1]/2) and then calculating 20*log10 from each upcoming sample make it?
You can't plot the signal directly, as it always varying positive and negative.
Therefore you need to average out the strength of the signal every so many samples.
Say you're sampling at 44.1kHz, perhaps you might choose 4410 samples so you're updating your display 10 times per second.
So you calculate the RMS of your 4410 samples - see http://en.wikipedia.org/wiki/Root_mean_square
The RMS value is always positive.
You can then convert this to Db:
dBV = 20 x log10(Vrms)
This assumes that your maximum signal -1 to +1 corresponds to -1 to +1 volt. You will need to do further adjustments if not.

Transforming Audio Samples From Time Domain to Frequency Domain

as a software engineer I am facing with some difficulties while working on a signal processing problem. I don't have much experience in this area.
What I try to do is to sample the environmental sound with 44100 sampling rate and for fixed size windows to test if a specific frequency (20KHz) exists and is higher than a threshold value.
Here is what I do according to the perfect answer in How to extract frequency information from samples from PortAudio using FFTW in C
102400 samples (2320 ms) is gathered from audio port with 44100 sampling rate. Sample values are between 0.0 and 1.0
int samplingRate = 44100;
int numberOfSamples = 102400;
float samples[numberOfSamples] = ListenMic_Function(numberOfSamples,samplingRate);
Window size or FFT Size is 1024 samples (23.2 ms)
int N = 1024;
Number of windows is 100
int noOfWindows = numberOfSamples / N;
Splitting samples to noOfWindows (100) windows each having size of N (1024) samples
float windowSamplesIn[noOfWindows][N];
for i:= 0 to noOfWindows -1
windowSamplesIn[i] = subarray(samples,i*N,(i+1)*N);
endfor
Applying Hanning window function on each window
float windowSamplesOut[noOfWindows][N];
for i:= 0 to noOfWindows -1
windowSamplesOut[i] = HanningWindow_Function(windowSamplesIn[i]);
endfor
Applying FFT on each window (real to complex conversion done inside the FFT function)
float frequencyData[noOfWindows][samplingRate/2];
for i:= 0 to noOfWindows -1
frequencyData[i] = RealToComplex_FFT_Function(windowSamplesOut[i], samplingRate);
endfor
In the last step, I use the FFT function implemented in this link: http://www.codeproject.com/Articles/9388/How-to-implement-the-FFT-algorithm ; because I cannot implement an FFT function from the scratch.
What I can't be sure is while giving N (1024) samples to FFT function as input, samplingRate/2 (22050) decibel values is returned as output. Is it what an FFT function does?
I understand that because of Nyquist Frequency, I can detect half of sampling rate frequency at most. But is it possible to get decibel values for each frequency up to samplingRate/2 (22050) Hz?
Thanks,
Vahit
See see How do I obtain the frequencies of each value in an FFT?
From a 1024 sample input, you can get back 512 meaningful frequency-levels.
So, yes, within your window, you'll get back a level for the Nyquist frequency.
The lowest frequency level you'll see is for DC (0 Hz), and the next one up will be for SampleRate/1024, or around 44 Hz, the next for 2 * SampleRate/1024, and so on, up to 512 * SampleRate / 1024 Hz.
Since only one band is used in your FFT, I would expect your results to be tarnished by side-band effects, even with proper windowing. It might work, but you might also get false positives with some input frequencies. Also, your signal is close to your niquist, so you are assuming a fairly good signal path up to your FFT. I don't think this is the right approach.
I think a better approach to this kind of signal detection would be with a high order filter (depending on your requirements, I would guess fourth or fifth order, which isn't actually that high). If you don't know how to design a high order filter, you could use two or three second order filters in series. Designing a second order filter, sometimes called a "biquad" is described here:
http://www.musicdsp.org/files/Audio-EQ-Cookbook.txt
albeit very tersely and with some assumptions of prior knowledge. I would use a high-pass (HP) filter with corner frequency as low as you can make it, probably between 18 and 20 kHz. Keep in mind there is some attenuation at the corner frequency, so after applying a filter multiple times you will drop a little signal.
After you filter the audio, take the RMS or average amplitude (that is, the average of the absolute value), to find the average level over a time period.
This technique has several advantages over what you are doing now, including better latency (you can start detecting within a few samples), better reliability (you won't get false-positives in response to loud signals at spurious frequencies), and so on.
This post might be of relevance: http://blog.bjornroche.com/2012/08/why-eq-is-done-in-time-domain.html

discrete fourier transform frequency bound?

for a 8KHz wav sound i took 20ms sample which has 160 samples of data, plotted the FFT spectrum in audacity.
It gave the magnitudes in 3000 and 4000 Hz as well, shouln't it be giving the magnitudes until
the 80Hz,because there is 160 samples of data?
For a sample rate of Fs = 8 khz the FFT will give meaningful results from DC to Nyquist (= Fs / 2), i.e. 0 to 4 kHz. The width of each FFT bin will be 1 / 20 ms = 50 Hz.
actually audacity shows the peaks as 4503Hz which means understands to 1Hz bins. by the way if I take 20ms and repeat it 50 times to make as 1s sample,is the fft going to be for 1Hz bins? and also audacity has the option for the window as far as I know If you use windowing then the components should be multiple times of 2,like 1,2,4,8,etc.. but it shows the exact frequencies,then why it uses the windowing?
The best sampling rate is 2*frequency.
in different frequencys you should to change the sampling rate.

Deciding on length of FFT

I am working on a tool to compare two wave files for similarity in their waveforms. Ex, I have a wave file of duration 1min and i make another wave file using the first one but have made each 5sec data at an interval of 5sec to 0.
Now my software will tell that there is waveform difference at time interval 5sec to 10sec, 15sec to 20sec, 25sec to 30 sec and so on...
As of now, with initial development, this is working fine.
Following are 3 test sets:
I have two wave files with sampling rate of 960Hz, mono, with no of data samples as 138551 (arnd 1min 12sec files). I am using 128 point FFT (splitting file in 128 samples chunk) and results are good.
When I use the same algorithm on wave files of sampling rate 48KHz, 2-channel with no of data samples 6927361 for each channel (arnd 2min 24 sec file), the process becomes too slow. When I use 4096 point FFT, the process is better.
But, the 4096 point FFT on files of 22050Hz, 2-channels with number of data samples 55776 for each channel (arnd 0.6sec file) gives very poor results. In this case 128 point FFT gives good result.
So, I am confused on how to decide the length of FFT so that my results are good in each case.
I guess the length should depend on number of samples and sampling rate.
Please give your inputs on this.
Thanks
The length of the FFT, N, will determine the resolution in the frequency domain:
resolution (Hz) = sample_rate (Hz) / N
So for example in case (1) you have resolution = 960 / 128 = 7.5 Hz. SO each bin in the resulting FFT (or presumably the power spectrum derived from this) will be 7.5 Hz wide, and you will be able to differentiate between frequency components which are at least this far apart.
Since you don't say what kind of waveforms these are, or what the purpose of your application is, it's hard to know what kind of resolution you need.
One important further point - many people using FFT for the first time are unaware that in general you need to apply a window function prior to the FFT to avoid spectral leakage.
I have to say I have found your question very cryptic. I think you should look into Short-time Fourier transform. The reason I say this is because you are looking at quite a large amount of samples if you use a sampling frequency of 44.1KhZ over 2mins with 2 channels. One fft across the entire amount will take quite a while indeed, not to mention the estimate will be biased as the signals mean and variance will change drastically over the whole duration. To avoid this you want to frame the time-domain signal first, these frames can be as small as 20ms-40ms (commonly used for speech) and often overlapping (Welch method of Spectral Estimation). Then you apply a window function such as Hamming or Hanning window to reduce spectral leakage and calculate an N-Point fft for each frame. Where N is the next power of two above the number of samples in that frame.
For example:
Fs = 8Khz, single channel;
time = 120sec;
no_samples = time * Fs = 960000 ;
frame length T_length= 20ms;
frame length in samples N_length = 160;
frame overlap T_overlap= 10ms;
frame overlap in samples N_overlap= 80;
Num of frames N_frames = (no_samples - (N_length-N_overlap))/N_overlap = 11999;
FFT length = 256;
So you will be processing 11999 frames in total, but your FFT length will be small. You will only need an FFT length of 256 (next power of two above frame length 160). Most algorithms that implement the fft require the signal length and fft length to be the same. All you have to do is append zeros to your framed signal up until 256. So pad each frame with x amount of zeros, where x = FFT_length-N_length. My latest android app does this on recorded speech and uses the short-time FFT data to display the Spectrogram of speech and also performs various spectral modification and filtering, its called Speech Enhancement for Android

Resources