How the ffmpeg astats crest factor is calculated - audio

I'm scripting a ffmpeg chain process for my work. The aim is normalizing/compressing lot of audio files (mp3's).
It's done in Python and the critical part is the line:
ffmpeg -y -i "Input.mp3" -codec:a libmp3lame -b:a 96k -af acompressor=threshold=-15dB:ratio=5:attack=0.01:release=1000:knee=2,dynaudnorm=g=3:m=2:p=0.95 "Output.mp3"
The python script it's complete and working BUT the nature of the audios (voice recordings) are very different so I can't use the same params for all of them.
I make some experimenting with the values of the ffmpeg filter astats and i discovered that the crest factor (Standard ratio of peak to RMS level ) gave a good reference to programatically get the better params.
In fact I saw that a recording with a nice dynamic range sound and smooth in shape, get crest values around 9-15 (the compress/normlz params will be somehow conservative). But audios with crest around 22-30 need more aggressive processing.
(All empirically)
Somebody can clarify how the crest values are really calculated? Which are the peaks taken to account? (Why the flat factor is always 0?)
Or if somebody knows how to get a value representing the sound 'smoothness' will be nice also.
Thanks for the ideas.

Generally speaking, the crest factor is defined as (Wikipedia):
Looking into ffmpeg's source code, we see that the crest factor is defined as:
p->sigma_x2 ? FFMAX(-p->nmin, p->nmax) / sqrt(p->sigma_x2 / p->nb_samples) : 1)
Setting aside the case p->sigma_x2 == 0, we see that:
crest_factor = FFMAX(-p->nmin, p->nmax) / sqrt(p->sigma_x2 / p->nb_samples)
which matches the formula above, given that:
max(- x_min, + x_max) is equivalent to abs(x_peak)
p->sigma_x2 designates the sum of squares of audio samples and p->nb_samples corresponds to the number of audio samples, so sqrt(p->sigma_x2 / p->nb_samples) is the RMS value.
Hope it helps!

Related

ffmpeg quality conversion options (video compression)

Can you provide a link, or an explanation, to the -q:v 1 argument that deals with video/image quality, and compression, in ffmpeg.
Let me explain...
for f in *
do
extension="${f##*.}"
filename="${f%.*}"
ffmpeg -i "$f" -q:v 1 "$filename"_lq."$extension"
rm -f "$f"
done
The ffmpeg for loop above compresses all images and videos in your working directory, it basically lowers the quality which results in smaller file sizes (the desired outcome).
I'm most interested in the -q:v 1 argument of this for loop. The 1 in the -q:v 1 argument is what controls the amount of compression. But I can't find any documentation describing how to change this value of 1, and describing what it does. Is it a percentage? Multiplier? How do I adjust this knob? Can/should I use negative values? Integers only? Min/max values? etc.
I started with the official documentation but the best I could find was a section on video quality, and the -q flag description is sparse.
-frames[:stream_specifier] framecount (output,per-stream)
Stop writing to the stream after framecount frames.
.
-q[:stream_specifier] q (output,per-stream)
-qscale[:stream_specifier] q (output,per-stream)
Use fixed quality scale (VBR). The meaning of q/qscale is codec-dependent. If qscale is used without a stream_specifier then it applies only to the video stream, this is to maintain compatibility with previous behavior and as specifying the same codec specific value to 2 different codecs that is audio and video generally is not what is intended when no stream_specifier is used.
-q:v is probably being ignored
You are outputting MP4, so it is most likely that you are using the encoder libx264 which outputs H.264 video.
-q:v / -qscale:v is ignored by libx264.
The console output even provides a warning about this: -qscale is ignored, -crf is recommended.
For more info on -crf see FFmpeg Wiki: H.264.
When can I use -q:v?
The MPEG* encoders (mpeg4, mpeg2video, mpeg1video, mjpeg, libxvid, msmpeg4) can use -q:v / -qscale:v.
See How can I extract a good quality JPEG image from a video file with ffmpeg? for more info on this option.
This option is an alias for -qscale:v which might be why you didn't encounter it during your research (eventhough my resultat came first with "ffmpeg q:v" on google).
This link explains how the qscale option is not a multiplier or a percentage, it's a bitrate mode (so it's to bitrate). For a given encoder, the lower this number the higher the bitrate and quality. It usually spans from 1-31 but some encoders can accept a subset of this range.

Making Sound To High To Hear Or Undetecable with Sox/FFMPEG

I want to make a sound that is too high to be detected by the human ear. From my understanding, humans can hear sounds between 20hz and 44000hz.
With sox, I am making a sound that is 50000hz. The problem is I can still hear it. The command I am using is this:
sox -n -r 50000 output.wav rate -L -s 50050 synth 3 sine
Either I have super good hearing or I am doing something wrong. How can I make this sound undetectable with SOX of FFMPEG?
Human hearing is generally considered to range between 20Hz and 20kHz, although most people don't hear much above 16kHz. Digital signals can only represent frequencies up to half of their sampling rate, known as the Nyquist frequency, and so, in order to accurately reproduce audio for the human ear, a sampling rate of at least 40kHz is needed. In practice, a sampling rate of 44.1kHz or 48kHz is almost always used, leaving plenty of space for an inaudable sound somewhere in the 20-22kHz range.
For example, this command generates a WAV file with a sampling rate of 48kHz containing a sine wave at 22kHz that is completely inaudible to me:
sox -n -r 48000 output.wav synth 3 sine 22000
I think part of your problem was that you were using the wrong syntax to specify the pitch to sox. This question has some good information about using SoX to generate simple tones.

How to remove long silent and unchanged video sections with ffmpeg?

I want to identify areas in a .mp4 (H264 + AAC) video that are silent and unchanged frames and cut them out.
Of course there would be some fine-tuning regarding thresholds and algorithms to measure unchanged frames.
My problem is more general, regarding how I would go about automating this?
Is it possible to solve this with ffmpeg? (preferably with C or python)
How can I programatically analyse the audio?
How can I programatically analyse video frames?
For audio silence see this.
For still video scenes ffmpeg might not be the ideal tool.
You could use scene change detection with a low threshold to find the specific frames, then extract those frames and compare them with something like imagemagick's compare function:
ffprobe -show_frames -print_format compact -f lavfi "movie=test.mp4,select=gt(scene\,.1)"
compare -metric RMSE frame1.png frame0.png
I don't expect this to work very well.
Your best bet is to use something like OpenCV to find differences between frames.
OpenCV Simple Motion Detection

Programmatic mix analysis of stereo audio files - is bass panned to one channel?

I want to analyze my music collection, which is all CD audio data (stereo 16-bit PCM, 44.1kHz). What I want to do is programmatically determine if the bass is mixed (panned) only to one channel. Ideally, I'd like to be able to run a program like this
mono-bass-checker music.wav
And have it output something like "bass is not panned" or "bass is mixed primarily to channel 0".
I have a rudimentary start on this, which in pseudocode looks like this:
binsize = 2^N # define a window or FFT bin as a power of 2
while not end of audio file:
read binsize samples from audio file
de-interleave channels into two separate arrays
chan0_fft_result = fft on channel 0 array
chan1_fft_result = fft on channel 1 array
for each index i in (number of items in chanX_fft_result/2):
freqency_bin = i * 44100 / binsize
# define bass as below 150 Hz (and above 30 Hz, since I can't hear it)
if frequency_bin > 150 or frequency_bin < 30 ignore
magnitude = sqrt(chanX_fft_result[i].real^2 + chanX_fft_result[i].complex^2)
I'm not really sure where to go from here. Some concepts I've read about but are still too nebulous to me:
Window function. I'm currently not using one, just naively reading from the audio file 0 to 1024, 1025 to 2048, etc (for example with binsize=1024). Is this something that would be useful to me? And if so, how does it get integrated into the program?
Normalizing and/or scaling of the magnitude. Lots of people do this for the purpose of making pretty spectograms, but do I need to do that in my case? I understand human hearing roughly works on a log scale, so perhaps I need to massage the magnitude result in some way to filter out what I wouldn't be able to hear anyway? Is something like A-weighting relevant here?
binsize. I understand that a bigger binsize gets me more frequency bins... but I can't decide if that helps or hurts in this case.
I can generate a "mono bass song" using sox like this:
sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine40hz_mono.wav synth 5.0 sine 40.0
sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine329hz_mono.wav synth 5.0 sine 329.6
sox -M sine40hz_mono.wav sine329hz_mono.wav sine_merged.wav
In the resulting "sine_merged.wav" file, one channel is pure bass (40Hz) and one is non-bass (329 Hz). When I compute the magnitude of bass frequencies for each channel of that file, I do see a significant difference. But what's curious is that the 329Hz channel has non-zero sub-150Hz magnitude. I would expect it to be zero.
Even then, with this trivial sox-generated file, I don't really know how to interpret the data I'm generating. And obviously, I don't know how I'd generalize to my actual music collection.
FWIW, I'm trying to do this with libsndfile and fftw3 in C, based on help from these other posts:
WAV-file analysis C (libsndfile, fftw3)
Converting an FFT to a spectogram
How do I obtain the frequencies of each value in an FFT?
Not using a window function (the same as using a rectangular window) will splatter some of the high frequency content (anything not exactly periodic in your FFT length) into all other frequency bins of an FFT result, including low frequency bins. (Sometimes this is called spectral "leakage".)
To minimize this, try applying a window function (von Hann, etc.) before the FFT, and expect to have to use some threshold level, instead of expecting zero content in any bins.
Also note that the bass notes from many musical instruments can generate some very powerful high frequency overtones or harmonics that will show up in the upper bins on an FFT, so you can't preclude a strong bass mix from the presence of a lot of high frequency content.

What is the effect of the "quality" option in SoX mp3 compression?

I am converting audio files of several different formats to mp3 using SoX. According to the docs, you can use the -C argument to specify compression options like the bitrate and quality, the quality being after the decimal point, for example:
sox input.wav -C 128.01 output.mp3 (highest quality, slower)
sox input.wav -C 128.99 output.mp3 (lowest quality, faster)
I expected the second one to sound terrible, however, the audio quality between the two sounds exactly the same. If that is the case, I do not understand why one performs so much slower or what I would gain by setting the compression to higher "quality".
Can someone please tell me if there is a real difference or advantage to using higher quality compression versus lower quality?
P.S. I also checked the file size of each output file and both are exactly the same size. But when hashed, each file comes out with a different hash.
The parameters are passed on to LAME. According to the LAME documentation (section “algorithm quality selection”/-q), the quality value has an effect on noise shaping and the psychoacoustic model used. They recommend a quality of 2 (i.e. -C 128.2 in SoX), saying that 0 and 1 are much slower, but hardly better.
However, the main factor determining the quality remains the bit rate. It is therefore not too surprising that there is no noticeable difference in your case.
For me faster with simple
time sox input.mp3 -C 128 output.mp3
real 0m7.417s user 0m7.334s sys 0m0.057s
time sox input.mp3 -C 128.02 output.mp3
real 0m39.805s user 0m39.430s sys 0m0.205s

Resources