I want to normalize the amplitude of sound files, and use the following:
sox infile.wav outfile.wav gain -n -3
The sound however contains peaks:
So of course, these are taken into account when the normalization is done.
Can SoX help me to detect such peaks and lower their amplitude before normalization?
(I haven't been able to find this in the documentation). Or, what would be some other recommended way to accomplish this?
Related
I have tried PyRubberband, librosa, praat-parselmouth and pysox. All off them work but I still hear some noise or small artifacts in the output. Also, they shift the audio around 100 ms.
How can I tune them to get the best possible quality or can you suggest any library which does it better?
UPD: FFMPEG approach:
ffmpeg -i input.wav -af asetrate=48000*1.1,aresample=48000,atempo=1/1.1 output.wav
I would guess that of the four mentioned, PyRubberband is probably the best algorithm. Depending on how much you want to shift pitch, you will never reach perfect results. This has to do with the fact that (as far as I know) they all use a phase vocoder, which transforms the signal into the frequency domain, shifts and then transforms it back into the time domain using the imperfect Griffin-Lim algorithm. Griffin-Lim tends to introduce small phase artifacts, which leads to a slightly metallic sound.
To learn more about time scale modification/pitch shifting, I recommend this overview article by Driedger.
I want to make a sound that is too high to be detected by the human ear. From my understanding, humans can hear sounds between 20hz and 44000hz.
With sox, I am making a sound that is 50000hz. The problem is I can still hear it. The command I am using is this:
sox -n -r 50000 output.wav rate -L -s 50050 synth 3 sine
Either I have super good hearing or I am doing something wrong. How can I make this sound undetectable with SOX of FFMPEG?
Human hearing is generally considered to range between 20Hz and 20kHz, although most people don't hear much above 16kHz. Digital signals can only represent frequencies up to half of their sampling rate, known as the Nyquist frequency, and so, in order to accurately reproduce audio for the human ear, a sampling rate of at least 40kHz is needed. In practice, a sampling rate of 44.1kHz or 48kHz is almost always used, leaving plenty of space for an inaudable sound somewhere in the 20-22kHz range.
For example, this command generates a WAV file with a sampling rate of 48kHz containing a sine wave at 22kHz that is completely inaudible to me:
sox -n -r 48000 output.wav synth 3 sine 22000
I think part of your problem was that you were using the wrong syntax to specify the pitch to sox. This question has some good information about using SoX to generate simple tones.
I want to identify areas in a .mp4 (H264 + AAC) video that are silent and unchanged frames and cut them out.
Of course there would be some fine-tuning regarding thresholds and algorithms to measure unchanged frames.
My problem is more general, regarding how I would go about automating this?
Is it possible to solve this with ffmpeg? (preferably with C or python)
How can I programatically analyse the audio?
How can I programatically analyse video frames?
For audio silence see this.
For still video scenes ffmpeg might not be the ideal tool.
You could use scene change detection with a low threshold to find the specific frames, then extract those frames and compare them with something like imagemagick's compare function:
ffprobe -show_frames -print_format compact -f lavfi "movie=test.mp4,select=gt(scene\,.1)"
compare -metric RMSE frame1.png frame0.png
I don't expect this to work very well.
Your best bet is to use something like OpenCV to find differences between frames.
OpenCV Simple Motion Detection
I want to analyze my music collection, which is all CD audio data (stereo 16-bit PCM, 44.1kHz). What I want to do is programmatically determine if the bass is mixed (panned) only to one channel. Ideally, I'd like to be able to run a program like this
mono-bass-checker music.wav
And have it output something like "bass is not panned" or "bass is mixed primarily to channel 0".
I have a rudimentary start on this, which in pseudocode looks like this:
binsize = 2^N # define a window or FFT bin as a power of 2
while not end of audio file:
read binsize samples from audio file
de-interleave channels into two separate arrays
chan0_fft_result = fft on channel 0 array
chan1_fft_result = fft on channel 1 array
for each index i in (number of items in chanX_fft_result/2):
freqency_bin = i * 44100 / binsize
# define bass as below 150 Hz (and above 30 Hz, since I can't hear it)
if frequency_bin > 150 or frequency_bin < 30 ignore
magnitude = sqrt(chanX_fft_result[i].real^2 + chanX_fft_result[i].complex^2)
I'm not really sure where to go from here. Some concepts I've read about but are still too nebulous to me:
Window function. I'm currently not using one, just naively reading from the audio file 0 to 1024, 1025 to 2048, etc (for example with binsize=1024). Is this something that would be useful to me? And if so, how does it get integrated into the program?
Normalizing and/or scaling of the magnitude. Lots of people do this for the purpose of making pretty spectograms, but do I need to do that in my case? I understand human hearing roughly works on a log scale, so perhaps I need to massage the magnitude result in some way to filter out what I wouldn't be able to hear anyway? Is something like A-weighting relevant here?
binsize. I understand that a bigger binsize gets me more frequency bins... but I can't decide if that helps or hurts in this case.
I can generate a "mono bass song" using sox like this:
sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine40hz_mono.wav synth 5.0 sine 40.0
sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine329hz_mono.wav synth 5.0 sine 329.6
sox -M sine40hz_mono.wav sine329hz_mono.wav sine_merged.wav
In the resulting "sine_merged.wav" file, one channel is pure bass (40Hz) and one is non-bass (329 Hz). When I compute the magnitude of bass frequencies for each channel of that file, I do see a significant difference. But what's curious is that the 329Hz channel has non-zero sub-150Hz magnitude. I would expect it to be zero.
Even then, with this trivial sox-generated file, I don't really know how to interpret the data I'm generating. And obviously, I don't know how I'd generalize to my actual music collection.
FWIW, I'm trying to do this with libsndfile and fftw3 in C, based on help from these other posts:
WAV-file analysis C (libsndfile, fftw3)
Converting an FFT to a spectogram
How do I obtain the frequencies of each value in an FFT?
Not using a window function (the same as using a rectangular window) will splatter some of the high frequency content (anything not exactly periodic in your FFT length) into all other frequency bins of an FFT result, including low frequency bins. (Sometimes this is called spectral "leakage".)
To minimize this, try applying a window function (von Hann, etc.) before the FFT, and expect to have to use some threshold level, instead of expecting zero content in any bins.
Also note that the bass notes from many musical instruments can generate some very powerful high frequency overtones or harmonics that will show up in the upper bins on an FFT, so you can't preclude a strong bass mix from the presence of a lot of high frequency content.
I am converting audio files of several different formats to mp3 using SoX. According to the docs, you can use the -C argument to specify compression options like the bitrate and quality, the quality being after the decimal point, for example:
sox input.wav -C 128.01 output.mp3 (highest quality, slower)
sox input.wav -C 128.99 output.mp3 (lowest quality, faster)
I expected the second one to sound terrible, however, the audio quality between the two sounds exactly the same. If that is the case, I do not understand why one performs so much slower or what I would gain by setting the compression to higher "quality".
Can someone please tell me if there is a real difference or advantage to using higher quality compression versus lower quality?
P.S. I also checked the file size of each output file and both are exactly the same size. But when hashed, each file comes out with a different hash.
The parameters are passed on to LAME. According to the LAME documentation (section “algorithm quality selection”/-q), the quality value has an effect on noise shaping and the psychoacoustic model used. They recommend a quality of 2 (i.e. -C 128.2 in SoX), saying that 0 and 1 are much slower, but hardly better.
However, the main factor determining the quality remains the bit rate. It is therefore not too surprising that there is no noticeable difference in your case.
For me faster with simple
time sox input.mp3 -C 128 output.mp3
real 0m7.417s user 0m7.334s sys 0m0.057s
time sox input.mp3 -C 128.02 output.mp3
real 0m39.805s user 0m39.430s sys 0m0.205s