I'm trying to detect speech volume above a threshold in short, 2-3 second, audio files with sox but it's always coming out about 90% max volume regardless of silence or noise.
This is the command i'm using (i've tried varying the scale option):
sox noise.wav -n stats -s 99
If i shout and have the microphone in my mouth or bash it i can get a detectable difference of about 95% volume but it is a desktop style microphone. Playing back the audio files there is an audible silence recorded but there is still a big distinction when speaking from a distance.
Is there a setting i'm missing or has anyone else encountered this?
Related
I've been using SoX to generate white noise. I'm after a way of modulating the volume across the entire track in a way that will create a pattern similar to this:
White noise envelope effect
I've experimented with fade, but that fades in to 100% volume and fades out to 0% volume, which is just a pain in this instance.
The tremolo effect isn't quite what I'm after either, as the frequency of the pattern will be changing over time.
The only other alternative is to split the white noise file into separate files, apply fade and then apply trim to either end so it doesn't fade all the way, but this seems like a lot of unnecessary processing.
I've been checking out this example Using SoX to change the volume level of a range of time in an audio file, but I don't think it's quite what I'm after.
I'm using the command-line in Ubuntu with SoX, but I'm open to suggestions with ffmpeg, or any other Linux based command-line solution.
With ffmpeg, you could use the volume filter
ffmpeg -i input.wav -af \
"volume='if(lt(mod(t\,5)/5\,0.5), 0.2+0.8*mod(2*t\,5)/5\, 1.0-0.8*mod(t-(5/2)\,5)/(5/2))':eval=frame" \
output.wav
The expression in the filter above, increases the volume from 0.2 to 1.0 over t=0 to t=2.5 seconds, then gradually back down to 0.2 at t=5 seconds. The period of the envelope here is 5 seconds.
To detect speech I'm playing with this sox command:
rec voice.wav silence 1 5 30% 1 0:00:02 30%
It should start recording whenever the input volume raises about the threshold of 30% and stops after 2 seconds the audio falls below the same threshold.
It works. But It would be much better if it could be "retriggerable". I mean: after the audio falls below the threshold and the audio rises again, it should continue the registration (i.e. the user is still speaking).
It should stops only when it detects silence for whole 2 seconds.
Or do you recommend any other "VOX" tool?
I've spent a lot of time experimenting with SOX to do VOX and have gotten it to work reasonably well. I've been using Audacity to view the resultant wave form, and have settled on the following SOX command...
rec snd.wav silence 1 .5 2.85% 1 1.0 3.0% vad gain -n : newfile : restart
This will:
wait until it hears activity above the threshold for a half second, then start recording (silence 1 .5 2.85%)
stop recording when audible activity falls to zero for one second (... 1 1.0 3.0%)
trim off any initial silence up to voice detection (vad)
normalize the gain (gain -n)
store the result into a new file (snd001.wav, snd002.wav)
restart the process
Getting the "silence" numbers correct involved a lot of trial and error, and will depend on ambient noise as well as the sensitivity of your microphone. I'm using the microphone in the Logitech QuickCam IM on a Raspberry Pi through USB.
On a side note, this whole thing complains with the following...
rec FAIL formats: can't open input `default': snd_pcm_open error: No such file or directory
... until I created this variable in the environment:
export AUDIODEV=hw:1,0
Again - this involved a lot of experimentation with the values for "silence", and it WILL need some tweaking for your environment.
Currently, I use sox like this:
sox -d -e u-law --endian little -b 8 -c 1 -r 8000 -t ul - silence 1 0.3 1% 1 0.3 1%
For reference, this is recording audio from the default microphone and outputting little endian, ulaw formatted audio at 8 bits and a 8k rate. The effects filter trims audio until the noise hits a threshold for 0.3 seconds, then continues to record until there is 0.3 seconds of silence. All of this streams to stdout which I use to stream to a remote server.
I am using all of this to record a bit of voice and finish when I am done speaking. To trigger sox, I use specialized hardware to trigger the start of the recording. I can switch to using almost any audio format or codec as long as it supports on the fly formatting/encoding. My target platform is raspbian on the raspberry pi 2 B.
My ideal solution would be to use vad to stop the recording when the user is finished speaking. My hope is that this would work even with background chatter. However, the sox documentation on the vad effect states this:
The use of the norm effect is recommended, but remember that neither
reverse nor norm is suitable for use with streamed audio.
I haven't been able to piece parameters together to get vad and streaming working. Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping? Are there better alternatives?
Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping?
No. The vad effect can trim silence only from the front of the audio. So you could only use it to detect recording start, and not ending and pauses.
The reverse and norm filters need all the input data before they produce any data on output, that is why they cannot be used with streaming.
The key is to select a good threshold for silence filter so it takes "background chatter" as silence.
You could use also noisered (with a profile based on previous recordings) before silence to reduce noise triggering the recording, but this will also affect output and probably will not take "background chatter" as noise.
I have trouble finding a program wich will resample 16bit 44.1KHz PCM into 12bit 25Khz. There's not much info on this to be found...
Anyone have a clue? I tried audacity, ffmpeg,... but to no avail. I was thinking about reducing the amplitude on a 16bit sample by 75% in a normal editor and throw away the highest nibble but something tells me it might not be that simple...
You can do that using Audacity, but it's not so intuitive. There is no direct resampling option, so you need to do it in two steps.
In the Audio Track menu you can use Set Rate to change the samling rate to 25000 Hz.
That only changes the replay speed, so you need to resample it also. That is done with Change Speed in the Effect menu. The speed change is 44100/25000 = 1.764, which is 76.4% faster.
Now you can export the track to the 12-bit format that you want.
I found another way in the meantime; sox (the command line version) - very good and complete conversion: http://sox.sourceforge.net/ I am converting the sample rate with sox and process the raw file through php
I would like to save a quiet audio file with more volume. Can you suggest me a program, method to do it?
The device I would like to use the audio file on is not very intelligent and the maximum volume setting is too quiet for me. Other audio files (that are louder) can be played fine on the device. So I thought, I open the file on PC, modify it to be louder, save it, then the device will play this fine too. I am aware of distortions and such, that is not the point now.
I have used VLC player and I can make a setting where the audio file is loud enough with little distortions, but I can not find the options to save the file with these settings. It is an MP3 file.
Thanks for the help,
Sziro
Increasing the volume of an audio file so that the peaks are at (or near) the maximum level is called normalisation. You can use an audio editior like audacity or there are dedicated solutions. Normally if saving to mp3 you normalise to slightly less than full volume (say -0.5 dB).
You might also want to consider compressing the audio. This will be useful if the peaks in your audio are much louder than the quiet passages, and the quiet passages are hard to hear as a result. Again, you can do this in audacity.