sox effect: retriggerable silence - audio

To detect speech I'm playing with this sox command:
rec voice.wav silence 1 5 30% 1 0:00:02 30%
It should start recording whenever the input volume raises about the threshold of 30% and stops after 2 seconds the audio falls below the same threshold.
It works. But It would be much better if it could be "retriggerable". I mean: after the audio falls below the threshold and the audio rises again, it should continue the registration (i.e. the user is still speaking).
It should stops only when it detects silence for whole 2 seconds.
Or do you recommend any other "VOX" tool?

I've spent a lot of time experimenting with SOX to do VOX and have gotten it to work reasonably well. I've been using Audacity to view the resultant wave form, and have settled on the following SOX command...
rec snd.wav silence 1 .5 2.85% 1 1.0 3.0% vad gain -n : newfile : restart
This will:
wait until it hears activity above the threshold for a half second, then start recording (silence 1 .5 2.85%)
stop recording when audible activity falls to zero for one second (... 1 1.0 3.0%)
trim off any initial silence up to voice detection (vad)
normalize the gain (gain -n)
store the result into a new file (snd001.wav, snd002.wav)
restart the process
Getting the "silence" numbers correct involved a lot of trial and error, and will depend on ambient noise as well as the sensitivity of your microphone. I'm using the microphone in the Logitech QuickCam IM on a Raspberry Pi through USB.
On a side note, this whole thing complains with the following...
rec FAIL formats: can't open input `default': snd_pcm_open error: No such file or directory
... until I created this variable in the environment:
export AUDIODEV=hw:1,0
Again - this involved a lot of experimentation with the values for "silence", and it WILL need some tweaking for your environment.

Related

record screen with high quality and minimum size in ElectronJS (Windows)

as I said in the title, I need to record my screen from an electron app.
my needs are:
high quality (720p or 1080p)
minimum size
record audio + screen + mic
low impact on PC hardware while recording
no need for any wait after the recorder stopped
by minimum size I mean about 400MB on 720p and 700MB on 1080p for a 3 to 4 hours recording. we already could achieve this by bandicam and obs and it's possible
I already tried:
the simple MediaStreamRecorder API using RecordRTC.Js; produces huge file sizes, like 1GB per hour for 720p video.
compressing the output video using FFmpeg; it can take up to 1 hour for 3 hours recording
save every chunk with 'ondataavailable' event and right after, run FFmpeg and convert and reduce the size and append all the compressed files (also by FFmpeg); there are two problems. 1, because of different PTS but it can be fixed by tunning compress command args. 2, the main problem is the audio data headers are only available in the first chunk and this approach causes a video that only has audio for the first few seconds
recording the video with FFmpeg itself; the end-users need to change some things manually (Stereo Mix), the configs are too complex, it causes the whole PC to work slower while recording (like fps drop; even if I set -threads to 1), in some cases after recording is finished it needs many times to wrap it all up
searched through the internet to find applications that can be used from the command line; I couldn't find much, the famous applications like bandicam and obs have command line args but there are not many args to play with and I can't set many options which leads to other problems
I don't know what else I can do, please tell me if u know a way or simple tool that can be used through CLI to achieve this and guide me through this
I end up using the portable mode of high-level 3d-party applications like obs-studio and adding them to our final package. I also created a js file to control the application using CLI
this way I could pre-set my options (such as crf value, etc) and now our average output size for a 3:30 hour value with 1080p resolution is about 700MB which is impressive

Remove start of new track from end of audio file

I've a couple of recordings where at the end of the track there's silence and the start of a new track (fraction).
I tried to remove the start of the new track from the end of the file.
My command (*nux)
sox file_in.mp3 -C 320 file_out.mp3 silence 1 0.75 0.2% -1 0.75 0.2%
Preferably keep the silence at the end or just add some new. Any help appreciated
Probably you can use the below period parameter without the above period.
A bit of silence can optionally be kept with the -l parameter (or you can just use pad with a fixed amount).
For example:
sox input.mp3 -C 320 output.mp3 silence -l 0 1 0.5 0.5%
Sample input:
Result:
Note, if the beginning of some new tracks has long silence periods inside, it may be tricky - but usually possible - to find appropriate duration and threshold parameters (the general case, apparently, it is not tractable at all by simple processing, one needs a machine learning approach to reliably mark inner periods of silence that belong to a track).
Also, if it is just a couple of files, opening an editor e.g. Audacity is the quickest and the best outcome approach.

piping a tone generator to aplay()

I thought it would be a simple task ...
- platform: linux on laptop
- language: python
- object: generate tone to be heard on speaker or headphones. Tone will be modified in real-time, many times per second (think about a metal-finder)
Initial design was to generate a tone in python and pipe it to aplay().
Since aplay consume data at a known rate (the sampling rate), I thought my tone generator would not have to care about timing if silences (between tones) where generated at normal sampling rate (null amplitude).
First result show an important time lag (many seconds). I found that the pipe is fairly long by default (64KB). That's 8 seconds of samples (at 8Khz).
I found a way to reduce the pipe size at 4KB but it is still too long (0.5s lag).
Sampling at a very high frequency would reduce the lag but I don't like that solution.
Second approach was to generate a real silence (no sample) during a silence and the generator would sleep() during the silence.
Result is that aplay complains about underrunning and, for some reason, the tones were truncated and mishandled (bad rendering).
So, my question is:
What are the best ways to send a tone to the audio stack without piping?

SoX detecting volume is always near max

I'm trying to detect speech volume above a threshold in short, 2-3 second, audio files with sox but it's always coming out about 90% max volume regardless of silence or noise.
This is the command i'm using (i've tried varying the scale option):
sox noise.wav -n stats -s 99
If i shout and have the microphone in my mouth or bash it i can get a detectable difference of about 95% volume but it is a desktop style microphone. Playing back the audio files there is an audible silence recorded but there is still a big distinction when speaking from a distance.
Is there a setting i'm missing or has anyone else encountered this?

Using sox for voice detection and streaming

Currently, I use sox like this:
sox -d -e u-law --endian little -b 8 -c 1 -r 8000 -t ul - silence 1 0.3 1% 1 0.3 1%
For reference, this is recording audio from the default microphone and outputting little endian, ulaw formatted audio at 8 bits and a 8k rate. The effects filter trims audio until the noise hits a threshold for 0.3 seconds, then continues to record until there is 0.3 seconds of silence. All of this streams to stdout which I use to stream to a remote server.
I am using all of this to record a bit of voice and finish when I am done speaking. To trigger sox, I use specialized hardware to trigger the start of the recording. I can switch to using almost any audio format or codec as long as it supports on the fly formatting/encoding. My target platform is raspbian on the raspberry pi 2 B.
My ideal solution would be to use vad to stop the recording when the user is finished speaking. My hope is that this would work even with background chatter. However, the sox documentation on the vad effect states this:
The use of the norm effect is recommended, but remember that neither
reverse nor norm is suitable for use with streamed audio.
I haven't been able to piece parameters together to get vad and streaming working. Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping? Are there better alternatives?
Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping?
No. The vad effect can trim silence only from the front of the audio. So you could only use it to detect recording start, and not ending and pauses.
The reverse and norm filters need all the input data before they produce any data on output, that is why they cannot be used with streaming.
The key is to select a good threshold for silence filter so it takes "background chatter" as silence.
You could use also noisered (with a profile based on previous recordings) before silence to reduce noise triggering the recording, but this will also affect output and probably will not take "background chatter" as noise.

Resources