lag when piping data from nodejs to play(sox) - node.js

Recently I was super excited to discover baudio but unfortunately it has a bug (issue #13) that renders it practically useless. In light of this I've taken the core ideas and started making my own version. here's a breakdown of what it does.
generate amplitude output at short intervals in the form of 8bit pcm data (aka single hex characters representing amplitude levels 0-15)
pipe node's sdtout to play's (sox's) stdin
I ~neeed~ it to be an instrument usable in real time so I'm piping the data (8bit amplitude samples) in chunks 20 times per second but this seems to cause a lot of lag and choppiness. when I create a file out of node's output and pipe the uninterrupted stream of hex digits to play:
cat bitfile | play -c 1 -r 44100 -t s8 -
it works perfectly.
here is the js for my node applet. it requires the 'through' module:
npm install through
So my question is:
can I do anything to reduce this piping lag? faster hardware? different approach? hopeless?

Related

record screen with high quality and minimum size in ElectronJS (Windows)

as I said in the title, I need to record my screen from an electron app.
my needs are:
high quality (720p or 1080p)
minimum size
record audio + screen + mic
low impact on PC hardware while recording
no need for any wait after the recorder stopped
by minimum size I mean about 400MB on 720p and 700MB on 1080p for a 3 to 4 hours recording. we already could achieve this by bandicam and obs and it's possible
I already tried:
the simple MediaStreamRecorder API using RecordRTC.Js; produces huge file sizes, like 1GB per hour for 720p video.
compressing the output video using FFmpeg; it can take up to 1 hour for 3 hours recording
save every chunk with 'ondataavailable' event and right after, run FFmpeg and convert and reduce the size and append all the compressed files (also by FFmpeg); there are two problems. 1, because of different PTS but it can be fixed by tunning compress command args. 2, the main problem is the audio data headers are only available in the first chunk and this approach causes a video that only has audio for the first few seconds
recording the video with FFmpeg itself; the end-users need to change some things manually (Stereo Mix), the configs are too complex, it causes the whole PC to work slower while recording (like fps drop; even if I set -threads to 1), in some cases after recording is finished it needs many times to wrap it all up
searched through the internet to find applications that can be used from the command line; I couldn't find much, the famous applications like bandicam and obs have command line args but there are not many args to play with and I can't set many options which leads to other problems
I don't know what else I can do, please tell me if u know a way or simple tool that can be used through CLI to achieve this and guide me through this
I end up using the portable mode of high-level 3d-party applications like obs-studio and adding them to our final package. I also created a js file to control the application using CLI
this way I could pre-set my options (such as crf value, etc) and now our average output size for a 3:30 hour value with 1080p resolution is about 700MB which is impressive

Why can I sometimes concatenate audio data using NodeJS Buffers, and sometimes I cannot?

As part of a project I am working on, there is a requirement to concatenate multiple pieces of audio data into one large audio file. The audio files are generated from four sources, and the individual files are stored in a Google Cloud storage bucket. Each file is an mp3 file, and it is easy to verify that each individual file is generating correctly (individually, I can play them, edit them in my favourite software, etc.).
To merge the audio files together, a nodejs server loads the files from the Google Cloud storage as an array buffer using an axios POST request. From there, it puts each array buffer into a node Buffer using Buffer.from(), so now we have an array of Buffer objects. Then it uses Buffer.concat() to concatenate the Buffer objects into one big Buffer, which we then convert to Base64 data and send to the client server.
This is cool, but the issue arises when concatenating audio generated from different sources. The 4 sources I mentioned above are Text to Speech software platforms, such as Google Cloud Voice and Amazon Polly. Specifically, we have files from Google Cloud Voice, Amazon Polly, IBM Watson, and Microsoft Azure Text to Speech. Essentially just five text to speech solutions. Again, all individual files work, but when concatenating them together via this method there are some interesting effects.
When the sound files are concatenated, seemingly depending on which platform they originate from, the sound data either will or will not be included in the final sound file. Below is a 'compatibility' table based on my testing:
|------------|--------|--------|-----------|-----|
| Platform / | Google | Amazon | Microsoft | IBM |
|------------|--------|--------|-----------|-----|
| Google | Yes | No | No | No |
|------------|--------|--------|-----------|-----|
| Amazon | | No | No | Yes |
|------------|--------|--------|-----------|-----|
| Microsoft | | | Yes | No |
|------------|--------|--------|-----------|-----|
| IBM | | | | Yes |
|------------|--------|--------|-----------|-----|
The effect is as follows: When I play the large output file, it will always start playing the first sound file included. From there, if the next sound file is compatible, it is heard, otherwise it is skipped entirely (no empty sound or anything). If it was skipped, the 'length' of that file (for example 10s long audio file) is included at the end of the generated output sound file. However, the moment that my audio player hits the point where the last 'compatible' audio has played, it immediately skips to the end.
As a scenario:
Input:
sound1.mp3 (3s) -> Google
sound2.mp3 (5s) -> Amazon
sound3.mp3 (7s)-> Google
sound4.mp3 (11s) -> IBM
Output:
output.mp3 (26s) -> first 10s is sound1 and sound3, last 16s is skipped.
In this case, the output sound file would be 26s seconds long. For the first 10 seconds, you would hear the sound1.mp3 and sound3.mp3 played back to back. Then at 10s (at least playing this mp3 file in firefox) the player immediately skips to the end at 26s.
My question is: Does anyone have any ideas why sometimes I can concatenate audio data in this way, and other times I cannot? And how come there is this 'missing' data included at the end of the output file? Shouldn't concatenating the binary data work in all cases if it works for some cases, as all the files have mp3 encoding? If I am wrong please let me know what I can do to successfully concatenate any mp3 files :)
I can provide my nodeJS backend code, but the process and methods used are described above.
Thanks for reading?
Potential Sources of Problems
Sample Rate
44.1 kHz is often used for music, as it's what is used on CD audio. 48 kHz is usually used for video, as it's what was used on DVDs. Both of those sample rates are much higher than is required for speech, so it's likely that your various text-to-speech providers are outputting something different. 22.05 kHz (half of 44.1 kHz) is common, and 11.025 kHz is out there too.
While each frame specifies its own sample rate, making it possible to generate a stream with varying sample rates, I've never seen a decoder attempt to switch sample rates mid-stream. I suspect that the decoder is skipping these frames, or maybe even skipping over an arbitrary block until it gets consistent data again.
Use something like FFmpeg (or FFprobe) to figure out what the sample rates of your files are:
ffmpeg -i sound2.mp3
You'll get an output like this:
Duration: 00:13:50.22, start: 0.011995, bitrate: 192 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 192 kb/s
In this example, 44.1 kHz is the sample rate.
Channel Count
I'd expect your voice MP3s to be in mono, but it wouldn't hurt to check to be sure. As with above, check the output of FFmpeg. In my example above, it says stereo.
As with sample rate, technically each frame could specify its own channel count but I don't know of any player that will pull off switching channel count mid-stream. Therefore, if you're concatenating, you need to make sure all the channel counts are the same.
ID3 Tags
It's common for there to be ID3 metadata at the beginning (ID3v2) and/or end (ID3v1) of the file. It's less expected to have this data mid-stream. You would want to make sure this metadata is all stripped out before concatenating.
MP3 Bit Reservoir
MP3 frames don't necessarily stand alone. If you have a constant bitrate stream, the encoder may still use less data to encode one frame, and more data to encode another. When this happens, some frames contain data for other frames. That way, frames that could benefit from the extra bandwidth can get it while still fitting the whole stream within a constant bitrate. This is the "bit reservoir".
If you cut a stream and splice in another stream, you may split up a frame and its dependent frames. This typically causes an audio glitch, but may also cause the decoder to skip ahead. Some badly behaving decoders will just stop playing altogether. In your example, you're not cutting anything so this probably isn't the source of your trouble... but I mention it here because it's definitely relevant to the way you're working these streams.
See also: http://wiki.hydrogenaud.io/index.php?title=Bit_reservoir
Solutions
Pick a "normal" format, resample and rencode non-conforming files
If most of your sources are all the exact same format and only one or two outstanding, you could convert the non-conforming file. From there, strip ID3 tags from everything and concatenate away.
To do the conversion, I'd recommend kicking it over to FFmpeg as a child process.
child_process.spawn('ffmpeg' [
// Input
'-i', inputFile, // Use '-' to write to STDIN instead
// Set sample rate
'-ar', '44100',
// Set audio channel count
'-ac', '1',
// Audio bitrate... try to match others, but not as critical
'-b:a', '64k',
// Ensure we output an MP3
'-f', 'mp3',
// Output
outputFile // As with input, use '-' to write to STDOUT
]);
Best Solution: Let FFmpeg (or similar) do the work for you
The simplest, most robust solution to all of this is to let FFmpeg build a brand new stream for you. This will cause your audio files to be decoded to PCM, and a new stream made. You can add parameters to resample those inputs, and modify channel counts if needed. Then output one stream. Use the concat filter.
This way, you can accept audio files of any type, you don't have to write the code to hack those streams together, and once setup you won't have to worry about it.
The only downside is that it will require a re-encoding of everything, meaning another generation of quality lost. This would be required for any non-conforming files anyway, and it's just speech, so I wouldn't give it a second thought.
#Brad's answer was the solution! The first solution he suggested worked. It took some messing around getting FFMpeg to work correctly, but in the end using the fluent-ffmpeg library worked.
Each file in my case was stored on Google Cloud Storage, and not on the server's hard drive. This posed some problems for FFmpeg, as it requires file paths to have multiple files, or an input stream (but only one is supported, as there is only one STDIN).
One solution is to put the files on the hard drive temporarily, but this would not work for our use case as we may have a lot of use in this function and the hard drive adds latency.
So, instead we did as suggested and loaded each file into ffmpeg to convert it into a standardized format. This was a bit tricky, but in the end requesting each file as a stream, using that stream as an input for ffmpeg, then using fluent-ffmpeg's pipe() method (which returns a stream) as output worked.
We then bound an event listener to the 'data' event for this pipe, and pushed the data to an array (bufs.push(data)), and on stream 'end' we concatenated this array using Buffer.concat(bufs), followed by a promise resolve.
Then once all requests promises were resolved, we could be sure ffmpeg had processed each file, and then those buffers were concatenated in the required groups as before using Buffer.concat(), converted to base64 data, and sent to the client.
This works great, and now it seems to be able to handle every combination of files/sources I can throw at it!
In conclusion:
The answer to the question was that the mp3 data must have been encoded differently (different channels, sample rates, etc.), and loading it through ffmpeg and outputing it in a 'unified' way made the mp3 data compatible.
The solution was to process each file in ffmpeg separately, pipe the ffmpeg output into a buffer, then concatenate the buffers.
Thanks #Brad for your suggestions and detailed answer!

piping a tone generator to aplay()

I thought it would be a simple task ...
- platform: linux on laptop
- language: python
- object: generate tone to be heard on speaker or headphones. Tone will be modified in real-time, many times per second (think about a metal-finder)
Initial design was to generate a tone in python and pipe it to aplay().
Since aplay consume data at a known rate (the sampling rate), I thought my tone generator would not have to care about timing if silences (between tones) where generated at normal sampling rate (null amplitude).
First result show an important time lag (many seconds). I found that the pipe is fairly long by default (64KB). That's 8 seconds of samples (at 8Khz).
I found a way to reduce the pipe size at 4KB but it is still too long (0.5s lag).
Sampling at a very high frequency would reduce the lag but I don't like that solution.
Second approach was to generate a real silence (no sample) during a silence and the generator would sleep() during the silence.
Result is that aplay complains about underrunning and, for some reason, the tones were truncated and mishandled (bad rendering).
So, my question is:
What are the best ways to send a tone to the audio stack without piping?

Correct way to encode Kinect audio with lame.exe

I receive data from a Kinect v2, which is (I believe, information is hard to find) 16kHz mono audio in 32-bit floating point PCM. The data arrives in up to 4 "SubFrames", which contain 256 samples each.
When I send this data to lame.exe with -r -s 16 --bitwidth 32 -m m I get an output containing gaps (supposedly where the second channel should be). These command line switches should however take stereo and downmix it to mono.
I've also tried importing the raw data into Audacity, but I still can't figure out the correct way to get continuous audio out of it.
EDIT: I can get continuous audio when I only save the first SubFrame. The audio still doesn't sound right though.
In the end I went with Ogg Vorbis. A free format, so no problems there either. I use the following command line switches for oggenc2.exe:
oggenc2.exe --raw-format=3 --raw-chan=1 --raw-rate=16000 - --output=[filename]

Using sox for voice detection and streaming

Currently, I use sox like this:
sox -d -e u-law --endian little -b 8 -c 1 -r 8000 -t ul - silence 1 0.3 1% 1 0.3 1%
For reference, this is recording audio from the default microphone and outputting little endian, ulaw formatted audio at 8 bits and a 8k rate. The effects filter trims audio until the noise hits a threshold for 0.3 seconds, then continues to record until there is 0.3 seconds of silence. All of this streams to stdout which I use to stream to a remote server.
I am using all of this to record a bit of voice and finish when I am done speaking. To trigger sox, I use specialized hardware to trigger the start of the recording. I can switch to using almost any audio format or codec as long as it supports on the fly formatting/encoding. My target platform is raspbian on the raspberry pi 2 B.
My ideal solution would be to use vad to stop the recording when the user is finished speaking. My hope is that this would work even with background chatter. However, the sox documentation on the vad effect states this:
The use of the norm effect is recommended, but remember that neither
reverse nor norm is suitable for use with streamed audio.
I haven't been able to piece parameters together to get vad and streaming working. Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping? Are there better alternatives?
Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping?
No. The vad effect can trim silence only from the front of the audio. So you could only use it to detect recording start, and not ending and pauses.
The reverse and norm filters need all the input data before they produce any data on output, that is why they cannot be used with streaming.
The key is to select a good threshold for silence filter so it takes "background chatter" as silence.
You could use also noisered (with a profile based on previous recordings) before silence to reduce noise triggering the recording, but this will also affect output and probably will not take "background chatter" as noise.

Resources