Is there a way to ensure mp3 duration accuracy with variable bit rate using FFMPEG? - node.js

In our application, we are processing audio files using ffmpeg. Specifically, we use the NodeJS library fluent-ffmpeg, (npm link).
Our audio files are generated from various text to speech providers. We recently noticed that when we converted audio using ssml to add pauses to the generated audio, the duration on the file is no longer correct. Upon further investigation, we noticed that the standard audios were also incorrect, just more accurate overall due to the more consistent data. When we put a pause at the beginning of the audio, the estimate was the worst, overshooting it by a very large margin (e.g., a 25s audio clip would read as 3 minutes long, but skip to the end when playing past the 25s mark.
I did some searching and research into the structure of MP3 files, and to me it seems like the issue is because the duration gets estimated by various audio players. Windows media player is an example, but Firefox's web player seems to also do this. I tried changing the ffmpeg command from using .audioQuality(0), which sets ffmpeg to use VBR, to .audioBitrate(320), which tells ffmpeg to use a constant bitrate.
For reference, the we are using libmp3lame, and the full command that gets run is the following, for the VBR and CBR cases respectively:
For VBR (broken durations): ffmpeg -i <URL> -acodec libmp3lame -aq 0 -f mp3 pipe:1
For CBR (correct duration): ffmpeg -i <URL> -acodec libmp3lame -b:a 320k -f mp3 pipe:1
Note: we then pipe the output to the requesting client application after sending the appropriate file headers, hence the pipe:1 output. The input is a cloud storage url where the source file is located
This fixes our problem of having a correct duration, and it makes sense to me why this would fix it if the problem was because the duration is being estimated by some of these players / audio consumers. But, this came at the cost that the file size was significantly larger, which also makes sense to me. While testing we found that compared to the same file in WAV, the VBR mp3 was about 10% of the WAV file size, while the CBR mp3 was still 50% of the WAV file size. This practically defeats the purpose of supporting the mp3 format for our use-case, which is a smaller but slightly lossy alternative to the large WAV file.
While researching, I found that there can be ID3 tags in a chunk at the beginning of the mp3 file, specifying information for the consumer of the audio to know the duration before potentially having processed the whole file. But, I also found that there doesn't seem to be a standard, at least for duration. More things like song title, album, artist, etc.
My question is, is there a way to get the proper duration onto an mp3 file, preferably via some ffmpeg mechanism, while still using VBR? Thanks!

FFmpeg does write a Xing header by default with duration info. However, that value is only known after the entire stream data has been received, so ffmpeg has to seek to the head to write it. Since you're piping the output, that can't be done.
Write the file locally or to some seekable destination, and then upload.

Related

Why can I sometimes concatenate audio data using NodeJS Buffers, and sometimes I cannot?

As part of a project I am working on, there is a requirement to concatenate multiple pieces of audio data into one large audio file. The audio files are generated from four sources, and the individual files are stored in a Google Cloud storage bucket. Each file is an mp3 file, and it is easy to verify that each individual file is generating correctly (individually, I can play them, edit them in my favourite software, etc.).
To merge the audio files together, a nodejs server loads the files from the Google Cloud storage as an array buffer using an axios POST request. From there, it puts each array buffer into a node Buffer using Buffer.from(), so now we have an array of Buffer objects. Then it uses Buffer.concat() to concatenate the Buffer objects into one big Buffer, which we then convert to Base64 data and send to the client server.
This is cool, but the issue arises when concatenating audio generated from different sources. The 4 sources I mentioned above are Text to Speech software platforms, such as Google Cloud Voice and Amazon Polly. Specifically, we have files from Google Cloud Voice, Amazon Polly, IBM Watson, and Microsoft Azure Text to Speech. Essentially just five text to speech solutions. Again, all individual files work, but when concatenating them together via this method there are some interesting effects.
When the sound files are concatenated, seemingly depending on which platform they originate from, the sound data either will or will not be included in the final sound file. Below is a 'compatibility' table based on my testing:
|------------|--------|--------|-----------|-----|
| Platform / | Google | Amazon | Microsoft | IBM |
|------------|--------|--------|-----------|-----|
| Google | Yes | No | No | No |
|------------|--------|--------|-----------|-----|
| Amazon | | No | No | Yes |
|------------|--------|--------|-----------|-----|
| Microsoft | | | Yes | No |
|------------|--------|--------|-----------|-----|
| IBM | | | | Yes |
|------------|--------|--------|-----------|-----|
The effect is as follows: When I play the large output file, it will always start playing the first sound file included. From there, if the next sound file is compatible, it is heard, otherwise it is skipped entirely (no empty sound or anything). If it was skipped, the 'length' of that file (for example 10s long audio file) is included at the end of the generated output sound file. However, the moment that my audio player hits the point where the last 'compatible' audio has played, it immediately skips to the end.
As a scenario:
Input:
sound1.mp3 (3s) -> Google
sound2.mp3 (5s) -> Amazon
sound3.mp3 (7s)-> Google
sound4.mp3 (11s) -> IBM
Output:
output.mp3 (26s) -> first 10s is sound1 and sound3, last 16s is skipped.
In this case, the output sound file would be 26s seconds long. For the first 10 seconds, you would hear the sound1.mp3 and sound3.mp3 played back to back. Then at 10s (at least playing this mp3 file in firefox) the player immediately skips to the end at 26s.
My question is: Does anyone have any ideas why sometimes I can concatenate audio data in this way, and other times I cannot? And how come there is this 'missing' data included at the end of the output file? Shouldn't concatenating the binary data work in all cases if it works for some cases, as all the files have mp3 encoding? If I am wrong please let me know what I can do to successfully concatenate any mp3 files :)
I can provide my nodeJS backend code, but the process and methods used are described above.
Thanks for reading?
Potential Sources of Problems
Sample Rate
44.1 kHz is often used for music, as it's what is used on CD audio. 48 kHz is usually used for video, as it's what was used on DVDs. Both of those sample rates are much higher than is required for speech, so it's likely that your various text-to-speech providers are outputting something different. 22.05 kHz (half of 44.1 kHz) is common, and 11.025 kHz is out there too.
While each frame specifies its own sample rate, making it possible to generate a stream with varying sample rates, I've never seen a decoder attempt to switch sample rates mid-stream. I suspect that the decoder is skipping these frames, or maybe even skipping over an arbitrary block until it gets consistent data again.
Use something like FFmpeg (or FFprobe) to figure out what the sample rates of your files are:
ffmpeg -i sound2.mp3
You'll get an output like this:
Duration: 00:13:50.22, start: 0.011995, bitrate: 192 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 192 kb/s
In this example, 44.1 kHz is the sample rate.
Channel Count
I'd expect your voice MP3s to be in mono, but it wouldn't hurt to check to be sure. As with above, check the output of FFmpeg. In my example above, it says stereo.
As with sample rate, technically each frame could specify its own channel count but I don't know of any player that will pull off switching channel count mid-stream. Therefore, if you're concatenating, you need to make sure all the channel counts are the same.
ID3 Tags
It's common for there to be ID3 metadata at the beginning (ID3v2) and/or end (ID3v1) of the file. It's less expected to have this data mid-stream. You would want to make sure this metadata is all stripped out before concatenating.
MP3 Bit Reservoir
MP3 frames don't necessarily stand alone. If you have a constant bitrate stream, the encoder may still use less data to encode one frame, and more data to encode another. When this happens, some frames contain data for other frames. That way, frames that could benefit from the extra bandwidth can get it while still fitting the whole stream within a constant bitrate. This is the "bit reservoir".
If you cut a stream and splice in another stream, you may split up a frame and its dependent frames. This typically causes an audio glitch, but may also cause the decoder to skip ahead. Some badly behaving decoders will just stop playing altogether. In your example, you're not cutting anything so this probably isn't the source of your trouble... but I mention it here because it's definitely relevant to the way you're working these streams.
See also: http://wiki.hydrogenaud.io/index.php?title=Bit_reservoir
Solutions
Pick a "normal" format, resample and rencode non-conforming files
If most of your sources are all the exact same format and only one or two outstanding, you could convert the non-conforming file. From there, strip ID3 tags from everything and concatenate away.
To do the conversion, I'd recommend kicking it over to FFmpeg as a child process.
child_process.spawn('ffmpeg' [
// Input
'-i', inputFile, // Use '-' to write to STDIN instead
// Set sample rate
'-ar', '44100',
// Set audio channel count
'-ac', '1',
// Audio bitrate... try to match others, but not as critical
'-b:a', '64k',
// Ensure we output an MP3
'-f', 'mp3',
// Output
outputFile // As with input, use '-' to write to STDOUT
]);
Best Solution: Let FFmpeg (or similar) do the work for you
The simplest, most robust solution to all of this is to let FFmpeg build a brand new stream for you. This will cause your audio files to be decoded to PCM, and a new stream made. You can add parameters to resample those inputs, and modify channel counts if needed. Then output one stream. Use the concat filter.
This way, you can accept audio files of any type, you don't have to write the code to hack those streams together, and once setup you won't have to worry about it.
The only downside is that it will require a re-encoding of everything, meaning another generation of quality lost. This would be required for any non-conforming files anyway, and it's just speech, so I wouldn't give it a second thought.
#Brad's answer was the solution! The first solution he suggested worked. It took some messing around getting FFMpeg to work correctly, but in the end using the fluent-ffmpeg library worked.
Each file in my case was stored on Google Cloud Storage, and not on the server's hard drive. This posed some problems for FFmpeg, as it requires file paths to have multiple files, or an input stream (but only one is supported, as there is only one STDIN).
One solution is to put the files on the hard drive temporarily, but this would not work for our use case as we may have a lot of use in this function and the hard drive adds latency.
So, instead we did as suggested and loaded each file into ffmpeg to convert it into a standardized format. This was a bit tricky, but in the end requesting each file as a stream, using that stream as an input for ffmpeg, then using fluent-ffmpeg's pipe() method (which returns a stream) as output worked.
We then bound an event listener to the 'data' event for this pipe, and pushed the data to an array (bufs.push(data)), and on stream 'end' we concatenated this array using Buffer.concat(bufs), followed by a promise resolve.
Then once all requests promises were resolved, we could be sure ffmpeg had processed each file, and then those buffers were concatenated in the required groups as before using Buffer.concat(), converted to base64 data, and sent to the client.
This works great, and now it seems to be able to handle every combination of files/sources I can throw at it!
In conclusion:
The answer to the question was that the mp3 data must have been encoded differently (different channels, sample rates, etc.), and loading it through ffmpeg and outputing it in a 'unified' way made the mp3 data compatible.
The solution was to process each file in ffmpeg separately, pipe the ffmpeg output into a buffer, then concatenate the buffers.
Thanks #Brad for your suggestions and detailed answer!

Combine Audio and Images in Stream

I would like to be able to create images on the fly and also create audio on the fly too and be able to combine them together into an rtmp stream (for Twitch or YouTube). The goal is to accomplish this in Python 3 as that is the language my bot is written in. Bonus points for not having to save to disk.
So far, I have figured out how to stream to rtmp servers using ffmpeg by loading a PNG image and playing it on loop as well as loading a mp3 and then combining them together in the stream. The problem is I have to load at least one of them from file.
I know I can use Moviepy to create videos, but I cannot figure out whether or not I can stream the video from Moviepy to ffmpeg or directly to rtmp. I think that I have to generate a lot of really short clips and send them, but I want to know if there's an existing solution.
There's also OpenCV which I hear can stream to rtmp, but cannot handle audio.
A redacted version of an ffmpeg command I have successfully tested with is
ffmpeg -loop 1 -framerate 15 -i ScreenRover.png -i "Song-Stereo.mp3" -c:v libx264 -preset fast -pix_fmt yuv420p -threads 0 -f flv rtmp://SITE-SUCH-AS-TWITCH/.../STREAM-KEY
or
cat Song-Stereo.mp3 | ffmpeg -loop 1 -framerate 15 -i ScreenRover.png -i - -c:v libx264 -preset fast -pix_fmt yuv420p -threads 0 -f flv rtmp://SITE-SUCH-AS-TWITCH/.../STREAM-KEY
I know these commands are not set up properly for smooth streaming, the result manages to screw up both Twitch's and Youtube's player and I will have to figure out how to fix that.
The problem with this is I don't think I can stream both the image and the audio at once when creating them on the spot. I have to load one of them from the hard drive. This becomes a problem when trying to react to a command or user chat or anything else that requires live reactions. I also do not want to destroy my hard drive by constantly saving to it.
As for the python code, what I have tried so far in order to create a video is the following code. This still saves to the HD and is not responsive in realtime, so this is not very useful to me. The video itself is okay, with the one exception that as time passes on, the clock the qr code says versus the video's clock start to spread apart farther and farther as the video gets closer to the end. I can work around that limitation if it shows up while live streaming.
def make_frame(t):
img = qrcode.make("Hello! The second is %s!" % t)
return numpy.array(img.convert("RGB"))
clip = mpy.VideoClip(make_frame, duration=120)
clip.write_gif("test.gif",fps=15)
gifclip = mpy.VideoFileClip("test.gif")
gifclip.set_duration(120).write_videofile("test.mp4",fps=15)
My goal is to be able to produce something along the psuedo-code of
original_video = qrcode_generator("I don't know, a clock, pyotp, today's news sources, just anything that can be generated on the fly!")
original_video.overlay_text(0,0,"This is some sample text, the left two are coordinates, the right three are font, size, and color", Times_New_Roman, 12, Blue)
original_video.add_audio(sine_wave_generator(0,180,2)) # frequency min-max, seconds
# NOTICE - I did not add any time measurements to the actual video itself. The whole point is this is a live stream and not a video clip, so the time frame would be now. The 2 seconds list above is for our psuedo sine wave generator to know how long the audio clip should be, not for the actual streaming library.
stream.send_to_rtmp_server(original_video) # Doesn't matter if ffmpeg or some native library
The above example is what I am looking for in terms of video creation in Python and then streaming. I am not trying to create a clip and then stream it later, I am trying to have the program be able to respond to outside events and then update it's stream to do whatever it wants. It is sort of like a chat bot, but with video instead of text.
def track_movement(...):
...
return ...
original_video = user_submitted_clip(chat.lastVideoMessage)
original_video.overlay_text(0,0,"The robot watches the user's movements and puts a blue square around it.", Times_New_Roman, 12, Blue)
original_video.add_audio(sine_wave_generator(0,180,2)) # frequency min-max, seconds
# It would be awesome if I could also figure out how to perform advance actions such as tracking movements or pulling a face out of a clip and then applying effects to it on the fly. I know OpenCV can track movements and I hear that it can work with streams, but I cannot figure out how that works. Any help would be appreciated! Thanks!
Because I forgot to add the imports, here are some useful imports I have in my file!
import pyotp
import qrcode
from io import BytesIO
from moviepy import editor as mpy
The library, pyotp, is for generating one time pad authenticator codes, qrcode is for the qr codes, BytesIO is used for virtual files, and moviepy is what I used to generate the GIF and MP4. I believe BytesIO might be useful for piping data to the streaming service, but how that happens, depends entirely on how data is sent to the service, whether it be ffmpeg over command line (from subprocess import Popen, PIPE) or it be a native library.
Are you using ffmpeg.exe and running a command through CMD? If so you can use either concat demuxer or pipe. When you use concat demuxer, ffmpeg can take image input from a text file. Text file should contain image paths and ffmpeg can find those images from different folders. Following code line shows how you can use concat demuxer. Image locations are saved to input.txt fie.
ffmpeg -f concat -i input.txt -vsync vfr -pix_fmt yuv420p output.mp4
But most suitable solution would be to use a data pipe to feed images to ffmpeg.
cat *.png | ffmpeg -f image2pipe -i - output.mkv
you can check this link to see more information about ffmpeg data pipe.
Generating multiple videos and streaming at real time is not a very stable solution. You can run into several problems.
I have settled on using Gstreamer to create my streams on the fly. It can allow me to take separate video and audio streams and combine them together. I do not exactly have a working example right now, but I hopefully will either have an answer or figure it out on my own soon, at Gstreamer in Python exits instantly, but is fine on command line.

mkv file out of sync with linear drift

I have a bunch of mkv files, with FLAC as the audio codec and FFV1 as the video one.
The files were created using an EasyCap aquisition dongle from a VCR analog source. Specifically, I used VLC's "open acquisition device" prompt and selected PAL. Then, I converted the files (audio PCM, video raw YUV) to (FLAC, FFV1) using
ffmpeg.exe -i input.avi -acodec flac -vcodec ffv1 -level 3 -threads 4 -coder 1 -context 1 -g 1 -slices 24 -slicecrc 1 output.mkv
Now, the files are progressively out of sync. It may be due to the fact that while (maybe) the video has a constant framerate, the FLAC track has variable framerate. So, is there a way to sync the track to audio, or something alike? Can FFmpeg do this? Thanks
EDIT
On Mulvya hint, I plotted the difference in sync at various times; the first column shows the seconds elapsed, the second shows the difference - in secs. The plot seems to behave linearly, with 0.0078 as a constant slope. NOTE: measurements taken by hands, by means of a chronometer
EDIT 2
Playing around with VirtualDub, I found that changing the framerate to 25 fps from the original 24.889 (Video->Frame rate...->Change frame rate to) and using the track converted to wav definitely does work. Two problems, though: VirtualDub crashes when importing the original FFV1-FLAC mkv file, so I had to convert the video to H264 to try it out; more, I find it difficult to use an external encoder to save VirtualDub output.
So, could I avoid using VirtualDub, and simply use ffmpeg for it? Here's the exported vdscript:
VirtualDub.audio.SetSource("E:\\4_track2.wav", "");
VirtualDub.audio.SetMode(0);
VirtualDub.audio.SetInterleave(1,500,1,0,0);
VirtualDub.audio.SetClipMode(1,1);
VirtualDub.audio.SetEditMode(1);
VirtualDub.audio.SetConversion(0,0,0,0,0);
VirtualDub.audio.SetVolume();
VirtualDub.audio.SetCompression();
VirtualDub.audio.EnableFilterGraph(0);
VirtualDub.video.SetInputFormat(0);
VirtualDub.video.SetOutputFormat(7);
VirtualDub.video.SetMode(3);
VirtualDub.video.SetSmartRendering(0);
VirtualDub.video.SetPreserveEmptyFrames(0);
VirtualDub.video.SetFrameRate2(25,1,1);
VirtualDub.video.SetIVTC(0, 0, 0, 0);
VirtualDub.video.SetCompression();
VirtualDub.video.filters.Clear();
VirtualDub.audio.filters.Clear();
The first line imports the wav-converted audio track.
Can I set an equivalent pipe in ffmpeg (possibly, using FLAC - not wav)? SetFrameRate2 is maybe the key, here.

mpeg-dash with live stream

I would like to use MPEG-DASH technology in situations where I am constantly receiving a live video stream from a client. The Web server gets a live video stream, keeps generating the m4s file, and declares it in mpd. So the new segment can be played back constantly.
(I'm using FFMPEG's ffserver. So the video stream continues to accumulate in /tmp/feed1.ffm file.)
Using MP4Box seems to be able to generate mpd, init.mp4, m4s for already existing files. But it does not seem to support live streaming.
I want fragmented mp4 in segment format rather than mpeg-ts.
A lot of advice is needed!
GPAC maintainer here. The dashcast project (and likely its dashcastx replacement from our Signals platform should help you). Please open issues on github if you have any issues.
Please note that there are some projects like this one using FFmpeg to generate some HLS and then GPAC to ingest the TS segments to produce MPEG-DASH. This introduces some latency but proved to be very robust.
Below information may be useful.
latest ffmpeg supports the live streaming and also mp4 fragmenting.
Example command
ffmpeg -re -y -i <input> -c copy -f dash -window_size 10 -use_template 1 -use_timeline 1 <ClearLive>.mpd

Extract audio from Transport Stream and preserve length

I'm using ffmpeg to extract audio from MPEG Transport Stream file recorded by DVB-S card. The command:
ffmpeg -i video.ts -vn audio.wav
The source file seems to be corrupted. I noticed the corruption happens from time to time, especially for videos longer than 1 hour. I've got errors like these:
[mp2 # 0x1bb5500] Header missing
Error while decoding stream #0:1
[mpegts # 0x17eaf40] Continuity check failed for pid 5261 expected 2 got 6
The problem is that the resulting audio.wav is shorter than the source video (40m33s and 40m59s accordingly). I'm looking for the way to preserve the original length in the resulting audio file.
I tried the recent ffmpeg under Windows and avconv under Ubuntu, output format was MP3 and WAV. For every case I've got the same results.
I didn't find whether it's possible to do it with ffmpeg however I found ProjectX - a tool which tries to fix the broken TS stream. Website: http://project-x.sourceforge.net/
With:
java -jar ProjectX.jar -demux my_video.ts
the stream is demuxed into audio and video files which are guaranteed to have the same length. I simply mux them back using ffmpeg.

Resources