Seting dwScale and dwRate values in the AVISTREAMHEADER structure at AVI muxing - audio

During capturing from some audio and video sources and encoding at AVI container for synchronizing audio & video I set audio as a master stream and this gave best result for synchronizing.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd312034(v=vs.85).aspx
But this method gives a higher FPS value as a result. About 40 or 50 instead of 30 FPS.
If this media file just playback - all OK, but if try to recode at different software to another video format appears out of sync.
How can I programmatically set dwScale and dwRate values in the AVISTREAMHEADER structure at AVI muxing?

How can I programmatically set dwScale and dwRate values in the AVISTREAMHEADER structure at AVI muxing?
MSDN:
This method works by adjusting the dwScale and dwRate values in the AVISTREAMHEADER structure.
You requested that multiplexer manages the scale/rate values, so you cannot adjust them. You should be seeing more odd things in your file, not just higher FPS. The file itself is perhaps out of sync and as soon as you process it with other applciations that don't do playback fine tuning, you start seeing issues. You might be having video media type showing one frame rate and effectively the rate is different.

Related

Why can I sometimes concatenate audio data using NodeJS Buffers, and sometimes I cannot?

As part of a project I am working on, there is a requirement to concatenate multiple pieces of audio data into one large audio file. The audio files are generated from four sources, and the individual files are stored in a Google Cloud storage bucket. Each file is an mp3 file, and it is easy to verify that each individual file is generating correctly (individually, I can play them, edit them in my favourite software, etc.).
To merge the audio files together, a nodejs server loads the files from the Google Cloud storage as an array buffer using an axios POST request. From there, it puts each array buffer into a node Buffer using Buffer.from(), so now we have an array of Buffer objects. Then it uses Buffer.concat() to concatenate the Buffer objects into one big Buffer, which we then convert to Base64 data and send to the client server.
This is cool, but the issue arises when concatenating audio generated from different sources. The 4 sources I mentioned above are Text to Speech software platforms, such as Google Cloud Voice and Amazon Polly. Specifically, we have files from Google Cloud Voice, Amazon Polly, IBM Watson, and Microsoft Azure Text to Speech. Essentially just five text to speech solutions. Again, all individual files work, but when concatenating them together via this method there are some interesting effects.
When the sound files are concatenated, seemingly depending on which platform they originate from, the sound data either will or will not be included in the final sound file. Below is a 'compatibility' table based on my testing:
|------------|--------|--------|-----------|-----|
| Platform / | Google | Amazon | Microsoft | IBM |
|------------|--------|--------|-----------|-----|
| Google | Yes | No | No | No |
|------------|--------|--------|-----------|-----|
| Amazon | | No | No | Yes |
|------------|--------|--------|-----------|-----|
| Microsoft | | | Yes | No |
|------------|--------|--------|-----------|-----|
| IBM | | | | Yes |
|------------|--------|--------|-----------|-----|
The effect is as follows: When I play the large output file, it will always start playing the first sound file included. From there, if the next sound file is compatible, it is heard, otherwise it is skipped entirely (no empty sound or anything). If it was skipped, the 'length' of that file (for example 10s long audio file) is included at the end of the generated output sound file. However, the moment that my audio player hits the point where the last 'compatible' audio has played, it immediately skips to the end.
As a scenario:
Input:
sound1.mp3 (3s) -> Google
sound2.mp3 (5s) -> Amazon
sound3.mp3 (7s)-> Google
sound4.mp3 (11s) -> IBM
Output:
output.mp3 (26s) -> first 10s is sound1 and sound3, last 16s is skipped.
In this case, the output sound file would be 26s seconds long. For the first 10 seconds, you would hear the sound1.mp3 and sound3.mp3 played back to back. Then at 10s (at least playing this mp3 file in firefox) the player immediately skips to the end at 26s.
My question is: Does anyone have any ideas why sometimes I can concatenate audio data in this way, and other times I cannot? And how come there is this 'missing' data included at the end of the output file? Shouldn't concatenating the binary data work in all cases if it works for some cases, as all the files have mp3 encoding? If I am wrong please let me know what I can do to successfully concatenate any mp3 files :)
I can provide my nodeJS backend code, but the process and methods used are described above.
Thanks for reading?
Potential Sources of Problems
Sample Rate
44.1 kHz is often used for music, as it's what is used on CD audio. 48 kHz is usually used for video, as it's what was used on DVDs. Both of those sample rates are much higher than is required for speech, so it's likely that your various text-to-speech providers are outputting something different. 22.05 kHz (half of 44.1 kHz) is common, and 11.025 kHz is out there too.
While each frame specifies its own sample rate, making it possible to generate a stream with varying sample rates, I've never seen a decoder attempt to switch sample rates mid-stream. I suspect that the decoder is skipping these frames, or maybe even skipping over an arbitrary block until it gets consistent data again.
Use something like FFmpeg (or FFprobe) to figure out what the sample rates of your files are:
ffmpeg -i sound2.mp3
You'll get an output like this:
Duration: 00:13:50.22, start: 0.011995, bitrate: 192 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 192 kb/s
In this example, 44.1 kHz is the sample rate.
Channel Count
I'd expect your voice MP3s to be in mono, but it wouldn't hurt to check to be sure. As with above, check the output of FFmpeg. In my example above, it says stereo.
As with sample rate, technically each frame could specify its own channel count but I don't know of any player that will pull off switching channel count mid-stream. Therefore, if you're concatenating, you need to make sure all the channel counts are the same.
ID3 Tags
It's common for there to be ID3 metadata at the beginning (ID3v2) and/or end (ID3v1) of the file. It's less expected to have this data mid-stream. You would want to make sure this metadata is all stripped out before concatenating.
MP3 Bit Reservoir
MP3 frames don't necessarily stand alone. If you have a constant bitrate stream, the encoder may still use less data to encode one frame, and more data to encode another. When this happens, some frames contain data for other frames. That way, frames that could benefit from the extra bandwidth can get it while still fitting the whole stream within a constant bitrate. This is the "bit reservoir".
If you cut a stream and splice in another stream, you may split up a frame and its dependent frames. This typically causes an audio glitch, but may also cause the decoder to skip ahead. Some badly behaving decoders will just stop playing altogether. In your example, you're not cutting anything so this probably isn't the source of your trouble... but I mention it here because it's definitely relevant to the way you're working these streams.
See also: http://wiki.hydrogenaud.io/index.php?title=Bit_reservoir
Solutions
Pick a "normal" format, resample and rencode non-conforming files
If most of your sources are all the exact same format and only one or two outstanding, you could convert the non-conforming file. From there, strip ID3 tags from everything and concatenate away.
To do the conversion, I'd recommend kicking it over to FFmpeg as a child process.
child_process.spawn('ffmpeg' [
// Input
'-i', inputFile, // Use '-' to write to STDIN instead
// Set sample rate
'-ar', '44100',
// Set audio channel count
'-ac', '1',
// Audio bitrate... try to match others, but not as critical
'-b:a', '64k',
// Ensure we output an MP3
'-f', 'mp3',
// Output
outputFile // As with input, use '-' to write to STDOUT
]);
Best Solution: Let FFmpeg (or similar) do the work for you
The simplest, most robust solution to all of this is to let FFmpeg build a brand new stream for you. This will cause your audio files to be decoded to PCM, and a new stream made. You can add parameters to resample those inputs, and modify channel counts if needed. Then output one stream. Use the concat filter.
This way, you can accept audio files of any type, you don't have to write the code to hack those streams together, and once setup you won't have to worry about it.
The only downside is that it will require a re-encoding of everything, meaning another generation of quality lost. This would be required for any non-conforming files anyway, and it's just speech, so I wouldn't give it a second thought.
#Brad's answer was the solution! The first solution he suggested worked. It took some messing around getting FFMpeg to work correctly, but in the end using the fluent-ffmpeg library worked.
Each file in my case was stored on Google Cloud Storage, and not on the server's hard drive. This posed some problems for FFmpeg, as it requires file paths to have multiple files, or an input stream (but only one is supported, as there is only one STDIN).
One solution is to put the files on the hard drive temporarily, but this would not work for our use case as we may have a lot of use in this function and the hard drive adds latency.
So, instead we did as suggested and loaded each file into ffmpeg to convert it into a standardized format. This was a bit tricky, but in the end requesting each file as a stream, using that stream as an input for ffmpeg, then using fluent-ffmpeg's pipe() method (which returns a stream) as output worked.
We then bound an event listener to the 'data' event for this pipe, and pushed the data to an array (bufs.push(data)), and on stream 'end' we concatenated this array using Buffer.concat(bufs), followed by a promise resolve.
Then once all requests promises were resolved, we could be sure ffmpeg had processed each file, and then those buffers were concatenated in the required groups as before using Buffer.concat(), converted to base64 data, and sent to the client.
This works great, and now it seems to be able to handle every combination of files/sources I can throw at it!
In conclusion:
The answer to the question was that the mp3 data must have been encoded differently (different channels, sample rates, etc.), and loading it through ffmpeg and outputing it in a 'unified' way made the mp3 data compatible.
The solution was to process each file in ffmpeg separately, pipe the ffmpeg output into a buffer, then concatenate the buffers.
Thanks #Brad for your suggestions and detailed answer!

How to get exact timestamp of audio recording start in with PortAudio?

We are using PortAudio for recording audio in our electron application. As a node wrapper, we use naudiodon.
The application needs to record both audio and video, but using different sources. Audio, as said, is being recorded with Port Audio, with additional app logic on top. Video, on the other hand, is being recorded with standard MediaRecorder API, with its own formats, properties, and codecs.
We use event 'onstart' to track actual video start and in order to sync audio and video, we must also know the exact audio start time.
Problem is: We are not able to detect that exact timestamp of audio start. What should be the correct way of doing it?
Here is what we tried:
 1. The first option is to listen to portaudio.AudioIO events, such as 'data' and 'readable'. Those are called as soon as PortAudio has new data chunk, so tracking the very first chunk minus its length in milliseconds would result in approximate audio start.
 2. The second option is to add Writable pipe to AudioIO, and do pretty much the same thing as with events.
The issue is, that by doing any of those options, calculated start doesn't always result in the actual timestamp of audio start. While playing around with port audio it was known, that calculated timestamp is higher than it should be, as though some chunks are being buffered before actually released.
Actual audio start and first chunk release can be different, in a range of around 50 - 500 ms with chunk length ~50ms. So chunks might buffer sometimes, and sometimes they don't. Is there any way to track the actual start time of the first chunk? I wasn't able to find any relevant info in checking port audio docs.
Maybe there are any other ways to keep using PortAudio and record video separately, but finally achieve the same desired feature, of synching them together?
PortAudio 19.5, Naudiodon 2.1.0, Electron 6.0.9, Node.js 12.4.0

Moviepy write_videofile changes number of video and audio frames even after using 'rawvideo' as the codec parameter

I am using moviepy (Python) to read video and audio frames of a video and after making some changes I am writing them back to a videofile, say new.avi, to preserve the changes, or to avoid compression, I am using codec= 'rawvideo' in write_videofile function. But when I read the video and audio frames back, the number of video and audio frames are different than when they were when written, they are usually increased.
Can anybody tell me the reason,? is it because of the ffmpeg used or some other reason? Does it happen always or there is some problem in my machine? Thank you :-)

AVAssetWriter real-time processing audio from file and audio from AVCaptureSession

I'm trying to create a MOV file with two audio tracks and one video track, and I'm trying to do so without AVAssetExportSession or AVComposition, as I want to have the resultant file ready almost immediately after the AVCaptureSession ends. An export after the capture session may only take a few seconds, but not in the case of a 5 minute capture session. This looks like it should be possible, but I feel like I'm just a step away:
There's source #1 - video and audio recorded via AVCaptureSession (handled via AVCaptureVideoDataOutput and AVCaptureAudioDataOutput).
There's source #2 - an audio file read in with an AVAssetReader. Here I use an AVAssetWriterInput and requestMediaDataWhenReadyOnQueue. I call setTimeRange on its AVAssetReader, from CMTimeZero to the duration of the asset, and this shows correctly as 27 seconds when logged out.
I have each of the three inputs working on a queue of its own, and all three are concurrent. Logging shows that they're all handling sample buffers - none appear to be lagging behind or stuck in a queue that isn't processing.
The important point is that the audio file works on its own, using all the same AVAssetWriter code. If I set my AVAssetWriter to output a WAVE file and refrain from adding the writer inputs from #1 (the capture session), I finish my writer session when the audio-from-file samples are depleted. The audio file reports as being of a certain size, and it plays back correctly.
With all three writer inputs added, and the file type set to AVFileTypeQuickTimeMovie, the requestMediaDataOnQueue process for the audio-from-file still appears to read the same data. The resultant mov file shows three tracks, two audio, one video, and the duration of the captured audio and video are not identical in length but they've obviously worked, and the video plays back with both intact. The third track (the second audio track), however, shows a duration of zero.
Does anyone know if this whole solution is possible, and why the duration of the from-file audio track is zero when it's in a MOV file? If there was a clear way for me to mix the two audio tracks I would, but for one, AVAssetReaderAudioMixOutput takes two AVAssetTracks, and I essentially want to mix an AVAssetTrack with captured audio, and they aren't managed or read in the same way.
I'd also considered that the QuickTime Movie won't accept certain audio formats, but I'm making a point of passing the same output settings dictionary to both audio AVAssetWriterInputs, and the captured audio does play and report its duration (and the from-file audio plays when in a WAV file with those same output settings), so I don't think this is an issue.
Thanks.
I discovered that the reason for this is:
I correctly use the Presentation Time Stamp of the incoming capture session data (I use the PTS of the video data at the moment) to begin a writer session (startSessionAtSourceTime), and that meant that the timestamp of the audio data read from file had the wrong timestamp - outwith the time range that was dictated to the AVAssetWriter session. So I had to further process the data from the audio file, changing its timing information by using CMSampleBufferCreateCopyWithNewTiming.
CMTime bufferDuration = CMSampleBufferGetOutputDuration(nextBuffer);
CMSampleBufferRef timeAdjustedBuffer;
CMSampleTimingInfo timingInfo;
timingInfo.duration = bufferDuration;
timingInfo.presentationTimeStamp = _presentationTimeUsedToStartSession;
timingInfo.decodeTimeStamp = kCMTimeInvalid;
CMSampleBufferCreateCopyWithNewTiming(kCFAllocatorDefault, nextBuffer, 1, &timingInfo, &timeAdjustedBuffer);

A way to add data "mid stream" to encoded audio (possibly with AAC)

Is there a way to add lossless data to an AAC audio stream?
Essentially I am looking to be able to inject "this frame of audio should be played at XXX time" every n frames in.
If I use a lossless codec I suppose I could just inject my own header mid stream and that data would be intact as it needs to be the same on the way out just like gzip does not loose data.
Any ideas? I suppose I could encode the data into chunks of AAC on the server and on the network layer add a timestamp saying play the following chunk of AAC at time x but I'd prefer to figure a way to add it to the audio itself.
This is not really possible (short of writing your own specialized encoder), as AAC (and MP3) frames are not truly standalone.
There is a concept of the bit reservoir, where unused bandwidth from one frame can be utilized for a later frame that may need more bandwidth to store a more complicated sound. That is, data from frame 1 might be needed in frame 2 and/or 3. If you cut the stream between frames 1 and 2 and insert your alternative frames, the reference to the bit reservoir data is broken and you have damaged frame 2's ability to be decoded.
There are encoders that can work in a mode where the bit reservoir isn't used (at the cost of quality). If operating in this mode, you should be able to cut the stream more freely along frame boundaries.
Unfortunately, the best way to handle this is to do it in the time domain when dealing with your raw PCM samples. This gives you more control over the timing placement anyway, and ensures that your stream can also be used with other codecs.

Resources