Problem understanding audio stream number of samples when decoded with ffmpeg - audio

The two streams I am decoding are an audio stream (adts AAC, 1 channel, 44100, 8-bit, 128bps) and a video stream (H264) which are received in an Mpeg-Ts stream, but I noticed something that doesn't make sense to me when I decode the AAC audio frames and try to line up the audio/video stream timestamps. I'm decoding the PTS for each video and audio frame, however I only get a PTS in the audio stream every 7 frames.
When I decode a single audio frame I get back 1024 samples, always. The frame rate is 30fps, so I see 30 frames each with 1024 samples which comes equals 30,720 samples and not the expected 44,100 samples. This is a problem when computing the timeline as the timestamps on the frames are slightly different between the audio and video streams. It's very close, but since I compute the timestamps via (1024 samples * 1,000 / 44,100 * 10,000 ticks) it's never going to line up exactly with the 30fps video.
Am I doing something wrong here with decoding the ffmpeg audio frames, or misunderstanding audio samples?
And in my particular application, these timestamps are critical as I am trying to line up LTC timestamps which are decoded at the audio frame level, and lining those up with video frames.
FFProbe.exe:
Video:
r_frame_rate=30/1
avg_frame_rate=30/1
codec_time_base=1/60
time_base=1/90000
start_pts=7560698279
start_time=84007.758656
Audio:
r_frame_rate=0/0
avg_frame_rate=0/0
codec_time_base=1/44100
time_base=1/90000
start_pts=7560686278
start_time=84007.625311

Related

FFmpeg adding extra 128ms in audio file while converting a WAV to AAC

I have a stream of audio bytes and doing a live stream using HLS. First, I'm converting a few audio bytes to WAV chunks and then converting WAV to AAC. While converting it to AAC by FFmpeg adds an extra 128ms in every chunk. Due to the extra 128ms audio chunk, over time audio length is getting significantly increase compare to original audio length.
I tried to read audio chunk size in multiple of 1024 samples for AAC conversion but it didn't work.

Mux segmented mpegts audio and video to single clip with error correction

I have a recording as a collection of files in mpegts format, like
audio: a-1.ts, a-2.ts, a-3.ts, a-4.ts
video: v-1.ts, v-2.ts, v-3.ts
I need to make a single video clip in mp4 or mkv format.
However, there are two problems:
audio and video segments have different duration each, number of audio segments is different from number of video segments. Total duration of audio and video matches. Hence I can not concat pairwise audio video segments using mpeg and merge them afterwards, I get sync issues increasing progressively
few segments are corrupt or missing. So if I concat audio and video streams separately using ffmpeg I get streams of different lengths. When I merge these streams using ffmpeg I have correct a/v synchronization until time when first missing packet is encountered.
It's OK if video freezes for a while or there is silence for a while as long as most of the video is in sync with audio.
I've checked with tsduck and PCR seems to be present in all audio and video segments yet I could not find a way to merge streams using mpegTS PCR as sync reference. Please advise how can I achieve this.

How is the AAC encoder priming delay handled in HLS?

As per Apple, in AAC encoding 2112 priming samples are added at the beginning of audio. When creating HLS stream with AAC audio, will these priming samples be added to the beginning of each HLS segment or only to the first HLS segment? And, how does this AAC encoder delay affect HLS DISCONTINUITY tags later in the HLS stream?
https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFAppenG/QTFFAppenG.html
I depends on the AAC you use.
For 'old-style' AAC-LC you only have priming samples at the beginning of the stream and not at the beginning of each segment.
But the delay is carried through the entire stream.
Typically a new piece of media is displayed after a DISCONTINUITY tag - for example an advertisement - so you will receive another set of priming samples.
Your AAC audio decoder needs to discard the priming samples (first 2112) PCM output samples after startup and after DISCONTINUITY.
If you use the more modern xHE-AAC - you don't have to worry about priming samples anymore.
Another wrinkle - in the early days it was just assumed that AAC-LC has 2112 priming samples.
Now the number can be different and it can be signaled in the MP4 container as Edit-List.

About definition for terms of audio codec

When I was studying Cocoa Audio Queue document, I met several terms in audio codec. There are defined in a structure named AudioStreamBasicDescription.
Here are the terms:
1. Sample rate
2. Packet
3. Frame
4. Channel
I known about sample rate and channel. How I was confused by the other two. What do the other two terms mean?
Also you can answer this question by example. For example, I have an dual-channel PCM-16 source with a sample rate 44.1kHz, which means there are 2*44100 = 88200 Bytes PCM data per second. But how about packet and frame?
Thank you at advance!
You are already familiar with the sample rate defintion.
The sampling frequency or sampling rate, fs, is defined as the number of samples obtained in one second (samples per second), thus fs = 1/T.
So for a sampling rate of 44100 Hz, you have 44100 samples per second (per audio channel).
The number of frames per second in video is a similar concept to the number of samples per second in audio. Frames for our eyes, samples for our ears. Additional infos here.
If you have 16 bits depth stereo PCM it means you have 16*44100*2 = 1411200 bits per second => ~ 172 kB per second => around 10 MB per minute.
To the definition in reworded terms from Apple:
Sample: a single number representing the value of one audio channel at one point in time.
Frame: a group of one or more samples, with one sample for each channel, representing the audio on all channels at a single point on time.
Packet: a group of one or more frames, representing the audio format's smallest encoding unit, and the audio for all channels across a short amount of time.
As you can see there is a subtle difference between audio and video frame notions. In one second you have for stereo audio at 44.1 kHz: 88200 samples and thus 44100 frames.
Compressed format like MP3 and AAC pack multiple frames in packets (these packets can then be written in MP4 file for example where they could be efficiently interleaved with video content). You understand that dealing with large packets helps to identify bits patterns for better coding efficiency.
MP3, for example, uses packets of 1152 frames, which are the basic atomic unit of an MP3 stream. PCM audio is just a series of samples, so it can be divided down to the individual frame, and it really has no packet size at all.
For AAC you can have 1024 (or 960) frames per packet. This is described in the Apple document you pointed at:
The number of frames in a packet of audio data. For uncompressed audio, the value is 1. For variable bit-rate formats, the value is a larger fixed number, such as 1024 for AAC. For formats with a variable number of frames per packet, such as Ogg Vorbis, set this field to 0.
In MPEG-based file format a packet is referred to as a data frame (not to be
mingled with the previous audio frame notion). See Brad comment for more information on the subject.

AAC RTP timestamps and synchronization

I am currently streaming audio (AAC-HBR at 8kHz) and video (H264) using RTP. Both feeds works fine individually, but when put together they get out of sync pretty fast (lass than 15 sec).
I am not sure how to increment the time stamp on the audio RTP header, I thought it should be the time difference between two RTP packets (around 127ms) or a constant increment of 1/8000 (0.125 ms). But neither worked, instead I managed to find a sweet spot. When I increment the time stamp by 935 for each packet It stays synchronized for about a minute.
AAC frame size is 1024 samples. Try to increment by (1/8000) * 1024 = 128 ms. Or a multiple of that in case your packet has multiple AAC frames.
Does that help?
Bit late, but thought of putting up my answer.
Timestamp on Audio RTP packet == the number of audio samples contained in RTP packet.
For AAC, each frame consist of 1024 samples, so timestamp on RTP packet should increase by 1024.
Difference between the clocktime of 2 RTP packets = (1/8000)*1024 = 128ms, i.e sender should send the rtp packets with difference of 128 ms.
Bit more information from other sampling rates:
Now AAC sampled at 44100hz means 44100 sample of signal in 1 sec.
So 1024 samples means (1000ms/44100)*1024 = 23.21995 ms
So the timestamp between 2 RTP packets = 1024, but
The difference of clock time between 2 RTP packets in rtp session should be 23.21995ms.
Trying to correlate with other example:
For example for G711 family (PCM, PCMU, PCMA), The sampling frequency = 8k.
So the 20ms packet should have samples == 8000/50 == 160.
And hence RTP timestamps are incremented by 160.
The difference of clock time between 2 RTP packets should be 20ms.
IMHO video and audio de-sync in android is difficult to fight if they are taken from different media recorders. They just capture different start frames and there is no way (as it seems) to find out how big de-sync is and adjust it with audio or video timestamps on flight.

Resources