How is the AAC encoder priming delay handled in HLS? - audio

As per Apple, in AAC encoding 2112 priming samples are added at the beginning of audio. When creating HLS stream with AAC audio, will these priming samples be added to the beginning of each HLS segment or only to the first HLS segment? And, how does this AAC encoder delay affect HLS DISCONTINUITY tags later in the HLS stream?
https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFAppenG/QTFFAppenG.html

I depends on the AAC you use.
For 'old-style' AAC-LC you only have priming samples at the beginning of the stream and not at the beginning of each segment.
But the delay is carried through the entire stream.
Typically a new piece of media is displayed after a DISCONTINUITY tag - for example an advertisement - so you will receive another set of priming samples.
Your AAC audio decoder needs to discard the priming samples (first 2112) PCM output samples after startup and after DISCONTINUITY.
If you use the more modern xHE-AAC - you don't have to worry about priming samples anymore.
Another wrinkle - in the early days it was just assumed that AAC-LC has 2112 priming samples.
Now the number can be different and it can be signaled in the MP4 container as Edit-List.

Related

FFmpeg adding extra 128ms in audio file while converting a WAV to AAC

I have a stream of audio bytes and doing a live stream using HLS. First, I'm converting a few audio bytes to WAV chunks and then converting WAV to AAC. While converting it to AAC by FFmpeg adds an extra 128ms in every chunk. Due to the extra 128ms audio chunk, over time audio length is getting significantly increase compare to original audio length.
I tried to read audio chunk size in multiple of 1024 samples for AAC conversion but it didn't work.

Mux segmented mpegts audio and video to single clip with error correction

I have a recording as a collection of files in mpegts format, like
audio: a-1.ts, a-2.ts, a-3.ts, a-4.ts
video: v-1.ts, v-2.ts, v-3.ts
I need to make a single video clip in mp4 or mkv format.
However, there are two problems:
audio and video segments have different duration each, number of audio segments is different from number of video segments. Total duration of audio and video matches. Hence I can not concat pairwise audio video segments using mpeg and merge them afterwards, I get sync issues increasing progressively
few segments are corrupt or missing. So if I concat audio and video streams separately using ffmpeg I get streams of different lengths. When I merge these streams using ffmpeg I have correct a/v synchronization until time when first missing packet is encountered.
It's OK if video freezes for a while or there is silence for a while as long as most of the video is in sync with audio.
I've checked with tsduck and PCR seems to be present in all audio and video segments yet I could not find a way to merge streams using mpegTS PCR as sync reference. Please advise how can I achieve this.

How to capture raw audio samples after every 10MS using FFmpeg programmatically?

I am using FFmpeg to record audio using pulse audio at a 48k sampling rate and 32k bitrate. Eventually, I perform encoding these recorded samples through opus which is configured on default settings such as frame_size set to 120, initial padding set to 120 and etc. I am not sure about the recording callback settings for pulse audio but I have read in opus RFC that it uses 120ms by default and can support 10, 20, 40, 60, and any other which is a multiple of 2.5.
I have a project where I have added WebRTC p2p connection support to send FFmpeg recorded audio/video on the audio/video channel of WebRTC to the next connected peer. In the case of sending plain audio as well as plain video which is recorded through FFmpeg, there is no issue with sending these over an audio/video channel (obviously) however, in the case of FFmpeg encoded data, I have to modify WebRTC audio/video encoder factories (clear to me).
I want to capture raw audio samples using FFmpeg after every 10MS callback.

Problem understanding audio stream number of samples when decoded with ffmpeg

The two streams I am decoding are an audio stream (adts AAC, 1 channel, 44100, 8-bit, 128bps) and a video stream (H264) which are received in an Mpeg-Ts stream, but I noticed something that doesn't make sense to me when I decode the AAC audio frames and try to line up the audio/video stream timestamps. I'm decoding the PTS for each video and audio frame, however I only get a PTS in the audio stream every 7 frames.
When I decode a single audio frame I get back 1024 samples, always. The frame rate is 30fps, so I see 30 frames each with 1024 samples which comes equals 30,720 samples and not the expected 44,100 samples. This is a problem when computing the timeline as the timestamps on the frames are slightly different between the audio and video streams. It's very close, but since I compute the timestamps via (1024 samples * 1,000 / 44,100 * 10,000 ticks) it's never going to line up exactly with the 30fps video.
Am I doing something wrong here with decoding the ffmpeg audio frames, or misunderstanding audio samples?
And in my particular application, these timestamps are critical as I am trying to line up LTC timestamps which are decoded at the audio frame level, and lining those up with video frames.
FFProbe.exe:
Video:
r_frame_rate=30/1
avg_frame_rate=30/1
codec_time_base=1/60
time_base=1/90000
start_pts=7560698279
start_time=84007.758656
Audio:
r_frame_rate=0/0
avg_frame_rate=0/0
codec_time_base=1/44100
time_base=1/90000
start_pts=7560686278
start_time=84007.625311

RTP AAC Packet Depacketizer

I asked earlier about H264 at RTP H.264 Packet Depacketizer
My question now is about the audio packets.
I noticed via the RTP packets that audio frames like AAC, G.711, G.726 and others all have the Marker Bit set.
I think frames are independent. am I right?
My question is: Audio is small, but I know that I can have more than one frame per RTP ​​packet. Independent of how many frames I have, they are complete? Or it may be fragmented between RTP packets.
The difference between audio and video is that audio is typically encoded either in individual samples, or in certain [small] frames without reference to previous data. Additionally, amount of data is small. So audio does not typically need complicated fragmentation to be transmitted over RTP. However, for any payload type you should again refer to RFC that describes the details:
AAC - RTP Payload Format for MPEG-4 Audio/Visual Streams
G.711 - RTP Payload Format for ITU-T Recommendation G.711.1
G.726 - RTP Profile for Audio and Video Conferences with Minimal Control
Other

Resources