How can I programmatically mux multiple RTP audio streams together? - audio

I have several RTP streams coming to from the network, and since RTP can only handle one stream in each direction, I need to able to merge a couple to send back to another client (could be one that is already sending an RTP stream, or not... that part isn't important).
My guess is that there is some algorithm for mixing audio bytes.
RTP Stream 1 ---------------------
\_____________________ (1 MUXED 2) RTP Stream Out
RTP Stream 2 ---------------------

There is an IETF draft for RTP stream Muxing which might help you the link is here
In case you want to use only one stream, then perhaps send data from multiple streams as different channles this link gives an overview how Audio channels are multiplexed in WAV files. You can adopt similar strategy

I think you are talking about VoIP conference.
mediastreamer2 library I think supports conference filter.


Does YouTube store video and audio separately

youtube-dl can be used to see what formats are used to store YouTube content:
youtube-dl -F
The above command hints that the audio and video are mostly stored separately. Is it right? Does YouTube streaming combine audio and video in real-time?
Formats for a sample YouTube content
Most large streaming services will use ABR streaming (see:
The two most common ABR streaming formats are HLS and MPEG-DASH and both provide a manifest or index file which the player downloads first and which will contain links to the media streams, typically audio, video, subtitle tracks etc.
For encrypted content the audio and video, and even different bit rate video tracks, may all have separate encryption keys.
The player will download the audio and video tracks and synchronise them for playback.
in general streaming video and audio are sent in separate channels .... ditto for multi track audio like 5+1 ... during transport these channels are wrapped by a media container like mp4 etc
motive is partly due to distinct compression algorithms ... some algos are best for audio versus others for video and baked into these algos is the spread and sharing of data over time across video frames see B-frames for details ... these channels are not limited to video and audio ... if you own the sending and receiving sides you can send arbitrary data in many distinct channels by making up your own data protocol ... as an aside modern codec like H.256 allow data to get sent from receiver back to sender when you think you are simply viewing a movie (read the RFC)
youtube stores each of its various flavors of video and audio in separate files on its end then combines them based in desired streaming quality choices on a per download basis

Save audio from RTP stream that contains RFC 2833 RTP events

I'm trying to extract audio from a telephone session captured with Wireshark. The capture as send to us from the telephone provider for debugging/analysis. I have 3 files: signalling, and two files with UDP data, one for each direction. After merging two of these files (one direction with signalling), Wireshark provides RTP stream analysis. What I observe (as I do for a second session capture) is that Wireshark isn't able to export RTP stream audio (Payload type: ITU-T G.711 PCMA (8)) for one direction. This happens to be an RTP stream containing "RTF 2833 RTP events" (Payload type: telephone-event (106)). These events seem to transport DTMF tunes out-of-band, for each DTMF tune, there is a section of 7 consecutive RTP events of this type. What Wireshark does is producing an 8 GB *.au file for an audio stream less than two minutes. For the opposite-direction stream I get an audio file that is 2 MB in size.
I have to admit that this is just guesswork: I connect the error with a feature that I can see, I'm a bit confused that Wireshark obviously knows these Events but fails on saving the corresponding audio stream. Do I maybe need some plugin for that?
I tried to search the web for this issue but without success.
This question was previously asked on Network Engineering but turned out to be off-topic there.
You can filter (rtp.p_type != 106) the DTMF events from the wireshark logs (pcap) and then save only the G.711 data in a separate file.
Then do the RTP analysis and save the audio payload in .au/.raw file format.

How do Shoutcast servers and clients deal with mp3 frame headers and frame dependencies?

Short story:
If I myself intend to receive and then send a Shoutcast compatible audio stream processed by my application, then how to do it properly using an mp3 (de/en)coder library? Pseudo code, or better - lame mp3 specific code would be highly appreciated.
Long story:
More specific questions which bother me were caused by an article about mp3, which says:
Generally, frames are independent items. Each frame has its own header
and audio informations. There is no file header. Therefore, you can
cut any part of MPEG file and play it correctly (this should be done
on frame boundaries but most applications will handle incorrect
headers). For Layer III, this is not 100% correct. Due to internal
data organization in MPEG version 1 Layer III files, frames are often
dependent of each other and they cannot be cut off just like that.
This made me wonder, how Shoutcast servers and clients deal with frame headers and frame dependencies.
Do I have to encode to constant bitrate (CBR) only, if I want to achieve maximum compatibility with the most of Shoutcast players out there?
Is the mp3 frame header used at all or the stream format is deduced from a Shoutcast protocol specific HTTP header?
Does Shoutcast protocol guarantee (or is it common good practice) to start serving mp3 stream on frame boundaries and continue to respond with chunks that are cut at frame boundaries? But what is the minimum or recommended size of a mp3 frame for streaming live audio?
How does Shoutcast deal with frame dependencies - does it do something special with mp3 encoding to ensure that the served stream does not have frames which depend on previous frames (if this is even possible)? Or maybe it ignores these dependencies on server side/client side, thus getting audio quality reduction or even artifacts?
SHOUTcast servers do not know or care about the data being passed through them. They send it as-is. You can actually send arbitrary data through a SHOUTcast server, and receive it. SHOUTcast will segment the media data wherever the buffer size falls.
It's up to the client to re-sync to the data. It does this by locating the frame header, then being decoding. Once the codec has enough frames to reliably play back audio, it will begin outputting raw PCM. It's up to the codec when to decide it's safe to start playback. Since the codec knows what it's doing in terms of decoding the media, it knows when it has sufficient data (including bit reservoirs) to begin without artifacts. It's also worth noting that the bit reservoir cannot be carried on too far, so it doesn't take but a few frames at worst to handle it.
This is one of the reasons it's important to have a sizable buffer server-side, to flush to the clients as fast as possible on connect. If playback is to start quickly, the codec needs more data than the current frame to begin.

RTP AAC Packet Depacketizer

I asked earlier about H264 at RTP H.264 Packet Depacketizer
My question now is about the audio packets.
I noticed via the RTP packets that audio frames like AAC, G.711, G.726 and others all have the Marker Bit set.
I think frames are independent. am I right?
My question is: Audio is small, but I know that I can have more than one frame per RTP ​​packet. Independent of how many frames I have, they are complete? Or it may be fragmented between RTP packets.
The difference between audio and video is that audio is typically encoded either in individual samples, or in certain [small] frames without reference to previous data. Additionally, amount of data is small. So audio does not typically need complicated fragmentation to be transmitted over RTP. However, for any payload type you should again refer to RFC that describes the details:
AAC - RTP Payload Format for MPEG-4 Audio/Visual Streams
G.711 - RTP Payload Format for ITU-T Recommendation G.711.1
G.726 - RTP Profile for Audio and Video Conferences with Minimal Control

Strategy for time-indexed audio archive with lossy compression

For part of one of my projects, I am considering developing an audio archive for internet radio stations. This archive would be indexed and addressable by date/time.
For example, the server would connect to a stream (generally encoded in MP3), and save the stream data. A client could connect to this server and request audio from 2011-07-05 15:58:30 to 2011-07-05 15:59:37. The server would return the audio data to the client for playback.
My initial thought was to save the data to 1-minute chunks of raw MP3 data to disk, and reference these files from a database. The server would be dumb to the stream/file format, and wouldn't understand mpeg frames. It would simply pass on data to the client, dividing the chunks up linearly to send. It would be up to the client to sync to the stream. This is not unlike how internet radio servers run in general. SHOUTcast servers simply output the data, byte for byte, that is sent to them from the encoder. When a client connects, data is sent, regardless of whether or not it even ends on an MP3 frame. It is up to the client to sync.
I am wondering if there might be a better approach, maximizing compatibility with clients and audio formats. Any thoughts on how to go about this?
The only other thing I can think of is decoding the MP3 to raw PCM audio and re-encoding as necessary when requested. I would prefer not to go this route due to the disk space required, and the loss of quality when re-encoding.
This question is language-agnostic, but if it is helpful, I will likely implement a solution in PHP with MySQL as the database.
You don't have to worry about this, since ALL mp3 that I accessed over shoutcast is Constant Bitrate. Do you don't have to index it. I have POC project that had archive in 5 minute chunks, then uses PHP to combine that files and pseudo-stream it to the winamp via shoutcast. It worked!
And since you are working with mp3, you can assume (and you'll assume correctly) that the density of the captured file is linear, so to access 30 second of the 60 second file you should seek in the middle. Since mp3 decoders are robust enough, you don't have to track the frames at all here.
AACplus, whole different story. It's inherent VBR.
