Live audio streaming container formats - audio

When I start receiving the live audio (radio) stream (e.g. MP3 or AAC) I think the received data are not kind of raw bitstream (i.e. raw encoder output), but they are always wrapped into some container format. If this assumption is correct, then I guess I cannot start streaming from arbitrary place of the stream, but I have to wait to some sync byte. Is that right? Is it usual to have some sync bytes? Is there any header following the sync byte, from which I can guess the used codec, number of channels, sample rate, etc.?
When I connect to live stream, will I receive data starting by the nearest sync byte or I will get them from the actual position and I have to check for the sync byte first?
Some streams like icecast use headers in the HTTP response, where stream related information are included, but i think i can skip them and deal directly with the steam format.
Is that correct?
Regards,
STeN

When you look at SHOUTcast/Icecast, the data that comes across is pure MPEG Layer III audio data, and nothing more. (Provided you haven't requested metadata.)
It can be cut at an arbitrary place, so you need to sync to the stream. This is usually done by finding a potential header, and using the data in that header to find sequential headers. Once you have found a few frame headers, you can safely assume you have synced up to the stream and start decoding for playback.
Again, there is no "container format" for these. It's just raw data.
Now, if you want metadata, you have to request it from the server. The data is then just injected into the stream every x number of bytes. See http://www.smackfu.com/stuff/programming/shoutcast.html.

Doom9 has great starting info about both mpeg and aac frame formats. Shoutcast will add some 'metadata' now and then, and it's really trivial. The thing I want to share with you is this; I have an application that can capture all kind of streams, and shoutcast, both aac and mp3 is among them. First versions had their files cut at arbitrary point according to the time, for example every 5 minutes, regardless of the mp3/aac frames. It was somehow OK for the mp3 (the files were playable) but was very bad for aacplus.
The thing is - aacplus decoder ISN'T that forgiving about wrong data, and I had everything from access violations to mysterious software shutdowns with no errors of any kind.
Anyway, if you want to capture stream, open the socket to the server, read the response, you'll have some header there, then use that info to strip metadata that will be injected now and then. Use the header information for both aacplus and mp3 to determine frame boundaries, and try to honor them and split the file at the right place.
mp3 frame header:
http://www.mp3-tech.org/programmer/frame_header.html
aacplus frame header:
http://wiki.multimedia.cx/index.php?title=Understanding_AAC
also this:
aacplus frame alignment problems

Unfortunately it's not always that easy, check the format and notes here:
MPEG frame header format

I will continue the discussion byu answering myself (even we are discouraged to do that):
I was also looking into streamed data and I have found that frequently the sequence ff f3 82 70 is repeated - this I suggest is the MPEG frame header, so I try to look what that means:
ff f3 82 70 (hex) = 11111111 11110011 10000010 01110000 (bin)
Analysis
11111111111 | SYNC
10 | MPEG version 2
01 | Layer III
1 | No CRC
1000 | 64 kbps
00 | 22050Hz
1 | Padding
0 | Private
01 | Joint stereo
11 | ...
Any comments to that?
When starting receiving the streaming data, should I discard all data prior this header before giving the buffer to the class which deals with the DSP? I know this can be implementation specific, but I would like to know what are in general the proceedings here...
BR
STeN

Related

Why can I sometimes concatenate audio data using NodeJS Buffers, and sometimes I cannot?

As part of a project I am working on, there is a requirement to concatenate multiple pieces of audio data into one large audio file. The audio files are generated from four sources, and the individual files are stored in a Google Cloud storage bucket. Each file is an mp3 file, and it is easy to verify that each individual file is generating correctly (individually, I can play them, edit them in my favourite software, etc.).
To merge the audio files together, a nodejs server loads the files from the Google Cloud storage as an array buffer using an axios POST request. From there, it puts each array buffer into a node Buffer using Buffer.from(), so now we have an array of Buffer objects. Then it uses Buffer.concat() to concatenate the Buffer objects into one big Buffer, which we then convert to Base64 data and send to the client server.
This is cool, but the issue arises when concatenating audio generated from different sources. The 4 sources I mentioned above are Text to Speech software platforms, such as Google Cloud Voice and Amazon Polly. Specifically, we have files from Google Cloud Voice, Amazon Polly, IBM Watson, and Microsoft Azure Text to Speech. Essentially just five text to speech solutions. Again, all individual files work, but when concatenating them together via this method there are some interesting effects.
When the sound files are concatenated, seemingly depending on which platform they originate from, the sound data either will or will not be included in the final sound file. Below is a 'compatibility' table based on my testing:
|------------|--------|--------|-----------|-----|
| Platform / | Google | Amazon | Microsoft | IBM |
|------------|--------|--------|-----------|-----|
| Google | Yes | No | No | No |
|------------|--------|--------|-----------|-----|
| Amazon | | No | No | Yes |
|------------|--------|--------|-----------|-----|
| Microsoft | | | Yes | No |
|------------|--------|--------|-----------|-----|
| IBM | | | | Yes |
|------------|--------|--------|-----------|-----|
The effect is as follows: When I play the large output file, it will always start playing the first sound file included. From there, if the next sound file is compatible, it is heard, otherwise it is skipped entirely (no empty sound or anything). If it was skipped, the 'length' of that file (for example 10s long audio file) is included at the end of the generated output sound file. However, the moment that my audio player hits the point where the last 'compatible' audio has played, it immediately skips to the end.
As a scenario:
Input:
sound1.mp3 (3s) -> Google
sound2.mp3 (5s) -> Amazon
sound3.mp3 (7s)-> Google
sound4.mp3 (11s) -> IBM
Output:
output.mp3 (26s) -> first 10s is sound1 and sound3, last 16s is skipped.
In this case, the output sound file would be 26s seconds long. For the first 10 seconds, you would hear the sound1.mp3 and sound3.mp3 played back to back. Then at 10s (at least playing this mp3 file in firefox) the player immediately skips to the end at 26s.
My question is: Does anyone have any ideas why sometimes I can concatenate audio data in this way, and other times I cannot? And how come there is this 'missing' data included at the end of the output file? Shouldn't concatenating the binary data work in all cases if it works for some cases, as all the files have mp3 encoding? If I am wrong please let me know what I can do to successfully concatenate any mp3 files :)
I can provide my nodeJS backend code, but the process and methods used are described above.
Thanks for reading?
Potential Sources of Problems
Sample Rate
44.1 kHz is often used for music, as it's what is used on CD audio. 48 kHz is usually used for video, as it's what was used on DVDs. Both of those sample rates are much higher than is required for speech, so it's likely that your various text-to-speech providers are outputting something different. 22.05 kHz (half of 44.1 kHz) is common, and 11.025 kHz is out there too.
While each frame specifies its own sample rate, making it possible to generate a stream with varying sample rates, I've never seen a decoder attempt to switch sample rates mid-stream. I suspect that the decoder is skipping these frames, or maybe even skipping over an arbitrary block until it gets consistent data again.
Use something like FFmpeg (or FFprobe) to figure out what the sample rates of your files are:
ffmpeg -i sound2.mp3
You'll get an output like this:
Duration: 00:13:50.22, start: 0.011995, bitrate: 192 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 192 kb/s
In this example, 44.1 kHz is the sample rate.
Channel Count
I'd expect your voice MP3s to be in mono, but it wouldn't hurt to check to be sure. As with above, check the output of FFmpeg. In my example above, it says stereo.
As with sample rate, technically each frame could specify its own channel count but I don't know of any player that will pull off switching channel count mid-stream. Therefore, if you're concatenating, you need to make sure all the channel counts are the same.
ID3 Tags
It's common for there to be ID3 metadata at the beginning (ID3v2) and/or end (ID3v1) of the file. It's less expected to have this data mid-stream. You would want to make sure this metadata is all stripped out before concatenating.
MP3 Bit Reservoir
MP3 frames don't necessarily stand alone. If you have a constant bitrate stream, the encoder may still use less data to encode one frame, and more data to encode another. When this happens, some frames contain data for other frames. That way, frames that could benefit from the extra bandwidth can get it while still fitting the whole stream within a constant bitrate. This is the "bit reservoir".
If you cut a stream and splice in another stream, you may split up a frame and its dependent frames. This typically causes an audio glitch, but may also cause the decoder to skip ahead. Some badly behaving decoders will just stop playing altogether. In your example, you're not cutting anything so this probably isn't the source of your trouble... but I mention it here because it's definitely relevant to the way you're working these streams.
See also: http://wiki.hydrogenaud.io/index.php?title=Bit_reservoir
Solutions
Pick a "normal" format, resample and rencode non-conforming files
If most of your sources are all the exact same format and only one or two outstanding, you could convert the non-conforming file. From there, strip ID3 tags from everything and concatenate away.
To do the conversion, I'd recommend kicking it over to FFmpeg as a child process.
child_process.spawn('ffmpeg' [
// Input
'-i', inputFile, // Use '-' to write to STDIN instead
// Set sample rate
'-ar', '44100',
// Set audio channel count
'-ac', '1',
// Audio bitrate... try to match others, but not as critical
'-b:a', '64k',
// Ensure we output an MP3
'-f', 'mp3',
// Output
outputFile // As with input, use '-' to write to STDOUT
]);
Best Solution: Let FFmpeg (or similar) do the work for you
The simplest, most robust solution to all of this is to let FFmpeg build a brand new stream for you. This will cause your audio files to be decoded to PCM, and a new stream made. You can add parameters to resample those inputs, and modify channel counts if needed. Then output one stream. Use the concat filter.
This way, you can accept audio files of any type, you don't have to write the code to hack those streams together, and once setup you won't have to worry about it.
The only downside is that it will require a re-encoding of everything, meaning another generation of quality lost. This would be required for any non-conforming files anyway, and it's just speech, so I wouldn't give it a second thought.
#Brad's answer was the solution! The first solution he suggested worked. It took some messing around getting FFMpeg to work correctly, but in the end using the fluent-ffmpeg library worked.
Each file in my case was stored on Google Cloud Storage, and not on the server's hard drive. This posed some problems for FFmpeg, as it requires file paths to have multiple files, or an input stream (but only one is supported, as there is only one STDIN).
One solution is to put the files on the hard drive temporarily, but this would not work for our use case as we may have a lot of use in this function and the hard drive adds latency.
So, instead we did as suggested and loaded each file into ffmpeg to convert it into a standardized format. This was a bit tricky, but in the end requesting each file as a stream, using that stream as an input for ffmpeg, then using fluent-ffmpeg's pipe() method (which returns a stream) as output worked.
We then bound an event listener to the 'data' event for this pipe, and pushed the data to an array (bufs.push(data)), and on stream 'end' we concatenated this array using Buffer.concat(bufs), followed by a promise resolve.
Then once all requests promises were resolved, we could be sure ffmpeg had processed each file, and then those buffers were concatenated in the required groups as before using Buffer.concat(), converted to base64 data, and sent to the client.
This works great, and now it seems to be able to handle every combination of files/sources I can throw at it!
In conclusion:
The answer to the question was that the mp3 data must have been encoded differently (different channels, sample rates, etc.), and loading it through ffmpeg and outputing it in a 'unified' way made the mp3 data compatible.
The solution was to process each file in ffmpeg separately, pipe the ffmpeg output into a buffer, then concatenate the buffers.
Thanks #Brad for your suggestions and detailed answer!

Is it possible to splice advertisements or messages dynamically into an MP3 file via a standard GET request?

Say you have an MP3 file and it's 60,000,000 bytes, and you also have an MP3 advertisement that's 500,000 bytes, both encoded at the same bit rate.
Would it be possible using an nginx or apache module to change the MP3 "Content-Length" header value to 60,500,000 and then control the incoming "Content-Range" requests so the first 500,000 bytes return the advertisement audio, and any range request greater than 500,000 begins returning the regular audio file with a 500,000 byte offset?
Or is it only possible to splice advertisements (or messages) into an MP3 file using an application such as FFmpeg to re-render the entire file?
Apologies if this is a stupid question, I'm just trying to think outside of the box.
You cannot arbitrarily splice MP3 without artifacts and decoder errors.
You also generally cannot cut/splice MP3 on frame boundaries due to the Bit Reservoir. Basically, a particular MP3 frame may contain data from another frame to more efficiently use the available bandwidth when its needed. Ignoring the bit reservoir can also cause artifacts and/or decoder errors.
What you can do is re-encode your advertisement and eventually re-join the stream. That is, at the point of ad insertion, decode the stream to PCM, mix (or replace in the audio) for your ad, and have this parallel stream re-encoded to PCM. If the encoding parameters are the same, eventually (after a couple of extra MP3 frames), you'll have identical bitstreams, and you can go back to reading the stream from the same buffer.
If you're doing this for ad-insertion on internet radio (live) streams, keep in mind that you'll have to do this on the server for every client (or at least, for each ad variant and timing variant). If this is for podcasts or other pre-recorded content, I'd recommend the FFmpeg route. You won't have to build anything, you can stream and cache the output as its being encoded, and you'll have compatibility with other codecs without building one-off code for each codec/container.

How do Shoutcast servers and clients deal with mp3 frame headers and frame dependencies?

Short story:
If I myself intend to receive and then send a Shoutcast compatible audio stream processed by my application, then how to do it properly using an mp3 (de/en)coder library? Pseudo code, or better - lame mp3 specific code would be highly appreciated.
Long story:
More specific questions which bother me were caused by an article about mp3, which says:
Generally, frames are independent items. Each frame has its own header
and audio informations. There is no file header. Therefore, you can
cut any part of MPEG file and play it correctly (this should be done
on frame boundaries but most applications will handle incorrect
headers). For Layer III, this is not 100% correct. Due to internal
data organization in MPEG version 1 Layer III files, frames are often
dependent of each other and they cannot be cut off just like that.
This made me wonder, how Shoutcast servers and clients deal with frame headers and frame dependencies.
Do I have to encode to constant bitrate (CBR) only, if I want to achieve maximum compatibility with the most of Shoutcast players out there?
Is the mp3 frame header used at all or the stream format is deduced from a Shoutcast protocol specific HTTP header?
Does Shoutcast protocol guarantee (or is it common good practice) to start serving mp3 stream on frame boundaries and continue to respond with chunks that are cut at frame boundaries? But what is the minimum or recommended size of a mp3 frame for streaming live audio?
How does Shoutcast deal with frame dependencies - does it do something special with mp3 encoding to ensure that the served stream does not have frames which depend on previous frames (if this is even possible)? Or maybe it ignores these dependencies on server side/client side, thus getting audio quality reduction or even artifacts?
SHOUTcast servers do not know or care about the data being passed through them. They send it as-is. You can actually send arbitrary data through a SHOUTcast server, and receive it. SHOUTcast will segment the media data wherever the buffer size falls.
It's up to the client to re-sync to the data. It does this by locating the frame header, then being decoding. Once the codec has enough frames to reliably play back audio, it will begin outputting raw PCM. It's up to the codec when to decide it's safe to start playback. Since the codec knows what it's doing in terms of decoding the media, it knows when it has sufficient data (including bit reservoirs) to begin without artifacts. It's also worth noting that the bit reservoir cannot be carried on too far, so it doesn't take but a few frames at worst to handle it.
This is one of the reasons it's important to have a sizable buffer server-side, to flush to the clients as fast as possible on connect. If playback is to start quickly, the codec needs more data than the current frame to begin.

A way to add data "mid stream" to encoded audio (possibly with AAC)

Is there a way to add lossless data to an AAC audio stream?
Essentially I am looking to be able to inject "this frame of audio should be played at XXX time" every n frames in.
If I use a lossless codec I suppose I could just inject my own header mid stream and that data would be intact as it needs to be the same on the way out just like gzip does not loose data.
Any ideas? I suppose I could encode the data into chunks of AAC on the server and on the network layer add a timestamp saying play the following chunk of AAC at time x but I'd prefer to figure a way to add it to the audio itself.
This is not really possible (short of writing your own specialized encoder), as AAC (and MP3) frames are not truly standalone.
There is a concept of the bit reservoir, where unused bandwidth from one frame can be utilized for a later frame that may need more bandwidth to store a more complicated sound. That is, data from frame 1 might be needed in frame 2 and/or 3. If you cut the stream between frames 1 and 2 and insert your alternative frames, the reference to the bit reservoir data is broken and you have damaged frame 2's ability to be decoded.
There are encoders that can work in a mode where the bit reservoir isn't used (at the cost of quality). If operating in this mode, you should be able to cut the stream more freely along frame boundaries.
Unfortunately, the best way to handle this is to do it in the time domain when dealing with your raw PCM samples. This gives you more control over the timing placement anyway, and ensures that your stream can also be used with other codecs.

Differences in WAV format (JS / NodeJS)

I'm trying to use WebRTC to record audio and then store it on the server side. My server is made using NodeJS and express, and I'm using POST to transmit the data from the client to the server.
On the client I'm translating the data from the wav BLOB to base64, transfer that, and on the server side, read it, translate it to binary, and then write it in a file. Should be fine, right?
There's just one problem : I'm getting some really bad inconsistencies between what you can download from the client, and what gets sent to the server. Sometimes it's added bytes, other times it's just deleted chunks of data. If it were just bytes added, that would mean a charset problem (translating from one to another, and then another, etc), but at some points I had 280 bytes added for example.
I've added a picture here of a hex diff :
http://i.stack.imgur.com/psqf4.png (sorry, I don't have enough reputation so far to post an image directly)
Also, running file with these gives me the following :
(uuid.wav is the server one, while output (1).wav is the client one)
9F2B75D3-4C34-4C8F-935E-FC7637D7A054.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 4 bit, stereo 11321924 Hz
output (1).wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz
... so clearly something is going wrong here. Also, trying to fix the headers, or convert the WAV gives me an error that goes along the lines of : could not find data chunk / data chunk has size 0.
Any ideas what might be causing this?
This looks suspiciously like some layer of code is attempting to convert the binary data to Unicode. 0x44 0xAC (which is 0xAC44 in little endian, which is 44100, which indicates the 44.1 kHz sample rate) is turning into 0x44 0xC2 0xAC. This gets byte-swapped to 0x00ACC244, which is 11321924 Hz, which reconciles with what you saw in the corrupted file.
Those 0xC2 additions definitely look like Unicode (UTF-8) artifacts. I don't know exactly which data types and functions you are using, but you will need to audit the steps to make sure none of them attempt to do implicit Unicode conversions.

Resources