AAC-LC format and RTP - audio

I'm trying to encode AAC-LC data packed in 3gpp to RTP. I've gone through rfc 3640 but I still don't know exactly where to start from. What will I find in the AAC data exactly ? If I'm not wrong , the first 40 bytes will be the MP4 header, but what comes afterwards and where can I find It's definition ? In order to build the RTP payload, I have to include the AU headers sections, but I don't know If they are already included in the AAC data and I can't find it anywhere.
Once I have taken out the mp4 header I have the following data:
00 00 14 03 E9 1C 00 00 14 03 E9 1C
Is this the AU header ? How do I interpret this data?
An another question, what is the relation between AAC-LC and AAC-lbr...I mean, I know the first one stands for low complexity and the second one for low bit rate, but is it the same ? one includes the other ?
Thanks in advance, I'm really new to AAC and I'm quite lost !

I'm trying to do the opposite, ie decode a RTP AAC stream, so some of the references I've found so far might be useful to you:
http://www.rfc-editor.org/rfc/rfc3016.txt
this describes the rtp structure. What I've found in reading my stream, is that there there is also a framing header around the RTP packets, 2bytes for the length:
https://www.rfc-editor.org/rfc/rfc4571
On top of that I have found an extra framing of 2bytes 0x24 0x00 - still no idea what that is all about, but thought I'd let you know you may need to recreate that as well.
Sadly it seems like a lot of the interesting specs are 'pay to view'. Although I did find some useful info from this blog:
http://thompsonng.blogspot.com/2010/03/rfc-3640-for-aac.html
Regarding your other question, I have AAC-hbr which is also AAC-LC apparently, although once again I also haven't found publicly available specs for this.
Your AU header looks a bit similar to what I've got:
0x00 0x00 0x01 0xB6 0x55 0x60 0x63 0xFF 0xFF 0x7A 0x7D 0xD5 0xF7 0xB7 0xA7 0xDF
Although I was expecting the first 16bits to be a length for the headers, so like yourself, I'm not quite sure what I'm looking at...
Anyway I hope some of that was helpful.

Related

Sending image directly to Epson Projector, trouble decoding jpeg image

I have an Epson Ex5220 that doesn't have a linux driver and have been trying to work out communication through wifi. I can connect and send images captured through packet traces from a Windows machine with a driver but cannot create an acceptable image. Here is where the problem lies:
In the data send, a jpeg image is sent with a header attached like this.
00:00:00:01:00:00:00:00:02:70:01:a0:00:00:00:07:90:80:85:00
00:00:00:04 - Number of jpeg images being sent (only the first header)
00:00 - X offset
00:00 - Y offset
02:70 - Width of Jpeg image (624 in this case)
01:a0 - Height of Jpeg image (416 in this case)
00:00:00:07:90 - Unknown (I believe it's a version number perhaps)
80:85:00 - (What I'm after) Some count of data units?
Following the header is a normal jpeg image. If I strip off that header, I can view the image. Here is a screen shot of a partial capture with the 3 bytes highlighted:
I have found what seems to be a base line by setting those last three bytes to 80:85:00. Anything less and the image will not project. Also the smallest image size I can send to the projector is a 3w x 1h which correlates with my first two images show below.
Here are some examples:
1a - All white (RGB565) image 1024x768 - filesize 12915 - 4 blocks
2a - Color (RGB565) image 1024x768 - filesize 58577 - only 3 blocks
I then changed the 3 bytes to 00:b5:80 (incremented the middle one by 0x30)
1b - All white (RGB565) image 1024x768 - filesize 12915 - 22 full rows and 4 blocks.
2b - Color (RGB565) image 1024x768 - filesize 58577 - 7 rows and 22 blocks.
So it seems that the 3 bytes have something to do with data units. I've read lots of stuff about jpeg and am still digesting much of it but I think if I knew what was required to calculate data units, I'd find my mystery 3 bytes.
ADDITIONAL INFO:
The projector only supports the use of RGB565 jpeg images inside the data send.
(Posted on behalf of the OP).
I was able to solve this, but I would like to know why this works. Here is the formula for those last 3 bytes:
int iSize = bImage.length;
baHeader[17] = (byte) ((iSize) | 0x80);
baHeader[18] = (byte) ((iSize >> 7) | 0x80);
baHeader[19] = (byte) ((iSize >> 14));
I got fed up with messing with it and just look at several images, wrote down all the file sizes and the magic bytes, converted everything to binary and hammered away at ANDing ORing bitshifting until I forced a formal that worked. I would like to know if this is related to calculating jpeg data units. I'm still researching Jpeg but it's not simple stuff!
It looks like you're misinterpreting how the SOS marker works. Here are bytes you show in one of your examples:
SOS = 00 0C 03 01 00 02 11 03 11 00 3F 00 F9 FE
This erroneously has two bytes of compressed data (F9 FE) included in the SOS. The length of 12 (00 0C) includes the 2 length bytes themselves, so there are really only 10 bytes of data for this marker.
The 00 byte before the F9 FE is the "successive approximation" bits field and is used for progressive JPEG images. It's actually a pair of 4-bit fields.
The bytes that you see as varying between images are really the first 2 compressed data bytes (which encode the DC value for the first MCU).

How to find resolution and framerate values in H.264 MPEG-2 TS?

I'm working on a MPEG-2 TS video containing a H.264 stream, and I'm looking for video properties stored in the stream, by scanning PAT, PMT, PES, etc.
I'm able to read PAT, PMT, and elementary streams type and PID. Here I would like to find the resolution and the framerate (fps). Are they located in the PES header, or elsewhere? They are not in PAT or PMT.
Below, Transport Stream Packet Editor is able to find two different informations, one itself and the other from the Haali Media Decoder helper codec. How to get the first one:
Pseudo-code is welcomed.
I am not sure about the availability of the height width information in the MPEG2TS header. Because TS file can have multiple programs. But if you are targeting only TS files made of H.264 elementary stream then you an get these informations from the SPS of the H.264 elementary stream.
Every H.264 frame starts with four or three bytes sequence header 0x00 0x00 0x01 or 0x00 0x00 0x00 0x01. The frame is an SPS frame if doing the AND operation with next byte after start headers is equal to 0x07.
E.g. SPS frame 0x00 0x00 0x00 0x01 0x67 ... Doing AND operation (0x67 & 0x1F) = 0x07
Parsing the SPS header is also not an easy task but you can find the details in ffmpeg source code.
Hope this helps.
No, they were not present in PES header. To find resolution and frame rate from H.264 video in MPEG2-TS you need to parse SPS(Sequence parameter set)from H.264 stream.
These are the steps for parsing H.264 NAL(Network adaption layer) units:
Parse NAL unit prefix(NAL unit prefix is of 3(0x00,0x00,0x01) or 4(0x00,0x00,0x00,0x01) byte code) then Header(next byte after prefix code)
Check the type of NAL unit(last 5 bits) from Header byte.
If NAL unit is of type 7 means,this NAL unit is SPS NAL unit then parse the code
This ITU link gives the documentation about h.264 standard.
See section 7.3.2.1.1: Sequence parameter set data syntax gives the syntax to find the parameters in SPS.
I presume working code for this is resident inside of the ffprobe binary for the FFMPEG project, as it produces the desired output:
$ ffprobe -v quiet -show_streams output1.mp4
[STREAM]
index=0
codec_name=h264
... // A bunch of stream data
width=1280
height=1024
sample_aspect_ratio=1:1
display_aspect_ratio=5:4
....
r_frame_rate=30000/1001
avg_frame_rate=30000/1001
time_base=1/30000
...
[/STREAM]
The information you are looking for is inside the H.264 SPS NAL units.
You need to parse the PES data, extract the NALUs and then parse the SPS data. There you'll find the resolution. If SPS carries VUI information you have information about the desired frame rate.
MPEG2-TS is a transport stream, it transports something but does not carry detailed information about what it carries. It just wraps stuff.
What you could use from MPEG2-TS is PTS/DTS of the PES header and average the frame rate from the presentation time stamps provided.
To do it properly, parse the PES header, parse the NALU headers, parse the actual SPS NAL unit and if present the VUI it contains.

Understanding the ZMODEM protocol

I need to include basic file-sending and file-receiving routines in my program, and it needs to be through the ZMODEM protocol. The problem is that I'm having trouble understanding the spec.
For reference, here is the specification.
The spec doesn't define the various constants, so here's a header file from Google.
It seems to me like there are a lot of important things left undefined in that document:
It constantly refers to ZDLE-encoding, but what is it? When exactly do I use it, and when don't I use it?
After a ZFILE data frame, the file's metadata (filename, modify date, size, etc.) are transferred. This is followed by a ZCRCW block and then a block whose type is undefined according to the spec. The ZCRCW block allegedly contains a 16-bit CRC, but the spec doesn't define on what data the CRC is computed.
It doesn't define the CRC polynomial it uses. I found out by chance that the CRC32 poly is the standard CRC32, but I've had no such luck with the CRC16 poly. Nevermind, I found it through trial and error. The CRC16 poly is 0x1021.
I've looked around for reference code, but all I can find are unreadable, undocumented C files from the early 90s. I've also found this set of documents from the MSDN, but it's painfully vague and contradictory to tests that I've run: http://msdn.microsoft.com/en-us/library/ms817878.aspx (you may need to view that through Google's cache)
To illustrate my difficulties, here is a simple example. I've created a plaintext file on the server containing "Hello world!", and it's called helloworld.txt.
I initiate the transfer from the server with the following command:
sx --zmodem helloworld.txt
This prompts the server to send the following ZRQINIT frame:
2A 2A 18 42 30 30 30 30 30 30 30 30 30 30 30 30 **.B000000000000
30 30 0D 8A 11 00.Š.
Three issues with this:
Are the padding bytes (0x2A) arbitrary? Why are there two here, but in other instances there's only one, and sometimes none?
The spec doesn't mention the [CR] [LF] [XON] at the end, but the MSDN article does. Why is it there?
Why does the [LF] have bit 0x80 set?
After this, the client needs to send a ZRINIT frame. I got this from the MSDN article:
2A 2A 18 42 30 31 30 30 30 30 30 30 32 33 62 65 **.B0100000023be
35 30 0D 8A 50.Š
In addition to the [LF] 0x80 flag issue, I have two more issues:
Why isn't [XON] included this time?
Is the CRC calculated on the binary data or the ASCII hex data? If it's on the binary data I get 0x197C, and if it's on the ASCII hex data I get 0xF775; neither of these are what's actually in the frame (0xBE50). (Solved; it follows whichever mode you're using. If you're in BIN or BIN32 mode, it's the CRC of the binary data. If you're in ASCII hex mode, it's the CRC of what's represented by the ASCII hex characters.)
The server responds with a ZFILE frame:
2A 18 43 04 00 00 00 00 DD 51 A2 33 *.C.....ÝQ¢3
OK. This one makes sense. If I calculate the CRC32 of [04 00 00 00 00], I indeed get 0x33A251DD. But now we don't have ANY [CR] [LF] [XON] at the end. Why is this?
Immediately after this frame, the server also sends the file's metadata:
68 65 6C 6C 6F 77 6F 72 6C 64 2E 74 78 74 00 31 helloworld.txt.1
33 20 32 34 30 20 31 30 30 36 34 34 20 30 20 31 3 240 100644 0 1
20 31 33 00 18 6B 18 50 D3 0F F1 11 13..k.PÓ.ñ.
This doesn't even have a header, it just jumps straight to the data. OK, I can live with that. However:
We have our first mysterious ZCRCW frame: [18 6B]. How long is this frame? Where is the CRC data, and is it CRC16 or CRC32? It's not defined anywhere in the spec.
The MSDN article specifies that the [18 6B] should be followed by [00], but it isn't.
Then we have a frame with an undefined type: [18 50 D3 0F F1 11]. Is this a separate frame or is it part of ZCRCW?
The client needs to respond with a ZRPOS frame, again taken from the MSDN article:
2A 2A 18 42 30 39 30 30 30 30 30 30 30 30 61 38 **.B0900000000a8
37 63 0D 8A 7c.Š
Same issues as with the ZRINIT frame: the CRC is wrong, the [LF] has bit 0x80 set, and there's no [XON].
The server responds with a ZDATA frame:
2A 18 43 0A 00 00 00 00 BC EF 92 8C *.C.....¼ï’Œ
Same issues as ZFILE: the CRC is all fine, but where's the [CR] [LF] [XON]?
After this, the server sends the file's payload. Since this is a short example, it fits in one block (max size is 1024):
48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0A Hello world!.
From what the article seems to mention, payloads are escaped with [ZDLE]. So how do I transmit a payload byte that happens to match the value of [ZDLE]? Are there any other values like this?
The server ends with these frames:
18 68 05 DE 02 18 D0 .h.Þ..Ð
2A 18 43 0B 0D 00 00 00 D1 1E 98 43 *.C.....Ñ.˜C
I'm completely lost on the first one. The second makes as much sense as the ZRINIT and ZDATA frames.
My buddy wonders if you are implementing a time
machine.
I don't know that I can answer all of your questions -- I've never
actually had to implement zmodem myself -- but here are few answers:
From what the article seems to mention, payloads are escaped with
[ZDLE]. So how do I transmit a payload byte that happens to match the
value of [ZDLE]? Are there any other values like this?
This is explicitly addressed in the document you linked to at the
beginning of your questions, which says:
The ZDLE character is special. ZDLE represents a control sequence
of some sort. If a ZDLE character appears in binary data, it is
prefixed with ZDLE, then sent as ZDLEE.
It constantly refers to ZDLE-encoding, but what is it? When exactly
do I use it, and when don't I use it?
In the Old Days, certain "control characters" were used to control the
communication channel (hence the name). For example, sending XON/XOFF
characters might pause the transmission. ZDLE is used to escape
characters that may be problematic. According to the spec, these are
the characters that are escaped by default:
ZMODEM software escapes ZDLE, 020, 0220, 021, 0221, 023, and 0223.
If preceded by 0100 or 0300 (#), 015 and 0215 are also escaped to
protect the Telenet command escape CR-#-CR. The receiver ignores
021, 0221, 023, and 0223 characters in the data stream.
I've looked around for reference code, but all I can find are
unreadable, undocumented C files from the early 90s.
Does this include the code for the lrzsz package? This is still
widely available on most Linux distributions (and surprisingly handy
for transmitting files over an established ssh connection).
There are a number of other implementations out there, including
several in software listed on freecode, including qodem,
syncterm, MBSE, and others. I believe the syncterm
implementation is written as library that may be reasonable easy
to use from your own code (but I'm not certain).
You may find additional code if you poke around older collections of
MS-DOS software.
I can't blame you. The user manual is not organized in a user friendly way
Are the padding bytes (0x2A) arbitrary?
No, from page 14,15:
A binary header begins with the sequence ZPAD, ZDLE, ZBIN.
A hex header begins with the sequence ZPAD, ZPAD, ZDLE, ZHEX.
.
The spec doesn't mention the [CR] [LF] [XON] at the end, but the MSDN article does. Why is it there?
Page 15
* * ZDLE B TYPE F3/P0 F2/P1 F1/P2 F0/P3 CRC-1 CRC-2 CR LF XON
.
Why does the [LF] have bit 0x80 set?
Not sure. From Tera term I got both control characters XORed with 0x80 (8D 8A 11)
We have our first mysterious ZCRCW frame: [18 6B]. How long is this frame? Where is the CRC data, and is it CRC16 or CRC32? It's not defined anywhere in the spec.
The ZCRCW is not a header or a frame type, it's more like a footer that tells the receiver what to expect next. In this case it's the footer of the data subpacket containing the file name. It's going to be a 32 bit checksum because you're using a "C" type binary header.
ZDLE C TYPE F3/P0 F2/P1 F1/P2 F0/P3 CRC-1 CRC-2 CRC-3 CRC-4
.
Then we have a frame with an undefined type: [18 50 D3 0F F1 11]. Is this a separate frame or is it part of ZCRCW?
That's the CRC for the ZCRCW data subpacket. It's 5 bytes because the first one is 0x10, a control character that needs to be ZDLE escaped. I'm not sure what 0x11 is.
and there's no [XON].
XON is just for Hex headers. You don't use it for a binary header.
ZDLE A TYPE F3/P0 F2/P1 F1/P2 F0/P3 CRC-1 CRC-2
.
So how do I transmit a payload byte that happens to match the value of [ZDLE]?
18 58 (AKA ZDLEE)
18 68 05 DE 02 18 D0
That's the footer of the data subframe. The next 5 bytes are the CRC (last byte is ZDLE encoded)
The ZDLE + ZBIN (0x18 0x41) means the frame is CRC-CCITT(XMODEM 16) with Binary Data.
ZDLE + ZHEX (0x18 0x42) means CRC-CCITT(XMODEM 16) with HEX data.
The HEX data is tricky, since at first some people don't understand it. Every two bytes, the ASCII chars represent one byte in binary. For example, 0x30 0x45 0x41 0x32 means 0x30 0x45, 0x41 0x32, or in ASCII 0 E A 2, then 0x0E, 0xA2. They expand the binary two nibbles to a ASCII representation. I found in several dataloggers that some devices use lower case to represent A~F (a~f) in HEX, it doesn't matter, but on those, you will not find 0x30 0x45 0x41 0x32 (0EA2) but 0x30 0x65 0x61 0x32 (0ea2), it doesn't change a thing, just make it a little bit confuse at first.
And yes, the CRC16 for ZBIN or ZHEX is CRC-CCITT(XMODEM).
The ZDLE ZBIN32 (0x18 0x43) or ZDLE ZBINR32 (0x18 0x44) use CRC-32 calculation.
Noticing that the ZDLE and the following byte are excluded in the CRC calculation.
I am digging into the ZMODEM since I need to "talk" with some Elevators Door Boards, to program a new set of parameters at once, instead using their software to change attribute by attribute. This "talk" could be on the bench instead sitting over the elevator car roof with a notebook. Those boards talk ZMODEM, but as I don't have the packets format they expect, the board still rejecting my file transfer. The boards send 0x2a 0x2A 0x18 0x42 0x30 0x31 0x30 (x6) + crc, the Tera Terminal transfering the file in ZMODEM send to the board 0x2A 0x2A 0x18 0x42 0x30 0x30 ... + CRC, I don't know why this 00 or 01 after the 0x4B means. The PC send this and the filename and some file attributes. The board after few seconds answer with "No File received"...

Live audio streaming container formats

When I start receiving the live audio (radio) stream (e.g. MP3 or AAC) I think the received data are not kind of raw bitstream (i.e. raw encoder output), but they are always wrapped into some container format. If this assumption is correct, then I guess I cannot start streaming from arbitrary place of the stream, but I have to wait to some sync byte. Is that right? Is it usual to have some sync bytes? Is there any header following the sync byte, from which I can guess the used codec, number of channels, sample rate, etc.?
When I connect to live stream, will I receive data starting by the nearest sync byte or I will get them from the actual position and I have to check for the sync byte first?
Some streams like icecast use headers in the HTTP response, where stream related information are included, but i think i can skip them and deal directly with the steam format.
Is that correct?
Regards,
STeN
When you look at SHOUTcast/Icecast, the data that comes across is pure MPEG Layer III audio data, and nothing more. (Provided you haven't requested metadata.)
It can be cut at an arbitrary place, so you need to sync to the stream. This is usually done by finding a potential header, and using the data in that header to find sequential headers. Once you have found a few frame headers, you can safely assume you have synced up to the stream and start decoding for playback.
Again, there is no "container format" for these. It's just raw data.
Now, if you want metadata, you have to request it from the server. The data is then just injected into the stream every x number of bytes. See http://www.smackfu.com/stuff/programming/shoutcast.html.
Doom9 has great starting info about both mpeg and aac frame formats. Shoutcast will add some 'metadata' now and then, and it's really trivial. The thing I want to share with you is this; I have an application that can capture all kind of streams, and shoutcast, both aac and mp3 is among them. First versions had their files cut at arbitrary point according to the time, for example every 5 minutes, regardless of the mp3/aac frames. It was somehow OK for the mp3 (the files were playable) but was very bad for aacplus.
The thing is - aacplus decoder ISN'T that forgiving about wrong data, and I had everything from access violations to mysterious software shutdowns with no errors of any kind.
Anyway, if you want to capture stream, open the socket to the server, read the response, you'll have some header there, then use that info to strip metadata that will be injected now and then. Use the header information for both aacplus and mp3 to determine frame boundaries, and try to honor them and split the file at the right place.
mp3 frame header:
http://www.mp3-tech.org/programmer/frame_header.html
aacplus frame header:
http://wiki.multimedia.cx/index.php?title=Understanding_AAC
also this:
aacplus frame alignment problems
Unfortunately it's not always that easy, check the format and notes here:
MPEG frame header format
I will continue the discussion byu answering myself (even we are discouraged to do that):
I was also looking into streamed data and I have found that frequently the sequence ff f3 82 70 is repeated - this I suggest is the MPEG frame header, so I try to look what that means:
ff f3 82 70 (hex) = 11111111 11110011 10000010 01110000 (bin)
Analysis
11111111111 | SYNC
10 | MPEG version 2
01 | Layer III
1 | No CRC
1000 | 64 kbps
00 | 22050Hz
1 | Padding
0 | Private
01 | Joint stereo
11 | ...
Any comments to that?
When starting receiving the streaming data, should I discard all data prior this header before giving the buffer to the class which deals with the DSP? I know this can be implementation specific, but I would like to know what are in general the proceedings here...
BR
STeN

Decoding sniffed packets

I understand that each packet has some header that seems like a random mix of chars. On the other hand, the content itself can be in pure ascii and therefore it might be human friendly. Some of the packets I sniffed were readable (raw html headers for sure). But some packets looked like this:
0000 00 15 af 51 68 b2 00 e0 98 be cf d6 08 00 45 00 ...Qh... ......E.
0010 05 dc 90 39 40 00 2e 06 99 72 08 13 f0 49 c0 a8 ...9#... .r...I..
0020 64 6b 00 50 c1 32 02 7a 60 4f 4c b6 45 62 50 10 dk.P.2.z `OL.EbP.
That was just a part, these packets were usually longer. My question is, how can I decode the packet content/data? Do I need the whole stream? Is the decoding simple, or every application can encode it slightly else, to ensure these packets are secured?
Edit:
I don't care about the header, Wireshark shows that. However, that's totally worthless info. I want to decode the data/content.
The content of a packet is defined by the process sending it. Think of it like a telephone call. What's said is dependent on who is calling and who they are talking to. You have to study the programs that construct it to determine how to "decode" it. There are some sniffers that will parse some commonly used methods of encoding and try to do this already.
Why not just use something like wireshark?
Packet headers will depend on the application sending the packet in question, as mentioned in an earlier post. You can also use Wiresharks protocol reference for understanding some of the common protocols.
What you have listed here is the Packet Byte, what you need to see is the Packet Detail view to understand what does the seemingly random data correspond to. In Packet Detail view, when you select various parts of the packet, it will highlight corresponding byte in the Packet Byte view.
If you're using C#, grab SharpPcap and look at the examples in code to get a feel for how it works.
Set the filter to only capture UDP, capture a packet, parse it to udp, and extract the payload. The payload's format is based on the application sending it.
There's a lot of extra gibberish because every udp packet contains a stack of:
Ethernet header
IP header
UDP header
of information before your data and all incoming data is in binary format until you parse it to something meaningful.

Resources