Wireshark mean jitter rtp analysis - voip

I would like to know how does wireshark calculate the mean jitter? Should it not be just the sum of all the jitters over the number of recieved packets? I have a stream (with packet loss) and when I run wireshark analysis for RTP then export analysis for this stream, sum all the jitter values and divide by the number of recieved packets, I get a smaller mean jitter than that of wireshark.

After some research I found the file that implements the rtp analysis for wireshark mentioned in this link (https://wiki.wireshark.org/RTP_statistics). The file is called tap-rtp-common.c and can be found here https://github.com/giuliano108/wireshark-rtpmon/blob/master/tap-rtp-common.c and It turns out that the mean jitter is not the sum of all the jitters but rather the sum of all the diffs (between timestamps and arrival times) over the total number of frames.

Related

About definition for terms of audio codec

When I was studying Cocoa Audio Queue document, I met several terms in audio codec. There are defined in a structure named AudioStreamBasicDescription.
Here are the terms:
1. Sample rate
2. Packet
3. Frame
4. Channel
I known about sample rate and channel. How I was confused by the other two. What do the other two terms mean?
Also you can answer this question by example. For example, I have an dual-channel PCM-16 source with a sample rate 44.1kHz, which means there are 2*44100 = 88200 Bytes PCM data per second. But how about packet and frame?
Thank you at advance!
You are already familiar with the sample rate defintion.
The sampling frequency or sampling rate, fs, is defined as the number of samples obtained in one second (samples per second), thus fs = 1/T.
So for a sampling rate of 44100 Hz, you have 44100 samples per second (per audio channel).
The number of frames per second in video is a similar concept to the number of samples per second in audio. Frames for our eyes, samples for our ears. Additional infos here.
If you have 16 bits depth stereo PCM it means you have 16*44100*2 = 1411200 bits per second => ~ 172 kB per second => around 10 MB per minute.
To the definition in reworded terms from Apple:
Sample: a single number representing the value of one audio channel at one point in time.
Frame: a group of one or more samples, with one sample for each channel, representing the audio on all channels at a single point on time.
Packet: a group of one or more frames, representing the audio format's smallest encoding unit, and the audio for all channels across a short amount of time.
As you can see there is a subtle difference between audio and video frame notions. In one second you have for stereo audio at 44.1 kHz: 88200 samples and thus 44100 frames.
Compressed format like MP3 and AAC pack multiple frames in packets (these packets can then be written in MP4 file for example where they could be efficiently interleaved with video content). You understand that dealing with large packets helps to identify bits patterns for better coding efficiency.
MP3, for example, uses packets of 1152 frames, which are the basic atomic unit of an MP3 stream. PCM audio is just a series of samples, so it can be divided down to the individual frame, and it really has no packet size at all.
For AAC you can have 1024 (or 960) frames per packet. This is described in the Apple document you pointed at:
The number of frames in a packet of audio data. For uncompressed audio, the value is 1. For variable bit-rate formats, the value is a larger fixed number, such as 1024 for AAC. For formats with a variable number of frames per packet, such as Ogg Vorbis, set this field to 0.
In MPEG-based file format a packet is referred to as a data frame (not to be
mingled with the previous audio frame notion). See Brad comment for more information on the subject.

How to prevent data throttling with audio codec streaming

I am sampling a incoming audio stream at 8Ksps. I have a codec that takes ~1.6ms to encode a packet of data (80 samples) into an encoded packet (5 samples). At this rate I get 8000*1.662e-3 ~= 13 samples every encoding cycle. But I need 80 samples every cycle. How do I keep the stream continuous? My only guess is slow down the bitrate of the outgoing encode stream but I'm not sure how to calculate this in general such that buffers on the incoming side don't fill up and the receiving side's buffers don't get starved.
This seems like a basic tenet of streaming but I can't find any info on methods. Thanks for any help!

AAC RTP timestamps and synchronization

I am currently streaming audio (AAC-HBR at 8kHz) and video (H264) using RTP. Both feeds works fine individually, but when put together they get out of sync pretty fast (lass than 15 sec).
I am not sure how to increment the time stamp on the audio RTP header, I thought it should be the time difference between two RTP packets (around 127ms) or a constant increment of 1/8000 (0.125 ms). But neither worked, instead I managed to find a sweet spot. When I increment the time stamp by 935 for each packet It stays synchronized for about a minute.
AAC frame size is 1024 samples. Try to increment by (1/8000) * 1024 = 128 ms. Or a multiple of that in case your packet has multiple AAC frames.
Does that help?
Bit late, but thought of putting up my answer.
Timestamp on Audio RTP packet == the number of audio samples contained in RTP packet.
For AAC, each frame consist of 1024 samples, so timestamp on RTP packet should increase by 1024.
Difference between the clocktime of 2 RTP packets = (1/8000)*1024 = 128ms, i.e sender should send the rtp packets with difference of 128 ms.
Bit more information from other sampling rates:
Now AAC sampled at 44100hz means 44100 sample of signal in 1 sec.
So 1024 samples means (1000ms/44100)*1024 = 23.21995 ms
So the timestamp between 2 RTP packets = 1024, but
The difference of clock time between 2 RTP packets in rtp session should be 23.21995ms.
Trying to correlate with other example:
For example for G711 family (PCM, PCMU, PCMA), The sampling frequency = 8k.
So the 20ms packet should have samples == 8000/50 == 160.
And hence RTP timestamps are incremented by 160.
The difference of clock time between 2 RTP packets should be 20ms.
IMHO video and audio de-sync in android is difficult to fight if they are taken from different media recorders. They just capture different start frames and there is no way (as it seems) to find out how big de-sync is and adjust it with audio or video timestamps on flight.

How to calculate effective time offset in RTP

I have to calculate time offset between packets in RTP streams. With video stream encoded with Theora codec i have timestamp field like
2856000
2940000
3024000
...
So I assume that transmission offset is 84000. With audio speex codec i have timestamp field like
38080
38400
38720
...
So I assume that transmission offset is 320. Why values so different? Are they microseconds, milliseconds, or what? Can i generalize a formula to calculate delay between packets in microseconds that works with any codec? Thank you.
RTP timestamps are media dependant. They use the sampling rate of the codec in use. You have to convert them to milliseconds before comparing with your clock or with timestamps from other RTP streams.
Added:
To convert the timstamp to seconds, just divide the timestamp by the sample rate. For most audio codecs, the sample rate is 8 kHz.
See here for a few examples.
Note that video codecs typically use 90000 for the timestamp rate.
Instead of guessing at the clock rate, look at the a=rtpmap line in the sdp for the payload in use. Example:
a=audio 5678 RTP/AVP 0 8 99
a=rtpmap 0 PCMU/8000
a=rtpmap 8 PCMA/8000
a=rtpmap 99 AAC-LD/16000
If the payload is 0 or 8, timestamps are 8KHz. If it's 99, they're 16KHz. Note that the rtpmap line has an optional 'channels' parameter, as in "a=rtpmap payload name/rate[/channels]"
Been researching this question for about an hour for the case of audio. Seems like the answer is: the RTP timestamp is incremented by the number of audio time units (samples) in a packet. Take this example where you have a stream of encoded, 2 channel audio, sampled at 44100 before the audio was encoded. Say that you send 512 audio samples (256 time units because we have 2 channel audio) for every packet. Assuming the first packet has a timestamp of 0 (it should be random though according to the RTP spec (RFC 3550)), the second timestamp would be 256, and the third 512. The receiver can convert the value back to an actual time by dividing the timestamp by the audio sample rate, so the first packet would be T0, the second equals 256/44100=0.0058 seconds, the third equals 512/44100=0.0116 seconds, etc.
Someone please correct me if I'm wrong, I'm not sure why there aren't any articles online that state it this way. I guess it would be more complicated if the resolution of the RTP timestamp is different than the sample rate of the audio stream. Nevertheless, converting the timestamp to a different resolution is not complicated. Use the example as before, but change the resolution of the RTP timestamp to 90 kHz, as in MPEG4 Audio (RFC 3016). From the source side the first timestamp is 0, the second is 90000*(256/44100)=522, and the third is 1044. And on the receiver, the time is 0 for first packet, 522/90000=0.0058 for the second, and 1044/90000=0.0116 for the third. Again, someone please correct me if I'm wrong.

RTP packet combining

I have a bunch of RTP packets that I'd like to re-assemble into an audio stream. For each packet, I have the sequence number, SSRC, timestamp, and a byte array representing the data itself.
Currently I'm taking each subset of packets by their SSRC, then ordering them by timestamp and combining the byte arrays in that order. Afterwards, I'm mixing the byte arrays. The resulting audio data sounds great (by great, I mean everything is in time), but I'm worried that it's due to not having much packet loss.
So, a couple questions...
For missing packets, a missing sequence number shows where I need to add a bit of empty audio. I believe the sequence number "wraps around" quite often, so I need to use timestamp to break them up into subsets. Then I can look for missing sequence numbers in those subsets and add as needed. Does that sound like the right thing to do?
I haven't quite figured out what else the timestamp is good for. Since I'm recording already existing packets and filling in the missing ones, maybe I don't need to worry about this as much?
1) Avoid using timestamps in your algorithm. Your algorithm will fail in case you are receiving stream from bad clients (Improper timestamps). And "timestamps increment" value changes with codec types. In that case you may need different subsets for different codecs. There is no limitations on sequence number. Sequence number are incremented monotonically. Using sequence number you can track lost packets easily.
2) Timestamp is used for synchronization between Audio and video. Mainly for lip sync. A relationship between audio and video timestamps is established for achieving synchronization. In your case its only audio so you can avoid using timestamp.
Edit: According to RFC 3389 (Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN))
RTP allows discontinuous transmission (silence suppression) on any
audio payload format. The receiver can detect silence suppression
on the first packet received after the silence by observing that
the RTP timestamp is not contiguous with the end of the interval
covered by the previous packet even though the RTP sequence number
has incremented only by one. The RTP marker bit is also normally
set on such a packet.
1) I don't think sequence number "wrap around" quickly. This is 16-bit value so it wraps every 65536 messages and even if message is send every 10 milliseconds this give more than 10 minutes of transmission. It is very unlikely that packet will be lost for so long. So in my opinion you should only check sequence number, checking timestamp is pointless.
2) I think you shouldn't worry much about timestamp. I know that some protocols didn't even fill this value and relay only on sequence number.
I think what Zulijn is getting at in his answer above is that if your packets are stored in the order they were captured then you can use some simple rules to find out-of-order packets - e.g. look back 50 packets and forward 50 packets. If it is not there then it counts as a lost packet.
This should avoid any issues with the sequence number having wrapped around. To handle any lost packets there are many techniques you can use, so it would be useful to google 'Audio packet loss' or 'VOIP packet loss concealment'. As Adam mentions timestamp will vary with codec so you need to understand this if you are going to use it.
You don't mention what the actual application is but if you are trying to understand what the received audio actually sounded like, you really need some more info, in particular the jitter buffer size - this effectively determines how long the receiver will wait for an out of sequence packet before deciding it is lost. What this means to you is that there may be out-of-sequence packets in your file which the 'real world' receiver would have given up and not played back - i.e. your reconstruction from the file may give a higher quality than the 'real time' experience.
If it is a two way transmission, then delay is very important also (even if it is a constant delay and hence does not affect jitter and packet loss). This is the type of effect you used to get on some radio telephones and still do on some satellite phones (or VoIP phones), and it can significantly impact the user experience.
Finally, different codecs and clients may apply different techniques to correct lost packets, insert 'silent tones' for any gaps in the audio (e.g. pauses in conversation), suppress background noise etc.
To get a proper feel for the user experience you would have to try to 'replay' your captured packets as accurately as possible using the same codec, jitter buffer and any error correction/packet loss techniques the receiver used also.

Resources