How to calculate effective time offset in RTP - voip

I have to calculate time offset between packets in RTP streams. With video stream encoded with Theora codec i have timestamp field like
2856000
2940000
3024000
...
So I assume that transmission offset is 84000. With audio speex codec i have timestamp field like
38080
38400
38720
...
So I assume that transmission offset is 320. Why values so different? Are they microseconds, milliseconds, or what? Can i generalize a formula to calculate delay between packets in microseconds that works with any codec? Thank you.

RTP timestamps are media dependant. They use the sampling rate of the codec in use. You have to convert them to milliseconds before comparing with your clock or with timestamps from other RTP streams.
Added:
To convert the timstamp to seconds, just divide the timestamp by the sample rate. For most audio codecs, the sample rate is 8 kHz.
See here for a few examples.

Note that video codecs typically use 90000 for the timestamp rate.
Instead of guessing at the clock rate, look at the a=rtpmap line in the sdp for the payload in use. Example:
a=audio 5678 RTP/AVP 0 8 99
a=rtpmap 0 PCMU/8000
a=rtpmap 8 PCMA/8000
a=rtpmap 99 AAC-LD/16000
If the payload is 0 or 8, timestamps are 8KHz. If it's 99, they're 16KHz. Note that the rtpmap line has an optional 'channels' parameter, as in "a=rtpmap payload name/rate[/channels]"

Been researching this question for about an hour for the case of audio. Seems like the answer is: the RTP timestamp is incremented by the number of audio time units (samples) in a packet. Take this example where you have a stream of encoded, 2 channel audio, sampled at 44100 before the audio was encoded. Say that you send 512 audio samples (256 time units because we have 2 channel audio) for every packet. Assuming the first packet has a timestamp of 0 (it should be random though according to the RTP spec (RFC 3550)), the second timestamp would be 256, and the third 512. The receiver can convert the value back to an actual time by dividing the timestamp by the audio sample rate, so the first packet would be T0, the second equals 256/44100=0.0058 seconds, the third equals 512/44100=0.0116 seconds, etc.
Someone please correct me if I'm wrong, I'm not sure why there aren't any articles online that state it this way. I guess it would be more complicated if the resolution of the RTP timestamp is different than the sample rate of the audio stream. Nevertheless, converting the timestamp to a different resolution is not complicated. Use the example as before, but change the resolution of the RTP timestamp to 90 kHz, as in MPEG4 Audio (RFC 3016). From the source side the first timestamp is 0, the second is 90000*(256/44100)=522, and the third is 1044. And on the receiver, the time is 0 for first packet, 522/90000=0.0058 for the second, and 1044/90000=0.0116 for the third. Again, someone please correct me if I'm wrong.

Related

how to deal with processing time delay of audio codec while streaming over RTP

In section 2.1 of the Speex codec manual it says:
Every speech codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount
of “look-ahead” required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16
kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to encode or decode the frames.
In RTP Payload Format for the Speex Codec, RFC5574 it says:
ptime: SHOULD be a multiple of 20 msec
I have a 20mS frame time of encoded data. so I assume my ptime should be 20.
The delay for the encoding is 30mS or more. The time between RTP packets are 20mS. How is this supposed to work? Every other RTP payload is an empty packet? How do I resolve this?
Seemingly this is an issue with every codec. I must be missing some fundamental understanding of how streaming works.
I have validated I can stream a pre-encoded buffer and it sounds as intended.
I have tried:
Creating a large queue in the beginning to compensate, however this quickly becomes zero length.
Sending zero data as the payload
Ideas I haven't yet tried:
Send a packet of all padding and mark the RTP header as padding
Increase the sequence but not the timestamp until the next actual payload is ready (this sounds like it is against the spec?)
Note: I'm now wondering if the delay mentioned by speex is within the encoded output and the delay I am seeing while streaming is due to my limited CPU (embedded)
My note was correct. This question is flawed.
The Speex manual is referring to a delay in the audio output, not an inherent delay of processing time. Therefore the issue in question is not an issue.
I'm glad I asked the question, it helped me come to the solution.

About definition for terms of audio codec

When I was studying Cocoa Audio Queue document, I met several terms in audio codec. There are defined in a structure named AudioStreamBasicDescription.
Here are the terms:
1. Sample rate
2. Packet
3. Frame
4. Channel
I known about sample rate and channel. How I was confused by the other two. What do the other two terms mean?
Also you can answer this question by example. For example, I have an dual-channel PCM-16 source with a sample rate 44.1kHz, which means there are 2*44100 = 88200 Bytes PCM data per second. But how about packet and frame?
Thank you at advance!
You are already familiar with the sample rate defintion.
The sampling frequency or sampling rate, fs, is defined as the number of samples obtained in one second (samples per second), thus fs = 1/T.
So for a sampling rate of 44100 Hz, you have 44100 samples per second (per audio channel).
The number of frames per second in video is a similar concept to the number of samples per second in audio. Frames for our eyes, samples for our ears. Additional infos here.
If you have 16 bits depth stereo PCM it means you have 16*44100*2 = 1411200 bits per second => ~ 172 kB per second => around 10 MB per minute.
To the definition in reworded terms from Apple:
Sample: a single number representing the value of one audio channel at one point in time.
Frame: a group of one or more samples, with one sample for each channel, representing the audio on all channels at a single point on time.
Packet: a group of one or more frames, representing the audio format's smallest encoding unit, and the audio for all channels across a short amount of time.
As you can see there is a subtle difference between audio and video frame notions. In one second you have for stereo audio at 44.1 kHz: 88200 samples and thus 44100 frames.
Compressed format like MP3 and AAC pack multiple frames in packets (these packets can then be written in MP4 file for example where they could be efficiently interleaved with video content). You understand that dealing with large packets helps to identify bits patterns for better coding efficiency.
MP3, for example, uses packets of 1152 frames, which are the basic atomic unit of an MP3 stream. PCM audio is just a series of samples, so it can be divided down to the individual frame, and it really has no packet size at all.
For AAC you can have 1024 (or 960) frames per packet. This is described in the Apple document you pointed at:
The number of frames in a packet of audio data. For uncompressed audio, the value is 1. For variable bit-rate formats, the value is a larger fixed number, such as 1024 for AAC. For formats with a variable number of frames per packet, such as Ogg Vorbis, set this field to 0.
In MPEG-based file format a packet is referred to as a data frame (not to be
mingled with the previous audio frame notion). See Brad comment for more information on the subject.

AAC RTP timestamps and synchronization

I am currently streaming audio (AAC-HBR at 8kHz) and video (H264) using RTP. Both feeds works fine individually, but when put together they get out of sync pretty fast (lass than 15 sec).
I am not sure how to increment the time stamp on the audio RTP header, I thought it should be the time difference between two RTP packets (around 127ms) or a constant increment of 1/8000 (0.125 ms). But neither worked, instead I managed to find a sweet spot. When I increment the time stamp by 935 for each packet It stays synchronized for about a minute.
AAC frame size is 1024 samples. Try to increment by (1/8000) * 1024 = 128 ms. Or a multiple of that in case your packet has multiple AAC frames.
Does that help?
Bit late, but thought of putting up my answer.
Timestamp on Audio RTP packet == the number of audio samples contained in RTP packet.
For AAC, each frame consist of 1024 samples, so timestamp on RTP packet should increase by 1024.
Difference between the clocktime of 2 RTP packets = (1/8000)*1024 = 128ms, i.e sender should send the rtp packets with difference of 128 ms.
Bit more information from other sampling rates:
Now AAC sampled at 44100hz means 44100 sample of signal in 1 sec.
So 1024 samples means (1000ms/44100)*1024 = 23.21995 ms
So the timestamp between 2 RTP packets = 1024, but
The difference of clock time between 2 RTP packets in rtp session should be 23.21995ms.
Trying to correlate with other example:
For example for G711 family (PCM, PCMU, PCMA), The sampling frequency = 8k.
So the 20ms packet should have samples == 8000/50 == 160.
And hence RTP timestamps are incremented by 160.
The difference of clock time between 2 RTP packets should be 20ms.
IMHO video and audio de-sync in android is difficult to fight if they are taken from different media recorders. They just capture different start frames and there is no way (as it seems) to find out how big de-sync is and adjust it with audio or video timestamps on flight.

What do the bytes in a .wav file represent?

When I store the data in a .wav file into a byte array, what do these values mean?
I've read that they are in two-byte representations, but what exactly is contained in these two-byte values?
You will have heard, that audio signals are represented by some kind of wave. If you have ever seen this wave diagrams with a line going up and down -- that's basically what's inside those files. Take a look at this file picture from http://en.wikipedia.org/wiki/Sampling_rate
You see your audio wave (the gray line). The current value of that wave is repeatedly measured and given as a number. That's the numbers in those bytes. There are two different things that can be adjusted with this: The number of measurements you take per second (that's the sampling rate, given in Hz -- that's how many per second you grab). The other adjustment is how exact you measure. In the 2-byte case, you take two bytes for one measurement (that's values from -32768 to 32767 normally). So with those numbers given there, you can recreate the original wave (up to a limited quality, of course, but that's always so when storing stuff digitally). And recreating the original wave is what your speaker is trying to do on playback.
There are some more things you need to know. First, since it's two bytes, you need to know the byte order (big endian, little endian) to recreate the numbers correctly. Second, you need to know how many channels you have, and how they are stored. Typically you would have mono (one channel) or stereo (two), but more is possible. If you have more than one channel, you need to know, how they are stored. Often you would have them interleaved, that means you get one value for each channel for every point in time, and after that all values for the next point in time.
To illustrate: If you have data of 8 bytes for two channels and 16-bit number:
abcdefgh
Here a and b would make up the first 16bit number that's the first value for channel 1, c and d would be the first number for channel 2. e and f are the second value of channel 1, g and h the second value for channel 2. You wouldn't hear much there because that would not come close to a second of data...
If you take together all that information you have, you can calculate the bit rate you have, that's how many bits of information is generated by the recorder per second. In our example, you generate 2 bytes per channel on every sample. With two channels, that would be 4 bytes. You need about 44000 samples per second to represent the sounds a human beeing can normally hear. So you'll end up with 176000 bytes per second, which is 1408000 bits per second.
And of course, it is not 2-bit values, but two 2 byte values there, or you would have a really bad quality.
The first 44 bytes are commonly a standard RIFF header, as described here:
http://tiny.systems/software/soundProgrammer/WavFormatDocs.pdf
and here: http://www.topherlee.com/software/pcm-tut-wavformat.html
Apple/OSX/macOS/iOS created .wav files might add an 'FLLR' padding chunk to the header and thus increase the size of the initial header RIFF from 44 bytes to 4k bytes (perhaps for better disk or storage block alignment of the raw sample data).
The rest is very often 16-bit linear PCM in signed 2's-complement little-endian format, representing arbitrarily scaled samples at a rate of 44100 Hz.
The WAVE (.wav) file contain a header, which indicates the formatting information of the audio file's data. Following the header is the actual audio raw data. You can check their exact meaning below.
Positions Typical Value Description
1 - 4 "RIFF" Marks the file as a RIFF multimedia file.
Characters are each 1 byte long.
5 - 8 (integer) The overall file size in bytes (32-bit integer)
minus 8 bytes. Typically, you'd fill this in after
file creation is complete.
9 - 12 "WAVE" RIFF file format header. For our purposes, it
always equals "WAVE".
13-16 "fmt " Format sub-chunk marker. Includes trailing null.
17-20 16 Length of the rest of the format sub-chunk below.
21-22 1 Audio format code, a 2 byte (16 bit) integer.
1 = PCM (pulse code modulation).
23-24 2 Number of channels as a 2 byte (16 bit) integer.
1 = mono, 2 = stereo, etc.
25-28 44100 Sample rate as a 4 byte (32 bit) integer. Common
values are 44100 (CD), 48000 (DAT). Sample rate =
number of samples per second, or Hertz.
29-32 176400 (SampleRate * BitsPerSample * Channels) / 8
This is the Byte rate.
33-34 4 (BitsPerSample * Channels) / 8
1 = 8 bit mono, 2 = 8 bit stereo or 16 bit mono, 4
= 16 bit stereo.
35-36 16 Bits per sample.
37-40 "data" Data sub-chunk header. Marks the beginning of the
raw data section.
41-44 (integer) The number of bytes of the data section below this
point. Also equal to (#ofSamples * #ofChannels *
BitsPerSample) / 8
45+ The raw audio data.
I copied all of these from http://www.topherlee.com/software/pcm-tut-wavformat.html here
As others have pointed out, there's metadata in the wav file, but I think your question may be, specifically, what do the bytes (of data, not metadata) mean? If that's true, the bytes represent the value of the signal that was recorded.
What does that mean? Well, if you extract the two bytes (say) that represent each sample (assume a mono recording, meaning only one channel of sound was recorded), then you've got a 16-bit value. In WAV, 16-bit is (always?) signed and little-endian (AIFF, Mac OS's answer to WAV, is big-endian, by the way). So if you take the value of that 16-bit sample and divide it by 2^16 (or 2^15, I guess, if it's signed data), you'll end up with a sample that is normalized to be within the range -1 to 1. Do this for all samples and plot them versus time (and time is determined by how many samples/second is in the recording; e.g. 44.1KHz means 44.1 samples/millisecond, so the first sample value will be plotted at t=0, the 44th at t=1ms, etc) and you've got a signal that roughly represents what was originally recorded.
I suppose your question is "What do the bytes in data block of .wav file represent?" Let us know everything systematically.
Prelude:
Let us say we play a 5KHz sine wave using some device and record it in a file called 'sine.wav', and recording is done on a single channel (mono). Now you already know what the header in that file represents.
Let us go through some important definitions:
Sample: A sample of any signal means the amplitude of that signal at the point where sample is taken.
Sampling rate: Many such samples can be taken within a given interval of time. Suppose we take 10 samples of our sine wave within 1 second. Each sample is spaced by 0.1 second. So we have 10 samples per second, thus the sampling rate is 10Hz. Bytes 25th to 28th in the header denote sampling rate.
Now coming to the answer of your question:
It is not possible practically to write the whole sine wave to the file because there are infinite points on a sine wave. Instead, we fix a sampling rate and start sampling the wave at those intervals and record the amplitudes. (The sampling rate is chosen such that the signal can be reconstructed with minimal distortion, using the samples we are going to take. The distortion in the reconstructed signal because of the insufficient number of samples is called 'aliasing'.)
To avoid aliasing, the sampling rate is chosen to be more than twice the frequency of our sine wave (5kHz)(This is called 'sampling theorem' and the rate twice the frequency is called 'nyquist rate'). Thus we decide to go with sampling rate of 12kHz which means we will sample our sine wave, 12000 times in one second.
Once we start recording, if we record the signal, which is sine wave of 5kHz frequency, we will have 12000*5 samples(values). We take these 60000 values and put it in an array. Then we create the proper header to reflect our metadata and then we convert these samples, which we have noted in decimal, to their hexadecimal equivalents. These values are then written in the data bytes of our .wav files.
Plot plotted on : http://fooplot.com
Two bit audio wouldn't sound very good :) Most commonly, they represent sample values as 16-bit signed numbers that represent the audio waveform sampled at a frequency such as 44.1kHz.

setting timestamps for audio samples in directshow graph

I am developing a directshow audio decoder filter, to decode AC3 audio.
the filter is used in a live graph, decoding TS multicast.
the demuxer (mainconcept) provides me with the audio data demuxed, but does not provide timestamps for the sample.
how can I get/compute the correct timestamp of the audio?
I found this forum post:
http://www.ureader.com/msg/14712447.aspx
In it, a member gives the following formula for calculating the timestamps for audio, given it's format (sample rate, number of channels, bits per sample):
With PCM audio, duration_in_secs = 8 * buffer_size / wBitsPerSample /
nChannels / nSamplesPerSec or duration_in_secs = buffer_size /
nAvgBytesPerSec (since, for PCM audio, nAvgBytesPerSec =
wBitsPerSample * nChannels * nSamplesPerSec / 8).
The only thing you need to add is a tracking variable that tells you what sample number in the stream that you are at, so you can use it to offset the start time and end time by the duration (duration_in_secs) when doing linear streaming. For seek operations you would of course need to know or calculate the sample number into the stream.
Don't forget that the units for timestamps in DirectShow are typed as REFERENCE_TIME, a long integer or Int64. Each unit is equal to 100 nanoseconds. That is why you see in video filters the value 10,000,000 being divided by the relevant number of frames per second (FPS) to calculate timestamps for each frame because 10,000,000 equals 1 second in a REFERENCE_TIME variable.
Each AC-3 frame embeds data for 6 * 256 samples. Sampling rate can be 32 kHz, 44.1 kHz or 48 kHz (as defined by AC-3 specification Digital Audio Compression Standard (AC-3, E-AC-3)). The frames themselves do not carry timestamps, so you needs to assume continuous stream and increment time stamps respectively. As you mentioned the source is live, you might need to re-adjust time stamps on data starvation.
Each AC-3 frame is of fixed length (which you can identify from bitstream header), so you might also be checking if demultiplexer is giving you a single AC-3 frame or a few in a batch.

Resources