I am sampling a incoming audio stream at 8Ksps. I have a codec that takes ~1.6ms to encode a packet of data (80 samples) into an encoded packet (5 samples). At this rate I get 8000*1.662e-3 ~= 13 samples every encoding cycle. But I need 80 samples every cycle. How do I keep the stream continuous? My only guess is slow down the bitrate of the outgoing encode stream but I'm not sure how to calculate this in general such that buffers on the incoming side don't fill up and the receiving side's buffers don't get starved.
This seems like a basic tenet of streaming but I can't find any info on methods. Thanks for any help!
Related
I'm trying to extract audio from a telephone session captured with Wireshark. The capture as send to us from the telephone provider for debugging/analysis. I have 3 files: signalling, and two files with UDP data, one for each direction. After merging two of these files (one direction with signalling), Wireshark provides RTP stream analysis. What I observe (as I do for a second session capture) is that Wireshark isn't able to export RTP stream audio (Payload type: ITU-T G.711 PCMA (8)) for one direction. This happens to be an RTP stream containing "RTF 2833 RTP events" (Payload type: telephone-event (106)). These events seem to transport DTMF tunes out-of-band, for each DTMF tune, there is a section of 7 consecutive RTP events of this type. What Wireshark does is producing an 8 GB *.au file for an audio stream less than two minutes. For the opposite-direction stream I get an audio file that is 2 MB in size.
I have to admit that this is just guesswork: I connect the error with a feature that I can see, I'm a bit confused that Wireshark obviously knows these Events but fails on saving the corresponding audio stream. Do I maybe need some plugin for that?
I tried to search the web for this issue but without success.
This question was previously asked on Network Engineering but turned out to be off-topic there.
You can filter (rtp.p_type != 106) the DTMF events from the wireshark logs (pcap) and then save only the G.711 data in a separate file.
Then do the RTP analysis and save the audio payload in .au/.raw file format.
I'm trying to use the opus Forward Error Correction (FEC) feature.
I have a service which does the encoding with OPUS_SET_INBAND_FEC(1)
and OPUS_SET_PACKET_LOSS_PERC(20) with 10ms packets and sends them over UDP.
I'm not clear on the decoding process though.
When a packet is lost, do I need to call decode with fec=1 ONLY or do I need to call decode with fec=0 after as well on the next packet?
How do I know up front the size of the pcm that I send to decode with fec enabled?
I managed to make it work.
The encoding part stated in the question was correct:
Use the encoder OPUS_SET_INBAND_FEC(1) and OPUS_SET_PACKET_LOSS_PERC(X) where x>0 ans x<100
Send packets with a duration of at least 10ms (for example: 480 samples at 48 kHz)
For the decoding part, when a packet is lost, call the decode function on the next packet first with fec=1 and again with fec=0.
When calling decode with fec=1, the pcm sent will be entirely filled.
If you don't know the length that the pcm should be use on the decoder OPUS_GET_LAST_PACKET_DURATION(x) where x will get the the duration of the last packet.
In section 2.1 of the Speex codec manual it says:
Every speech codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount
of “look-ahead” required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16
kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to encode or decode the frames.
In RTP Payload Format for the Speex Codec, RFC5574 it says:
ptime: SHOULD be a multiple of 20 msec
I have a 20mS frame time of encoded data. so I assume my ptime should be 20.
The delay for the encoding is 30mS or more. The time between RTP packets are 20mS. How is this supposed to work? Every other RTP payload is an empty packet? How do I resolve this?
Seemingly this is an issue with every codec. I must be missing some fundamental understanding of how streaming works.
I have validated I can stream a pre-encoded buffer and it sounds as intended.
I have tried:
Creating a large queue in the beginning to compensate, however this quickly becomes zero length.
Sending zero data as the payload
Ideas I haven't yet tried:
Send a packet of all padding and mark the RTP header as padding
Increase the sequence but not the timestamp until the next actual payload is ready (this sounds like it is against the spec?)
Note: I'm now wondering if the delay mentioned by speex is within the encoded output and the delay I am seeing while streaming is due to my limited CPU (embedded)
My note was correct. This question is flawed.
The Speex manual is referring to a delay in the audio output, not an inherent delay of processing time. Therefore the issue in question is not an issue.
I'm glad I asked the question, it helped me come to the solution.
Short story:
If I myself intend to receive and then send a Shoutcast compatible audio stream processed by my application, then how to do it properly using an mp3 (de/en)coder library? Pseudo code, or better - lame mp3 specific code would be highly appreciated.
Long story:
More specific questions which bother me were caused by an article about mp3, which says:
Generally, frames are independent items. Each frame has its own header
and audio informations. There is no file header. Therefore, you can
cut any part of MPEG file and play it correctly (this should be done
on frame boundaries but most applications will handle incorrect
headers). For Layer III, this is not 100% correct. Due to internal
data organization in MPEG version 1 Layer III files, frames are often
dependent of each other and they cannot be cut off just like that.
This made me wonder, how Shoutcast servers and clients deal with frame headers and frame dependencies.
Do I have to encode to constant bitrate (CBR) only, if I want to achieve maximum compatibility with the most of Shoutcast players out there?
Is the mp3 frame header used at all or the stream format is deduced from a Shoutcast protocol specific HTTP header?
Does Shoutcast protocol guarantee (or is it common good practice) to start serving mp3 stream on frame boundaries and continue to respond with chunks that are cut at frame boundaries? But what is the minimum or recommended size of a mp3 frame for streaming live audio?
How does Shoutcast deal with frame dependencies - does it do something special with mp3 encoding to ensure that the served stream does not have frames which depend on previous frames (if this is even possible)? Or maybe it ignores these dependencies on server side/client side, thus getting audio quality reduction or even artifacts?
SHOUTcast servers do not know or care about the data being passed through them. They send it as-is. You can actually send arbitrary data through a SHOUTcast server, and receive it. SHOUTcast will segment the media data wherever the buffer size falls.
It's up to the client to re-sync to the data. It does this by locating the frame header, then being decoding. Once the codec has enough frames to reliably play back audio, it will begin outputting raw PCM. It's up to the codec when to decide it's safe to start playback. Since the codec knows what it's doing in terms of decoding the media, it knows when it has sufficient data (including bit reservoirs) to begin without artifacts. It's also worth noting that the bit reservoir cannot be carried on too far, so it doesn't take but a few frames at worst to handle it.
This is one of the reasons it's important to have a sizable buffer server-side, to flush to the clients as fast as possible on connect. If playback is to start quickly, the codec needs more data than the current frame to begin.
I am currently streaming audio (AAC-HBR at 8kHz) and video (H264) using RTP. Both feeds works fine individually, but when put together they get out of sync pretty fast (lass than 15 sec).
I am not sure how to increment the time stamp on the audio RTP header, I thought it should be the time difference between two RTP packets (around 127ms) or a constant increment of 1/8000 (0.125 ms). But neither worked, instead I managed to find a sweet spot. When I increment the time stamp by 935 for each packet It stays synchronized for about a minute.
AAC frame size is 1024 samples. Try to increment by (1/8000) * 1024 = 128 ms. Or a multiple of that in case your packet has multiple AAC frames.
Does that help?
Bit late, but thought of putting up my answer.
Timestamp on Audio RTP packet == the number of audio samples contained in RTP packet.
For AAC, each frame consist of 1024 samples, so timestamp on RTP packet should increase by 1024.
Difference between the clocktime of 2 RTP packets = (1/8000)*1024 = 128ms, i.e sender should send the rtp packets with difference of 128 ms.
Bit more information from other sampling rates:
Now AAC sampled at 44100hz means 44100 sample of signal in 1 sec.
So 1024 samples means (1000ms/44100)*1024 = 23.21995 ms
So the timestamp between 2 RTP packets = 1024, but
The difference of clock time between 2 RTP packets in rtp session should be 23.21995ms.
Trying to correlate with other example:
For example for G711 family (PCM, PCMU, PCMA), The sampling frequency = 8k.
So the 20ms packet should have samples == 8000/50 == 160.
And hence RTP timestamps are incremented by 160.
The difference of clock time between 2 RTP packets should be 20ms.
IMHO video and audio de-sync in android is difficult to fight if they are taken from different media recorders. They just capture different start frames and there is no way (as it seems) to find out how big de-sync is and adjust it with audio or video timestamps on flight.