Web Audio Api Realtime streaming PCM ADPCM

Web Audio Api Realtime streaming PCM ADPCM - audio

I have a server that passes the client PCM or ADPCM data.
I initially decided to use PCM because I did not want to deal with encoding and decoding.
I got PCM to work however between each chunk of audio I heard glitches.(Sort of like clipping)
So I thought maybe the reason is latency/high quality audio and all that stuff.
So I decided to use ADPCM to reduce the data amount. I wrote a adpcm to pcm decoder in javascript. It was a hassle. I was hoping that since the data count reduced maybe that would stop the glitches(data would catch up with what is being played)
But I was wrong. I still get the glitches.
Can this even be done with TCP ? Or is it a lost cause. I dont have UDP over websockets.
Do I need to implement a buffering algorithm ? I don't want to do this as it is real time audio and i just want to process it as fast as I can.
Do you guys know a good link to read about real time audio over the web.
I can give code example but this is a high level question.
PS: I tried to use tabs but we get a buffering issue and we cant control it.
I also dont get any flow control from the server. It does not say that Audio starter or audio stopped our paused.
It is a push protocol and All I get is ADPCM and PCM data

Yes, of course you can use TCP. UDP is often used in telephony applications as the lower overhead makes everything faster, and for this application it doesn't matter if packets are dropped or arrive in the wrong order. But since UDP isn't an option, you can use TCP.
As you have suspected, it seems to me that your problem is buffer underruns. Either your connection to the server is not fast enough (or at least consistently fast enough), or you are not providing data from the encoder at a fast enough rate. This can happen if you are recording data in real time, and trying to play it back in real time.
A solution is to buffer data server-side before sending it to the client. Have as large of buffer as your latency requirements allow. For internet radio purposes, I usually pick a 30-second buffer, as latency doesn't matter. For your purposes, you will probably want a buffer of at least 64KB. This is the maximum size allowed in a TCP packet. This packet will get fragmented along the way, but that is okay.
You might also look into how your server is sending data. Experiment with disabling the Nagle algorithm so that your server isn't waiting for ACKs before sending more data.

Related

TCP or UDP? Delays building up on production for video stream

I am creating a video stream from a camera with FFMPEG and nodejs stream Duplex class.
this.ffmpegProcess = spawn('"ffmpeg"', [
'-i', '-',
'-loglevel', 'info',
/**
* MJPEG Stream
*/
'-map', '0:v',
'-c:v', 'mjpeg',
'-thread_type', 'frame', // suggested for performance on StackOverflow.
'-q:v', '20', // force quality of image. 2-31 when 2=best, 31=worst
'-r', '25', // force framerate
'-f', 'mjpeg',
`-`,
], {
shell: true,
detached: false,
});
On local network, we are testing it with a few computers and every thing works very stable with latency of maximum 2 seconds.
However, we have imported the service to AWS production, when we connect the first computer we have a latency of around 2 seconds, but if another client connects to the stream, they both start delaying a lot, where the latency builds up to 10+ seconds, and in addition the video is very slow, as in the motion between frames.
Now I asked about TCP or UDP is because we are using TCP for the stream, which means that every packet that is sent is implementing the TCP syn-ack-synack protocol and sequence.
My question is, could TCP really be causing such an issue that the delays get up to 10 seconds with very slow motion? because on local network it works very just fine.

Yes, TCP is definitely not correct protocol for it. Can be used but with some modification on the source side. UDP sadly is not a magic bullet, without additional logic UDP will not solve the problem either (if you don't care that you will see broken frames constructed randomly from other frames).
Explanation
The main features of the TCP are that packets are delivered in the correct order and all packets are delivered. These two features are very useful but quite harming for video streaming.
On the local network bandwidth is quite large and also packet loss is very low so TCP works fine. On the internet bandwidth is limited and every reach of the limit causes packet loss. TCP will deal with packet loss by resending the packet. Each resend will cause extending delay of the stream.
For better understanding try to imagine that a packet contains whole frame (JPEG image). Let's assume that normal delay of the link is 100ms (time in which frame will be in transit). For 25 FPS you need to deliver each frame every 40ms. If the frame is the lost in transit, TCP will ensure that copy will resend. TCP can detect this situation and fix in ideal situation in 2 times of delay - 2*100ms (in the reality it will be more, but for sake of simplicity I am keeping this). So during image loss 5 frames are waiting in the receiver queue and waits for single missing frame. In this example one missing frame causes 5 frames of delay. And because TCP creates queue of packets. delay will not be ever reduced, only can grow. In the ideal situation when bandwidth is enough delay will be still the same.
How to fix it
What I have done in the nodejs was fixing the source side. Packets in the TCP can be skipped only if the source will do it, TCP has no way how to do it itself.
For that purpose, I used event drain.
Idea behind the algorithm is that ffmpeg generates frame it's own speed. And node.js reads frames and always keep the last received. It has also outgoing buffer with size of a single frame. So if sending of single frame is delayed due to network conditions, incoming images from ffmpeg are silently discarded (this compensates low bandwidth) except the last received. So when TCP buffer signals (by drain event) that some frame was correctly sent, nodejs will take the last received image and write it to the stream.
This algorithm regulates itself. If sending bandwidth is enough (sending is faster then frames generated by ffmpeg) then no packet will be discarded and 25fps will be delivered. If bandwidth can transfer only half of frames in average one of two frames will be discarded so receiver will se 12.5fps but not growing delay.
Probably the most complicated part in this algorithm is correctly slicing byte stream to frames.

On either the server or a client, run a wireshark trace, this will tell you the exact "delay" of the packets, or which side is not doing the right thing. It sound's like the server is not living up to your expectations. Make sure the stream is UDP.

This might not be relevant for your use case, but this is where a HTTP based content delivery system, like HLS, is very useful.
UDP or TCP, trying to stream the data in real time across the internet is hard.
HLS will spit out files that can be downloaded over HTTP (which is actually quite efficient when serving large files, and ensures the integrity of the video as it's got built in error correction and so on), and re-assembled in order on the client side by the player.
It adapts easily to different quality streams (part of the protocol) which lets the client decide how quickly it consume the data.
As an aside, Netflix and the like use Widevine DRM which is somewhat similar - HTTP based delivery of the content, although widevine will take the different quality encodes, stuff it into one gigantic file and use HTTP range requests to switch between quality streams.

Is it realistic to stream 12-16 bit audio through SPP bluetooth in realtime?

I have tried to send 12-bit audio to be listened to in real time through the HC05 SPP bluetooth module hooked up to an arduino and DAC over serial with a python RFCOMM socket. I have since learned that Serial Port Protocol is not very great at all for this purpose due to its low bandwidth. I figured I could definitely send the data and then play it out through a DAC, but I doubt an arduino would hold an array the size of a WAV file and maybe not even an mp3 file, but that would defeat the purpose of controlling the audio (play,pause,rewind,etc) from my computer. Would it be more realistic and worthwhile to use an A2DP enabled bluetooth module? Or is it still possible to listen to acceptable quality 12-16 bit audio in real time with SPP? I have tried to use lower bit songs, adjusted baud rates for the arduino and HC-05 serial ports, and tried to adjust the magnitude of the values outputted by the DAC to the audio port and I still seem to get crackly audio. It seems the problem comes down to the low bitrate transfer speed of SPP, or am I wrong?

Is it realistic to stream 12-16 bit audio through SPP bluetooth in realtime?
Sure, at some awfully slow sample rate <= 8 kHz. You'd be better off sending 8-bit audio at a higher sample rate.
Would it be more realistic and worthwhile to use an A2DP enabled bluetooth module?
Yes, absolutely, without question. That's what it's designed for, as I mentioned in your other question.
Or is it still possible to listen to acceptable quality 12-16 bit audio in real time with SPP?
Acceptable is subjective. If it's just voice, you can get away with it. If you want reasonable audio quality for music, almost universally, no, it's not acceptable.
It seems the problem comes down to the low bitrate transfer speed of SPP, or am I wrong?
Without any code to inspect and debug, it's impossible to say what the specific problem is that you're referring to. Undoubtedly, the low bandwidth will not enable good quality audio anyway.
If you must continue to use SPP and simple codecs like PCM, at least use differential PCM to save a bit more bandwidth.

How do Shoutcast servers and clients deal with mp3 frame headers and frame dependencies?

Short story:
If I myself intend to receive and then send a Shoutcast compatible audio stream processed by my application, then how to do it properly using an mp3 (de/en)coder library? Pseudo code, or better - lame mp3 specific code would be highly appreciated.
Long story:
More specific questions which bother me were caused by an article about mp3, which says:
Generally, frames are independent items. Each frame has its own header
and audio informations. There is no file header. Therefore, you can
cut any part of MPEG file and play it correctly (this should be done
on frame boundaries but most applications will handle incorrect
headers). For Layer III, this is not 100% correct. Due to internal
data organization in MPEG version 1 Layer III files, frames are often
dependent of each other and they cannot be cut off just like that.
This made me wonder, how Shoutcast servers and clients deal with frame headers and frame dependencies.
Do I have to encode to constant bitrate (CBR) only, if I want to achieve maximum compatibility with the most of Shoutcast players out there?
Is the mp3 frame header used at all or the stream format is deduced from a Shoutcast protocol specific HTTP header?
Does Shoutcast protocol guarantee (or is it common good practice) to start serving mp3 stream on frame boundaries and continue to respond with chunks that are cut at frame boundaries? But what is the minimum or recommended size of a mp3 frame for streaming live audio?
How does Shoutcast deal with frame dependencies - does it do something special with mp3 encoding to ensure that the served stream does not have frames which depend on previous frames (if this is even possible)? Or maybe it ignores these dependencies on server side/client side, thus getting audio quality reduction or even artifacts?

SHOUTcast servers do not know or care about the data being passed through them. They send it as-is. You can actually send arbitrary data through a SHOUTcast server, and receive it. SHOUTcast will segment the media data wherever the buffer size falls.
It's up to the client to re-sync to the data. It does this by locating the frame header, then being decoding. Once the codec has enough frames to reliably play back audio, it will begin outputting raw PCM. It's up to the codec when to decide it's safe to start playback. Since the codec knows what it's doing in terms of decoding the media, it knows when it has sufficient data (including bit reservoirs) to begin without artifacts. It's also worth noting that the bit reservoir cannot be carried on too far, so it doesn't take but a few frames at worst to handle it.
This is one of the reasons it's important to have a sizable buffer server-side, to flush to the clients as fast as possible on connect. If playback is to start quickly, the codec needs more data than the current frame to begin.

Strategy for time-indexed audio archive with lossy compression

For part of one of my projects, I am considering developing an audio archive for internet radio stations. This archive would be indexed and addressable by date/time.
For example, the server would connect to a stream (generally encoded in MP3), and save the stream data. A client could connect to this server and request audio from 2011-07-05 15:58:30 to 2011-07-05 15:59:37. The server would return the audio data to the client for playback.
My initial thought was to save the data to 1-minute chunks of raw MP3 data to disk, and reference these files from a database. The server would be dumb to the stream/file format, and wouldn't understand mpeg frames. It would simply pass on data to the client, dividing the chunks up linearly to send. It would be up to the client to sync to the stream. This is not unlike how internet radio servers run in general. SHOUTcast servers simply output the data, byte for byte, that is sent to them from the encoder. When a client connects, data is sent, regardless of whether or not it even ends on an MP3 frame. It is up to the client to sync.
I am wondering if there might be a better approach, maximizing compatibility with clients and audio formats. Any thoughts on how to go about this?
The only other thing I can think of is decoding the MP3 to raw PCM audio and re-encoding as necessary when requested. I would prefer not to go this route due to the disk space required, and the loss of quality when re-encoding.
This question is language-agnostic, but if it is helpful, I will likely implement a solution in PHP with MySQL as the database.

You don't have to worry about this, since ALL mp3 that I accessed over shoutcast is Constant Bitrate. Do you don't have to index it. I have POC project that had archive in 5 minute chunks, then uses PHP to combine that files and pseudo-stream it to the winamp via shoutcast. It worked!
And since you are working with mp3, you can assume (and you'll assume correctly) that the density of the captured file is linear, so to access 30 second of the 60 second file you should seek in the middle. Since mp3 decoders are robust enough, you don't have to track the frames at all here.
AACplus, whole different story. It's inherent VBR.

RTP packet combining

I have a bunch of RTP packets that I'd like to re-assemble into an audio stream. For each packet, I have the sequence number, SSRC, timestamp, and a byte array representing the data itself.
Currently I'm taking each subset of packets by their SSRC, then ordering them by timestamp and combining the byte arrays in that order. Afterwards, I'm mixing the byte arrays. The resulting audio data sounds great (by great, I mean everything is in time), but I'm worried that it's due to not having much packet loss.
So, a couple questions...
For missing packets, a missing sequence number shows where I need to add a bit of empty audio. I believe the sequence number "wraps around" quite often, so I need to use timestamp to break them up into subsets. Then I can look for missing sequence numbers in those subsets and add as needed. Does that sound like the right thing to do?
I haven't quite figured out what else the timestamp is good for. Since I'm recording already existing packets and filling in the missing ones, maybe I don't need to worry about this as much?

1) Avoid using timestamps in your algorithm. Your algorithm will fail in case you are receiving stream from bad clients (Improper timestamps). And "timestamps increment" value changes with codec types. In that case you may need different subsets for different codecs. There is no limitations on sequence number. Sequence number are incremented monotonically. Using sequence number you can track lost packets easily.
2) Timestamp is used for synchronization between Audio and video. Mainly for lip sync. A relationship between audio and video timestamps is established for achieving synchronization. In your case its only audio so you can avoid using timestamp.
Edit: According to RFC 3389 (Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN))
RTP allows discontinuous transmission (silence suppression) on any
audio payload format. The receiver can detect silence suppression
on the first packet received after the silence by observing that
the RTP timestamp is not contiguous with the end of the interval
covered by the previous packet even though the RTP sequence number
has incremented only by one. The RTP marker bit is also normally
set on such a packet.

1) I don't think sequence number "wrap around" quickly. This is 16-bit value so it wraps every 65536 messages and even if message is send every 10 milliseconds this give more than 10 minutes of transmission. It is very unlikely that packet will be lost for so long. So in my opinion you should only check sequence number, checking timestamp is pointless.
2) I think you shouldn't worry much about timestamp. I know that some protocols didn't even fill this value and relay only on sequence number.

I think what Zulijn is getting at in his answer above is that if your packets are stored in the order they were captured then you can use some simple rules to find out-of-order packets - e.g. look back 50 packets and forward 50 packets. If it is not there then it counts as a lost packet.
This should avoid any issues with the sequence number having wrapped around. To handle any lost packets there are many techniques you can use, so it would be useful to google 'Audio packet loss' or 'VOIP packet loss concealment'. As Adam mentions timestamp will vary with codec so you need to understand this if you are going to use it.
You don't mention what the actual application is but if you are trying to understand what the received audio actually sounded like, you really need some more info, in particular the jitter buffer size - this effectively determines how long the receiver will wait for an out of sequence packet before deciding it is lost. What this means to you is that there may be out-of-sequence packets in your file which the 'real world' receiver would have given up and not played back - i.e. your reconstruction from the file may give a higher quality than the 'real time' experience.
If it is a two way transmission, then delay is very important also (even if it is a constant delay and hence does not affect jitter and packet loss). This is the type of effect you used to get on some radio telephones and still do on some satellite phones (or VoIP phones), and it can significantly impact the user experience.
Finally, different codecs and clients may apply different techniques to correct lost packets, insert 'silent tones' for any gaps in the audio (e.g. pauses in conversation), suppress background noise etc.
To get a proper feel for the user experience you would have to try to 'replay' your captured packets as accurately as possible using the same codec, jitter buffer and any error correction/packet loss techniques the receiver used also.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string