How client side handle so many audio streams in large Conference? - audio

In a SFU audio conference platform, media server simply route audio packets. Lets say in client side I keep audio packet queue for each present participant (updated by signaling server) and at a certain rate I simply dequeue from every queue, handle, pick top 4-6 voice packets and mix for play. If sequence number is missing for some participants I even send nack and wait for some threshold time for that participants queue to be dequeued (to maintain the voice flow).
But to make this solution scalable, I have to do this dequeue then pick top 4-6 voice from media server side and send it to every one. Now, from client side, even if some participant's packet sequence gets missing I am not sure whether it was actually missing or it was not able to make it to top 4-6 voice packets in server (as I need to send nNack and wait if packet actually got missing).
How can I handle this usecase efficiently and any suggestion with top mixing numbers or anything is highly appreciable?

Related

Recommendation on client side echo cancellations

I am developing a MCU based voip service. I think the traditional way of doing MCU is, you have N audio mixers at server and every participant in the call receive a steam that does not have their own voice encoded.
Guess what I wish to do is, have only 1 audio mixer running at server and (on a broadcast kind model) send the final mixer audio to every participant (For scalability obviously).
Now this obviously creates a problem of hearing your own voice coming from speaker as MCU’s output stream.
I am wondering if there is any “client side echo cancellation” project that I can use to cancel the voice of user at desktop/mobile level.
The general approach is to filter/subtract the own voice in the MCU. Doing this on the client side does not work.

Websocket Dropped Frames?

Trying to solve a perplexing issue w/streaming audio over websockets. We are using Nexmo (Twilio competitor) which enables bidirectional streaming of call audio over websockets. Nexmo connects to our websocket server and starts sending 16khz sampled audio frames of length 640 bytes each.
Everything was working great until recently the websocket audio suddenly started dropping clumps of frames, resulting in gaps in the audio.
But the most interesting thing is the following:
When Nexmo connects directly to our digitalocean vps, frames are dropped
When Nexmo connects via an ngrok tunnel, everything starts working again
Any ideas on where to look for a real solution would be awesome.
Make certain the process which is receiving the websocket traffic has a separate thread just to handle this traffic ... any system will drop traffic if its too busy with other tasks ... if your receiving end has some event loop its maintaining while getting preempted by incoming websocket interrupts you will drop packets
I did a project where receiving end was the browser which was running an event loop to perform the audio rendering while simultaneously also handling the websocket traffic - not a good idea since the critical portions of this event loop must not be allowed to get preempted ... I had to create a webworker process on browser side to handle all the websocket traffic to then populate a circular audio buffer ... this webworker was viewed as a client by the browser event loop which was rendering the audio yet was now permitted to never get preempted by incoming traffic ... only when browser event loop reached its lull period did it then request to retrieve another gulp of data buffered up by webworker audio buffer queue

Mix multiple RTP streams into a single one

I am trying to build a basic conference call system based on plain RTP.
_____
RTP IN #1 ______ | | _______ MIX RTP receiver #1
|______| MIX |_____|
______| | RTP | |_______ MIX RTP receiver #2
RTP IN #2 |_____|
I am creating RTP streams on Android via the AudioStream class and using a server written in Node.js to receive them.
The naive approach I've been using is that the server receives the UDP packets and forwards them to the participants of the conversation. This works perfectly as long as there are two participants, and it's basically the same as if the two were sending their RTP stream to each other.
I would like this to work with multiple participants, but forwarding the RDP packets as they arrive to the server doesn't seem to work, probably for obvious reasons. With more than two participants, the result of delivering the packets coming in from different sources to each of the participants (excluding the sender of such packet) results in a completely broken audio.
Without changing the topology of the network (star rather than mesh) I presume that the server will need to take care of carrying out some operations on the packets in order to extract a unique output RTP stream containing the mixed input RTP streams.
I'm just not sure how to go about doing this.
In your case I know two options:
MCU or Multipoint Control Unit
Or RTP simulcast
MCU Control Unit
This is middle box (network element) that gets several RTP streams and generate one or more RTP streams.
You can implement it by yourself but it is not trivial because you need to deal with:
Stream decoding (and therefore you need jitter buffer and codecs implementation)
Stream mixing - so you need some synchronisation between streams (collect some data from source 1 and source 2, mix them and send to destination 3)
Also there are several project that can do it for you (like Asterisk, FreeSWITCH etc), you can try to write some integration level with them. I haven't heard anything about something on Node.js
Simulcast
This is pretty new technology and their specifications available only in IETF drafts. Core idea here is to send several RTP streams inside one RTP stream simultaneously.
When destination receives several RTP streams it needs to do exactly the same as MCU does - decode all streams and mix them together but in this case destination may use hardware audio mixer to do that.
Main cons for this approach is bandwidth to the client device. If you have N participants you need:
either send all N streams to all other
or select streams based on some metadata like voice activity or audio level
First one is not efficient, second is very tricky.
The options suggested by Dimtry's answer were not feasible in my case because:
The middle box solution is difficult to implement, requires too many resources or requires to rely on an external piece of software, which I didn't want to have to rely on, especially because Android RTP stack should work out of the box with basic support from a server component, especially for hole punching
The simulcast solution cannot be used because the Android RTP package cannot handle that and ad far as my understanding goes it's only capable of handling simple RTP streams
Other options I've been evaluating:
SIP
Android supports it but it's more of a high level feature and I wanted to build the solution into my own custom application, without relying on additional abstractions introduced by a high level protocol such as SIP. Also, this felt just too complex to set up, and conferencing doesn't even seem to be a core feature but rather an extension
WebRTC
This is supposed to be the de-facto standard for peer 2 peer voice and video conferencing but looking through code examples it just looks too difficult to set up. Also requires support from servers for hole punching.
Our solution
Even though I had, and still have, little experience on this I thought there must be a way to make it work using plain RTP and some support from a simple server component.
The server component is necessary for hole punching, otherwise getting the clients to talk to each other is really tricky.
So what we ended up doing for conference calling is have the caller act as the mixer and the server component as the middle-man to deliver RTP packets to the participants.
In practice:
whenever a N-user call is started, we instantiate N-1 simple UDP broadcast servers, listening on N-1 different ports
We send those N-1 ports to the initiator of the call via a signaling mechanism built on socket.io and 1 port to each of the remaining participants
The server component listening on those ports will simply act as a relay: whenever it receives a UDP packet containing the RTP data it will forward it to all the connected clients (the sockets it has seen thus far) except the sender
The initiator of the call will receive and send data to the other participants, mixing it via the Android AudioGroup class
The participants will send data only to the initiator of the call, and they will receive the mixed audio (together with the caller's own voice and the other participants' voices) on the server port that has been assigned to them
This allows for a very simple implementation, both on the client and on the server side, with minimal signaling work required. It's certainly not a bullet proof conferencing solution, but given the simplicity and feature completeness (especially regarding common network issues like NAT traversal, which using a server aid is basically a non-issue) is in my opinion better than writing lots of code which requires many resources for mixing server-side, relying on external software like SIP servers, or using protocols like WebRTC which basically achieve the same with lots more effort implementation wise.

How to measure voice delay between multiple hops

I am developing a voip solution, where my voice has to pass through 2 hops other than originator of voice before handed over to TDM.
Lets say, Hop A is generating Voice (RTP) and sent to Hop B (it may mix the stream with other streams) and sends to Hop C (it may also do some processing) and hands over the stream to PBX (TDM).
Now how can i measure how much delay is introduced by each hop?

a UDP socket based rateless file transmission

I'm new to socket programming and I need to implement a UDP based rateless file transmission system to verify a scheme in my research. Here is what I need to do:
I want a server S to send a file to a group of peers A, B, C.., etc. The file is divided into a number of packets. At the beginning, peers will send a Request message to the server to initialize transmission. Whenever S receives a request from a client, it ratelessly transmit encoded packets(how to encode is done by my design, the encoding itself has the erasure-correction capability, that's why I can transmit ratelessly via UDP) to that client. The client keeps collecting packets and try to decode them. When it finally decodes all packets and re-construct the file successfully, it sends back a Stop message to the server and S will stop transmitting to this client.
Peers request the file asynchronously (they may request the file at different time). And the server will have to be able to concurrently serve multiple peers. The encoded packets for different clients are different (they are all encoded from the same set source packets, though).
Here is what I'm thinking about the implementation. I have not much experience with unix network programming though, so I'm wondering if you can help me assess it, and see if it is possible or efficient.
I'm gonna implement the server as a concurrent UDP server with two socket ports(similar to TFTP according to the UNP book). One is to receive controlling messages, as in my context it is for the Request and Stop messages. The server will maintain a flag (=1 initially) for each request. When it receives a Stop message from the client, the flag will be set to 0.
When the serve receives a request, it will fork() a new process that use the second socket and port to send encoded packets to the client. The server keeps sending packets to the client as long as the flag is 1. When it turns to 0, the sending ends.
The client program is easy to do. Just send a Request, recvfrom() the server, progressively decode the file and send a Stop message in the end.
Is this design workable? The main concerns I have are: (1), is that efficient by forking multiple processes? Or should I use threads? (2), If I have to use multiple processes, how can the flag bit be known by the child process? Thanks for your comments.
Using UDB for file transfer is not best idea. There is no way for server or client to know if any packet has been lost so you would only know that during reconstruction assuming you have some mechanism (like counter) to detect lost packes. It would then be hard to request just one of those packets that got lost. And in the end you would have a code that would do what TCP sockets do. So I suggest to start with TCP.
Typical design of a server involves a listener thread that spawns a worker thread whenever there is a new client request. That new thread would handle communication with that particular client and then end. You should keep a limit of clients (threads) that are served simultaneously. Do not spawn a new process for each client - that is inefficient and not needed as this will get you nothing that you can't achieve with threads.
Thread programming requires carefulness so do not cut corners. Otherwise you will have hard time finding and diagnosing problems.
File transfer with UDP wil be fun :(
Your struct/class for each message should contain a sequence number and a checksum. This should enable each client to detect, and ask for the retransmission of, any missing blocks at the end of the transfer.
Where UDP might be a huge winner is on a local LAN. You could UDP-broadcast the entire file to all clients at once and then, at the end, ask each client in turn which blocks it has missing and send just those. I wish Kaspersky etc. would use such a scheme for updating all my local boxes.
I have used such a broadcast scheme on a CANBUS network where there are dozens of microControllers that need new images downloaded. Software upgrades take minutes instead of hours.

Resources