Recommendation on client side echo cancellations - audio

I am developing a MCU based voip service. I think the traditional way of doing MCU is, you have N audio mixers at server and every participant in the call receive a steam that does not have their own voice encoded.
Guess what I wish to do is, have only 1 audio mixer running at server and (on a broadcast kind model) send the final mixer audio to every participant (For scalability obviously).
Now this obviously creates a problem of hearing your own voice coming from speaker as MCU’s output stream.
I am wondering if there is any “client side echo cancellation” project that I can use to cancel the voice of user at desktop/mobile level.

The general approach is to filter/subtract the own voice in the MCU. Doing this on the client side does not work.

Related

Capture audio/video in NodeJS from hardware, steam it to frontend

I am running a NodeJS/express server on a device (I have both BeagleBone and Raspberry Pi). This software needs to also access camera / microphone hardware directly, and send these audio/video streams to connected clients. I don't care at this point how the video /audio gets to the client, just that it gets there are can be decoded and played. The client in this case is a React Native Mobile app.
I want to take a moment to mention that this "device code" is NOT running in a browser. It's NodeJS / server side code. Consider this, for all intents and purposes, a "security" camera device. It sits somewhere and broadcasts what it sees and hears.
How do I:
a) Access video and audio streams in NodeJS
b) Get those streams into some kind of format that a web browser can play?
c) Decode the given video/audio in React Native?
d) Decode the video/audio in React (web)?
Working code examples would be greatly appreciate as opposed to explanations that lead me to dead ends when things don't work as expected.
I've been googling this for the last month and can not find answers. Can't even find someone else doing this same kind of project. Any help is appreciated. Thanks

Ways to broadcast audio from WebAudio API to server-side and then to connected clients

I am developing a colaborative instrument playing game, where multiple users will play an instrument (a synthesizer or sample, using the WebAudio API). On my first prototype I've set up a keyboard that sends note/volume signals via Socket.io to the server, and when the server gets that signal it sends it back to all connected sockets, which will play the corresponding note.
You might have guessed it right: there's a massive amount of lag and inconsistency as to the order of arrival of notes.
What are some efficient ways that I can send the output of WebAudio to the server, and have it broadcast to all connected users, so I have some sort of consistency?
You could try using a MediaStream by adding an MediaStreamAudioDestinationNode to your audio node graph as a destination and use that stream with either WebRTC or RecordRTC to send to your server.
Here is some info I found you could look at.
It does talk about using the getUserMedia method, but both getUserMedia and MediaStreamAudioDestinationNode methods send out a MediaStream constructor. This info
has some ideas on how you could send a MediaStream to your sever. However it does say that it needs to be recorded first. Not when it's live and running.
Sending a MediaStream to host Server with WebRTC after it is captured by getUserMedia
I hope this helps :)

PSTN to Internet

I'm making a PSTN interface for some voice chat software I use, just for personal use. My current plan is this:
With a mobile phone, I call a number registered with Twilio and input a keypad sequence to authenticate myself.
My server instructs Twilio to forward a SIP call to my server.
I decode the SIP stream and pipe it into my (NodeJS-run) voice bot for the voice chat, and the voice chat data from the other callers are sent back to the phone.
Is there a simpler way to do this, or is this the best plan? If possible I'd like to cut SIP out of the equation since decoding that is probably a pain. The voice bot at the end is necessary, but everything else could be changed. The core requirement is that PSTN data is turned into a computer-readable stream on the Internet and that the Internet stream can be sent back to the phone.

Mix multiple RTP streams into a single one

I am trying to build a basic conference call system based on plain RTP.
_____
RTP IN #1 ______ | | _______ MIX RTP receiver #1
|______| MIX |_____|
______| | RTP | |_______ MIX RTP receiver #2
RTP IN #2 |_____|
I am creating RTP streams on Android via the AudioStream class and using a server written in Node.js to receive them.
The naive approach I've been using is that the server receives the UDP packets and forwards them to the participants of the conversation. This works perfectly as long as there are two participants, and it's basically the same as if the two were sending their RTP stream to each other.
I would like this to work with multiple participants, but forwarding the RDP packets as they arrive to the server doesn't seem to work, probably for obvious reasons. With more than two participants, the result of delivering the packets coming in from different sources to each of the participants (excluding the sender of such packet) results in a completely broken audio.
Without changing the topology of the network (star rather than mesh) I presume that the server will need to take care of carrying out some operations on the packets in order to extract a unique output RTP stream containing the mixed input RTP streams.
I'm just not sure how to go about doing this.
In your case I know two options:
MCU or Multipoint Control Unit
Or RTP simulcast
MCU Control Unit
This is middle box (network element) that gets several RTP streams and generate one or more RTP streams.
You can implement it by yourself but it is not trivial because you need to deal with:
Stream decoding (and therefore you need jitter buffer and codecs implementation)
Stream mixing - so you need some synchronisation between streams (collect some data from source 1 and source 2, mix them and send to destination 3)
Also there are several project that can do it for you (like Asterisk, FreeSWITCH etc), you can try to write some integration level with them. I haven't heard anything about something on Node.js
Simulcast
This is pretty new technology and their specifications available only in IETF drafts. Core idea here is to send several RTP streams inside one RTP stream simultaneously.
When destination receives several RTP streams it needs to do exactly the same as MCU does - decode all streams and mix them together but in this case destination may use hardware audio mixer to do that.
Main cons for this approach is bandwidth to the client device. If you have N participants you need:
either send all N streams to all other
or select streams based on some metadata like voice activity or audio level
First one is not efficient, second is very tricky.
The options suggested by Dimtry's answer were not feasible in my case because:
The middle box solution is difficult to implement, requires too many resources or requires to rely on an external piece of software, which I didn't want to have to rely on, especially because Android RTP stack should work out of the box with basic support from a server component, especially for hole punching
The simulcast solution cannot be used because the Android RTP package cannot handle that and ad far as my understanding goes it's only capable of handling simple RTP streams
Other options I've been evaluating:
SIP
Android supports it but it's more of a high level feature and I wanted to build the solution into my own custom application, without relying on additional abstractions introduced by a high level protocol such as SIP. Also, this felt just too complex to set up, and conferencing doesn't even seem to be a core feature but rather an extension
WebRTC
This is supposed to be the de-facto standard for peer 2 peer voice and video conferencing but looking through code examples it just looks too difficult to set up. Also requires support from servers for hole punching.
Our solution
Even though I had, and still have, little experience on this I thought there must be a way to make it work using plain RTP and some support from a simple server component.
The server component is necessary for hole punching, otherwise getting the clients to talk to each other is really tricky.
So what we ended up doing for conference calling is have the caller act as the mixer and the server component as the middle-man to deliver RTP packets to the participants.
In practice:
whenever a N-user call is started, we instantiate N-1 simple UDP broadcast servers, listening on N-1 different ports
We send those N-1 ports to the initiator of the call via a signaling mechanism built on socket.io and 1 port to each of the remaining participants
The server component listening on those ports will simply act as a relay: whenever it receives a UDP packet containing the RTP data it will forward it to all the connected clients (the sockets it has seen thus far) except the sender
The initiator of the call will receive and send data to the other participants, mixing it via the Android AudioGroup class
The participants will send data only to the initiator of the call, and they will receive the mixed audio (together with the caller's own voice and the other participants' voices) on the server port that has been assigned to them
This allows for a very simple implementation, both on the client and on the server side, with minimal signaling work required. It's certainly not a bullet proof conferencing solution, but given the simplicity and feature completeness (especially regarding common network issues like NAT traversal, which using a server aid is basically a non-issue) is in my opinion better than writing lots of code which requires many resources for mixing server-side, relying on external software like SIP servers, or using protocols like WebRTC which basically achieve the same with lots more effort implementation wise.

Continuous audio download stream

I'm looking to set up a server which will read from a some audio input device and serve that audio continuously to clients.
I don't need the audio to be necessarily be played by the client in real time I just want to be able for the client to start downloading from the point at which they join and then leave again.
So say the server broadcasts 30 seconds of audio data, a client could connect 5 seconds in and download 10 seconds of it (giving them 0:05 - 0:15).
Can you do this kind of partial download over TCP starting at whenever the client connects and end up with a playable audio file?
Sorry if this question is a bit too broad and not a 'how do a set variable x to y' kinda question. Let me know if there's a better forum to to post this in.
Disconnect the concepts of file and connection. They're not related. A TCP connection simply supports the reliable transfer of data. Nothing more. What your application chooses to send over that connection is its business, so you need to set your application in a way that it sends the data you want.
It sounds like what you want is a simple HTTP progressive internet radio stream, which is commonly provided by SHOUTcast and Icecast servers. I recommend Icecast to get started. The user connects, they get a small buffer of a few second in front to get them started (optional), and when they disconnect, that's it.

Resources