google speech to text captures the aduio coming from speaker - node.js

I am using google speech-to-text API specifically to Perform streaming speech recognition on an audio stream like microphone input in video conferencing.
But while using it, if the output source is a speaker and two or more peers are speaking simultaneously, then google speech to text capture the audio coming from the speaker even if the peer is not speaking.
How to provide a processed output to speech-to-text/ Sox recorder to avoid this issue.
I am using NodeJS and angular as my programming languages.
I tried to store the streamed audio in a file and then provide it to the speech client.
Expected result:
speech to text only transcribes the microphone audio not recognize the speaker audio.

Related

Stream Audio (back ground music) for unreal engine 4

Is it possible to stream an audio (RESTApi ) in UE4 for a game, I checked the audio docs from UE4, seems like we need to drop audio files as an asset and load it accordingly, is there any possibility to stream audio from the server as a simple HTTP stream for a game in UE4
Any advice would be a great help.
(https://docs.unrealengine.com/en-US/WorkingWithMedia/Audio/index.html)

Processing of the audio file to the transcription using the Azure Cognitive Services SDK | Python 3.x

Microsoft has a library for the transcription, but in the official examples only have a microphone input. I want to understand how to transcribe an audio file of wav format.
Based on the official document, it only illustrates how to recognize speech from microphone input.
I think you could try to use custom audio stream,please refer to this article.
The Speech SDK's Audio Input Stream API provides a way to stream audio
streams into the recognizers instead of using either the microphone or
the input file APIs.
However,you need to make sure your format is supported by Azure Speech Service. As the statement in the document:
Currently, only the following configuration is supported:
Audio samples in PCM format, one channel, 16000 samples per second,
32000 bytes per second, two block align (16 bit including padding for
a sample), 16 bits per sample.

Nexmo audio sampling rate

We have connected the Nexmo Voice WebSocket API (telephony) with Google Voice Recognition API but the quality is poor. We assume that the reason is sampling rate. Google requires 16KHZ but not upsampled. Does Nexmo support this?
See our example in https://www.youtube.com/watch?v=cIxS_CF3t00
Nexmos voice core runs at 16bit/16kHz but we are limited to whatever audio the phone company delivers the call into us at, generally this is 8Khz G.711 codec.
We do up-sample but as you've found that doesn't always work great with speech recognition API's.
I haven't tested out the Google API's yet myself but its very near the top of my todo list.
In the mean time you might want to take a look at the IBM Watson APIs as they have a Narrowband speech model which seems to work much better with telephony. There's some sample code for that here https://github.com/nexmo-community/voice-watson-speechtotext

Record audio from both sides of the skype call in nodejs

I am writing a node.js program for transcribing skype calls using Google speech api, for that I am easily able to read microphone input but not able to read what the other person is speaking. Is there a way in which this can be done using node.js.

Audio stream interfacing API's, Tokbox and Twilio

Hi so we have a brain teaser on our hands.
Currently we're attempting to build a conferencing application with Tokbox, the setup is simple and the video conferencing works fine.
However we want to be able to break into voice, so this means if user x and y are video conferencing but user Z doesn't have a computer they can dial in via a twilio phone number, however the issues comes with the audio, we need the twilio audio to be layered into the tokbox audio so everybody can hear each other.
The best solution is to turn off the tokbox audio and let the twilio client handle the audio, via posting tokbox audio through their client, however this seems it would be a slow option.
Ideally tokbox would be able to handle the twilio audio but it currently doesn't have support.
Apart from extending tokbox with a lot of custom code I was just wandering if you guys know anyway of mixing audio into one layer?
With OpenTok's new iOS v2.2 beta or Android v2.2 beta there is an included tutorial that lets you build your own Publisher object so that you can stream your own media (pictures, videos from your phone, etc). You might be able to build you own Publisher object to steam Twilio's audio.

Resources