We have connected the Nexmo Voice WebSocket API (telephony) with Google Voice Recognition API but the quality is poor. We assume that the reason is sampling rate. Google requires 16KHZ but not upsampled. Does Nexmo support this?
See our example in https://www.youtube.com/watch?v=cIxS_CF3t00
Nexmos voice core runs at 16bit/16kHz but we are limited to whatever audio the phone company delivers the call into us at, generally this is 8Khz G.711 codec.
We do up-sample but as you've found that doesn't always work great with speech recognition API's.
I haven't tested out the Google API's yet myself but its very near the top of my todo list.
In the mean time you might want to take a look at the IBM Watson APIs as they have a Narrowband speech model which seems to work much better with telephony. There's some sample code for that here https://github.com/nexmo-community/voice-watson-speechtotext
Related
Hi would like to know if for example my own audioTrack is muted and I started speaking while muted it can return an event, this will be similar to teams to tell you that you are muted.
Probably the general question if we are able to track AudioEvents while speaking? Because I believe that dominant speaker is the only audio speaking event I see on Twilio. Any hints in obtaining the audio speaking event would be great.
Twilio developer evangelist here.
It sounds like you are using Twilio Video (since you mention dominant speaker events). Twilio Video itself doesn't have "audio speaking" events, neither does the web platform itself.
You can however do some audio analysis in the browser to tell whether a person is making noise and you can compare that to whether their audio track is currently enabled in order to show a warning that they are speaking while muted.
To do so, you would need to access the localParticipant's audio track. From that you can get the underlying mediaStreamTrack, turn it into a MediaStream and then pass it to the web audio API for analysis. I have an example of doing this to show the volume of localParticipant's audio here: https://github.com/philnash/phism/blob/main/client/src/lib/volume-meter.js.
Once you have that volume you can then choose a threshold where you decide a user is trying to speak and then compare whether that threshold is broken while the user is muted.
Let me know if that helps.
I really don't have knowledge about this area (WebRTC, video conference, audio conference, etc).
I want to add to my system (web application) a client support using audio conference.
I was looking for Twilio, it seems a good solution, but I think it doesn't fit my case, because it always need a virtual phone number to get works and I don't need it at my system.
What I need is something like Facebook calls, Google Hangouts (without video).
Is there any solution/library/API for it? It's no necessary be a free solution.
Hi so we have a brain teaser on our hands.
Currently we're attempting to build a conferencing application with Tokbox, the setup is simple and the video conferencing works fine.
However we want to be able to break into voice, so this means if user x and y are video conferencing but user Z doesn't have a computer they can dial in via a twilio phone number, however the issues comes with the audio, we need the twilio audio to be layered into the tokbox audio so everybody can hear each other.
The best solution is to turn off the tokbox audio and let the twilio client handle the audio, via posting tokbox audio through their client, however this seems it would be a slow option.
Ideally tokbox would be able to handle the twilio audio but it currently doesn't have support.
Apart from extending tokbox with a lot of custom code I was just wandering if you guys know anyway of mixing audio into one layer?
With OpenTok's new iOS v2.2 beta or Android v2.2 beta there is an included tutorial that lets you build your own Publisher object so that you can stream your own media (pictures, videos from your phone, etc). You might be able to build you own Publisher object to steam Twilio's audio.
I'm working on a desktop application built with XNA. It has a Text-To-Speech application and I'm using Microsoft Translator V2 api to do the job. More specifically, I'm using is the Speak method (http://msdn.microsoft.com/en-us/library/ff512420.aspx), and I play the audio with SoundEffect and SoundEffectInstance classes.
The service works fine, but I'm having some issues with the audio. The quality is not very good and the volume is not loud enough.
I need a way to improve the volume programmatically (I've already tried some basic solutions in CodeProject, but the algorithms are not very good and the resulting audio is very low quality), or maybe use another api.
Are there some good algorithms to improve the audio programmatically? Are there other good text-to-speech api's out there with better audio quality and wav support?
Thanks in advance
If you are doing off-line processing of audio, you can try using Audacity. It has very good tools for off-line processing of audio. If you are processing real-time streaming audio you can try using SoliCall Pro. It creates virtual audio device and filters all audio that it captures.
I'm developing a mobile application using j2me. There I need to have a speech recognition function, so that application should be able to process and act upon the commands given by the user. What I wanted to know is
Is this technically possible (I'm a novice to j2me programming)?
If it is possible, where can I find a j2me library for speech recognition?
Thanks in advance,
Nuwan
This is technically possible, but in
reality most devices that run J2ME
aren't powerful enough to do it in pure Java code. You need to look for devices which support JSR 113 - JavaTM Speech API 2.0.
Look at JSR 113 - JavaTM Speech API 2.0.
There is a Java Speech API Implementation (JSR-113), which supposed to do speech recognition:
But, unfortunately, I don't know if any device support it :)
If you want to implement speech recognition yourself, there are many limitations in j2me such as slow performance, and impossibility to access audio data while recording.
An in-between way may be to do very simple ASR in the client (e.g. yes,no,digits etc) and for anything beyond you can send it to the server. The limits on what the client can do can change in the future in you upgrade your phone.