I know how we can provide the audio config to improve the ASR when using some third party telephony partners like twilio etc.
But how can we provide the audio config for boosting and improving ASR when using google telephony gateway?
To improve speech recognition for Dialogflow phone gateway, if you are using Dialogflow ES you can enable both settings under Speech tab > SPEECH RECOGNITION QUALITY.
Enabling Enhanced speech models for Dialogflow phone gateway integration only supports the following languages:
English (en) English
US (en-US)
For more information, you may refer to the following links:
Auto Speech Adaptation Documentation
Data logging and enhanced speech models
You may also want to look into using webhook for iterative confirmation of spoken sequences to help with speech recognition.
The audio config boosting and improving should be the same using Google telephony gateway. You can refer to this article for a sample implementation of the audio configuration on an agent.
Related
I am building an android app which uses speech synthesis and continuous speech recognition simultaneously.
To prevent the recognizer from hearing what synthesis plays, the user is using earphones.
When audio sound is raised higher, the recognizer starts picking up what synthesis plays.
How can I help the recognizer to ignore sounds from synthesis? Is there a way to tell recognizer:
"okay, right now synthesis is saying the following text. If you pick it from the mic, please ignore".
The app is built in AndroidStudio.
Synthesis is built-in.
Recognizer is Google Cloud Speech, but I can consider other engines as well.
Some notes for more clarification
I was able to achieve very good results by giving commands in one language and playing synthesis in another language.
However when setting them both in one language and raising sound high, there are some interruptions.
It is very important to keep continuous speech recognition, because the user should have freedom to speak whenever he wants.
We have connected the Nexmo Voice WebSocket API (telephony) with Google Voice Recognition API but the quality is poor. We assume that the reason is sampling rate. Google requires 16KHZ but not upsampled. Does Nexmo support this?
See our example in https://www.youtube.com/watch?v=cIxS_CF3t00
Nexmos voice core runs at 16bit/16kHz but we are limited to whatever audio the phone company delivers the call into us at, generally this is 8Khz G.711 codec.
We do up-sample but as you've found that doesn't always work great with speech recognition API's.
I haven't tested out the Google API's yet myself but its very near the top of my todo list.
In the mean time you might want to take a look at the IBM Watson APIs as they have a Narrowband speech model which seems to work much better with telephony. There's some sample code for that here https://github.com/nexmo-community/voice-watson-speechtotext
I am asked to develop a text-to-speech module in our product, which should support a variety of text-to-speech engines.
Is there a standard describes how to interface with third party TTS(text-to-speech) service or ASR(auto-speak-recognition) service?
Most ASR's use Media Resource Control Protocol (MRCP) as the standard for their interface. It can also be used for TTS.
it depends on what is your purpose or the field you would use ASR & TTS in.
you can use MRCP to control ASR, TTS media resources if you will use it in IVR (Interactive Voice Response) apps like call centers and so on, in this case you would interface your MRCP server with Voice Gateway like CISCO and VXML server.
a famous and common MRCP implementation is unimrcp , its a C implementation of the protocol , its a good and stable open source project.
but at end, it depends on your purpose as I said, you may never need to use MRCP, you can use your TTS engine as a stand alone server if it would work alone.
famous open source TTS engines are Mary TTS written in Java, Festival written in C++.
famous open source ASR engines are cmu Sphinx4 written in Java, Julius written in C.
I'm working on a desktop application built with XNA. It has a Text-To-Speech application and I'm using Microsoft Translator V2 api to do the job. More specifically, I'm using is the Speak method (http://msdn.microsoft.com/en-us/library/ff512420.aspx), and I play the audio with SoundEffect and SoundEffectInstance classes.
The service works fine, but I'm having some issues with the audio. The quality is not very good and the volume is not loud enough.
I need a way to improve the volume programmatically (I've already tried some basic solutions in CodeProject, but the algorithms are not very good and the resulting audio is very low quality), or maybe use another api.
Are there some good algorithms to improve the audio programmatically? Are there other good text-to-speech api's out there with better audio quality and wav support?
Thanks in advance
If you are doing off-line processing of audio, you can try using Audacity. It has very good tools for off-line processing of audio. If you are processing real-time streaming audio you can try using SoliCall Pro. It creates virtual audio device and filters all audio that it captures.
I'm developing a mobile application using j2me. There I need to have a speech recognition function, so that application should be able to process and act upon the commands given by the user. What I wanted to know is
Is this technically possible (I'm a novice to j2me programming)?
If it is possible, where can I find a j2me library for speech recognition?
Thanks in advance,
Nuwan
This is technically possible, but in
reality most devices that run J2ME
aren't powerful enough to do it in pure Java code. You need to look for devices which support JSR 113 - JavaTM Speech API 2.0.
Look at JSR 113 - JavaTM Speech API 2.0.
There is a Java Speech API Implementation (JSR-113), which supposed to do speech recognition:
But, unfortunately, I don't know if any device support it :)
If you want to implement speech recognition yourself, there are many limitations in j2me such as slow performance, and impossibility to access audio data while recording.
An in-between way may be to do very simple ASR in the client (e.g. yes,no,digits etc) and for anything beyond you can send it to the server. The limits on what the client can do can change in the future in you upgrade your phone.