Is it possible to improve the speechrecognizer in the Actions-on-Google simulator - dialogflow-es

I am working on a DialogFlow project building a chatbot in Dutch. When testing the bot in the Actions-on-Google simulator the biggest issue we find is that the speechrecogniser does not recognise half the number we say. The project has the Dutch locale and works fine with short strings of number but as soon as we go over 5 number in a row the speech recogniser goes funny.
Does anyone know a solution for this?

Related

How do you speed up Dialogflow CX speech recognition on single word responses?

I have a Dialogflow CX agent working in polish [pl] language as an audio bot using AudioCodes.
I want it to respond to yes/no answers (pl: "tak"/"nie"), yet it takes usually 15 seconds or more to detect the end of utterance. Enabling advanced speech settings and setting "End of speech sensitivity" and "No speech timeout" does not help.
I'd love to set some Audiocodes parameters, like fast STT recognition, but I don't know where to set them.
Any ideas on how to speed up the detection time? Forcing users to respond with two or more words is not allowed in my case.
Well, "End of speech sensitivity" in Dialogflow CX is only available for agents using the en-US language tag, as seen here.

How to detect filler sound like um, uh, etc using cmusphinx/mozilla deepspeech/google stt etc?

I am working on a project in Speech Recognition and the task is to detect filler sounds like um, uh, eh, etc. on audio clips of children/students speaking in English. Their speaking English is not that great.
How can this be done using cmuSphinx/Mozilla deep speech/google cloud speech/Kaldi?
Or do I need to start from scratch?
I also tried to go through other posts and papers on how to build an ASR but since its not a long term project, I do not have the time to spend on building it from scratch and see the results. Also, I am okay with less accuracy which I can claim to improve later on.

Perform Speech-to-Text in Python using pre-transcribed text as guide

I'm working on a python application that's meant to align video clips based on what actors are saying on screen.
As an example, I have a scene where actors are reading dialogue from a script. They do the 3min scene 10 times.
I am currently transcribing what they say using speech-to-text, but because the actors are reading the same dialogue repeatedly, I want to use the pre-transcribed dialogue (the movie script) to help guide the speech-to-text engine to be more accurate.
For example:
"Are you telling me you built a time machine out of a Delorean?"
Speech to text returns:
"Are you talking me you building a time machine out of a daylight?"
I should be able to figure out where the mistakes are and estimate the correct line using the original script and lock everything against the movie script.
I'm currently using CMUSphinx in Python to get my STT data and it works very well. But I'm having some trouble with the logic on this next part.
I'll post some code shortly!
EDIT: Discovered that the search term I was looking for is "audio aligner" and "long audio aligner." These seem to be tools included in some STT packages. CMUSphinx in particular may have the ability to do this built in. Exploring that.

How do you leave the microphone on in Android Studio

I am currently doing a speech to text project. The person in question could be talking randomly and for random lengths of time. On various phrases the app responds according to the phrase. At the moment the user has to press a button to begin the recording. So far I have no issues with doing speech to text and grabbing key phrases from the snippet. But having to continually press the button is not an option. So where should I be looking as to pressing a start and stop button and have the microphone stay on?
This of course leads to other questions and issues. I'm not looking for answers to the following things, but I will state them so that you can see what I have considered.
Will the mobile device overheat with the mic staying on?
How do you leave the mic on while converting 2 seconds of speech to text every 2 seconds
I did instantiate 2 instances of turning on the mic which overlapped each other... I can tell you the device did not like this at all
Thanks for any hints and suggestions
I am sure there will be other issues I've not thought of.
I am using a pc and the latest version of Android Studio. 1.4 I believe it is

How to convert human voice into digital format?

I am working on a project where biometric system is used to secure the system. We are planning to use human voice to secure the system.
Idea is to allow the person to say some words or sentences and system will store that voice in digital format. Next time person wants to enter the system, he/she has to speak some words which may or may not be different from the words used earlier.
We don't want to match words but want to match voice frequency.
I have read some research papers regarding this system but those papers don't have any implementation details.
So just want to know whether there is any software/API which can convert analog voice into digital format and will also tell us the frequency of voice.
Until now I was working on normal web based applications so I know normal APIs and platforms like Java EE, C#, etc but I don't have any experience about this kind of application.
Please enlighten !!!
http://www.loquendo.com/en/products/speaker-verification/
http://www.nuance.com/for-business/by-solution/contact-center-customer-care/cccc-solutions-services/verifier/index.htm
(two links removed due to reported virus content)
http://www.persay.com/products.asp
This is as good a starting point as any : http://marsyas.info/
It's a open source software framework for audio processing. They've listed a bunch of projects that have used their framework in various ways so you could probably draw inspiration from it.
http://marsyas.info/about/projects. The Telligence project in particular seems the closest to your needs as it it was used to gender classify audio : http://marsyas.info/about/projects#5Teligence
There are two steps on a project like this one I believe:
First step would be to record the voice from an analog input into digital format (let's assume wav-pcm). For this you can use DirectShow API in C#, or standard Wav-In as in this project: http://www.codeproject.com/KB/audio-video/cswavrec.aspx. You may consider compressing your audio files later on, there are many options for this, in Windows you may consider Windows Media Format SDK to avoid licensing issues with other formats.
Second step is to build or use a voice recognition framework, if you want to build a recognition framework you will probably need to define a set of "features" for your sound fragments and select+implement a recognition algorithm. There are many aproaches available for this, IEEE amd ACM.org websties are usually good sources. If you want to use an existing framework you may want to consider Nuance Recognizer (commercial) or http://cmusphinx.sourceforge.net (open source).
Hope this helps.

Resources