How do you speed up Dialogflow CX speech recognition on single word responses? - speech-to-text

I have a Dialogflow CX agent working in polish [pl] language as an audio bot using AudioCodes.
I want it to respond to yes/no answers (pl: "tak"/"nie"), yet it takes usually 15 seconds or more to detect the end of utterance. Enabling advanced speech settings and setting "End of speech sensitivity" and "No speech timeout" does not help.
I'd love to set some Audiocodes parameters, like fast STT recognition, but I don't know where to set them.
Any ideas on how to speed up the detection time? Forcing users to respond with two or more words is not allowed in my case.

Well, "End of speech sensitivity" in Dialogflow CX is only available for agents using the en-US language tag, as seen here.

Related

Perform Speech-to-Text in Python using pre-transcribed text as guide

I'm working on a python application that's meant to align video clips based on what actors are saying on screen.
As an example, I have a scene where actors are reading dialogue from a script. They do the 3min scene 10 times.
I am currently transcribing what they say using speech-to-text, but because the actors are reading the same dialogue repeatedly, I want to use the pre-transcribed dialogue (the movie script) to help guide the speech-to-text engine to be more accurate.
For example:
"Are you telling me you built a time machine out of a Delorean?"
Speech to text returns:
"Are you talking me you building a time machine out of a daylight?"
I should be able to figure out where the mistakes are and estimate the correct line using the original script and lock everything against the movie script.
I'm currently using CMUSphinx in Python to get my STT data and it works very well. But I'm having some trouble with the logic on this next part.
I'll post some code shortly!
EDIT: Discovered that the search term I was looking for is "audio aligner" and "long audio aligner." These seem to be tools included in some STT packages. CMUSphinx in particular may have the ability to do this built in. Exploring that.

Is it possible to improve the speechrecognizer in the Actions-on-Google simulator

I am working on a DialogFlow project building a chatbot in Dutch. When testing the bot in the Actions-on-Google simulator the biggest issue we find is that the speechrecogniser does not recognise half the number we say. The project has the Dutch locale and works fine with short strings of number but as soon as we go over 5 number in a row the speech recogniser goes funny.
Does anyone know a solution for this?

Problem with transcription of short audios (yes, no ) with Google

I'm having difficulties when trying to transcribe short user audio answers such as "yes", or "no".
I'm using dialogFlow detectIntent function, using audio as input, but same thing happens using Google Speech-To-Text API. I assume both use the same algorithms. Basically the problem is that there are a lot of cases where the response is empty
Audios clips are taken from a phone call (MULAW, 8KHz) and encoding and sample rate match what I'm sending in the request, because it works with almost all the audios.
We only have a problem with short responses. We hear the audio and the word (yes/no) is quite clear, but both dialogFlow and Google Speech-To-Text returns an empty response.
Did someone have the same problem? Is there any configuration that can be applied to solve or mitigate this problem?

Microsoft Speech Recognition defaults vs API

So I've been using Microsoft Speech Recognition in Windows 10, doing the training exercises, dictating text into Wordpad and correcting it, adding words to the dictionary and so on. I would like to use the software to transcribe .wav files. It appears one can do this using the Windows Speech Recognition API, but this seems to involve creating and loading one's own grammar files, which suggests to me that this would basically create a new speech recognizer, that uses the same building blocks but is a different program from the one that runs when I click "Start Speech Recognition" in the start menu. In particular, it would perform differently because of differences in training or configuration.
Am I wrong in this ? And if I'm not, is there still a way of retrieving all the data the default speech recognizer uses so I can reproduce its behavior exactly? If I need to create a separate speech recognizer with its own grammar files and separate training history and so on in order to transcribe .wav files then so be it but I'd like to better understand what's going on here.
The Woundify open source project contains examples of how to convert wav files and to text (STT).

Are there any open-source phoneme sets (for speech synthesis)?

I am trying to make a super basic speech synthesizer, and I need some form of phoneme audio files so that I can piece them together and build words. Are there any open phoneme sets that I would be able to use for this?
For super basic speech synthesizer it's worth to check espeak http://espeak.sourceforge.net, it's better than to glue sound files together.
This may be more than you're looking for, but have you checked into http://www.vocaloid.com/en/ by any chance? There are many speech products on the market. You might also be interested in http://msdn.microsoft.com/en-us/library/hh361572(v=office.14).aspx

Resources