How to disable disfluency removal for Google Cloud Speech to Text API - speech-to-text

I am building an app that captures user audio and analyzes disfluency in a reader's speech, so it it important for me to know all forms of disfluency.
I noticed that Google's speech to text cloud API automatically removes disfluencies in speech. For example:
"so uhh, I will probably do that umm probably next week"
Gets transcribed to:
"so I will probably do that probably next week"
Is there a way to keep the uhhs and umms?

Related

azure speech to text full breaks/filler words detection

I've been looking for a model that is capable to detect what we call "full breaks" or "filler words" such as "eh", "uhmm" "ahh" but Azure doesn't get them.
I've been playing with Azure's speech to text web UI but it seems it doesn't catch these types of words/expressions.
I wonder if there is some option in the API configuration to "toggle" the detection of full breaks or filler words.
Thank you in advance

How to use change stress in words. Azure Speech

Please tell me how I can change the stress in some words in the Azure voice engine text-to-speech. I use Russian voices. I am not working through SSML.
When I send a text for processing, then in some words he puts the stress on the wrong syllable or letter.
I know that some voice engines use special characters like + or 'in front of a stressed vowel. I have not found such an option here
To specify the stress for individual words you can use the SpeakSsmlAsync method and pass a lexicon url or you can directly specify it directly in the ssml by using the phoneme-element. In both cases you can use IPA.

Control Actions-on-Google Media-Response (e.g. start at minute 3)

i want to develop an google-action. (ideally using dialogflow).
but the google-action needs some features where i couldn't find a solution, and i'm not sure if it's even possible.
My Usecases:
The google action starts a mps. someone stops and exits the google action, and if the user starts the google action again, i would resume the mp3.
but i couldn't find a solution where i can determine the "offset", when the user stops the mp3.
and even i would have this offset, i didn't find a solution how to tell google assistant, that i want to play the mp3, but starts at e.g. Minute 51.
I would be really wondered, it the google action possibilitys are so extremly restricted.
can someone confirm, that this usecases are not possible, or can someone give me a hint how to do it?
i only found this one, which is restricted to start a mp3 at beginning.
https://developers.google.com/actions/assistant/responses#media_responses
Kind Regards
Stefan
To start an mp3 file at a certain point you can try the SSML tag and its clipBegin property.
clipBegin - A TimeDesignation that is the offset from the audio source's beginning to start playback from. If this value is greater than or equal to the audio source's actual duration, then no audio is inserted.
https://developers.google.com/actions/reference/ssml
To use this, your mp3 file has to be hosted using HTTPS
Hope that this helps.
You could use the conversational actions (instead of dialogflow) where media responses allow using a start_offset
....
"content": {
"media": {
"start_offset": "2.12345s",
...
For more details see
https://developers.google.com/assistant/conversational/prompts-media#MediaResponseProperties
Even conversational actions seem to be the "newest" technology for google actions. Or at least released recently.

Google Home -> Dialogflow entity matching very bad? for non dictonary enities

with Dialogflow (API.AI) I find the problem that names from vessel are not well matched when the input comes from google home.
It seems as the speech to text engine completly ignore them and just does the speech to text based on dictionary so Dialogflow cant match the resulting text all at the end.
Is it really like that or is there some way to improve?
Thanks and
Best regards
I'd recommend look at Dialogflow's training feature to identify where the speech recognition of the Google Assistant may not have worked they way you expect. In those cases, you'll see how Google's speech recognition detected words you may not have accounted for. In cases where you'd like to match these unrecognized words to a entity value, simply add them as synonyms.

Match with Phrase for Google Speech API

I capture an audio from a speaker where they say - "I want to meet John Disilva". I pass this to Google Speech API with Phrase as { 'John Disilva', 'Ashish Mundra'}. However, Google Speech API returns me full phrase i.e. - 'I want to meet John Disilva'.
Is there a way I can only get my phrase as return value as I am only interested to extract the name part?
The reason is that I cannot control what someone is saying to my mic. They can say 'I would like to see John Disilva' or 'Do you know John Disilva', but I am sure that my user will always have that name somewhere in this sentence which I want to extract.
If Google Speech API can give me the exact phrase via which it was able to detect John Disilva in that sentence then I can use that Phrase for further processing in my code.
This isn't possible with the Google Speech API. Your best bet may be to just do post-processing to see which name is present. If you need something more accurate than that look for an ASR system that supports "keyword spotting."

Resources