Bing Speech to Text API returning very wrong text - speech-to-text

I am trying the "Bing Speech To Text API" in audio files that contains a real conversations between a person that answer customers in a call-center, and a customer that calls the call center to solve his doubts. Thus, these audios have two persons talking, and sometimes have long silence period when the customer is waiting an answer from support. These audios have 5 to 10 minutes long.
My doubt is:
What is the best aproach to translate audios like that to text, using Microsoft Cognitive Services?
What APIs do I have to use, besides Bing Speech To Text?
Do I have to cut or convert the audios before sending them to Bing Speech To Text?
I am asking that because the Bing Speech to text API is returning an text very very very very very different from the audio content. It is impossible to use or undertand. But, of course, I think I am doing some mistake.
Please, could you explain to me the best strategy to work with audio files like this?
I would be very glad for any help.
Best Regads,

I had run into this problem with conversations as well. Make sure that the transcription mode is set to "conversation" instead of "interactive."

Related

Is there a way to get timestamps of speaker switch times using Google Cloud's speech to text service?

I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.
I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.
Google "Speech to text" - phone model does what you are looking at by giving result end times for each identified speaker.
Check more here https://cloud.google.com/speech-to-text/docs/phone-model

How can I do speaker identification (diarization) with microsoft speech to text without previous voice enrollment?

In my application, I need to record a conversation between people and there's no room in the physical workflow to take a 20 second sample of each person's voice for the purpose of training the recognizer, nor to ask each person to read a canned passphrase for training. But without doing that, as far as I can tell, there's no way to get speaker identification.
Is there any way to just record, say, 5 people speaking and have the recognizer automatically classify returned text as belonging to one of the 5 distinct people, without previous training?
(For what it's worth, IBM Watson can do this, although it doesn't do it very accurately, in my testing.)
If I understand your question right then Conversation Transcription should be a solution for your scenario, as it will show the speakers as Speaker[x] and iterate for each new speaker, if you don't generate user profiles beforehand.
User voice samples are optional. Without this input, the transcription
will show different speakers, but shown as "Speaker1", "Speaker2",
etc. instead of recognizing as pre-enrolled specific speaker names.
You can get started with the real-time conversation transcription quickstart.
Microsoft Conversation Transcription which is in Preview, now targeting to microphone array device. So the input recording should be recorded by a microphone array. If your recordings are from common microphone, it may not work and you need special configuration. You can also try Batch diarization which support offline transcription with diarizing 2 speakers for now, it will support 2+ speaker very soon, probably in this month.

Is there anyway to make google assistant's speech recognition better recognise words used in my dialogflow agent?

I am using Dialogflow to create a chatbot that can be used on google assistant. However the speech recognition often mis-recognizes the intended word. Example, when I say the word "seal", it recognizes the spoken word wrongly as "shield".
Is there any way to "train" or make google assistant better recognize a word?
If you have a limited amount of words that you would like to improve upon, then using Dialogflow's entities would be an option. For instance, if you are trying to recognize certain animals. You can create a set of animals as entities and set the intent to look for an animal entity in the user input.
Besides this option I don't know of any other things to improve the speech itself, you could train Dialogflow to map both "seal" and "shield" to your desired intent, but that doesn't change the actual word, it will still be shield.
For any other improvements to the speech recognition, I'm afraid you will have to wait for updates from Google to their algorithms.
Just found out there is a new beta function in dialogflow that should help.
https://cloud.google.com/dialogflow/docs/speech-adaptation
Edit:
However does not work with Actions on google.

Google actions sdk 2 nodejs response / chat bubble limit

I am using the Google-actions-sdk v2 and trying to build a gaming application. In the documentation it says conv.ask() is limited to 2 responses per turn. So this basically means I can only show 2 chat bubbles then it will not allow me to display more until after user input. But when I look at some other published applications they have many more then 2 in a row displayed. I can't seem to understand or find any info on how they can get around this limitation. 2 seems a unreasonable limit.
For speech you can merge text lines together and it will sound fine, but presentation on screen is awful without being able to break it down to more responses.
Does anyone out there have any insight on this?
In fact, everything in a single line would sound bad. Why don't you try to separate the necessary texts with the help of the SSML library, I recommend it to you.
You can use the break tag to put a pause between each text.
<speak>
I can pause <break time="3s"/>.
I can pause by second time <break time="3s"/>.
</speak>
Here you have the documentation.
Now if what you want to give is multiple selection options, you can also use the suggestion chip.
https://developers.google.com/actions/assistant/responses#suggestion_chip

Phone number and Date of Birth from human speech

Is there an effective Natural Language Processor that can fetch the phone number and date of birth from human speech. Each user has a different way of specifying the phone number and date of birth. Hence, converting speech to text and then parsing the text for phone number is not helpful.
You can use Google speech to text api. I had used same for entering account number for blind people. I was working for bank so I there were lots of numbers involved as input eg account number, card number etc.
With Google STT engine you can define custom voice inputs.
Also I had created feedback mechanism using Text to Speech Api so that app can tell if users feedback is invalid and request him to speak again.
You can see code snippet at github.
https://github.com/hiteshsahu/Android-TTS-STT
Easiest way is to extract text from speech, there is plenty of tools, proprietary (nuance), and tinker friendly open source like sphinx, and plenty of tools to extract dates and phones expressed differently. IBM Watson offers one, Smart Formatting beta, to uniform dates and phones in own transcripts. To guess which dates are birthdays you try detect related keywords (birth, born so on) nearby.
For few free alternatives, check
For phone #
https://www.npmjs.com/package/phone-number-extractor
https://github.com/googlei18n/libphonenumber
For date extractions check prev questions
Extracting dates from text in Java
Best way to identify and extract dates from text Python?
There is a patent for the process your are asking, but you might have to pay royalties or smth.
http://www.freepatentsonline.com/8416928.html
If you want to fetch the phone number and date of birth from human speech.
So, you can use another option and implement it.
https://cloud.google.com/speech/
This API is really useful for converting your speech to text. I also have this problem at one moment so you can try it too.
The another API which is really good for authentication.
https://api.ai/
I hope it helps you.

Resources