Is there a way to get timestamps of speaker switch times using Google Cloud's speech to text service? - python-3.x

I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.
I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.

Google "Speech to text" - phone model does what you are looking at by giving result end times for each identified speaker.
Check more here https://cloud.google.com/speech-to-text/docs/phone-model

Related

How can I do speaker identification (diarization) with microsoft speech to text without previous voice enrollment?

In my application, I need to record a conversation between people and there's no room in the physical workflow to take a 20 second sample of each person's voice for the purpose of training the recognizer, nor to ask each person to read a canned passphrase for training. But without doing that, as far as I can tell, there's no way to get speaker identification.
Is there any way to just record, say, 5 people speaking and have the recognizer automatically classify returned text as belonging to one of the 5 distinct people, without previous training?
(For what it's worth, IBM Watson can do this, although it doesn't do it very accurately, in my testing.)
If I understand your question right then Conversation Transcription should be a solution for your scenario, as it will show the speakers as Speaker[x] and iterate for each new speaker, if you don't generate user profiles beforehand.
User voice samples are optional. Without this input, the transcription
will show different speakers, but shown as "Speaker1", "Speaker2",
etc. instead of recognizing as pre-enrolled specific speaker names.
You can get started with the real-time conversation transcription quickstart.
Microsoft Conversation Transcription which is in Preview, now targeting to microphone array device. So the input recording should be recorded by a microphone array. If your recordings are from common microphone, it may not work and you need special configuration. You can also try Batch diarization which support offline transcription with diarizing 2 speakers for now, it will support 2+ speaker very soon, probably in this month.

How to train custom speech model in Microsoft cognitive services Speech to text

I'm doing a POC with Speech to text. I need to recognize specific words like "D-STUM" (daily stand up meeting). The problem is, every time I tell my program to recognize "D-STUM", i get "Destiny", "This theme", etc.
I already went on speech.microsoft.com/.../customspeech, and I've recorded around 40 wav files of people saying "D-STUM". I've also created a file named "trans.txt" which contains every wav file with the word "D-STUM" after each file. Like this :
D_stum_1.wav D-STUM
D_stum_2.wav D-STUM
D_stum_3.wav D-STUM
D_stum_4.wav D-STUM
...
Then I uploaded a zip containing the wav files and the trans.txt file, train a model with those datas, and created an endpoint. I referenced this endpoint on my soft, and launched it.
I expect my custom speech-to-text to recognize people saying "D-STUM" and displaying "D-STUM" as text. I never had "D-STUM" displayed after customizing the model.
Did I do something wrong? Is it the right way to do a custom training?
Is 40 samples not enough for the model to be properly trained?
Thank you for your answers.
Custom Speech has several ways to get a better understanding of specific words:
By providing audio sample with their transcription, as you have done
By providing text sample (without audio)
Based on my previous use-cases, I would highly suggest to create a training file with 5 to 10 sentences in it, each one containing "D-STUM" in its usage context. Then duplicate those sentences like 10 to 20 times in the file.
It worked for us to understand specific words.
Additionally, if you are using "en-US" or "de-DE" as target language, you can use a pronunciation file, see here

Google actions sdk 2 nodejs response / chat bubble limit

I am using the Google-actions-sdk v2 and trying to build a gaming application. In the documentation it says conv.ask() is limited to 2 responses per turn. So this basically means I can only show 2 chat bubbles then it will not allow me to display more until after user input. But when I look at some other published applications they have many more then 2 in a row displayed. I can't seem to understand or find any info on how they can get around this limitation. 2 seems a unreasonable limit.
For speech you can merge text lines together and it will sound fine, but presentation on screen is awful without being able to break it down to more responses.
Does anyone out there have any insight on this?
In fact, everything in a single line would sound bad. Why don't you try to separate the necessary texts with the help of the SSML library, I recommend it to you.
You can use the break tag to put a pause between each text.
<speak>
I can pause <break time="3s"/>.
I can pause by second time <break time="3s"/>.
</speak>
Here you have the documentation.
Now if what you want to give is multiple selection options, you can also use the suggestion chip.
https://developers.google.com/actions/assistant/responses#suggestion_chip

Bing Speech to Text API returning very wrong text

I am trying the "Bing Speech To Text API" in audio files that contains a real conversations between a person that answer customers in a call-center, and a customer that calls the call center to solve his doubts. Thus, these audios have two persons talking, and sometimes have long silence period when the customer is waiting an answer from support. These audios have 5 to 10 minutes long.
My doubt is:
What is the best aproach to translate audios like that to text, using Microsoft Cognitive Services?
What APIs do I have to use, besides Bing Speech To Text?
Do I have to cut or convert the audios before sending them to Bing Speech To Text?
I am asking that because the Bing Speech to text API is returning an text very very very very very different from the audio content. It is impossible to use or undertand. But, of course, I think I am doing some mistake.
Please, could you explain to me the best strategy to work with audio files like this?
I would be very glad for any help.
Best Regads,
I had run into this problem with conversations as well. Make sure that the transcription mode is set to "conversation" instead of "interactive."

Identify start/stop times of spoken words within a phrase using Sphinx

I'm trying to identify the start/end time of individual words within a phrase. I have a WAV file of the phrase AND the text of the utterance.
Is there an intelligent way of combining these two data (audio, text) to improve Sphinx's recognition abilities? What I'd like as output are accurate start/stop times for each word within the phrase.
(I know you can pass -time yes to pocketsphinx to get the time data I'm looking for -- however, the speech recognition itself is not very accurate.)
The solution cannot be for a specific speaker, as the corpus I'm working with contains a lot of different speakers, although they are all using US English.
We have a specific tool for that - audio aligner in sphinx4. You can check
http://cmusphinx.sourceforge.net/2014/07/long-audio-aligner-landed-in-trunk/

Resources