How can I do speaker identification (diarization) with microsoft speech to text without previous voice enrollment? - speech-to-text

In my application, I need to record a conversation between people and there's no room in the physical workflow to take a 20 second sample of each person's voice for the purpose of training the recognizer, nor to ask each person to read a canned passphrase for training. But without doing that, as far as I can tell, there's no way to get speaker identification.
Is there any way to just record, say, 5 people speaking and have the recognizer automatically classify returned text as belonging to one of the 5 distinct people, without previous training?
(For what it's worth, IBM Watson can do this, although it doesn't do it very accurately, in my testing.)

If I understand your question right then Conversation Transcription should be a solution for your scenario, as it will show the speakers as Speaker[x] and iterate for each new speaker, if you don't generate user profiles beforehand.
User voice samples are optional. Without this input, the transcription
will show different speakers, but shown as "Speaker1", "Speaker2",
etc. instead of recognizing as pre-enrolled specific speaker names.
You can get started with the real-time conversation transcription quickstart.

Microsoft Conversation Transcription which is in Preview, now targeting to microphone array device. So the input recording should be recorded by a microphone array. If your recordings are from common microphone, it may not work and you need special configuration. You can also try Batch diarization which support offline transcription with diarizing 2 speakers for now, it will support 2+ speaker very soon, probably in this month.

Related

Is there a way to get timestamps of speaker switch times using Google Cloud's speech to text service?

I know there is a way to get delineated words by speaker using the google cloud speech to text API. I'm looking for a way to get the timestamps of when a speaker changes for a longer file. I know that Descript must do something like this under the hood. , which I am trying to replicate. My desired end result is to be able to split an audio file with multiple speakers into clips of each speaker, in the order that they occurred.
I know I could probably extract timestamps for each word and then iterate through the results, getting the timestamps for when a previous result is a different speaker than the current result. This seems very tedious for a long audio file and I'm not sure how accurate this is.
Google "Speech to text" - phone model does what you are looking at by giving result end times for each identified speaker.
Check more here https://cloud.google.com/speech-to-text/docs/phone-model

Bing Speech to Text API returning very wrong text

I am trying the "Bing Speech To Text API" in audio files that contains a real conversations between a person that answer customers in a call-center, and a customer that calls the call center to solve his doubts. Thus, these audios have two persons talking, and sometimes have long silence period when the customer is waiting an answer from support. These audios have 5 to 10 minutes long.
My doubt is:
What is the best aproach to translate audios like that to text, using Microsoft Cognitive Services?
What APIs do I have to use, besides Bing Speech To Text?
Do I have to cut or convert the audios before sending them to Bing Speech To Text?
I am asking that because the Bing Speech to text API is returning an text very very very very very different from the audio content. It is impossible to use or undertand. But, of course, I think I am doing some mistake.
Please, could you explain to me the best strategy to work with audio files like this?
I would be very glad for any help.
Best Regads,
I had run into this problem with conversations as well. Make sure that the transcription mode is set to "conversation" instead of "interactive."

Phone number and Date of Birth from human speech

Is there an effective Natural Language Processor that can fetch the phone number and date of birth from human speech. Each user has a different way of specifying the phone number and date of birth. Hence, converting speech to text and then parsing the text for phone number is not helpful.
You can use Google speech to text api. I had used same for entering account number for blind people. I was working for bank so I there were lots of numbers involved as input eg account number, card number etc.
With Google STT engine you can define custom voice inputs.
Also I had created feedback mechanism using Text to Speech Api so that app can tell if users feedback is invalid and request him to speak again.
You can see code snippet at github.
https://github.com/hiteshsahu/Android-TTS-STT
Easiest way is to extract text from speech, there is plenty of tools, proprietary (nuance), and tinker friendly open source like sphinx, and plenty of tools to extract dates and phones expressed differently. IBM Watson offers one, Smart Formatting beta, to uniform dates and phones in own transcripts. To guess which dates are birthdays you try detect related keywords (birth, born so on) nearby.
For few free alternatives, check
For phone #
https://www.npmjs.com/package/phone-number-extractor
https://github.com/googlei18n/libphonenumber
For date extractions check prev questions
Extracting dates from text in Java
Best way to identify and extract dates from text Python?
There is a patent for the process your are asking, but you might have to pay royalties or smth.
http://www.freepatentsonline.com/8416928.html
If you want to fetch the phone number and date of birth from human speech.
So, you can use another option and implement it.
https://cloud.google.com/speech/
This API is really useful for converting your speech to text. I also have this problem at one moment so you can try it too.
The another API which is really good for authentication.
https://api.ai/
I hope it helps you.

RFID Limitations

my graduate project is about Smart Attendance System for University using RFID.
What if one student have multiple cards (cheating) and he want to attend his friend as well? The situation here my system will not understand the human adulteration and it will attend the detected RFID Tags by the reader and the result is it will attend both students and it will store them in the database.
I am facing this problem from begging and it is a huge glitch in my system.
I need a solution or any idea for this problem and it can be implemented in the code or in the real live to identify the humans.
There are a few ways you could do this depending upon your dedication, the exact tech available to you, and the consistency of the environment you are working with. Here are the first two that come to mind:
1) Create a grid of reader antennae on the ceiling of your room and use signal response times to the three nearest readers to get a decent level of confidence as to where the student tag is. If two tags register as being too close, display the associated names for the professor to call out and confirm presence. This solution will be highly dependent upon the precision of your equipment and stability of temperature/humidity in the room (and possibly other things like liquid and metal presence).
2) Similar to the first solution, but a little different. Some readers and tags (Impinj R2000 and Indy Readers, Impinj Monza 5+ for sure, maybe others aswell) have the ability to report a response time and a phase angle associated with the signal received from an interrogated tag. Using a set up similar to the first, you can get a much higher level of reliability and precision if you use this method.
Your software could randomly pick a few names of attending people, so that the professor can ask them to identify themselves. This will not eliminate the possibility of cheating, but increase the risk of beeing caught.
Other idea: count the number of attendiees (either by the prof or by camera + SW) and compare that to the number of RfID tags visible.
There is no solution for this RFID limitation.
But if you could then you can use Biometric(fingerprint) recognition facility with RFID card. With this in your system you have to:
Integrate biometric scanner with your RFID reader
Store biometric data in your card
and while making attendance :
Read UID
Scan biometric by student
Match scanned biometric with your stored biometric(in the card :
step 2)
Make attendance (present if biometric matched, absent if no match)
Well, We all have that glitch, and you can do nothing about it, but with the help of a camera system, i think it would minimise this glitch.
why use a camera system and not a biometric fingerprint system? lets re-phrase the question, why use RFID if there is biometric fingerprint system ? ;)
what is ideal to use, is an RFID middleware that handle the tag reading.
once the reader detects a tag, the middleware simply call the security camera system and request for a snapshot, and store it in the db. I'm using an RFID middleware called Envoy.

Methods for determining acoustical similarity (but not fingerprinting)

I'm looking for methods that work in practise for determining some kind of acoustical similarity between different songs.
Most of the methods I've seen so far (MFCC etc.) seem actually to aim at finding identical songs only (i.e. fingerprinting, for music recognition not recommendation). While most recommendation systems seem to work on network data (co-listened songs) and tags.
Most Mpeg-7 audio descriptors also seem to be along this line. Plus, most of them are defined on the level of "extract this and that" level, but nobody seems to actually make any use of these features and use them for computing some song similarity. Yet even an efficient search of similar items...
Tools such as http://gjay.sourceforge.net/ and http://imms.luminal.org/ seem to use some simple spectral analysis, file system location, tags, plus user input such as the "color" and rating manually assigned by the user or how often the song was listened and skipped.
So: which audio features are reasonably fast to compute for a common music collection, and can be used to generate interesting playlists and find similar songs? Ideally, I'd like to feed in an existing playlist, and get out a number of songs that would match this playlist.
So I'm really interested in accoustic similarity, not so much identification / fingerprinting. Actually, I'd just want to remove identical songs from the result, because I don't want them twice.
And I'm also not looking for query by humming. I don't even have a microphone attached.
Oh, and I'm not looking for an online service. First of all, I don't want to send all my data to Apple etc., secondly I want to get only recommendations from the songs I own (I don't want to buy additional music right now, while I havn't explored all of my music. I havn't even converted all my CDs into mp3 yet ...) and secondly my music taste is not mainstream; I don't want the system to recommend Maria Carey all the time.
Plus of course, I'm really interested in what techniques work well, and which don't... Thank you for any recommendations of relevant literature and methods.
Only one application has ever done this really well. MusicIP mixer.
http://www.spicefly.com/article.php?page=musicip-software
It hasn't been updated for about ten years (and even then the interface was a bit clunky), it requires a very old version of Java, and doesn't work with all file formats - but it was and still is cross-platform and free. It does everything you're asking : generates acoustic fingerprints for every mp3/ogg/flac/m3u in your collection, saves them to a tag on the song, and given one or more songs, generates a playlist similar to those songs. It only uses the acoustics of the songs, so it's just as likely to add an unreleased track which only you have on your own hard drive as a famous song.
I love it, but every time I update my operating system / buy a new computer it takes forever to get it working again.

Resources