Open source for Automatic Speech Matching ? - audio

Automatic Speech Matching is not Automatic Speech Recognition, which is to compare two pieces of speech audio signal and return how many percentages these two audio signal match.
This tech will usually be used like such scenarios:
Pronounce learning, for example, there is standard pronounce signal of word "Hello", for students who are learning English, they pronounce their own "Hello", so we need use the ASM tech to compare how similar or how correct the student pronounced. So we need figure out sort of algorithm to compare these two 1-D audio signal.
2.We can extend from above from single word to a sentence. Then how to match these audio signal ?
Question here is to look for some good open source or commercial solution for ASM.
Or any other good solutions for such real requirements ?
Thanks in advance !

The comparison with the template will not give anything good because it will not actually hint what was spoken incorrectly. The good pronunciation learning framework doesn't match with the template but with the acoustic model representing proper and wrong pronunciation. This way it can detect errors in speech which learner make. You can read
The SRI EduSpeakTM System: Recognition and Pronunciation Scoring
http://www.speech.sri.com/people/hef/papers/EduSpeak.ps
For implementation of this algorithm on IPhone you can check
http://ottercall.com

Related

Audio segmentantion

What I am trying to do is to "separate" vowels from consonants from an audio file (wav file). For example, a file would be this sentence: "I am fine" and I have to separate the vowel sounds from the consonants one. After the "separation", I can ignore the consonants because they have no importance in this project. Also, I have to ignore the pauses in speech (the pauses between words). So this is my problem, how to separate the vowels from consonants.
I was advised that for segmentation I could use a fcm algorithm or the histogram method. I searched these 2 methods, however I could not find something that could help me.
Can someone walk me through the steps I have to do or give me some useful links? I want to mention I can also use some other methods (not necessarily fcm or histograms).
Thanks!
You can use hidden markov model (HMM) based segmentation methods to segment your speech signal into corresponding phonemes.
You need correct transcription of the speech signal and letter-to-sound (LTS) rules to do this.
Once you segment the speech correctly, you can then separate vowels easily.
This link will be useful in this
http://hts.sp.nitech.ac.jp/

Search in a book with speech

I am trying to build a program that will find which page/sentence in a book is read to microphone. I have the book's text and its audio content. The user will start reading from a random page and program is supposed to synch to the user and show the section of the book which is being read. It might seem useless program but please bear with me..
Would an approach similar to shazam-like programs work? I am not sure how effective those algorithms for speech. Also, the speaker will be different and might have accent and different speeds to read.
Another approach would be converting the speech to text and searching the text in the book. The problem is that the language of the book is a rare one for which there is no language model available. In addition, the script does not use latin characters which makes programming difficult (for me at least).
Is there any solutions that anyone can recommend? Would extracting features from the audio file and comparing with the "real-time" extracted features (from microphone) would work? Which features?
Any implementation/code that I can start with? Any language is ok but prefer C.
You need to use speech recognizer.
Create a language model directly from the book text. That will make the recognition of the book reading very accurate, both original reading and the reading by the user.
Use this language model to recognize the book and assign timestamps for the words or use more advanced algorithm to perform text to audio alignment.
Recognize user's speech with the book-specific language model and use the recognized text to display a position in a book.
You can use CMUSphinx for the mentioned tasks.

How to count number of words spoken using any method (SR or otherwise)

I am having some trouble getting pointers to how to perform what appears to be a deceptively easy task:
Given an audio stream, how do you count the number of words that have been spoken, in real-time?
I don't need to recognize what the words are, but rather just have an accurate counter on words that have been uttered. The counter doesn't have to be too accurate and could even consider utterances and other "grunts" like coughs.
It appears that all Speech Recognition systems depend on a pre-defined grammar to be provided before they can analyze the phonemes that are spoken to convert to known words with some degree of accuracy. But I don't care about the accuracy at all, but rather the rate of words being spoken.
What is important is that this runs in real time, and allow the system to provide alerts after a certain number of words have been spoken. The system will encourage a visual cue to pause, and then the speaker can continue.
I've looked at CMU Sphinx FAQ and found that the idea of "word spotting" is not yet supported. I don't really need a real time search of particular words, but it approximates more closely to what I am looking for. Looking for very small silences in the waveform seems to be a very crude way of doing this and probably not very accurate at all, but that's all I have right now.
Any pointers on algorithms, research papers or any other insights would be appreciated!

Synchronizing text and audio. Is there a NLP/speech-to-text library to do this?

I would like to synchronize a spoken recording against a known text. Is there a speech-to-text / natural language processing library that would facilitate this? I imagine I'd want to detect word boundaries and compute candidate matches from a dictionary. Most of the questions I've found on SO concern written language.
Desired, but not required:
Open Source
Compatible with American English out-of-the-box
Cross-platform
Thoroughly documented
Edit: I realize this is a very broad, even naive, question, so thanks in advance for your guidance.
What I've found so far:
OpenEars (iOS Sphinx/Flite wrapper)
Forced Alignment
It sounds like you want to do forced alignment between your audio and the known text.
Pretty much all research/industry grade speech recognition systems will be able to do this, since forced alignment is an important part of training a recognition system on data that doesn't have phone level alignments between the audio and the transcript.
Alignment CMUSphinx
The Sphinx4-1.0 beta 5 release of CMU's open source speech recognition system now includes a demo on how to do alignment between a transcript and long speech recordings.

How to split male and female voices from an audio file(in c++ or java)

I want to differentiate betwen the male n female voices in an audio file and seperate them.As an output I want the two voices seperated.Can u please help me out n can the coding be done in java or c++
This is potentially a very complicated question, and it is similar to writing your own speech recognition (or identification) algorithm.
You would start by converting the audio into the frequency domain, which is done using a Fast Fourier Transform.
For each slice in time that you take an FFT, this will give you a list of frequencies and their amplitudes. You will somehow need to detect the fundamental tone by analysing the harmonics. The 2nd and 3rd harmonics will be clearest. It's very hard to figure out which harmonics they are, especially with the background noise and the natural difference between people's voices in terms of which harmonics are loudest. Then you can try to determine if the speaker is male or female by whatever you guessed the fundamental tone to be.
Keep in mind that during many parts of speech like sibilance ('s', 't', etc) there is no tone, just noise. It will need to be pretty intelligent.
Hope that sets you in the right general direction.
Note: if the two voices are simultaneous and you want to separate them cleanly, then this won't help you. I don't believe anyone alive has solved such a problem.
I think this is already possible. I just started taking an on-line course on Machine Learning by Stanford University with professor Andrew Ng, and during the first lecture he shows a demo where an audio recording of two overlapping voices is processed and the individual voices extracted (the same with music in the background and a person speaking). Apparently it uses an unsupervised learning algorithm that allows it to extract the two underlying patterns. You may want to look into that course (there's one version of the course here: http://www.academicearth.org/courses/machine-learning)
One such tool that makes this possible is LIUM spkdiarization. Written in Java and available under GPL, it is a speech recognition tool and uses statistical models for male, female and child. Luckily for you, the models are provided and you can use it without having to tag the recordings and train the models.
See the scripting page of the LIUM wiki for examples, search in page for "gender".
I would start by saying this is impossible. Speech recognition is really, really hard.
You're not clear in your question - are the voices overlapping? If so, splitting them up will be absurdly difficult.
If they are separate, your more likely bet is to have a large set of samples of male and female voices, and look for common characteristics (and a way to programmatically identify them). If the samples aren't recorded cleanly (if they have background noise), things get even more complicated.
You may get away with an average tone - male voices are generally deeper than female..
What you are asking is one hell of a task. thomasrutter wrote some "pointers" how to do it - but, i guess the algorithm would have to be really really robust if you would wish to use it everywhere (in all sorts of music (with singing of course)). Maybe it would be better/easier to start with separating (spliting) a single instrument sample from the song.

Resources