CMU Sphinx for Voice/Speaker Recognition - audio

I'm looking for a way to match a known data set, let's say a list of MP3s or wav files, each which is a sample of someone speaking. At this point I know file ABC is of Person X speaking.
I would then like to take another sample, and do some voice matching to show who this voice is most likely of, given then known data set.
Also, I don't necessarily care what the person has said, as long as I can find a match, i.e I don't need any transcribing or otherwise.
I'm aware CMU Sphinx doesn't do voice recognition, and it's primarily used for voice-to-text, but I have seen other systems, eg: the LIUM Speaker Diarization (http://cmusphinx.sourceforge.net/wiki/speakerdiarization) or the VoiceID project (https://code.google.com/p/voiceid/) which uses CMU as a base for this type of work.
If I am to use CMU, how can I do voice matching?
Also, if CMU Sphinx isn't the best framework, is there an alternate that's open source?

This is a subject which would be adequate in complexity for a PhD thesis. There are no good and reliable systems as of right now.
The task you're up for is a very complex one. How you should approach it depends on your situation.
do you have a limited amount of people? how many?
how much data do you have for each person?
If you have very few people to recognize, you may attempt something as simple as obtaining formants of those people and comparing them to a sample.
Otherwise - you have to contact some academics who work on the subject or jury rig a solution of your own. Either way, as I said, it is a difficult problem.

Related

How to count the number of spoken syllables in an audio file?

I have many audio files with clean audio and only spoken voice in Mandarin Chinese. I need to estimate of how many syllables are spoken in each file. Is there a tool for OS X, Windows, or Linux that can estimate these?
sample01.wav 15
sample02.wav 8
sample03.wav 5
sample04.wav 1
sample05.wav 18
As there are many files, command-line or batch-capable software is preferred, e.g.:
$ application sample01.wav
15
A solution that uses speech-to-text, then counts the number of characters present would be suitable to.
The automatic segmentation of speech is an active scientific domain, meaning that there is no method that works perfectly.
In 2009, de Jong and Wempe proposed a method to automatically detect syllables in a human speech signal using Praat. This methods compares well with man-made segmentation, and has been employed in many third-party scientific studies. You can find a detailed description of the method in their scientific article (pdf), along with an historical perspective on previously proposed methods. The Praat script per se and a couple of tutorials can be found on a dedicated website (www - speechrate).
You may also be interested in another segmentation algorithm developed by Harma that has been implemented in Matlab (Harma Syllable Segmentation)
You can use formants to determine this. Each syllable should correspond to a formant. Here is more information on formants:
https://en.wikipedia.org/wiki/Formants
This might be of interest for you
http://sites.google.com/site/speechrate/
Your question requires specific attention and solution for Speech to Text.
I really doubt any free open source library, easily available and serving to purpose will be served.
I have used one but for reverse purpose "text to speech".
Though this is not a free library, i would love to help just Google "annosoft lipsync"...
http://www.annosoft.com/lipsync-sdks
This library is available for SDK evaluation as well....

How to decode speech input

What I want to do is create an API that translates human speech into the IPA (International Phonetic Alphabet) format. My question is, where are the resources on how to decode speech at the level of the original audio waveform. I looked for an API, but most of what I found just translates straight to the roman alphabet. I'm looking to create something a little more accurate in its ability to distinguish vocal phonetics.
I would just like to start out by saying that this project is much more difficult and complicated than you think it is. Speech to text processing is a very large and complicated field with a huge amount of research that has been done into it. The reason most parsers send things straight to roman characters is because most of their processing is a probabilistic matching of vague sounds with their context of other vague sounds to guess which words make sense together. You are much more likely to find something that will give you Soundex rather than IPA. That said, this is a problem that has been approached on several fronts. Your best bet is probably the Sphinx project from CMU.
http://cmusphinx.sourceforge.net/wiki/start
That will give you a good start, but you make an assumption that speech to text processing is a lot more developed than it actually is, and there is no simple way of translating speech to IPA through the waveform with any kind of accuracy. Sphinx is very modular and completely open source and so it would give you a huge amount of power at your fingertips, and at that point whether or not you can figure out how to make this work is up to you, but again. This is not a solved problem in any way.

How to count number of words spoken using any method (SR or otherwise)

I am having some trouble getting pointers to how to perform what appears to be a deceptively easy task:
Given an audio stream, how do you count the number of words that have been spoken, in real-time?
I don't need to recognize what the words are, but rather just have an accurate counter on words that have been uttered. The counter doesn't have to be too accurate and could even consider utterances and other "grunts" like coughs.
It appears that all Speech Recognition systems depend on a pre-defined grammar to be provided before they can analyze the phonemes that are spoken to convert to known words with some degree of accuracy. But I don't care about the accuracy at all, but rather the rate of words being spoken.
What is important is that this runs in real time, and allow the system to provide alerts after a certain number of words have been spoken. The system will encourage a visual cue to pause, and then the speaker can continue.
I've looked at CMU Sphinx FAQ and found that the idea of "word spotting" is not yet supported. I don't really need a real time search of particular words, but it approximates more closely to what I am looking for. Looking for very small silences in the waveform seems to be a very crude way of doing this and probably not very accurate at all, but that's all I have right now.
Any pointers on algorithms, research papers or any other insights would be appreciated!

How to compare spoken audio against reference recording - language learning

I am looking for a way to compare a user submitted audio recording against a reference recording for comparison in order to give someone a grade or percentage for language learning.
I realize that this is a very un-scientific way of doing things and is more than a gimmick than anything.
My first thoughts are some sort of audio fingerprinting, or waveform comparison.
Any ideas where I should be looking?
This is by no means a trivial problem to solve, though there is an abundance of research on the topic. Presently the most successful forms of machine learning in the speech recognition domain apply Hidden Markov Model techniques.
You may also want to take a look at existing implementations of HMM algorithms. One such library in its early stages is ghmm.
Perhaps even better and more readily applicable to your problem is HTK.
In addition to chomp's great answer, one important keyword you probably need to look up is Dynamic Time Warping (DTW). This is the wikipedia article: http://en.wikipedia.org/wiki/Dynamic_time_warping

How to split male and female voices from an audio file(in c++ or java)

I want to differentiate betwen the male n female voices in an audio file and seperate them.As an output I want the two voices seperated.Can u please help me out n can the coding be done in java or c++
This is potentially a very complicated question, and it is similar to writing your own speech recognition (or identification) algorithm.
You would start by converting the audio into the frequency domain, which is done using a Fast Fourier Transform.
For each slice in time that you take an FFT, this will give you a list of frequencies and their amplitudes. You will somehow need to detect the fundamental tone by analysing the harmonics. The 2nd and 3rd harmonics will be clearest. It's very hard to figure out which harmonics they are, especially with the background noise and the natural difference between people's voices in terms of which harmonics are loudest. Then you can try to determine if the speaker is male or female by whatever you guessed the fundamental tone to be.
Keep in mind that during many parts of speech like sibilance ('s', 't', etc) there is no tone, just noise. It will need to be pretty intelligent.
Hope that sets you in the right general direction.
Note: if the two voices are simultaneous and you want to separate them cleanly, then this won't help you. I don't believe anyone alive has solved such a problem.
I think this is already possible. I just started taking an on-line course on Machine Learning by Stanford University with professor Andrew Ng, and during the first lecture he shows a demo where an audio recording of two overlapping voices is processed and the individual voices extracted (the same with music in the background and a person speaking). Apparently it uses an unsupervised learning algorithm that allows it to extract the two underlying patterns. You may want to look into that course (there's one version of the course here: http://www.academicearth.org/courses/machine-learning)
One such tool that makes this possible is LIUM spkdiarization. Written in Java and available under GPL, it is a speech recognition tool and uses statistical models for male, female and child. Luckily for you, the models are provided and you can use it without having to tag the recordings and train the models.
See the scripting page of the LIUM wiki for examples, search in page for "gender".
I would start by saying this is impossible. Speech recognition is really, really hard.
You're not clear in your question - are the voices overlapping? If so, splitting them up will be absurdly difficult.
If they are separate, your more likely bet is to have a large set of samples of male and female voices, and look for common characteristics (and a way to programmatically identify them). If the samples aren't recorded cleanly (if they have background noise), things get even more complicated.
You may get away with an average tone - male voices are generally deeper than female..
What you are asking is one hell of a task. thomasrutter wrote some "pointers" how to do it - but, i guess the algorithm would have to be really really robust if you would wish to use it everywhere (in all sorts of music (with singing of course)). Maybe it would be better/easier to start with separating (spliting) a single instrument sample from the song.

Resources