I have many audio files with clean audio and only spoken voice in Mandarin Chinese. I need to estimate of how many syllables are spoken in each file. Is there a tool for OS X, Windows, or Linux that can estimate these?
sample01.wav 15
sample02.wav 8
sample03.wav 5
sample04.wav 1
sample05.wav 18
As there are many files, command-line or batch-capable software is preferred, e.g.:
$ application sample01.wav
15
A solution that uses speech-to-text, then counts the number of characters present would be suitable to.
The automatic segmentation of speech is an active scientific domain, meaning that there is no method that works perfectly.
In 2009, de Jong and Wempe proposed a method to automatically detect syllables in a human speech signal using Praat. This methods compares well with man-made segmentation, and has been employed in many third-party scientific studies. You can find a detailed description of the method in their scientific article (pdf), along with an historical perspective on previously proposed methods. The Praat script per se and a couple of tutorials can be found on a dedicated website (www - speechrate).
You may also be interested in another segmentation algorithm developed by Harma that has been implemented in Matlab (Harma Syllable Segmentation)
You can use formants to determine this. Each syllable should correspond to a formant. Here is more information on formants:
https://en.wikipedia.org/wiki/Formants
This might be of interest for you
http://sites.google.com/site/speechrate/
Your question requires specific attention and solution for Speech to Text.
I really doubt any free open source library, easily available and serving to purpose will be served.
I have used one but for reverse purpose "text to speech".
Though this is not a free library, i would love to help just Google "annosoft lipsync"...
http://www.annosoft.com/lipsync-sdks
This library is available for SDK evaluation as well....
Related
I'm looking for a way to match a known data set, let's say a list of MP3s or wav files, each which is a sample of someone speaking. At this point I know file ABC is of Person X speaking.
I would then like to take another sample, and do some voice matching to show who this voice is most likely of, given then known data set.
Also, I don't necessarily care what the person has said, as long as I can find a match, i.e I don't need any transcribing or otherwise.
I'm aware CMU Sphinx doesn't do voice recognition, and it's primarily used for voice-to-text, but I have seen other systems, eg: the LIUM Speaker Diarization (http://cmusphinx.sourceforge.net/wiki/speakerdiarization) or the VoiceID project (https://code.google.com/p/voiceid/) which uses CMU as a base for this type of work.
If I am to use CMU, how can I do voice matching?
Also, if CMU Sphinx isn't the best framework, is there an alternate that's open source?
This is a subject which would be adequate in complexity for a PhD thesis. There are no good and reliable systems as of right now.
The task you're up for is a very complex one. How you should approach it depends on your situation.
do you have a limited amount of people? how many?
how much data do you have for each person?
If you have very few people to recognize, you may attempt something as simple as obtaining formants of those people and comparing them to a sample.
Otherwise - you have to contact some academics who work on the subject or jury rig a solution of your own. Either way, as I said, it is a difficult problem.
I am working on a project where biometric system is used to secure the system. We are planning to use human voice to secure the system.
Idea is to allow the person to say some words or sentences and system will store that voice in digital format. Next time person wants to enter the system, he/she has to speak some words which may or may not be different from the words used earlier.
We don't want to match words but want to match voice frequency.
I have read some research papers regarding this system but those papers don't have any implementation details.
So just want to know whether there is any software/API which can convert analog voice into digital format and will also tell us the frequency of voice.
Until now I was working on normal web based applications so I know normal APIs and platforms like Java EE, C#, etc but I don't have any experience about this kind of application.
Please enlighten !!!
http://www.loquendo.com/en/products/speaker-verification/
http://www.nuance.com/for-business/by-solution/contact-center-customer-care/cccc-solutions-services/verifier/index.htm
(two links removed due to reported virus content)
http://www.persay.com/products.asp
This is as good a starting point as any : http://marsyas.info/
It's a open source software framework for audio processing. They've listed a bunch of projects that have used their framework in various ways so you could probably draw inspiration from it.
http://marsyas.info/about/projects. The Telligence project in particular seems the closest to your needs as it it was used to gender classify audio : http://marsyas.info/about/projects#5Teligence
There are two steps on a project like this one I believe:
First step would be to record the voice from an analog input into digital format (let's assume wav-pcm). For this you can use DirectShow API in C#, or standard Wav-In as in this project: http://www.codeproject.com/KB/audio-video/cswavrec.aspx. You may consider compressing your audio files later on, there are many options for this, in Windows you may consider Windows Media Format SDK to avoid licensing issues with other formats.
Second step is to build or use a voice recognition framework, if you want to build a recognition framework you will probably need to define a set of "features" for your sound fragments and select+implement a recognition algorithm. There are many aproaches available for this, IEEE amd ACM.org websties are usually good sources. If you want to use an existing framework you may want to consider Nuance Recognizer (commercial) or http://cmusphinx.sourceforge.net (open source).
Hope this helps.
I would like to synchronize a spoken recording against a known text. Is there a speech-to-text / natural language processing library that would facilitate this? I imagine I'd want to detect word boundaries and compute candidate matches from a dictionary. Most of the questions I've found on SO concern written language.
Desired, but not required:
Open Source
Compatible with American English out-of-the-box
Cross-platform
Thoroughly documented
Edit: I realize this is a very broad, even naive, question, so thanks in advance for your guidance.
What I've found so far:
OpenEars (iOS Sphinx/Flite wrapper)
Forced Alignment
It sounds like you want to do forced alignment between your audio and the known text.
Pretty much all research/industry grade speech recognition systems will be able to do this, since forced alignment is an important part of training a recognition system on data that doesn't have phone level alignments between the audio and the transcript.
Alignment CMUSphinx
The Sphinx4-1.0 beta 5 release of CMU's open source speech recognition system now includes a demo on how to do alignment between a transcript and long speech recordings.
I want to differentiate betwen the male n female voices in an audio file and seperate them.As an output I want the two voices seperated.Can u please help me out n can the coding be done in java or c++
This is potentially a very complicated question, and it is similar to writing your own speech recognition (or identification) algorithm.
You would start by converting the audio into the frequency domain, which is done using a Fast Fourier Transform.
For each slice in time that you take an FFT, this will give you a list of frequencies and their amplitudes. You will somehow need to detect the fundamental tone by analysing the harmonics. The 2nd and 3rd harmonics will be clearest. It's very hard to figure out which harmonics they are, especially with the background noise and the natural difference between people's voices in terms of which harmonics are loudest. Then you can try to determine if the speaker is male or female by whatever you guessed the fundamental tone to be.
Keep in mind that during many parts of speech like sibilance ('s', 't', etc) there is no tone, just noise. It will need to be pretty intelligent.
Hope that sets you in the right general direction.
Note: if the two voices are simultaneous and you want to separate them cleanly, then this won't help you. I don't believe anyone alive has solved such a problem.
I think this is already possible. I just started taking an on-line course on Machine Learning by Stanford University with professor Andrew Ng, and during the first lecture he shows a demo where an audio recording of two overlapping voices is processed and the individual voices extracted (the same with music in the background and a person speaking). Apparently it uses an unsupervised learning algorithm that allows it to extract the two underlying patterns. You may want to look into that course (there's one version of the course here: http://www.academicearth.org/courses/machine-learning)
One such tool that makes this possible is LIUM spkdiarization. Written in Java and available under GPL, it is a speech recognition tool and uses statistical models for male, female and child. Luckily for you, the models are provided and you can use it without having to tag the recordings and train the models.
See the scripting page of the LIUM wiki for examples, search in page for "gender".
I would start by saying this is impossible. Speech recognition is really, really hard.
You're not clear in your question - are the voices overlapping? If so, splitting them up will be absurdly difficult.
If they are separate, your more likely bet is to have a large set of samples of male and female voices, and look for common characteristics (and a way to programmatically identify them). If the samples aren't recorded cleanly (if they have background noise), things get even more complicated.
You may get away with an average tone - male voices are generally deeper than female..
What you are asking is one hell of a task. thomasrutter wrote some "pointers" how to do it - but, i guess the algorithm would have to be really really robust if you would wish to use it everywhere (in all sorts of music (with singing of course)). Maybe it would be better/easier to start with separating (spliting) a single instrument sample from the song.
For a project we're currently working on, we need a library of spoken words in many different languages.
Two options seem possible: text-to-speech or "real" recordings by native speakers. As the quality is important to us, we're thinking about going the latter path.
In order to create a prototype for our application, we're looking for libraries that contain as many words in different languages as possible. To get a feeling for the quality of our approach, this library should not be made up of synthesized speech.
Do you know of any available/accessible libraries?
A co-worker just found this community based library, which is nice, but rather small in size:
Forvo.com
I've just found this on the Audacity wiki: VoxForge. From their site:
VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK has distribution restrictions).
There is also Old time radio, not sure if this is the sort of spoken word you're after though.
My guess is that you won't find a library anywhere that consists of just individual words. Whatever you find, you're going to have to open the audio up in an editor (like Pro Tools or Cool Edit) and chop it up into individual words.
You would probably be better off creating a list of all the words you need for each language, and then finding native speakers to read them while you record. You can have them read slowly, so that you'll have an easy time chopping up each individual word.
One I use to use a lot: http://shtooka.net/index.php
Easy access to the recordings.