Library for extracting words (speech) out from audio stream? - audio

I have an audio stream and I would extract words (speech) from it. So for example having audio.wav I would get 001.wav, 002.wav, 003.wav, etc where each XXX.wav is one word.
I am looking for a library or program to do it -- platform does not matter, but I prefer open-source solution.
Thank you in advance for help.

Nuance, the company that makes Dragon Naturally Speaking, has a number of Software Development Kits.
The Audio Mining kit seems to match your requirements:
Dragon NaturallySpeaking SDK
AudioMining is a speaker-independent
speech recognition toolkit that
enables the indexing of 100% of the
speech information within audio files.
The technology uses highly accurate
speech recognition to turn audio files
into XML text with timestamp
information. This can be integrated
with standard text-search products to
enable rapid access to specific audio
content.
The speech to speech+metadata is far and away the hardest part to get right. Once you have the speech + metadata, extracting the words as individual audio files is much more straightforward.

Related

Are there any open-source phoneme sets (for speech synthesis)?

I am trying to make a super basic speech synthesizer, and I need some form of phoneme audio files so that I can piece them together and build words. Are there any open phoneme sets that I would be able to use for this?
For super basic speech synthesizer it's worth to check espeak http://espeak.sourceforge.net, it's better than to glue sound files together.
This may be more than you're looking for, but have you checked into http://www.vocaloid.com/en/ by any chance? There are many speech products on the market. You might also be interested in http://msdn.microsoft.com/en-us/library/hh361572(v=office.14).aspx

Audio content analysis for online audiovisual data

I want to work on a project where I have to segment and classify online audiovisual data based on its audio content, i.e. different parts of the audio visual data will be segmented and classified as silence, music, speech, speech+background music, etc based on their audio content.
I am aware that I have to obtain the audio part from the audiovisual data and extract features like zero crossing, spectral peaks, etc. and find out segment boundaries in order to segment audio data.
But I'm lost in the beginning itself.
I do not know how to start off with the project. The output of the software are segments of audiovisual data under different categories like silence, speech, music, etc.
It will be really helpful if someone lets me know
Which programming language is convenient for this purpose?
What steps should i follow in order to develop this software?
I have no background in digital signal processing. It'll be really helpful if I get some guidance
I'd suggest to look into a multimedia framework such as GStreamer. It is crossplatform, but the easiest to get started on Linux where it originates from. It already comes with all kind of plugins to receve, demux and decode audio and video. It also has a couple of analyzers (such as level and spectrum analyzers for audio as well as voice activity detection). Those could be a good starting point for your experiments. Gstreamer itself is written in C, but applications can use the language bindings to python, perl, c#, c++, java, ...

How to convert human voice into digital format?

I am working on a project where biometric system is used to secure the system. We are planning to use human voice to secure the system.
Idea is to allow the person to say some words or sentences and system will store that voice in digital format. Next time person wants to enter the system, he/she has to speak some words which may or may not be different from the words used earlier.
We don't want to match words but want to match voice frequency.
I have read some research papers regarding this system but those papers don't have any implementation details.
So just want to know whether there is any software/API which can convert analog voice into digital format and will also tell us the frequency of voice.
Until now I was working on normal web based applications so I know normal APIs and platforms like Java EE, C#, etc but I don't have any experience about this kind of application.
Please enlighten !!!
http://www.loquendo.com/en/products/speaker-verification/
http://www.nuance.com/for-business/by-solution/contact-center-customer-care/cccc-solutions-services/verifier/index.htm
(two links removed due to reported virus content)
http://www.persay.com/products.asp
This is as good a starting point as any : http://marsyas.info/
It's a open source software framework for audio processing. They've listed a bunch of projects that have used their framework in various ways so you could probably draw inspiration from it.
http://marsyas.info/about/projects. The Telligence project in particular seems the closest to your needs as it it was used to gender classify audio : http://marsyas.info/about/projects#5Teligence
There are two steps on a project like this one I believe:
First step would be to record the voice from an analog input into digital format (let's assume wav-pcm). For this you can use DirectShow API in C#, or standard Wav-In as in this project: http://www.codeproject.com/KB/audio-video/cswavrec.aspx. You may consider compressing your audio files later on, there are many options for this, in Windows you may consider Windows Media Format SDK to avoid licensing issues with other formats.
Second step is to build or use a voice recognition framework, if you want to build a recognition framework you will probably need to define a set of "features" for your sound fragments and select+implement a recognition algorithm. There are many aproaches available for this, IEEE amd ACM.org websties are usually good sources. If you want to use an existing framework you may want to consider Nuance Recognizer (commercial) or http://cmusphinx.sourceforge.net (open source).
Hope this helps.

Is there open source audio feature extraction software avaliable?

I undertaking a personal project which involves the development of a system which will automatically generate audio thumbnail clips (about 30 seconds in length) from a full length track.
In order to do this I want to look at the energy and pitch of the audio to try and correctly identify its major structural features.
Is there any open source software available that can do energy/pitch extraction? If not I will start looking into alternative methods using MATLAB.
Thanks!
YAAFE (Yet Another Audio Feature Extractor) http://yaafe.sourceforge.net/ does audio feature extraction in MATLAB, Python and C.
You might want to look into the Echo Nest API. It has a lot of audio analysis capabilities, and I know there's a script bundled in the Remix package that can automagically turn songs into shorter or longer versions (I believe the script is called earworm).
Audacity may do it.
Try JAudio which can extract features from an audio.
MARSYAS contains bextract for analysis, can find MFCCs and various other timbral and spectral features. http://marsyas.info/

Libraries of audio samples (spoken text)

For a project we're currently working on, we need a library of spoken words in many different languages.
Two options seem possible: text-to-speech or "real" recordings by native speakers. As the quality is important to us, we're thinking about going the latter path.
In order to create a prototype for our application, we're looking for libraries that contain as many words in different languages as possible. To get a feeling for the quality of our approach, this library should not be made up of synthesized speech.
Do you know of any available/accessible libraries?
A co-worker just found this community based library, which is nice, but rather small in size:
Forvo.com
I've just found this on the Audacity wiki: VoxForge. From their site:
VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK has distribution restrictions).
There is also Old time radio, not sure if this is the sort of spoken word you're after though.
My guess is that you won't find a library anywhere that consists of just individual words. Whatever you find, you're going to have to open the audio up in an editor (like Pro Tools or Cool Edit) and chop it up into individual words.
You would probably be better off creating a list of all the words you need for each language, and then finding native speakers to read them while you record. You can have them read slowly, so that you'll have an easy time chopping up each individual word.
One I use to use a lot: http://shtooka.net/index.php
Easy access to the recordings.

Resources