Libraries of audio samples (spoken text) - audio

For a project we're currently working on, we need a library of spoken words in many different languages.
Two options seem possible: text-to-speech or "real" recordings by native speakers. As the quality is important to us, we're thinking about going the latter path.
In order to create a prototype for our application, we're looking for libraries that contain as many words in different languages as possible. To get a feeling for the quality of our approach, this library should not be made up of synthesized speech.
Do you know of any available/accessible libraries?

A co-worker just found this community based library, which is nice, but rather small in size:
Forvo.com

I've just found this on the Audacity wiki: VoxForge. From their site:
VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK has distribution restrictions).

There is also Old time radio, not sure if this is the sort of spoken word you're after though.

My guess is that you won't find a library anywhere that consists of just individual words. Whatever you find, you're going to have to open the audio up in an editor (like Pro Tools or Cool Edit) and chop it up into individual words.
You would probably be better off creating a list of all the words you need for each language, and then finding native speakers to read them while you record. You can have them read slowly, so that you'll have an easy time chopping up each individual word.

One I use to use a lot: http://shtooka.net/index.php
Easy access to the recordings.

Related

How to count the number of spoken syllables in an audio file?

I have many audio files with clean audio and only spoken voice in Mandarin Chinese. I need to estimate of how many syllables are spoken in each file. Is there a tool for OS X, Windows, or Linux that can estimate these?
sample01.wav 15
sample02.wav 8
sample03.wav 5
sample04.wav 1
sample05.wav 18
As there are many files, command-line or batch-capable software is preferred, e.g.:
$ application sample01.wav
15
A solution that uses speech-to-text, then counts the number of characters present would be suitable to.
The automatic segmentation of speech is an active scientific domain, meaning that there is no method that works perfectly.
In 2009, de Jong and Wempe proposed a method to automatically detect syllables in a human speech signal using Praat. This methods compares well with man-made segmentation, and has been employed in many third-party scientific studies. You can find a detailed description of the method in their scientific article (pdf), along with an historical perspective on previously proposed methods. The Praat script per se and a couple of tutorials can be found on a dedicated website (www - speechrate).
You may also be interested in another segmentation algorithm developed by Harma that has been implemented in Matlab (Harma Syllable Segmentation)
You can use formants to determine this. Each syllable should correspond to a formant. Here is more information on formants:
https://en.wikipedia.org/wiki/Formants
This might be of interest for you
http://sites.google.com/site/speechrate/
Your question requires specific attention and solution for Speech to Text.
I really doubt any free open source library, easily available and serving to purpose will be served.
I have used one but for reverse purpose "text to speech".
Though this is not a free library, i would love to help just Google "annosoft lipsync"...
http://www.annosoft.com/lipsync-sdks
This library is available for SDK evaluation as well....

Are there any open-source phoneme sets (for speech synthesis)?

I am trying to make a super basic speech synthesizer, and I need some form of phoneme audio files so that I can piece them together and build words. Are there any open phoneme sets that I would be able to use for this?
For super basic speech synthesizer it's worth to check espeak http://espeak.sourceforge.net, it's better than to glue sound files together.
This may be more than you're looking for, but have you checked into http://www.vocaloid.com/en/ by any chance? There are many speech products on the market. You might also be interested in http://msdn.microsoft.com/en-us/library/hh361572(v=office.14).aspx

How to convert human voice into digital format?

I am working on a project where biometric system is used to secure the system. We are planning to use human voice to secure the system.
Idea is to allow the person to say some words or sentences and system will store that voice in digital format. Next time person wants to enter the system, he/she has to speak some words which may or may not be different from the words used earlier.
We don't want to match words but want to match voice frequency.
I have read some research papers regarding this system but those papers don't have any implementation details.
So just want to know whether there is any software/API which can convert analog voice into digital format and will also tell us the frequency of voice.
Until now I was working on normal web based applications so I know normal APIs and platforms like Java EE, C#, etc but I don't have any experience about this kind of application.
Please enlighten !!!
http://www.loquendo.com/en/products/speaker-verification/
http://www.nuance.com/for-business/by-solution/contact-center-customer-care/cccc-solutions-services/verifier/index.htm
(two links removed due to reported virus content)
http://www.persay.com/products.asp
This is as good a starting point as any : http://marsyas.info/
It's a open source software framework for audio processing. They've listed a bunch of projects that have used their framework in various ways so you could probably draw inspiration from it.
http://marsyas.info/about/projects. The Telligence project in particular seems the closest to your needs as it it was used to gender classify audio : http://marsyas.info/about/projects#5Teligence
There are two steps on a project like this one I believe:
First step would be to record the voice from an analog input into digital format (let's assume wav-pcm). For this you can use DirectShow API in C#, or standard Wav-In as in this project: http://www.codeproject.com/KB/audio-video/cswavrec.aspx. You may consider compressing your audio files later on, there are many options for this, in Windows you may consider Windows Media Format SDK to avoid licensing issues with other formats.
Second step is to build or use a voice recognition framework, if you want to build a recognition framework you will probably need to define a set of "features" for your sound fragments and select+implement a recognition algorithm. There are many aproaches available for this, IEEE amd ACM.org websties are usually good sources. If you want to use an existing framework you may want to consider Nuance Recognizer (commercial) or http://cmusphinx.sourceforge.net (open source).
Hope this helps.

Synchronizing text and audio. Is there a NLP/speech-to-text library to do this?

I would like to synchronize a spoken recording against a known text. Is there a speech-to-text / natural language processing library that would facilitate this? I imagine I'd want to detect word boundaries and compute candidate matches from a dictionary. Most of the questions I've found on SO concern written language.
Desired, but not required:
Open Source
Compatible with American English out-of-the-box
Cross-platform
Thoroughly documented
Edit: I realize this is a very broad, even naive, question, so thanks in advance for your guidance.
What I've found so far:
OpenEars (iOS Sphinx/Flite wrapper)
Forced Alignment
It sounds like you want to do forced alignment between your audio and the known text.
Pretty much all research/industry grade speech recognition systems will be able to do this, since forced alignment is an important part of training a recognition system on data that doesn't have phone level alignments between the audio and the transcript.
Alignment CMUSphinx
The Sphinx4-1.0 beta 5 release of CMU's open source speech recognition system now includes a demo on how to do alignment between a transcript and long speech recordings.

Is there open source audio feature extraction software avaliable?

I undertaking a personal project which involves the development of a system which will automatically generate audio thumbnail clips (about 30 seconds in length) from a full length track.
In order to do this I want to look at the energy and pitch of the audio to try and correctly identify its major structural features.
Is there any open source software available that can do energy/pitch extraction? If not I will start looking into alternative methods using MATLAB.
Thanks!
YAAFE (Yet Another Audio Feature Extractor) http://yaafe.sourceforge.net/ does audio feature extraction in MATLAB, Python and C.
You might want to look into the Echo Nest API. It has a lot of audio analysis capabilities, and I know there's a script bundled in the Remix package that can automagically turn songs into shorter or longer versions (I believe the script is called earworm).
Audacity may do it.
Try JAudio which can extract features from an audio.
MARSYAS contains bextract for analysis, can find MFCCs and various other timbral and spectral features. http://marsyas.info/

Resources