Are there any open-source phoneme sets (for speech synthesis)? - audio

I am trying to make a super basic speech synthesizer, and I need some form of phoneme audio files so that I can piece them together and build words. Are there any open phoneme sets that I would be able to use for this?

For super basic speech synthesizer it's worth to check espeak http://espeak.sourceforge.net, it's better than to glue sound files together.

This may be more than you're looking for, but have you checked into http://www.vocaloid.com/en/ by any chance? There are many speech products on the market. You might also be interested in http://msdn.microsoft.com/en-us/library/hh361572(v=office.14).aspx

Related

speech-to-text .. .VOX files to text, is this possible?

a little background: I'm faced with converting 3000 IVR scripts for a new PBX & IVR.
Currently the voice scripts exist in .VOX format--AND..they're not 'written-out'/documented.
I humbly ask if anyone knows of a program, such that, I can dump the .vox file in the program and have it produce a text document. If the .vox format is a problem I could probably convert all of them to .wav or whatever.
Yes there are tons of apps/programs which do speech to text in real time, but I want to be able to "upload/dump" the recording into a program, and obtain text.
Can someone point me in the right direction?
Thank you in advance for any sort of comment/help.
SF
The problem is that you are hoping to perform generic natural language processing on low quality audio files. Low quality audio data significantly reduces the reliability of natural language processing software. Upsampling your audio files will not improve their content which means poor results even if you did have access to a natural language engine.
Your best bet is to work with a company that performs hybrid machine/human transcription and pay them for each transcriptions. Alternatively you could consider working with Amazon Mechanical Turk and buy some general purpose human effort to get these transcribed. In both of those cases it is likely that VOX files would not work, you would first need to convert them to WAV or MP3 files to make it easier for the third party to use off the shelf software to listen to the prompts.

How to decode speech input

What I want to do is create an API that translates human speech into the IPA (International Phonetic Alphabet) format. My question is, where are the resources on how to decode speech at the level of the original audio waveform. I looked for an API, but most of what I found just translates straight to the roman alphabet. I'm looking to create something a little more accurate in its ability to distinguish vocal phonetics.
I would just like to start out by saying that this project is much more difficult and complicated than you think it is. Speech to text processing is a very large and complicated field with a huge amount of research that has been done into it. The reason most parsers send things straight to roman characters is because most of their processing is a probabilistic matching of vague sounds with their context of other vague sounds to guess which words make sense together. You are much more likely to find something that will give you Soundex rather than IPA. That said, this is a problem that has been approached on several fronts. Your best bet is probably the Sphinx project from CMU.
http://cmusphinx.sourceforge.net/wiki/start
That will give you a good start, but you make an assumption that speech to text processing is a lot more developed than it actually is, and there is no simple way of translating speech to IPA through the waveform with any kind of accuracy. Sphinx is very modular and completely open source and so it would give you a huge amount of power at your fingertips, and at that point whether or not you can figure out how to make this work is up to you, but again. This is not a solved problem in any way.

How to convert human voice into digital format?

I am working on a project where biometric system is used to secure the system. We are planning to use human voice to secure the system.
Idea is to allow the person to say some words or sentences and system will store that voice in digital format. Next time person wants to enter the system, he/she has to speak some words which may or may not be different from the words used earlier.
We don't want to match words but want to match voice frequency.
I have read some research papers regarding this system but those papers don't have any implementation details.
So just want to know whether there is any software/API which can convert analog voice into digital format and will also tell us the frequency of voice.
Until now I was working on normal web based applications so I know normal APIs and platforms like Java EE, C#, etc but I don't have any experience about this kind of application.
Please enlighten !!!
http://www.loquendo.com/en/products/speaker-verification/
http://www.nuance.com/for-business/by-solution/contact-center-customer-care/cccc-solutions-services/verifier/index.htm
(two links removed due to reported virus content)
http://www.persay.com/products.asp
This is as good a starting point as any : http://marsyas.info/
It's a open source software framework for audio processing. They've listed a bunch of projects that have used their framework in various ways so you could probably draw inspiration from it.
http://marsyas.info/about/projects. The Telligence project in particular seems the closest to your needs as it it was used to gender classify audio : http://marsyas.info/about/projects#5Teligence
There are two steps on a project like this one I believe:
First step would be to record the voice from an analog input into digital format (let's assume wav-pcm). For this you can use DirectShow API in C#, or standard Wav-In as in this project: http://www.codeproject.com/KB/audio-video/cswavrec.aspx. You may consider compressing your audio files later on, there are many options for this, in Windows you may consider Windows Media Format SDK to avoid licensing issues with other formats.
Second step is to build or use a voice recognition framework, if you want to build a recognition framework you will probably need to define a set of "features" for your sound fragments and select+implement a recognition algorithm. There are many aproaches available for this, IEEE amd ACM.org websties are usually good sources. If you want to use an existing framework you may want to consider Nuance Recognizer (commercial) or http://cmusphinx.sourceforge.net (open source).
Hope this helps.

Library for extracting words (speech) out from audio stream?

I have an audio stream and I would extract words (speech) from it. So for example having audio.wav I would get 001.wav, 002.wav, 003.wav, etc where each XXX.wav is one word.
I am looking for a library or program to do it -- platform does not matter, but I prefer open-source solution.
Thank you in advance for help.
Nuance, the company that makes Dragon Naturally Speaking, has a number of Software Development Kits.
The Audio Mining kit seems to match your requirements:
Dragon NaturallySpeaking SDK
AudioMining is a speaker-independent
speech recognition toolkit that
enables the indexing of 100% of the
speech information within audio files.
The technology uses highly accurate
speech recognition to turn audio files
into XML text with timestamp
information. This can be integrated
with standard text-search products to
enable rapid access to specific audio
content.
The speech to speech+metadata is far and away the hardest part to get right. Once you have the speech + metadata, extracting the words as individual audio files is much more straightforward.

Libraries of audio samples (spoken text)

For a project we're currently working on, we need a library of spoken words in many different languages.
Two options seem possible: text-to-speech or "real" recordings by native speakers. As the quality is important to us, we're thinking about going the latter path.
In order to create a prototype for our application, we're looking for libraries that contain as many words in different languages as possible. To get a feeling for the quality of our approach, this library should not be made up of synthesized speech.
Do you know of any available/accessible libraries?
A co-worker just found this community based library, which is nice, but rather small in size:
Forvo.com
I've just found this on the Audacity wiki: VoxForge. From their site:
VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as Sphinx, ISIP, Julius and HTK (note: HTK has distribution restrictions).
There is also Old time radio, not sure if this is the sort of spoken word you're after though.
My guess is that you won't find a library anywhere that consists of just individual words. Whatever you find, you're going to have to open the audio up in an editor (like Pro Tools or Cool Edit) and chop it up into individual words.
You would probably be better off creating a list of all the words you need for each language, and then finding native speakers to read them while you record. You can have them read slowly, so that you'll have an easy time chopping up each individual word.
One I use to use a lot: http://shtooka.net/index.php
Easy access to the recordings.

Resources