I want to use CMU sphinx4 to transcribe a given audio file. It should take an audio file in .wav format and having conversation in Indian English. I am new to CMUSphinx and can't get easy step wise description of the process.
You might want to look into the transcriber demo provided with the sphinx api.
You can just change the language model and the acoustic model, configure the same in the config.xml file, and use the same code.
The language model- depending on the use-case of your application, you can use the WSJ language model having 5k words, or you can make your own model.To make your own landuage model, you can read more here. One easy way is to use the lmtool. google "lmtool cmu"
The acoustic model- As you want to have an application for Indian accent, you need to have audio files for Indian english, and the corresponding transcription file. Based on your use-case, you can either train your own acoustic model, or you can adapt an existing acoustic model. read more here. you can also search data sets online.
Configure things in the config.xml file so your application uses your language and acoustic model.
For a beginner, these steps might be helpful-
Read about sphinx architecture and try demos
Study about what the language model is.
Read about how to construct the language model. (lmtool, cmuclmtk, etc)
Read about what the acoustic model is.
Read about how to train/adapt the acoustic model.
Configure the config.xml file in your java application to use these models.
Related
Is it possible to adapt an acoustic model using "sphinx4" only? I have already checked the website but the commands mentioned are for pocketShphinx "Adapting the default acoustic model".
I have also checked some other solutions but all those used pocketsphinx.
It is not possible, you need sphinxtrain and pocketsphinx for adaptation.
So I've been using Microsoft Speech Recognition in Windows 10, doing the training exercises, dictating text into Wordpad and correcting it, adding words to the dictionary and so on. I would like to use the software to transcribe .wav files. It appears one can do this using the Windows Speech Recognition API, but this seems to involve creating and loading one's own grammar files, which suggests to me that this would basically create a new speech recognizer, that uses the same building blocks but is a different program from the one that runs when I click "Start Speech Recognition" in the start menu. In particular, it would perform differently because of differences in training or configuration.
Am I wrong in this ? And if I'm not, is there still a way of retrieving all the data the default speech recognizer uses so I can reproduce its behavior exactly? If I need to create a separate speech recognizer with its own grammar files and separate training history and so on in order to transcribe .wav files then so be it but I'd like to better understand what's going on here.
The Woundify open source project contains examples of how to convert wav files and to text (STT).
I am trying to make a super basic speech synthesizer, and I need some form of phoneme audio files so that I can piece them together and build words. Are there any open phoneme sets that I would be able to use for this?
For super basic speech synthesizer it's worth to check espeak http://espeak.sourceforge.net, it's better than to glue sound files together.
This may be more than you're looking for, but have you checked into http://www.vocaloid.com/en/ by any chance? There are many speech products on the market. You might also be interested in http://msdn.microsoft.com/en-us/library/hh361572(v=office.14).aspx
I am trying to build a program that will find which page/sentence in a book is read to microphone. I have the book's text and its audio content. The user will start reading from a random page and program is supposed to synch to the user and show the section of the book which is being read. It might seem useless program but please bear with me..
Would an approach similar to shazam-like programs work? I am not sure how effective those algorithms for speech. Also, the speaker will be different and might have accent and different speeds to read.
Another approach would be converting the speech to text and searching the text in the book. The problem is that the language of the book is a rare one for which there is no language model available. In addition, the script does not use latin characters which makes programming difficult (for me at least).
Is there any solutions that anyone can recommend? Would extracting features from the audio file and comparing with the "real-time" extracted features (from microphone) would work? Which features?
Any implementation/code that I can start with? Any language is ok but prefer C.
You need to use speech recognizer.
Create a language model directly from the book text. That will make the recognition of the book reading very accurate, both original reading and the reading by the user.
Use this language model to recognize the book and assign timestamps for the words or use more advanced algorithm to perform text to audio alignment.
Recognize user's speech with the book-specific language model and use the recognized text to display a position in a book.
You can use CMUSphinx for the mentioned tasks.
I am a final year CS student, and very interested about OCR and NLP stuffs.
The problem is I don't know anything about OCR yet and my project duration is only for 5 months. I would like to know OCR & NLP stuff that is viable for my project?
Is writing a (simple) OCR engine for a single language too hard for my project? What about adding a language support for existing FOSS OCR softwares?
My background is in the commercial side of OCR and in my experience writing anything but a simple OCR engine would take a fair amout of time. To get even reasonable results your input files would have to contain very clean text characters for the purposes of OCR or you would need lots of marked up training data to train the engine. This would limit your input data available using OCR to high quality printed documents and computer generated documents such as exporting a Word document to a TIFF image. Commercial OCR engines do a much better job reading standard scanned invoices and letters than even Tesseract OCR and they still make mistakes.
You could write a simple OCR engine and use NLP and language analysis to show how it can improve the OCR results. Most of the OCR engines are doing this anyway but it could be an interesting project. The commercial engines have had years of fine tuning to improve their recognition accuracy and they use every trick they can think of.
This article may give you some ideas on one way how to write an OCR engine:
http://www.codeproject.com/KB/dotnet/simple_ocr.aspx
You may be able to contribute to the Tesseract project but you would first need to research what has already been included and what is not and if anyone else is working on the same problem.