Pocketsphinx cannot decode mfc file while pocketsphinx_continuous decodes corresponding wav - cmusphinx

I have been working with CMUsphinx for Turkish language speech to text for couple months. I have succeeded to run a train on a 100 hours of sound. My target was to use the resulting Acoustic Model with Sphinx3 decoder. However Sphinx3 decoder cannot decode my test wav files. Then I have noticed that sphinxtrain runs pocketsphinx_batch in the end of training for testing the model.
So, I started working on poscketsphinx. I am at a point where pocketsphinx batch cannot decode a wav file (actually it only produces ııı nothing else) but pocketsphinx continuous produces more meaningful output with the same file (e.g. 10 correct words out of 15 words).
I guess I am missing some configuration steps. I have an compressed archive in this link
which includes the Acoustic and language models, dictionary and wav files I try to decode.
I am asking to get help for being able to use my model with Sphinx3 and Pocketsphinx_batch.
Thank you.

Fortunately I found the problem. It was feature vectors which are produced by sphinx_fe. I was creating them with default values. After reading the make_feats.pl and sphinxtrain.cfg files, I created feature vectors compatible with the Acoustic Model. Sphinxtrain.cfg has the lifter parameter as 22, but if we use sphinx_fe with default values lifter is 0, which means no lifter. I created mfc with lifter value 22 then it worked.

Related

How can I train TensorFlow to read variable length numbers on an image?

I have a set of images like this
And I'm trying to train TensoFlow on python to read the numbers on the images.
I'm new to machine learn and on my research I found a solution to a similar problem that uses CTC to train/predict variable length data on an image.
I'm trying to figure out if I should use CTC or find a way to create a new image for every number of the image that I already have.
Like if the number of my image is 213, then I create 3 new images to train the model with the respective numbers 2, 1, 3 also using them as labels. I'm looking for tutorials or even TensorFlow documentation that can help me on that.
in the case of text CTC absolutely makes sense: you don't want to split a text (like "213") into "2", "1", "3" manually, because it often is difficult to segment the text into individual characters.
CTC, on the other hand, just needs images and the corresponding ground-truth texts as input for training. You don't have to manually take care of things like alignment of chars, width of chars, number of chars. CTC handles that for you.
I don't want to repeat myself here, so I just point you to the tutorials I've written about text recognition and to the source code:
Build a Handwritten Text Recognition System using TensorFlow
SimpleHTR: a TensorFlow model for text-recognition
You can use the SimpleHTR model as a starting point. To get good results, you will have to generate training data (e.g. write a rendering tool which renders realsitic looking examples) and train the model from scratch with that data (more details on training can be found in the README).

What algorithm is used for audio feature extraction in google's audioset?

I am getting started with Google's Audioset. While the dataset is extensive, I find the information with regards to the audio feature extraction very vague. The website mentions
128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.
Within the paper, the authors discuss using mel spectrograms on 960 ms chunks to get a 96x64 representation. It is then unclear to me how they get to the 1x128 format representation used in the Audioset. Does anyone know more about this??
They use the 96*64 data as input for a modified VGG network.The last layer of VGG is FC-128, so its output will be 1*128, and that is the reason.
The architecture of VGG can be found here: https://github.com/tensorflow/models/blob/master/research/audioset/vggish_slim.py

Small Data training in CMU Sphinx

I have installed sphinxbase, sphinxtrain and pocketsphinx in Linux (Ubuntu). Now I am trying to train data with speechcorps,transcriptions, dictionary etc obtained from VOXFORGE. (My etc and wav folder's data is obtained from VOXFORGE)
As I am new so I just want to train data and get some results with few line of transcripts and few wav files. let say 10 wav file and 10 transcript lines cosponsoring to it. Like this person in doing in this video
but when I run sphinxtrain then I am getting error.
Estimated Total Hours Training: 0.07021431623931
This is a small amount of data, no comment at this time
If I do CFG_CD_TRAIN= no I dont know what it means.
What changes I need to make? So I am able to remove this error.
PS: I can not add more data because I want to see some results first for my better understanding the whole scenario.
Not enough data for the training, we can only train CI models
You need at least 30 minutes of audio data to train CI models. Alternatively, you can set CFG_CD_TRAIN to "no".

Where can I get CoNLL-X training data?

I'm trying to train the Stanford Neural Network Dependency Parser to check phrase similarity.
The way I tried is:
java edu.stanford.nlp.parser.nndep.DependencyParser -trainFile trainPath -devFile devPath -embedFile wordEmbeddingFile -embeddingSize wordEmbeddingDimensionality -model modelOutputFile.txt.gz
The error that I got is:
Train File: C:\Users\rohit\Downloads\CoreNLP-master\CoreNLP-master\data\edu\stanford\nlp\parser\trees\en-onetree.txt
Dev File: null
Model File: modelOutputFile.txt.gz
Embedding File: null
Pre-trained Model File: null
################### Train
#Trees: 1
0 tree(s) are illegal (0.00%).
1 tree(s) are legal but have multiple roots (100.00%).
0 tree(s) are legal but not projective (0.00%).
###################
#Word: 3
#POS:3
#Label: 2
###################
#Transitions: 3
#Labels: 1
ROOTLABEL: null
Random generator initialized with seed 1459831358061
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.parser.nndep.Util.scaling(Util.java:49)
at edu.stanford.nlp.parser.nndep.DependencyParser.readEmbedFile. (DependencyParser.java:636)
at edu.stanford.nlp.parser.nndep.DependencyParser.setupClassifierForTraining(DependencyParser.java:787)
at edu.stanford.nlp.parser.nndep.DependencyParser.train(DependencyParser.java:676)
at edu.stanford.nlp.parser.nndep.DependencyParser.main(DependencyParser.java:1247)
The help embedded within the code says that the training file should be a - "Path to a training treebank in CoNLL-X format".
Does anyone know where I can find some CoNLL-X training data to train?
I gave training file but not embedding file and got this error.
My guess is if I give the embedding file it might work.
Please shed some light on which training file & embedding file I should use and where I can find them.
CoNLL-X treebanks
You can get the training data for Danish, Dutch, Portuguese, and Swedish available for free here. For other languages, you'll probably need to license a treebank from LDC, unfortunately (details for many languages on that page).
Universal Dependencies are in CoNLL-U format, which can usually be converted to CoNLL-X format with some work.
Lastly, there's a large list of treebanks and their availability on this page. You should be able to convert many of the dependency treebanks in this list into CoNLL-X format if they're not already in that format.
Training the Stanford Neural Net Dependency parser
From this page: The embedding file is optional, but the treebank is not. The best treebank and embedding files to use depend on which language and type of text you'd like to parse. Ideally, you would train on as much data as possible in the domain/genre that you're trying to parse.

Using CMU's Pocketsphinx with a small set of words

I want to use CMU pocket sphinx for recognizing between a small set of words. I created a corpus for them and created the model files here - http://www.speech.cs.cmu.edu/tools/lmtool.html.
Now when I run the pocketsphinx_continuous executable with this model on my 12 core Linux machine, it takes about 5 seconds to recognize each word.
Is this library usually this slow or am I doing something wrong?
the console output shows that it is still searching and evaluating a large number of words where as the size of my model is only 12 words.
Is there any other lightweight and easy to use library which I can use for this simple task of disnguishing between about 12-15 words.

Resources