Creating UTF8 language dictionary using g2p-seq2seq - cmusphinx

I am preparing a CMU sphinx data dictionary for new language.
I've made few hundreds tranliteration in ur.txt, and and training it by passing to g2p-seq2seq as mentioned in docs, its resulting in Accuracy:0 and Error:1.
The wordlist file is utf8 with urdu characters. http://pastebin.com/2rRXay9J
Just testing it first time, can anyone identify the issue in it or it is correct?
# g2p-seq2seq --train ur.txt --model ur-model3 --size 512 --max_steps 50 &
Preparing G2P data
Creating vocabularies in ur-model3
Creating vocabulary ur-model3/vocab.phoneme
Creating vocabulary ur-model3/vocab.grapheme
Reading development and training data.
Creating 2 layers of 512 units.
Created model with fresh parameters.
Training done.
Creating 2 layers of 512 units.
Reading model parameters from ur-model3
Beginning calculation word error rate (WER) on test sample.
Words: 14
Errors: 14
WER: 1.000
Accuracy: 0.000

Related

Encoding an array of texts with a vocabulary of 100000 words in the context of deep learning

I'm training a keras model to do multi-class exclusive categorization of texts.
I've got 15000+ texts and a vocabulary of 100 000 words. I tried one-hot-encoding using the 10 000 most used words of this vocabulary and the 2D matrix of the shape (15 000,10 000,) I obtain is 2.4gb once saved to a file.
What are my possibilities since I don't want to create a (15 000, 100 000,) shaped matrix ?
I read about Embedding, skip n-Gram, but wasn't sure to understand how it should be used in that context.

CTC + BLSTM Architecture Stalls/Hangs before 1st epoch

I am working on a code which recognizes online handwriting recognition.
It works with CTC loss function and Word Beam Search (custom implementation: githubharald)
TF Version: 1.14.0
Following are the parameters used:
batch_size: 128
total_epoches: 300
hidden_unit_size: 128
num_layers: 2
input_dims: 10 (number of input Features)
num_classes: 80 (CTC output logits)
save_freq: 5
learning_rate: 0.001
decay_rate: 0.99
momentum: 0.9
max_length: 1940.0 (BLSTM with variable length time stamps)
label_pad: 63
The problem that I'm facing is, that after changing the decoder from CTC Greedy Decoder to Word Beam Search, my code stalls after a particular step. It does not show the output of the first epoch and is stuck there for about 5-6 hours now.
The step it is stuck after: tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
I am using a Nvidia DGX-2 for training (name: Tesla V100-SXM3-32GB)
Here is the paper describing word beam search, maybe it contains some useful information for you (I'm the author of the paper).
I would look at your task as two separate parts:
optical model, i.e. train a model that is as good as possible at reading text just by "looking" at it
language model, i.e. use a large enough text corpus, use a fast enough mode of the decoder
To select the best model for part (1), using best path (greedy) decoding for validation is good enough.
If the best path contains wrong characters, chances are high that also beam search has no chance to recover (even when using language models).
Now to part (2). Regarding runtime of word beam search: you are using "NGramsForecast" mode, which is the slowest of all modes. It has running time O(W*log(W)) with W being the number of words in the dictionary. "NGrams" has O(log(W)).
If you look into the paper and go to Table 1, you see that the runtime gets much worse when using the forecast modes ("NGramsForecast" or "NGramsForecastAndSample"), while character error rate may or may not get better (e.g. "Words" mode has 90ms runtime, while "NGramsForecast" has over 16s for the IAM dataset).
For practical use cases, I suggest the following:
if you have a dictionary (that means, a list of unique words), then use "Words" mode
if you have a large text corpus containing enough sentences in the target language, then use "NGrams" mode
don't use the forecast modes, instead use "Words" or "NGrams" mode and increase the beam width if you need better character error rate

using flow_from_directory for training and validation, without augmentation

I am training a simple CNN with Nt=148 + Nv=37 images for training and validation respectively. I used the ImageGenerator.flow_from_directory() method because I plan to use data augmentation in the future, but for the time being I don't want any data augmentation. I just want to read the images from disk one by one (and each exactly once, this is primarily important for the validation) to avoid loading all of them in memory.
But the following makes me think that something different than expected is happening:
the training and validation accuracy achieve values which do not resemble a fraction with 148 or 37 as the denominator. Actually trying to estimate a reasonable denominator from a submultiple of the deltas, leads to numbers much bigger than 148 (about 534 or 551, see below (*) why I think they should be multiples of 19) and of 37
verifying all predictions on both the training and and validation datasets (with a separate program, which reads the validation directory only once and doesn't use the above generators), shows a number of fails which is not exactly (1-val_acc)*Nv as I would expect
(*) Lastly I found that the batch size I used for both is 19, so I expect that I am providing 19*7=133 or 19*8=152 training images per epoch and 19 or 38 images as the validation set at each epoch end.
By the way: is it possible to use the model.fit_generator() with generators built from the ImageGenerator.flow_from_directory() to achieve:
- no data augmentation
- both generators should respectively supply all images to the training process and to the validation process exactly once per epoch
- shuffling is fine, and actually desired, so that each epoch runs different
Meanwhile I am orienting myself to set the batch size equal to the validation set length (i.e. 37). Being it a divider of the training set numerosity, I think it should work out the numbers.
But still I am unsure if the following code is achieving the requirement "no data augmentation at all"
valid_augmenter = ImageDataGenerator(rescale=1./255)
val_batch_size = 37
train_generator = train_augmenter.flow_from_directory(
train_data_dir,
target_size=(img_height, img_width),
batch_size=val_batch_size,
class_mode='binary',
color_mode='grayscale',
follow_links=True )
validation_generator = valid_augmenter.flow_from_directory(
validation_data_dir,
target_size=(img_height,img_width),
batch_size=val_batch_size,
class_mode='binary',
color_mode='grayscale',
follow_links=True )
Some issues in your situation.
First of all, that amount of images is quite low. Scrape a lot more images and use augmentation.
Second, I have seen typical fractions are:
from the total data:
80% for train
20% for validation.
Put the images you select in folders with that proportion.
Third, you can check if your code generates data if you put this line in your flow_from_directory call, after the last line (and put a comma after that last line):
save_to_dir='folder_to_see_augmented_images'
Then run the model (compile, and then fit) and check the contents of the save_to_dir folder.

How I can do binary classfication using CNN and RNN LSTM

I am new to deep learning and machine learning techniques. I am learning doing some examples in Python and also watching Youtube videos.
Now I am want to do binary classification of two datasets with CNN model and RNN model to compare their performance.
The criteria is: length of data in a column is 16 then it is Class:0 otherwise class:1.
The Datasets image are attached herewith.Class: 1 If plaintext length is 16 otherwise class: 0

which is the most suitable method for training among model.fit(), model.train_on_batch(), model.fit_generator()

I have a training dataset of 600 images with (512*512*1) resolution categorized into 2 classes(300 images per class). Using some augmentation techniques I have increased the dataset to 10000 images. After having following preprocessing steps
all_images=np.array(all_images)/255.0
all_images=all_images.astype('float16')
all_images=all_images.reshape(-1,512,512,1)
saved these images to H5 file.
I am using an AlexNet architecture for classification purpose with 3 convolutional, 3 overlap max-pool layers.
I want to know which of the following cases will be best for training using Google Colab where memory size is limited to 12GB.
1. model.fit(x,y,validation_split=0.2)
# For this I have to load all data into memory and then applying an AlexNet to data will simply cause Resource-Exhaust error.
2. model.train_on_batch(x,y)
# For this I have written a script which randomly loads the data batch-wise from H5 file into the memory and train on that data. I am confused by the property of train_on_batch() i.e single gradient update. Do this will affect my training procedure or will it be same as model.fit().
3. model.fit_generator()
# giving the original directory of images to its data_generator function which automatically augments the data and then train using model.fit_generator(). I haven't tried this yet.
Please guide me which will be the best among these methods in my case. I have read many answers Here, Here, and Here about model.fit(), model.train_on_batch() and model.fit_generator() but I am still confused.
model.fit - suitable if you load the data as numpy-array and train without augmentation.
model.fit_generator - if your dataset is too big to fit in the memory or\and you want to apply augmentation on the fly.
model.train_on_batch - less common, usually used when training more than one model at a time (GAN for example)

Resources