CTC + BLSTM Architecture Stalls/Hangs before 1st epoch

CTC + BLSTM Architecture Stalls/Hangs before 1st epoch - python-3.x

I am working on a code which recognizes online handwriting recognition.
It works with CTC loss function and Word Beam Search (custom implementation: githubharald)
TF Version: 1.14.0
Following are the parameters used:
batch_size: 128
total_epoches: 300
hidden_unit_size: 128
num_layers: 2
input_dims: 10 (number of input Features)
num_classes: 80 (CTC output logits)
save_freq: 5
learning_rate: 0.001
decay_rate: 0.99
momentum: 0.9
max_length: 1940.0 (BLSTM with variable length time stamps)
label_pad: 63
The problem that I'm facing is, that after changing the decoder from CTC Greedy Decoder to Word Beam Search, my code stalls after a particular step. It does not show the output of the first epoch and is stuck there for about 5-6 hours now.
The step it is stuck after: tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
I am using a Nvidia DGX-2 for training (name: Tesla V100-SXM3-32GB)

Here is the paper describing word beam search, maybe it contains some useful information for you (I'm the author of the paper).
I would look at your task as two separate parts:
optical model, i.e. train a model that is as good as possible at reading text just by "looking" at it
language model, i.e. use a large enough text corpus, use a fast enough mode of the decoder
To select the best model for part (1), using best path (greedy) decoding for validation is good enough.
If the best path contains wrong characters, chances are high that also beam search has no chance to recover (even when using language models).
Now to part (2). Regarding runtime of word beam search: you are using "NGramsForecast" mode, which is the slowest of all modes. It has running time O(W*log(W)) with W being the number of words in the dictionary. "NGrams" has O(log(W)).
If you look into the paper and go to Table 1, you see that the runtime gets much worse when using the forecast modes ("NGramsForecast" or "NGramsForecastAndSample"), while character error rate may or may not get better (e.g. "Words" mode has 90ms runtime, while "NGramsForecast" has over 16s for the IAM dataset).
For practical use cases, I suggest the following:
if you have a dictionary (that means, a list of unique words), then use "Words" mode
if you have a large text corpus containing enough sentences in the target language, then use "NGrams" mode
don't use the forecast modes, instead use "Words" or "NGrams" mode and increase the beam width if you need better character error rate

Related

seq2seq model transformer model - what's the best way to batchify my inputs?

I'm trying to build a character-level model that matches diacritics for Hebrew characters (each character is decorated with a diacritic). Note that the correct diacritic is dependent on the word, the context and the part-of-speech (not trivial).
I built an LSTM based model which achieves 18% word-level accuracy (18% of the words were exactly right in all their characters, on an unseen test set)
Now I'm trying to beat that with a transformer model, following pytorch seq-2-seq tutorial, and I'm reaching far worse results (7% word-level accuracy).
My training dataset is 100K sentences, most with up to 30 characters, but some go all the way to 80 characters.
My question (finally) - what's the best way to batchify these inputs for the transformer? I prepared 30-characters chunks that cover each sentence (e.g. a 55 characters sentence => 30 + 25) and padded with zeros when a chunk is shorter than 30. I'm also trying to split the chunks between words (on spaces) and not in mid-word.
Is this the way to go? Am I missing some better (and better-known) technique?

How to use Bert for long text classification?

We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text
How can BERT be used?

You have basically three options:
You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.
I would suggest to try option 1, and only if this is not good enough to consider the other options.

This paper compared a few different strategies: How to Fine-Tune BERT for Text Classification?.
On the IMDb movie review dataset, they actually found that cutting out the middle of the text (rather than truncating the beginning or the end) worked best! It even outperformed more complex "hierarchical" approaches involving breaking the article into chunks and then recombining the results.
As another anecdote, I applied BERT to the Wikipedia Personal Attacks dataset here, and found that simple truncation worked well enough that I wasn't motivated to try other approaches :)

In addition to chunking data and passing it to BERT, check the following new approaches.
There are new researches for long document analysis. As you've asked for Bert a similar pre-trained transformer Longformer has recently been made available from ALLEN NLP (https://arxiv.org/abs/2004.05150). Check out this link for the paper.
The related work section also mentions some previous work on long sequences. Google them too. I'll suggest at least go through Transformer XL (https://arxiv.org/abs/1901.02860). As far I know it was one of the initial models for long sequences, so would be good to use it as a foundation before moving into 'Longformers'.

You can leverage from the HuggingFace Transformers library that includes the following list of Transformers that work with long texts (more than 512 tokens):
Reformer: that combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences.
Longformer: with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
Eight other recently proposed efficient Transformer models include Sparse Transformers (Child et al.,2019), Linformer (Wang et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al., 2020a), Linear Transformers (Katharopoulos et al., 2020), and BigBird (Zaheeret al., 2020).
The paper from the authors from Google Research and DeepMind tries to make a comparison between these Transformers based on Long-Range Arena "aggregated metrics":
They also suggest that Longformers have better performance than Reformer when it comes to the classification task.

I have recently (April 2021) published a paper regarding this topic that you can find on arXiv (https://arxiv.org/abs/2104.07225).
There, Table 1 allows to review previous approaches to the problem in question, and the whole manuscript is about long text classification and proposing a new method called Text Guide. This new method claims to improve performance over naive and semi-naive text selection methods used in the paper (https://arxiv.org/abs/1905.05583) that was mentioned in one of the previous answers to this question.
Long story short about your options:
Low computational cost: use naive/semi naive approaches to select a part of original text instance. Examples include choosing first n tokens, or compiling a new text instance out of the beginning and end of original text instance.
Medium to high computational cost: use recent transformer models (like Longformer) that have 4096 token limit instead of 512. In some cases this will allow for covering the whole text instance and the modified attention mechanism decreases computational cost, and
High computational cost: divide the text instance into chunks that fit a model like BERT with ‘standard’ 512 limit of tokens per instance, deploy the model on each part separately, join the resulting vector representations.
Now, in my recently published paper there is a new method proposed called Text Guide. Text Guide is a text selection method that allows for improved performance when compared to naive or semi-naive truncation methods. As a text selection method, Text Guide doesn’t interfere with the language model, so it can be used to improve performance of models with ‘standard’ limit of tokens (512 for transformer models) or ‘extended’ limit (4096 as for instance for the Longformer model). Summary: Text Guide is a low-computational-cost method that improves performance over naive and semi-naive truncation methods. If text instances are exceeding the limit of models deliberately developed for long text classification like Longformer (4096 tokens), it can also improve their performance.

There are two main methods:
Concatenating 'short' BERT altogether (which consists of 512 tokens max)
Constructing a real long BERT (CogLTX, Blockwise BERT, Longformer, Big Bird)
I resumed some typical papers of BERT for long text in this post : https://lethienhoablog.wordpress.com/2020/11/19/paper-dissected-and-recap-4-which-bert-for-long-text/
You can have an overview of all methods there.

There is an approach used in the paper Defending Against Neural Fake News ( https://arxiv.org/abs/1905.12616)
Their generative model was producing outputs of 1024 tokens and they wanted to use BERT for human vs machine generations. They extended the sequence length which BERT uses simply by initializing 512 more embeddings and training them while they were fine-tuning BERT on their dataset.

U can use the max_position_embeddings argument in the configuration while downloading the BERT model into your kernel. with this argument you can choose 512, 1024, 2048
as max sequence length
max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
https://huggingface.co/transformers/model_doc/bert.html

A relatively straightforward way to go is altering the input. For example, you can truncate the input or separately classify multiple parts of the input and aggregate the results. However, you would probably lose some useful information this way.
The main obstacle of applying Bert on long texts is that attention needs O(n^2) operations for n input tokens. Some newer methods try to subtly change the Bert's architecture and make it compatible for longer texts. For instance, Longformer limits the attention span to a fixed value so every token would only be related to a set of nearby tokens. This table (Longformer 2020, Iz Beltagy et al.) demonstrates a set of attention-based models for long-text classification:
LTR methods process the input in chunks from left to right and are suitable for auto-regressive applications. Sparse methods mostly reduce the computational order to O(n) by avoiding a full quadratic attention
matrix calculation.

Dataset for RNN-LSTM as Spell checker in python

I have dataset of more than 5 million of records which has many noise features(words) in it So i thought of doing spell correction and abbreviation handling.
When i googled for spell correction packages in python i got packages like autocorrect, textblob, hunspell etc and Peter norvig's method
Below is the sample of my dataset
Id description
1 switvch for air conditioner..............
2 control tfrmr...........
3 coling pad.................
4 DRLG machine
5 hair smothing kit...............
I Tried spell correction function by above packages using the code
dataset['description']=dataset['description'].apply(lambda x: list(set([spellcorrection_function(item) for item in x])))
For entire dataset it took more than 12 hours to complete spell correction and also it introduces few noise( for 20% of total words which are important)
for eg: In last row, "smothing" corrected as "something" but it should be "smoothing" ( i dont get "something" in this context)
Approaching Further
When I observed the dataset not all time the spelling of word is wrong, there were also correct instance of spelling somewhere in dataset.So I tokenize the entire dataset and split correct words and wrong words by using dictionary , applied jarowinkler similarity method between all pair of words and selected pairs which is having similarity value 0.93 and more
Wrong word correct word similarity score
switvch switch 0.98
coling cooling 0.98
smothing smoothing 0.99
I got more than 50k pair of similar words which I put in dictionary with wrong word as key and correct word as value
I also kept words with its abbreviation list( ~3k pairs) in dictionary
key value
tfrmr transformer
drlg drilling
Search and replace key-value pair using code
dataset['description']=dataset['description'].replace(similar_word_dictionary,regex=true)
dataset['description']=dataset['description'].replace(abbreviation_dictionary,regex=true)
This code took more than a day to complete for only 10% of my entire dataset which I found is not efficient one.
Along With Python packages I had also found deep spelling which is something very efficient way of doing spelling correction.There was a very clear explanation of RNN-LSTM as spell checker.
As I dont know much about RNN and LSTM i got very basic understanding of above link.
Question
I am confused how to consider trainset for RNN to my problem,
whether
I need to consider correct words ( without any spelling mistake) in entire dataset as trainset and entire description of my dataset as testset.
or Pair of similar words and abbrievation list as trainset and description of my dataset as testset ( where model find wrong word in description and correct it)
or any other way? could some one please tell me how can I approach further

Could you give some more information about the model you are building?
It makes sense to use a character level sequence to sequence model, similar to the one you would use for translation. There are already some approaches trying to do the same (1, 2, 3).
Maybe draw on them for some inspiration?
Now, with regards to the dataset, It seems that the one you are trying to use mostly has errors? If you don't have the correct version of each phrase, I don't think you can use this dataset.
A simple approach would be to get an existing dataset and introduce random noise in it. The deep spelling blog talks about how you can do that an existing text corpus. Also, a recommendation from myself would be to use small-ish standalone sentences as the training set. A good place to find those is from machine translation datasets (like the tatoeba project) and only use the english phrases. Out of those you can create pairs of (input_phrase, target_phrase) where the input_phrase is potentially noisy (but not always).
With regards to performance, firstly 12hrs training for 1 pass of a 5M dataset sounds about right for a home pc. You can use a GPU or a cloud solution (1, 2) for faster training.
Now for false-positive correction, the dictionary you have created could indeed be handy: if a word exists in this dictionary, don't accept a "correction" on it from the model.

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector

There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

how to perfom classfication

I'm trying to perform document classification into two categories (category1 and category2), using Weka.
I've gathered a training set consisting of 600 documents belonging to both categories and the total number of documents that are going to be classified is 1,000,000.
So to perform the classification, I apply the StringToWordVector filter. I set true the followings from the filter:
- IDF transform
- TF ransform
- OutputWordCounts
I'd like to ask a few questions about this process.
1) How many documents shall I use as training set, so that I over-fitting is avoided?
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
3) As classification method I usually choose naiveBayes but the results I get are the followings:
-------------------------
Correctly Classified Instances 393 70.0535 %
Incorrectly Classified Instances 168 29.9465 %
Kappa statistic 0.415
Mean absolute error 0.2943
Root mean squared error 0.5117
Relative absolute error 60.9082 %
Root relative squared error 104.1148 %
----------------------------
and if I use SMO the results are:
------------------------------
Correctly Classified Instances 418 74.5098 %
Incorrectly Classified Instances 143 25.4902 %
Kappa statistic 0.4742
Mean absolute error 0.2549
Root mean squared error 0.5049
Relative absolute error 52.7508 %
Root relative squared error 102.7203 %
Total Number of Instances 561
------------------------------
So in document classification which one is "better" classifier?
Which one is better for small data sets, like the one I have?
I've read that naiveBayes performs better with big data sets but if I increase my data set, will it cause the "over-fitting" effect?
Also, about Kappa statistic, is there any accepted threshold or it doesn't matter in this case because there are only two categories?
Sorry for the long post, but I've been trying for a week to improve the classification results with no success, although I tried to get documents that fit better in each category.

1) How many documents shall I use as training set, so that I
over-fitting is avoided? \
You don't need to choose the size of training set, in WEKA, you just use the 10-fold cross-validation. Back to the question, machine learning algorithms influence much more than data set in over-fitting problem.
2) After applying the filter, I get a list of the words in the
training set. Do I have to remove any of them to get a better result
at the classifier or it doesn't play any role? \
Definitely it does. But whether the result get better can not be promised.
3) As classification method I usually choose naiveBayes but the
results I get are the followings: \
Usually, to define whether a classify algorithm is good or not, the ROC/AUC/F-measure value is always considered as the most important indicator. You can learn them in any machine learning book.

To answers your questions:
I would use (10 fold) cross-validation to evaluate your method. The model is trained trained 10 times on 90% of the data and tested on 10% of the data using different parts of the data each time. The results are therefor less biased towards your current (random) selection of train and test set.
Removing stop words (i.e., frequently occurring words with little discriminating value like the, he or and) is a common strategy to improve your classifier. Weka's StringToWordVector allows you to select a file containing these stop words, but it should also have a default list with English stop words.
Given your results, SMO performs the best of the two classifiers (e.g., it has more Correctly Classified Instances). You might also want to take a look at (Lib)SVM or LibLinear (You may need to install them if they are not in Weka natively; Weka 3.7.6 has a package manager allowing for easy installation), which can perform quite well on document classification as well.

Regarding the second question
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
I was building a classifier and training it with the famous 20news group dataset, when testing it without the preprocessing the results were not good. So, i pre-processed the data according to the following steps:
Substitute TAB, NEWLINE and RETURN characters by SPACE.
Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
Turn all letters to lowercase.
Substitute multiple SPACES by a single SPACE.
The title/subject of each document is simply added in the beginning of the document's text.
no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining
words. Information about stemming can be found here.
These steps are taken from http://web.ist.utl.pt/~acardoso/datasets/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string