How to use Bert for long text classification? - nlp

We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text
How can BERT be used?

You have basically three options:
You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.
I would suggest to try option 1, and only if this is not good enough to consider the other options.

This paper compared a few different strategies: How to Fine-Tune BERT for Text Classification?.
On the IMDb movie review dataset, they actually found that cutting out the middle of the text (rather than truncating the beginning or the end) worked best! It even outperformed more complex "hierarchical" approaches involving breaking the article into chunks and then recombining the results.
As another anecdote, I applied BERT to the Wikipedia Personal Attacks dataset here, and found that simple truncation worked well enough that I wasn't motivated to try other approaches :)

In addition to chunking data and passing it to BERT, check the following new approaches.
There are new researches for long document analysis. As you've asked for Bert a similar pre-trained transformer Longformer has recently been made available from ALLEN NLP (https://arxiv.org/abs/2004.05150). Check out this link for the paper.
The related work section also mentions some previous work on long sequences. Google them too. I'll suggest at least go through Transformer XL (https://arxiv.org/abs/1901.02860). As far I know it was one of the initial models for long sequences, so would be good to use it as a foundation before moving into 'Longformers'.

You can leverage from the HuggingFace Transformers library that includes the following list of Transformers that work with long texts (more than 512 tokens):
Reformer: that combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences.
Longformer: with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
Eight other recently proposed efficient Transformer models include Sparse Transformers (Child et al.,2019), Linformer (Wang et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al., 2020a), Linear Transformers (Katharopoulos et al., 2020), and BigBird (Zaheeret al., 2020).
The paper from the authors from Google Research and DeepMind tries to make a comparison between these Transformers based on Long-Range Arena "aggregated metrics":
They also suggest that Longformers have better performance than Reformer when it comes to the classification task.

I have recently (April 2021) published a paper regarding this topic that you can find on arXiv (https://arxiv.org/abs/2104.07225).
There, Table 1 allows to review previous approaches to the problem in question, and the whole manuscript is about long text classification and proposing a new method called Text Guide. This new method claims to improve performance over naive and semi-naive text selection methods used in the paper (https://arxiv.org/abs/1905.05583) that was mentioned in one of the previous answers to this question.
Long story short about your options:
Low computational cost: use naive/semi naive approaches to select a part of original text instance. Examples include choosing first n tokens, or compiling a new text instance out of the beginning and end of original text instance.
Medium to high computational cost: use recent transformer models (like Longformer) that have 4096 token limit instead of 512. In some cases this will allow for covering the whole text instance and the modified attention mechanism decreases computational cost, and
High computational cost: divide the text instance into chunks that fit a model like BERT with ‘standard’ 512 limit of tokens per instance, deploy the model on each part separately, join the resulting vector representations.
Now, in my recently published paper there is a new method proposed called Text Guide. Text Guide is a text selection method that allows for improved performance when compared to naive or semi-naive truncation methods. As a text selection method, Text Guide doesn’t interfere with the language model, so it can be used to improve performance of models with ‘standard’ limit of tokens (512 for transformer models) or ‘extended’ limit (4096 as for instance for the Longformer model). Summary: Text Guide is a low-computational-cost method that improves performance over naive and semi-naive truncation methods. If text instances are exceeding the limit of models deliberately developed for long text classification like Longformer (4096 tokens), it can also improve their performance.

There are two main methods:
Concatenating 'short' BERT altogether (which consists of 512 tokens max)
Constructing a real long BERT (CogLTX, Blockwise BERT, Longformer, Big Bird)
I resumed some typical papers of BERT for long text in this post : https://lethienhoablog.wordpress.com/2020/11/19/paper-dissected-and-recap-4-which-bert-for-long-text/
You can have an overview of all methods there.

There is an approach used in the paper Defending Against Neural Fake News ( https://arxiv.org/abs/1905.12616)
Their generative model was producing outputs of 1024 tokens and they wanted to use BERT for human vs machine generations. They extended the sequence length which BERT uses simply by initializing 512 more embeddings and training them while they were fine-tuning BERT on their dataset.

U can use the max_position_embeddings argument in the configuration while downloading the BERT model into your kernel. with this argument you can choose 512, 1024, 2048
as max sequence length
max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
https://huggingface.co/transformers/model_doc/bert.html

A relatively straightforward way to go is altering the input. For example, you can truncate the input or separately classify multiple parts of the input and aggregate the results. However, you would probably lose some useful information this way.
The main obstacle of applying Bert on long texts is that attention needs O(n^2) operations for n input tokens. Some newer methods try to subtly change the Bert's architecture and make it compatible for longer texts. For instance, Longformer limits the attention span to a fixed value so every token would only be related to a set of nearby tokens. This table (Longformer 2020, Iz Beltagy et al.) demonstrates a set of attention-based models for long-text classification:
LTR methods process the input in chunks from left to right and are suitable for auto-regressive applications. Sparse methods mostly reduce the computational order to O(n) by avoiding a full quadratic attention
matrix calculation.

Related

Which HuggingFace summarization models support more than 1024 tokens? Which model is more suitable for programming related articles?

If this is not the best place to ask this question, please lead me to the most accurate one.
I am planning to use one of the Huggingface summarization models (https://huggingface.co/models?pipeline_tag=summarization) to summarize my lecture video transcriptions.
So far I have tested facebook/bart-large-cnn and sshleifer/distilbart-cnn-12-6, but they only support a maximum of 1,024 tokens as input.
So, here are my questions:
Are there any summarization models that support longer inputs such as 10,000 word articles?
What are the optimal output lengths for given input lengths? Let's say for a 1,000 word input, what is the optimal (minimum) output length (the min. length of the summarized text)?
Which model would likely work on programming related articles?
Question 1
Are there any summarization models that support longer inputs such as
10,000 word articles?
Yes, the Longformer Encoder-Decoder (LED) [1] model published by Beltagy et al. is able to process up to 16k tokens. Various LED models are available here on HuggingFace. There is also PEGASUS-X [2] published recently by Phang et al. which is also able to process up to 16k tokens. Models are also available here on HuggingFace.
Alternatively, you can look at either:
Extractive followed by abstractive summarisation, or
Splitting a large document into chunks of max_input_length (e.g. 1024), summarise each, and then concatenate together. Care will have to be taken as to how the documents are chunked as to avoid chunking mid-way through particular topics, or having a relatively short final chunk that may produce an unusable summary.
Question 2
What are the optimal output lengths for given input lengths? Let's say
for a 1,000 word input, what is the optimal (minimum) output
length (i.e. the min. length of the summarized text)?
This is a very difficult question to answer as it hard to empirically evaluate the quality of a summarisation. I would suggest running a few tests yourself with varied output length limits (e.g. 20, 50, 100, 200) and find what subjectively works best. Each model and document genre will be different. Anecdotally, I would say 50 words will a good minimum, with 100-150 offering better results.
Question 3
Which model would likely to work on programming related articles?
I can imagine three possible cases for what constitutes a programming related article.
Source code summarisation (which involves producing a natural (informal) language summary of code (formal language)).
Traditional abstractive summarisation (i.e. natural language summary of natural language but for articles talking about programming yet have no code).
Combination of both 1 and 2.
For case (1), I'm not aware of any implementations on HuggingFace that focus on this problem. However, it is an active research topic (see [3], [4], [5]).
For case (2), you can use the models you've been using already, and if feasible, fine tune on your own specific dataset of programming related articles.
For case (3), simply look at combining implementations from both (1) and (2) based on whether the input is categorised as either formal (code) or informal (natural) language.
References
[1] Beltagy, I., Peters, M.E. and Cohan, A., 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
[2] Phang, J., Zhao, Y. and Liu, P.J., 2022. Investigating Efficiently Extending Transformers for Long Input Summarization. arXiv preprint arXiv:2208.04347.
[3] Ahmad, W.U., Chakraborty, S., Ray, B. and Chang, K.W., 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653.
[4] Wei, B., Li, G., Xia, X., Fu, Z. and Jin, Z., 2019. Code generation as a dual task of code summarization. Advances in neural information processing systems, 32.
[5] Wan, Y., Zhao, Z., Yang, M., Xu, G., Ying, H., Wu, J. and Yu, P.S., 2018, September. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE international conference on automated software engineering (pp. 397-407).

PEGASUS pre-training for summarisation tasks

I am unsure of how the evaluation for large document summarisation is conducted for the recently introduced PEGASUS model for single document summarisation.
The author's show evaluation against large document datasets like Big Patent, PubMed etc with document lengths exceeding that of the input size to the transformer models.
To quote from the paper, they did talk about this but didn't really elaborate further.
CNN/DailyMail, Multi-News, arXiv, PubMed, BIG- PATENT datasets contain input documents longer than the maximum input length (L_input = 512 tokens) in pre- training. This would present a problem for position em- beddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal po- sitional encodings (Vaswani et al., 2017) generalize well when fine-tuning PEGASUSLARGE beyond the input lengths observed in training up to L_input = 1024 tokens. Since average input length in BIGPATENT, arXiv, PubMed and Multi-News are well beyond 1024 tokens, further scaling up L_input or applying a two-stage approach (Liu et al., 2018) may improve performance even more, although this is out- side the scope of this work.
They did mention that the input length is up till 1024 tokens. In the PEGASUS Large model on huggingface the max input tokens is also 1024.
I am not sure how they managed to extend their document summarisations for more than 1024 tokens.
I would also like to do similar for my own long document summarisations that I want to try.

how can I simplify BoWs?

I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.

Document classification using LSA/SVD

I am trying to do document classification using Support Vector Machines (SVM). The documents I have are collection of emails. I have around 3000 documents to train the SVM classifier and have a test document set of around 700 for which I need classification.
I initially used binary DocumentTermMatrix as the input for SVM training. I got around 81% accuracy for the classification with the test data. DocumentTermMatrix was used after removing several stopwords.
Since I wanted to improve the accuracy of this model, I tried using LSA/SVD based dimensional reduction and use the resulting reduced factors as input to the classification model (I tried with 20, 50, 100 and 200 singular values from the original bag of ~ 3000 words). The performance of the classification worsened in each case. (Another reason for using LSA/SVD was to overcome memory issues with one of the response variable that had 65 levels).
Can someone provide some pointers on how to improve the performance of LSA/SVD classification? I realize this is general question without any specific data or code but would appreciate some inputs from the experts on where to start the debugging.
FYI, I am using R for doing the text preprocessing (packages: tm, snowball,lsa) and building classification models (package: kernelsvm)
Thank you.
Here's some general advice - nothing specific to LSA, but it might help improving the results nonetheless.
'binary documentMatrix' seems to imply your data is represented by binary values, i.e. 1 for a term existing in a document, and 0 for non-existing term; moving to other scoring scheme
(e.g. tf/idf) might lead to better results.
LSA is a good metric for dimensional reduction in some cases, but less so in others. So depending in the exact nature of your data, it might be a good idea to consider additional methods, e.g. Infogain.
If the main incentive for reducing the dimensionality is the one parameter with 65 levels, maybe treating this parameter specifically, e.g. by some form of quantization, would lead to a better tradeoff?
This might not be the best tailored answer. Hope these suggestions may help.
Maybe you could use lemmatization over stemming to reduce unacceptable outcomes.
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma.
One instance:
go,goes,going ->Lemma: go,go,go ||Stemming: go, goe, go
And use some predefined set of rules; such that short term words are generalized. For instance:
I'am -> I am
should't -> should not
can't -> can not
How to deal with parentheses inside a sentence.
This is a dog(Its name is doggy)
Text inside parentheses often referred to alias names of the entities mentioned. You can either removed them or do correference analysis and treat it as a new sentence.
Try to use Local LSA, which can improve the classification process compared to Global LSA. In addition, LSA's power depends entirely on its parameters, so try to tweak parameters (start with 1, then 2 or more) and compare results to enhance the performance.

NLP software for classification of large datasets

Background
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
The Question
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
NLTK
TIMBL
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.
After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.
MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.
I would recommend Mahout as it is intended for handling very large scale data sets.
The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps.
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Resources