PEGASUS pre-training for summarisation tasks - nlp

I am unsure of how the evaluation for large document summarisation is conducted for the recently introduced PEGASUS model for single document summarisation.
The author's show evaluation against large document datasets like Big Patent, PubMed etc with document lengths exceeding that of the input size to the transformer models.
To quote from the paper, they did talk about this but didn't really elaborate further.
CNN/DailyMail, Multi-News, arXiv, PubMed, BIG- PATENT datasets contain input documents longer than the maximum input length (L_input = 512 tokens) in pre- training. This would present a problem for position em- beddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal po- sitional encodings (Vaswani et al., 2017) generalize well when fine-tuning PEGASUSLARGE beyond the input lengths observed in training up to L_input = 1024 tokens. Since average input length in BIGPATENT, arXiv, PubMed and Multi-News are well beyond 1024 tokens, further scaling up L_input or applying a two-stage approach (Liu et al., 2018) may improve performance even more, although this is out- side the scope of this work.
They did mention that the input length is up till 1024 tokens. In the PEGASUS Large model on huggingface the max input tokens is also 1024.
I am not sure how they managed to extend their document summarisations for more than 1024 tokens.
I would also like to do similar for my own long document summarisations that I want to try.


OpenAI API: Does fine-tuning have a token limit?

In the documentation for GPT-3 API, it says One limitation to keep in mind is that, for most models, a single API request can only process up to 2,048 tokens (roughly 1,500 words) between your prompt and completion.
In the documentation for fine tuning model, it says The more training samples you have, the better. We recommend having at least a couple hundred examples. in general, we've found that each doubling of the dataset size leads to a linear increase in model quality.
My question is, does the 1,500 words limit also apply to fine tune model? Does "Doubling of the dataset size" mean number of training datasets instead of size of each training dataset?
As far as I understand...
GPT-3 models have token limits because you can only provide 1 prompt and you only get 1 completion. Therefore, as stated in the official OpenAI article:
Depending on the model used, requests can use up to 4097 tokens shared
between prompt and completion. If your prompt is 4000 tokens, your
completion can be 97 tokens at most.
Whereas, fine-tuning as such does not have a token limit (i.e., you can have a million training examples, a million prompt-completion pairs) as stated on the official OpenAI website:
The more training examples you have, the better. We recommend having
at least a couple hundred examples. In general, we've found that each
doubling of the dataset size leads to a linear increase in model
But, each fine-tuning prompt-completion pair does have a token limit. Each fine-tuning prompt-completion pair should not exceed the token limit.

Evaluation of gensim Doc2Vec model for Recommendations

I have developed a pipeline to extract text from documents, preprocess the text, and train a gensim Doc2vec model on given documents. Given a document in my corpus, I would like to recommend other documents in the corpus.
I want to know how I can evaluate my model without having a pre-defined list of "good" recommendations. Any ideas?
One simple self-check that can be used to catch some big problems with a Doc2Vec model training pipeline – like gross misparameterizations, or insufficient data/epochs – is to re-infer vectors for the training texts (using .infer_vector()), and check that generally:
the bulk-trained vector for the same text is "close to" the re-inferred vector - such as its nearest-neighbor, or one of the top neighbors, in a .most_similar() operation on the re-inferred text
the overall list of nearest-neighbors (from .most_similar()) for the bulk-trained vector, & the re-inferred vector, are very similar.
They won't necessarily be identical, for reasons explained in Q11 & Q12 of the Gensim Project FAQ, but if they're wildly-different, then something foundational has gone wrong, like:
insufficient (in quantity or quality/form) training data
misparameterizations, like too few epochs or too-large (overfitting-prone) vectors for the quantity of data
Ultimately, though, the variety of data sources & intended uses & possible dimensions of "recommendation-worthy" mean that you need cusomt inputs, based on your project's needs, usually from the intended audience (or your own ability to simulate/represent it).
In the original paper introducing the "Paragraph Vector" algorithm (what's inside the Doc2Vec class), and a followup evaluating it on Wikipedia & arXiv articles, several of the evaluations used triplets of documents, where 2 of the triplet were conjectured to be "necessarily similar" based on some preexisting system's groupings, and the 3rd randomly-chosen.
The algorithm's performance, and relative performance under different parameter choices, was scored based on how often it placed the 2 presumptively-related documents closer-together than the 3rd randomly-chosen document.
For example, one of the original paper's evaluations use brief search-engine-result snippets as documents, and considered any 2 documents that appeared as sibling top-10 results for the same query as presumptively-related. Two of the followup paper's evaluation used the human-curated categories of Wikipedia or arXiv as signalling that articles of the same category should be presumptively-related.
It's imperfect, but allowed the creation of large evaluation sets from already-existing systems/data, which generally pointed results in the same direction as human senses-of-relatedness.
Perhaps you can find a similar preexisting guide for your data. Or, as you perform ad-hoc checking, be sure to capture every judgement you make, so that it becomes, over time, a growing dataset of desirable pairings that are either (a) better than some other result that was co-presented; or (b) just "presumably good enough" that they usually should rank higher than other random 3rd documents. A large amount of imprecision in such desirability-data is tolerable, as it can even out as the set of probe-pairings grows, and the power of being able to automate bulk quantitative evaluations (reusing old assessments against new parameters/models) drives far more overall improvement than any small glitches in the evaluations cost.

How to use Bert for long text classification?

We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text
How can BERT be used?
You have basically three options:
You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.
I would suggest to try option 1, and only if this is not good enough to consider the other options.
This paper compared a few different strategies: How to Fine-Tune BERT for Text Classification?.
On the IMDb movie review dataset, they actually found that cutting out the middle of the text (rather than truncating the beginning or the end) worked best! It even outperformed more complex "hierarchical" approaches involving breaking the article into chunks and then recombining the results.
As another anecdote, I applied BERT to the Wikipedia Personal Attacks dataset here, and found that simple truncation worked well enough that I wasn't motivated to try other approaches :)
In addition to chunking data and passing it to BERT, check the following new approaches.
There are new researches for long document analysis. As you've asked for Bert a similar pre-trained transformer Longformer has recently been made available from ALLEN NLP ( Check out this link for the paper.
The related work section also mentions some previous work on long sequences. Google them too. I'll suggest at least go through Transformer XL ( As far I know it was one of the initial models for long sequences, so would be good to use it as a foundation before moving into 'Longformers'.
You can leverage from the HuggingFace Transformers library that includes the following list of Transformers that work with long texts (more than 512 tokens):
Reformer: that combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences.
Longformer: with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
Eight other recently proposed efficient Transformer models include Sparse Transformers (Child et al.,2019), Linformer (Wang et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al., 2020a), Linear Transformers (Katharopoulos et al., 2020), and BigBird (Zaheeret al., 2020).
The paper from the authors from Google Research and DeepMind tries to make a comparison between these Transformers based on Long-Range Arena "aggregated metrics":
They also suggest that Longformers have better performance than Reformer when it comes to the classification task.
I have recently (April 2021) published a paper regarding this topic that you can find on arXiv (
There, Table 1 allows to review previous approaches to the problem in question, and the whole manuscript is about long text classification and proposing a new method called Text Guide. This new method claims to improve performance over naive and semi-naive text selection methods used in the paper ( that was mentioned in one of the previous answers to this question.
Long story short about your options:
Low computational cost: use naive/semi naive approaches to select a part of original text instance. Examples include choosing first n tokens, or compiling a new text instance out of the beginning and end of original text instance.
Medium to high computational cost: use recent transformer models (like Longformer) that have 4096 token limit instead of 512. In some cases this will allow for covering the whole text instance and the modified attention mechanism decreases computational cost, and
High computational cost: divide the text instance into chunks that fit a model like BERT with ‘standard’ 512 limit of tokens per instance, deploy the model on each part separately, join the resulting vector representations.
Now, in my recently published paper there is a new method proposed called Text Guide. Text Guide is a text selection method that allows for improved performance when compared to naive or semi-naive truncation methods. As a text selection method, Text Guide doesn’t interfere with the language model, so it can be used to improve performance of models with ‘standard’ limit of tokens (512 for transformer models) or ‘extended’ limit (4096 as for instance for the Longformer model). Summary: Text Guide is a low-computational-cost method that improves performance over naive and semi-naive truncation methods. If text instances are exceeding the limit of models deliberately developed for long text classification like Longformer (4096 tokens), it can also improve their performance.
There are two main methods:
Concatenating 'short' BERT altogether (which consists of 512 tokens max)
Constructing a real long BERT (CogLTX, Blockwise BERT, Longformer, Big Bird)
I resumed some typical papers of BERT for long text in this post :
You can have an overview of all methods there.
There is an approach used in the paper Defending Against Neural Fake News (
Their generative model was producing outputs of 1024 tokens and they wanted to use BERT for human vs machine generations. They extended the sequence length which BERT uses simply by initializing 512 more embeddings and training them while they were fine-tuning BERT on their dataset.
U can use the max_position_embeddings argument in the configuration while downloading the BERT model into your kernel. with this argument you can choose 512, 1024, 2048
as max sequence length
max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
A relatively straightforward way to go is altering the input. For example, you can truncate the input or separately classify multiple parts of the input and aggregate the results. However, you would probably lose some useful information this way.
The main obstacle of applying Bert on long texts is that attention needs O(n^2) operations for n input tokens. Some newer methods try to subtly change the Bert's architecture and make it compatible for longer texts. For instance, Longformer limits the attention span to a fixed value so every token would only be related to a set of nearby tokens. This table (Longformer 2020, Iz Beltagy et al.) demonstrates a set of attention-based models for long-text classification:
LTR methods process the input in chunks from left to right and are suitable for auto-regressive applications. Sparse methods mostly reduce the computational order to O(n) by avoiding a full quadratic attention
matrix calculation.

Which method dm or dbow works well for document similarity using Doc2Vec?

I'm trying to find out the similarity between 2 documents. I'm using Doc2vec Gensim to train around 10k documents. There are around 10 string type of tags. Each tag consists of a unique word and contains some sort of documents. Model is trained using distributed memory method.
Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=1)
I've tried both dm and dbow as well. dm gives better result(similarity score) as compared to dbow. I understood the concepts of dm vs dbow. But don't know which method is good for similarity measures between two documents.
First question: Which method is the best to perform well on similarities?
model.wv.n_similarity(<words_1>, <words_2>) gives similarity score using word vectors.
model.docvecs.similarity_unseen_docs(model, doc1, doc2) gives similarity score using doc vectors where doc1 and doc2 are not tags/ or indexes of doctags. Each doc1 and doc2 contains 10-20 words kind of sentences.
Both wv.n_similarity and docvecs.similarity_unseen_docs provide different similarity scores on same types of documents.
docvecs.similarity_unseen_docs gives little bit good results as compared to wv.n_similarity but wv.n_similarity sometimes also gives good results.
Question: What is the difference between docvecs.similarity_unseen_docs and wv.n_similarity? Can I use docvecs.similarity_unseen_docs to find the similarity score between unseen data (It might be a silly question)?
Why I asked because docvecs.similarity_unseen_docs provides similarity score on tags, not on actual words belonging to their tags. I'm not sure, please correct me here, if I'm wrong.
How can I convert cosine similarity score to probability?
model = Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=4)
# Training of the model
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(<list_of_list_of_tokens>)]
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
# Finding similarity score
model.wv.n_similarity(<doc_words1>, <doc_words2>)
model.docvecs.similarity_unseen_docs(model, <doc_words1>, <doc_words2>)
Both PV-DM mode (dm=1, the default) and PV-DBOW mode (dm=0) can work well. Which is better will depend on your data and goals. Once you have a robust way to quantitatively score the quality of a model's results, for your project goals – which you'll want to be able to tune all of the model's meta-parameters, including DM/DBOW mode – you can and should try both.
PV-DBOW trains fast, and often works very well on shortish-docs (a few dozens of words). Note, though, that this mode doesn't train usable word-vectors unless you also add the dbow_words=1 option, which will slow training.
Using model.wv.n_similarity() relies on word-vectors only. It averages each set f word-vectors, then reports the cosine-similarity between those two averages. (So, it will only be sensible in PV-DM mode, or PV-DBOW with dbow_words=1 activated.
Using model. docvecs.similarity_unseen_docs() uses infer_vector() to treat each of the supplied docs as new texts, for which a true Doc2Vec doc-vector (not a mere average-of-word-vectors) is calculated. (This method operates on lists-of-words, not lists-of-tags.)
Which is better is something you should test for your goals. The average-of-word-vectors is a simpler, faster technique for making a text-vector – but still works ok for a lot of purposes. The inferred doc-vectors take longer to calculate, but with a good model, may be better for some tasks.
Other notes on your setup:
often, setting min_count as low as 2 is a bad idea: those rare words don't have enough examples to mean much, and actually interfere with the quality of surrounding words
10k documents is on the smallish side for a training corpus, compared to published Doc2Vec results (which usually use tens-of-thousands to millions of documents).
published results often use 10-20 training epochs (though more, like your choice of 50, might be helpful especially for smaller corpuses)
on typical multi-core machines workers=1 will be much slower than the default (workers=3); on a machine with 8 or more cores, up to workers=8 is often a good idea. (Though, unless using the newer corpus_file input option, more workers up to the full count of 16, 32, etc cores doesn't help.)
classic Doc2Vec usage doesn't assign docs just known labels (as in your "10 string type of tags"), but unique IDs for each document. In some cases using, or adding, known labels as tags may help, but beware that if you're only supplying 10 tags, you've essentially turned your 10,000 documents into 10 documents (from the perspective of the model's view, which sees all texts with the same tag as if they were segments of one larger document with that tag). In plain PV-DBOW, training only 10 doc-vectors, of 100-dimensions each, from just 10 distinct examples wouldn't make much sense: it'd be prone to severe overfitting. (In PV-DM or PV-DBOW with dbow_words, the fact that the model is training both 10 doc-vectors and many hundreds/thousands of other vocabulary-word word-vectors would help offset the risk of overfitting.)

Natural Language Generation - how to go beyond templates

We've build a system that analyzes some data and outputs some results in plain English (i.e. no charts etc.). The current implementation relies on lots of templates and some randomization in order to give as much diversity to the text as possible.
We'd like to switch to something more advanced with the hope that the produced text is less repetitive and sounds less robotic. I've searched a lot on google but I cannot find something concrete to start from. Any ideas?
EDIT: The data fed to the NLG mechanism are in JSON format. Here is an example about web analytics data. The json file may contain for example a metric (e.g. visits), it's value in the last X days, whether the last value is expected or not and which dimensions (e.g. countries or marketing channels) affected its change.
The current implementation could give something like this:
Overall visits in the UK mainly from ABC email campaign reached 10K (+20% DoD) and were above the expected value by 10%. Users were mainly landing on XXX page while the increase was consistent across devices.
We're looking to finding a way to depend less on templates, sound even more natural and increase the vocabulary.
What you are looking for is a hot research area and a pretty tough task. Currently there is no way to generate 100% meaningful diverse and natural sentences. one approach to generate sentences is using n-grams. using these method you can generate sentences that look more natural and diverse that may look good but probably meaningless and grammatically incorrect.
A more up to date approach is using Deep learning. anyway if you want to generate meaningful sentences, maybe your best way is using your current template based method.
You can find an introduction to basics of n-gram based NLG here:
Generating Random Text with Bigrams
this tool sounds to implement some of the most famous techniques for natural language generation: simplenlg
Have you tried Neural Networks especially LSTM and GRU architectures? These models are the most recent developments in predicting sequences of words. Generating natural language means to generate a sequence of words such that it makes sense with respect to the input and earlier words in the sequence. This is equivalent to predicting time series. LSTM is designed for predicting time series. Hence, it is commonly used to predict a sequence of words, given an input sequence, an input word, or any other input that can be embedded in a vector.
Deep learning libraries such as Tensorflow, Keras, and Torch all have sequence to sequence implementations that can be used for generating natural language by predicting a sequence of words given an input.
Note that usually these models need a huge amount of training data.
You need to meet two criteria in order to benefit from such models:
You should be able to represent your input as a vector.
You need a relatively large amount of input/target pairs.
