How to generate sentence embedding using long-former model - python-3.x

I am using Hugging Face mrm8488/longformer-base-4096-finetuned-squadv2 pre-trained model
https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2.
I want to generate sentence level embedding. I have a data-frame which has a text column.
I am using this code:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
How can I do modification in the above code to generate embedding for sentences. ?
I have the following examples:
Text
i've added notes to the claim and it's been escalated for final review
after submitting the request you'll receive an email confirming the open request.
hello my name is person and i'll be assisting you
this is sam and i'll be assisting you for date.
I'll return the amount as asap.
ill return it to you.

The Longformer uses a local attention mechanism and you need to pass a global attention mask to let one token attend to all tokens of your sequence.
import torch
from transformers import LongformerTokenizer, LongformerModel
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = LongformerTokenizer.from_pretrained(ckpt)
model = LongformerModel.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(text, return_tensors="pt")
global_attention_mask = [1].extend([0]*encoding["input_ids"].shape[-1])
encoding["global_attention_mask"] = global_attention_mask
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
o = model(**encoding)
sentence_embedding = o.last_hidden_state[:,0]
You should keep in mind that mrm8488/longformer-base-4096-finetuned-squadv2 was not pre-trained to produce meaningful sentence embeddings and faces the same issues as the MLM pre-trained BERT's regarding sentence embeddings.

Related

How do I make a paraphrase generation using BERT/ GPT-2

I am trying hard to understand how to make a paraphrase generation using BERT/GPT-2. I cannot understand how do I make it. Could you please provide me with any resources where I will be able to make a paraphrase generation model?
"The input would be a sentence and the output would be a paraphrase of the sentence"
Here is my recipe for training a paraphraser:
Instead of BERT (encoder only) or GPT (decoder only) use a seq2seq model with both encoder and decoder, such as T5, BART, or Pegasus. I suggest using the multilingual T5 model that was pretrained for 101 languages. If you want to load embeddings for your own language (instead of using all 101), you can follow this recipe.
Find a corpus of paraphrases for your language and domain. For English, ParaNMT, PAWS, and QQP are good candidates. A corpus called Tapaco, extracted from Tatoeba, is a paraphrasing corpus that covers 73 languages, so it is a good starting point if you cannot find a paraphrase corpus for your language.
Fine-tune your model on this corpus. The code can be something like this:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
# use here a backbone model of your choice, e.g. google/mt5-base
backbone_model = 'cointegrated/rut5-base-multitask'
model = T5ForConditionalGeneration.from_pretrained(backbone_model)
tokenizer = T5Tokenizer.from_pretrained(backbone_model)
model.cuda();
optimizer = torch.optim.Adam(params=[p for p in model.parameters() if p.requires_grad], lr=1e-5)
# todo: load the paraphrasing corpus and define the get_batch function
for i in range(100500):
xx, yy = get_batch(mult=mult)
x = tokenizer(xx, return_tensors='pt', padding=True).to(model.device)
y = tokenizer(yy, return_tensors='pt', padding=True).to(model.device)
# do not force the model to predict pad tokens
y.input_ids[y.input_ids==0] = -100
loss = model(
input_ids=x.input_ids,
attention_mask=x.attention_mask,
labels=y.input_ids,
decoder_attention_mask=y.attention_mask,
return_dict=True
).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
model.save_pretrained('my_paraphraser')
tokenizer.save_pretrained('my_paraphraser')
A more complete version of this code can be found in this notebook.
After the training, the model can be used in the following way:
from transformers import pipeline
pipe = pipeline(task='text2text-generation', model='my_paraphraser')
print(pipe('Here is your text'))
# [{'generated_text': 'Here is the paraphrase or your text.'}]
If you want your paraphrases to be more diverse, you can control the generation procress using arguments like
print(pipe(
'Here is your text',
encoder_no_repeat_ngram_size=3, # make output different from input
do_sample=True, # randomize
num_beams=5, # try more options
max_length=128, # longer texts
))
Enjoy!
you can use T5 paraphrasing for generating paraphrases

Whitelist tokens for text generation (XLNet, GPT-2) in huggingface-transformers

In the documentation on text generation (https://huggingface.co/transformers/main_classes/model.html#generative-models) there is the option to put
bad_words_ids (List[int], optional) – List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
Is there also the option to put something along the lines of "allowed_words_ids"? The idea would be to restrict the language of the generated texts.
I'd also suggest to do what Sahar Mills said. You can do it in the following way.
You get the whole vocab of the model you are using, e.g.
from transformers import AutoTokenizer
# Load tokenizer
checkpoint = "CenIA/distillbert-base-spanish-uncased" #Example model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100] # to see the first 100 words
Define words you do want in the model.
words_to_delete = ['forzado', 'vendieron', 'verticales'] # or load them from somewhere else
Define function to create the bad_words_ids, that is, the whole model vocab minus the words you want in the model
def create_bad_words_ids(bad_words_ids, words_to_delete):
for pictogram in range(len(words_to_delete)):
if words_to_delete[pictogram] in bad_words_ids:
bad_words_ids.remove(words_to_delete[pictogram])
return bad_words_ids
bad_words_ids = create_bad_words_ids(bad_words_ids=bad_words_ids, words_to_delete=words_to_delete)
print(bad_words_ids)
Hope it helps,
cheers

Should BERT embeddings be made on tokens or sentences?

I am making a sentence classification model and using BERT word embeddings in it. Due to very large dataset, I combined all the sentences together in one string and made embeddings on the tokens generated from those.
s = " ".join(text_list)
len(s)
Here s is the string and text_list contains the sentences on which I want to make my word embeddings.
I then tokenize the string
stokens = tokenizer.tokenize(s)
My question is, will BERT perform better on whole sentence given at a time or making embeddings on tokens for whole string is also fine?
Here is the code for my embedding generator
pool = []
all = []
i=0
while i!=600000:
stokens = stokens[i:i+500]
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)
a, b= embedd(input_ids, input_masks, input_segments)
pool.append(a)
all.append(b)
print(i)
i+=500
What essentially I am doing here is, I have the string length of 600000 and I take 500 tokens at a time and generate embdedings for it and append it in a list call pool.
For classification, you don't have to concatenate the sentences. By concatenating, you are merging the sentences of different classes.
If it is BERT fine-tuning, by default, for the classification task a logistic regression layer is learnt on top of [CLS] token. Since, its attention based transformer model, it assumes that each token has seen the other tokens and has captured the context. Thus [CLS] token is sufficient.
However, if you want to use the embeddings, you can learn a classifier on single vector,i.e, embeddings [CLS] token or averaged embeddings of all the tokens. Or, you can get the embeddings for each token and form a sequence to learn it using other classifiers such as CNN or RNN.

Improving on the basic, existing GloVe model

I am using GloVe as part of my research. I've downloaded the models from here. I've been using GloVe for sentence classification. The sentences I'm classifying are specific to a particular domain, say some STEM subject. However, since the existing GloVe models are trained on a general corpus, they may not yield the best results for my particular task.
So my question is, how would I go about loading the retrained model and just retraining it a little more on my own corpus to learn the semantics of my corpus as well? There would be merit in doing this were it possible.
After a little digging, I found this issue on the git repo. Someone suggested the following:
Yeah, this is not going to work well due to the optimization setup. But what you can do is train GloVe vectors on your own corpus and then concatenate those with the pretrained GloVe vectors for use in your end application.
So that answers that.
I believe GloVe (Global Vectors) is not meant to be appended, since it is based on the corpus' overall word co-occurrence statistics from a single corpus known only at initial training time
You can do is use gensim.scripts.glove2word2vec api to convert GloVe vectors into word2vec, but i dont think you can continue training since its loading in a KeyedVector not a Full Model
Mittens library (installable via pip) does that if your corpus/vocab is not too huge or your RAM is big enough to handle the entire co-occurrence matrix.
3 steps-
import csv
import numpy as np
from collections import Counter
from nltk.corpus import brown
from mittens import GloVe, Mittens
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer
1- Load pretrained model - Mittens needs a pretrained model to be loaded as a dictionary. Get the pretrained model from https://nlp.stanford.edu/projects/glove
with open("glove.6B.100d.txt", encoding='utf-8') as f:
reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
embed = {line[0]: np.array(list(map(float, line[1:])))
for line in reader}
Data pre-processing
sw = list(stop_words.ENGLISH_STOP_WORDS)
brown_data = brown.words()[:200000]
brown_nonstop = [token.lower() for token in brown_data if (token.lower() not in sw)]
oov = [token for token in brown_nonstop if token not in pre_glove.keys()]
Using brown corpus as a sample dataset here and new_vocab represents the vocabulary not present in pretrained glove. The co-occurrence matrix is built from new_vocab. It is a sparse matrix, requiring a space complexity of O(n^2). You can optionally filter out rare new_vocab words to save space
new_vocab_rare = [k for (k,v) in Counter(new_vocab).items() if v<=1]
corp_vocab = list(set(new_vocab) - set(new_vocab_rare))
remove those rare words and prepare dataset
brown_tokens = [token for token in brown_nonstop if token not in new_vocab_rare]
brown_doc = [' '.join(brown_tokens)]
corp_vocab = list(set(new_vocab))
2- Building co-occurrence matrix:
sklearn’s CountVectorizer transforms the document into word-doc matrix.
The matrix multiplication Xt*X gives the word-word co-occurrence matrix.
cv = CountVectorizer(ngram_range=(1,1), vocabulary=corp_vocab)
X = cv.fit_transform(brown_doc)
Xc = (X.T * X)
Xc.setdiag(0)
coocc_ar = Xc.toarray()
3- Fine-tuning the mittens model - Instantiate the model and run the fit function.
mittens_model = Mittens(n=50, max_iter=1000)
new_embeddings = mittens_model.fit(
coocc_ar,
vocab=corp_vocab,
initial_embedding_dict= pre_glove)
Save the model as pickle for future use.
newglove = dict(zip(corp_vocab, new_embeddings))
f = open("repo_glove.pkl","wb")
pickle.dump(newglove, f)
f.close()

Natural Language Processing Model

I'm a beginner in NLP and making a project to parse, and understand the intentions of input lines by a user in english.
Here is what I think I should do:
Create a text of sentences with POS tagging & marked intentions for every sentence by hand.
Create a model say: decision tree and train it on the above sentences.
Try the model on user input:
Do basic tokenizing and POS tagging on user input sentence and testing it on the above model for knowing the intention of this sentence.
It all may be completely wrong or silly but I'm determined to learn how to do it. I don't want to use ready-made solutions and the programming language is not a concern.
How would you guys do this task? Which model to choose and why? Normally to make NLP parsers, what steps are done.
Thanks
I would use NLTK.
There is an online book with a chapter on tagging, and a chapter on parsing. They also provide models in python.
Here is a simple example based on NLTK and Bayes
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)
]
random.shuffle(documents)
all_words = [w.lower() for w in movie_reviews.words()]
for w in movie_reviews.words():
all_words.append(w.lower())git b
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words("neg/cv000_29416.txt"))))
featuresets = [(find_features(rev),category) for (rev,category) in documents ]
training_set =featuresets[:10]
testing_set = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo Accuracy: ",(nltk.classify.accuracy(classifier,testing_set))* 100 )

Resources