Getting an embedded output from huggingface transformers - nlp

To compare different paragraphs, I am trying to use a transformer model, fitting each paragraph onto the model and then in the end I intend to compare the outputs and see which paragraph has the most similarity.
For this purpose, I am using Roberta-base model. I first used roberta tokenizer on a paragraph. Then I used the roberta model on that tokenized output. But the process is failing due to lack of memory. Even 25GB ram is not enough to complete the process for the paragraphs with 1324 lines.
Any idea how can I make it better or any suggestion what mistakes i might be doing?
from transformers import RobertaTokenizer, RobertaModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base").to(device)
inputs = tokenizer(dict_anrika['Anrika'], return_tensors="pt", truncation=True,
padding=True).to(device)
outputs = model(**inputs)

Sound like you gave the model input of shape [1324, longest_length_in_batch], which is huge. I tried [1000, 512] input, and found even 200GB RAM server also hits OOM.
One solution is to break the huge input into smaller batches, for example 10 lines at a time.

Related

How to use SciBERT in the best manner?

I'm trying to use BERT models to do text classification. As the text is about scientific texts, I intend to use the SicBERT pre-trained model: https://github.com/allenai/scibert
I have faced several limitations which I want to know if there is any solutions for them:
When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?
I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?
I know this is a general question, but any suggestion that I can improve my fine tuning (from data to hyper parameter, etc)? Currently, I'm getting ~75% accuracy. Thanks
Codes:
tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
encoded_data_train = tokenizer.batch_encode_plus(
df_train.text.values,
add_special_tokens=True,
return_attention_mask=True,
padding=True,
max_length=256
)
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df_train.label.values)
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataloader_train = DataLoader(dataset_train,
sampler=RandomSampler(dataset_train),
batch_size=batch_size)
model = BertForSequenceClassification.from_pretrained('allenai/scibert_scivocab_uncased',
num_labels=len(labels),
output_attentions=False,
output_hidden_states=False)
epochs = 1
optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer,
num_training_steps=len(dataloader_train)*epochs)
When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?
Yes, you are not using the complete text. And this is one of the limitations of BERT and T5 models, which limit to using 512 and 1024 tokens resp. to the best of my knowledge.
I can suggest you to use Longformer or Bigbird or Reformer models, which can handle sequence lengths up to 16k, 4096, 64k tokens respectively. These are really good for processing longer texts like scientific documents.
I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?
SciBERT is actually a pre-trained BERT model.
See this issue for more details where they mention the feasibility of converting BERT to ROBERTa:
Since you're working with a BERT model that was pre-trained, you unfortunately won't be able to change the tokenizer now from a WordPiece (BERT) to a Byte-level BPE (RoBERTa).
I know this is a general question, but any suggestion that I can
improve my fine tuning (from data to hyper parameter, etc)? Currently,
I'm getting ~79% accuracy.
I would first try to tune the most important hyperparameter learning_rate. I would then explore different values for hyperparameters of AdamW optimizer and num_warmup_steps hyperparamter of the scheduler.

How to deal with large amount of sentences with gensim word2vec?

I have a very large amount of sentences, the problem is i cannot load them all at once in memory, specially when i tokenize the sentences and split them into list of words my RAM goes full really fast.
but i couldn't find any example of how can i train the gensim word2vec with batches, meaning in each epoch i guess i have to somehow load batches of data from disk, tokenize them and give it to the model then unload it and load the next batch.
how can i overcome this problem and train a word2vec model when i don't have enough ram to load all the sentences (not even 20% of them).
my sentences are basically in a text file, each line representing a sentence.
You can define your own corpus as suggested in docs and basically size of corpus doesn't matter in this case:
from gensim.test.utils import datapath
from gensim import utils
class MyCorpus(object):
"""An interator that yields sentences (lists of str)."""
def __iter__(self):
corpus_path = datapath('lee_background.cor')
for line in open(corpus_path):
# assume there's one document per line, tokens separated by whitespace
yield utils.simple_preprocess(line)
Then train it as follow:
import gensim.models
sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

How to improve code to speed up word embedding with transformer models?

I need to compute words embeddings for a bunch of documents with different language models.
No problem with that, the script is doing fine, except I'm working on a notebook, without GPU and each text needs around 1.5s to be processed which is by far too long (I have thousands of texts to process).
Here is how I'm doing it with pytorch and transformers lib:
import torch
from transformers import CamembertModel, CamembertTokenizer
docs = [text1, text2, ..., text20000]
tok = CamembertTokenizer.from_pretrained('camembert-base')
model = CamembertModel.from_pretrained('camembert-base', output_hidden_states=True)
# let try with a batch size of 64 documents
docids = [tok.encode(
doc, max_length=512, return_tensors='pt', pad_to_max_length=True) for doc in docs[:64]]
ids=torch.cat(tuple(docids))
device = 'cuda' if torch.cuda.is_available() else 'cpu' # cpu in my case...
model = model.to(device)
ids = ids.to(device)
model.eval()
with torch.no_grad():
out = model(input_ids=ids)
# 103s later...
Do someone has any idea or suggestions to improve speed?
I don't think that there is a trivial way to significantly improve the speed, without using a GPU.
Some of the ways I could think of include smart batching, which is used by Sentence-Transformers, where you basically sort inputs of similar length together, to avoid padding to the full 512 token limit. I'm not sure how much of a speedup this is going to get you, but the only way that you can improve it significantly in a short period of time.
Otherwise, if you have access to Google colab, you can also utilize their GPU environment, if the processing can be completed in reasonable time.

Gensim's Word2Vec not training provided documents

I'm facing a Gensim training problem using Word2Vec.
model.wv.vocab is not getting any further word from the trained corpus
the only words in are from the ones from initialization instruction !
In fact, after many times trying on my own code, even the official site's example didn't work !
I tried saving model at many spots in my code
I even tried saving and reloading the corpus alongside train instruction
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
print(len(model.wv.vocab))
model.train([["hello", "world"]], total_examples=1, epochs=1)
model.save("word2vec.model")
print(len(model.wv.vocab))
first print statement gives 12 which is right
second 12 when it's supposed to give 14 (len(vocab + 'hello' + 'world'))
Additional calls to train() don't expand the known vocabulary. So, there is no way that the value of len(model.wv.vocab) will change after another call to train(). (Either 'hello' and 'world' are already known to the model, in which case they were in the original count of 12, or they weren't known, in which case they were ignored.)
The vocabulary is only established during a specific build_vocab() phase, which happens automatically if, as your code shows, you supplied a training corpus (common_texts) in model instantiation.
You can use a call to build_vocab() with the optional added parameter update=True to incrementally update a model's vocabulary, but this is best considered an advanced/experimental technique that introduces added complexities. (Whether such vocab-expansion, and then followup incremental training, actually helps or hurts will depend on getting a lot of other murky choices about alpha, epochs, corpus-sizing, training modes, and corpus-contents correct.)

How do I know if my tensorflow structure is good for my problem?

There are two sets of very similar code below with a very simple input as an illustrative example to my question. I think an explanation to the following observation can somehow answer my question. Thanks!
When I run the following code, the model can be trained quickly and can predict good results.
import tensorflow as tf
import numpy as np
from tensorflow import keras
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(xs, ys, epochs=1000)
print(model.predict([7.0]))
However, when i run the following code, which is very similar to the one above, the model is trained very slowly and may not be well trained and give bad predictions (i.e. the loss becomes <1 easily with the code above but stays at around 20000 with the code below)
model = keras.Sequential()# Your Code Here#
model.add(keras.layers.Dense(2,activation = 'relu',input_shape = (1,)))
model.add(keras.layers.Dense(1))
#model.compile(optimizer=tf.train.AdamOptimizer(0.1),
#loss='mean_squared_error')
model.compile(optimizer = tf.train.AdamOptimizer(1),loss = 'mean_squared_error')
#model.compile(# Your Code Here#)
xs = np.array([1,2,3,4,5,6,7,8,9,10], dtype=float)# Your Code Here#
ys = np.array([100,150,200,250,300,350,400,450,500,550], dtype=float)# Your Code Here#
model.fit(xs,ys,epochs = 1000)
print(model.predict([7.0]))
One more note: when I train my model with the second set of code, the model may be well trained occasionally (~8 out of 10 times it is not well trained, and loss remains >10000 after 1000 epochs).
I don't think there is any direct way to choose best deep architecture rather doing multiple experiments by varying hyper-parameters and changing the architecture. Compare the performance of each and every experiment and choose the best one. There are few articles listed below which may be helpful for you.
link-1, link-2, link-3

Resources