I need to get the last layer of embeddings from a BERT model using HuggingFace. The following code works, however is extremely slow, how can I increase the speed?
This is a toy example, my real data is made of thousands of examples with long texts.
import transformers
import pandas as pd
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
def getPrediction(text):
encoded_input = tokenizer(text, return_tensors='pt')
outputs = model(**encoded_input)
embedding = outputs[0][:, -1]
embedding_1ist = embedding.cpu().detach().tolist()[0]
return embedding_1ist
df = pd.DataFrame({'text':['First text', 'Second text']})
results = pd.DataFrame(df.apply(lambda x: getPrediction(x.text), axis=1))
Related
It is a multi-class classification model with sklearn.
I am using OneVsOneClassifier model to train and predict 150 intents. Its a multi-class classification problem.
Data:
text intents
text1 int1
text2 int2
I convert these intents in labels using:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)
Expectation:
Without changing the training pipeline or parameters, note the inference time. Currently, it's slow, ~1second for 1 inference. So to convert pipeline to ONNX format and then use for inferencing on 1 example.
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC,LinearSVC
def create_pipe(clf):
# Each pipeline uses the same column transformer.
column_trans = ColumnTransformer(
[('Text', TfidfVectorizer(), 'text')
],
remainder='drop')
pipeline = Pipeline([('prep',column_trans),
('clf', clf)])
return pipeline
def fit_and_print(pipeline):
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(metrics.classification_report(y_test, y_pred,
target_names=le.classes_,
digits=3))
clf = OneVsOneClassifier(LinearSVC(random_state=42, class_weight='balanced'))
pipeline = create_pipe(clf)
%time fit_and_print(pipeline)
# convert input to df
def create_test_data(x):
d = {'text' : x}
df = pd.DataFrame(d, index=[0])
return df
revs=[]
for idx in [948, 5717, 458]:
cur = test.loc[idx, 'text']
revs.append(cur)
print(revs)
revs=sam['text'].values
%%time
for rev in revs:
c_res = pipeline.predict(create_test_data(rev))
print(rev, '=', labels[c_res[0]])
ONNX conversion code
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType, StringTensorType
initial_type = [('UTTERANCE', StringTensorType([None, 2]))]
model_onnx = convert_sklearn(pipeline, initial_types=initial_type)
Error
MissingShapeCalculator: Unable to find a shape calculator for type '<class 'sklearn.multiclass.OneVsOneClassifier'>'.
It usually means the pipeline being converted contains a
transformer or a predictor with no corresponding converter
implemented in sklearn-onnx. If the converted is implemented
in another library, you need to register
the converted so that it can be used by sklearn-onnx (function
update_registered_converter). If the model is not yet covered
by sklearn-onnx, you may raise an issue to
https://github.com/onnx/sklearn-onnx/issues
to get the converter implemented or even contribute to the
project. If the model is a custom model, a new converter must
be implemented. Examples can be found in the gallery.
How to resolve this? Also how to do prediction after converting to ONNX format?
I'm trying to use hugging face's BERT-base-uncased model to train on emoji prediction on tweets, and it seems that after the first epoch, the model immediately starts to overfit. I have tried the following:
Increasing the training data (I increased this from 1x to 10x with no effect)
Changing the learning rate (no differences there)
Using different models from hugging face (the results were the same again)
Changing the batch size (went from 32, 72, 128, 256, 512, 1024)
Creating a model from scratch, but I ran into issues and decided to post here first to see if I was missing anything obvious.
At this point, I'm concerned that the individual tweets don't give enough information for the model to make a good guess, but wouldn't it be random in that case, rather than overfitting?
Also, training time seems to be ~4.5 hours on Colab's free GPUs, is there any way to speed that up? I tried their TPU, but it doesn't seem to be recognized.
This is what the data looks like
And this is my code below:
import pandas as pd
import json
import re
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
import torch
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score
import numpy as np
# opening up the data and removing all symbols
df = pd.read_json('/content/drive/MyDrive/computed_results.json.bz2')
df['text_no_emoji'] = df['text_no_emoji'].apply(lambda text: re.sub(r'[^\w\s]', '', text))
# loading the tokenizer and the model from huggingface
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5).to('cuda')
# test train split
train, test = train_test_split(df[['text_no_emoji', 'emoji_codes']].sample(frac=1), test_size=0.2)
# defining a dataset class that generates the encoder and labels on the fly to minimize memory usage
class Dataset(torch.utils.data.Dataset):
def __init__(self, input, labels=None):
self.input = input
self.labels = labels
def __getitem__(self, pos):
encoded = tokenizer(self.input[pos], truncation=True, max_length=15, padding='max_length')
label = self.labels[pos]
ret = {key: torch.tensor(val) for key, val in encoded.items()}
ret['labels'] = torch.tensor(label)
return ret
def __len__(self):
return len(self.labels)
# training and validation datasets are defined here
train_dataset = Dataset(train['text_no_emoji'].tolist(), train['emoji_codes'].tolist())
val_dataset = Dataset(train['text_no_emoji'].tolist(), test['emoji_codes'].tolist())
# defining the training arguments
args = TrainingArguments(
output_dir="output",
evaluation_strategy="epoch",
logging_steps = 10,
per_device_train_batch_size=1024,
per_device_eval_batch_size=1024,
num_train_epochs=5,
save_steps=3000,
seed=0,
load_best_model_at_end=True,
weight_decay=0.2,
)
# defining the model trainer
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
# Training the model
trainer.train()
Results: After this, the training generally stops pretty fast due to the early stopper
The dataset can be found here (39 Mb compressed)
I am trying to embed some documents including a couple of sentences using huggingface transformers models. I have multi-gpu single-node and I want to do embedding parallel and distributed in all 8 gpus. I tried to use pytorch DistributedDataParallel, but I think all sentences are sending to all GPUs and for all sentences, it is returning one tensor. this is a sample code:
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import argparse
import os
from transformers import AlbertTokenizer, AlbertModel
import numpy
from tqdm import tqdm
from torch.utils.data import DataLoader,TensorDataset
def parse_args():
parse = argparse.ArgumentParser()
parse.add_argument(
'--local_rank',
dest = 'local_rank',
type = int,
default = 0,
)
parse.add_argument("--gpu", type=str, default='None',
help="choose gpu device.")
return parse.parse_args()
def train():
args = parse_args()
if not args.gpu == 'None':
device = torch.device("cuda")
os.environ["CUDA_VISIBLE_DEVICES"]=args.gpu
else:
device = torch.device("cpu")
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(
backend='nccl',
init_method='env://',
)
tokenizer = AlbertTokenizer.from_pretrained('albert-xxlarge-v2')
sentences=['I love tea',
'He hates tea',
'We love tea',
'python coder',
'geeksforgeeks',
'coder in geeksforgeeks']
sentence_tokens = []
for sent in (sentences):
token_id = tokenizer.encode(sent, max_length=128, add_special_tokens=True, pad_to_max_length=True)
sentence_tokens.append(token_id)
original_sentences = torch.tensor(sentence_tokens)
train_dataset = TensorDataset(original_sentences)
#setup training sampler
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,num_replicas=len(sentences))
#setup training data loader with the train sampler setup
train_dataloader = DataLoader(train_dataset, batch_size=16,sampler=train_sampler, shuffle=False)
model = AlbertModel.from_pretrained('albert-xxlarge-v2', return_dict=True)
model = model.to(device)
model = nn.parallel.DistributedDataParallel(model,
device_ids = [args.local_rank, ],
output_device = args.local_rank,\
find_unused_parameters=True
)
for batch in (train_dataloader):
batch_input_tensors = batch[0].to('cuda')
outputs = model(batch_input_tensors)
last_hidden_states = outputs.last_hidden_state
average= torch.mean(last_hidden_states,dim=1)
if __name__ == "__main__":
train()
all of sentences are sending to all 8 GPUs and output as last_hidden_states is only one tensor. I got the average of tensor elements because I thought at the end they should be same but they aren't.
how can do it distributed and sentences distribute to GPUs and embed over there? and finally for each sentence or for my final case each Doc I have one tensor as feature vector?
thanks
I want to use one of the models for sequence classification provided by huggingface.It seems they are providing a function called
glue_convert_examples_to_features()
for preparing the data so that it can be input into the models.
However, it seems this conversion function only applies to the glue dataset. I can't find an easy solution to apply the conversion to my costum data. Am I overseen a prebuilt function like above ? What would be an easy way to convert my custom data with one sequence and two labels into the format the model expects ?
Huggingface added a fine-tuning with custom datasets guide that contains a lot of useful information. I was able to use the information in the IMDB sequence classification section to successfully adapt a notebook using a glue dataset with my own pandas dataframe.
from transformers import (
AutoConfig,
AutoTokenizer,
TFAutoModelForSequenceClassification,
AdamW
)
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
df = pd.read_pickle('data.pkl')
train_texts = df.text.values # an array of strings
train_labels = df.label.values # an array of integers
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
train_encodings = tokenizer(train_texts.tolist(), truncation=True, max_length=96, padding=True)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, max_length=96, padding=True)
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
))
num_labels = 3
num_train_examples = len(train_dataset)
num_dev_examples = len(val_dataset)
train_dataset = train_dataset.shuffle(100).batch(train_batch_size)
val_dataset = val_dataset.shuffle(100).batch(eval_batch_size)
learning_rate = 2e-5
train_batch_size = 8
eval_batch_size = 8
num_epochs = 1
train_steps_per_epoch = int(num_train_examples / train_batch_size)
dev_steps_per_epoch = int(num_dev_examples / eval_batch_size)
config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
history = model.fit(train_dataset,
epochs=num_epochs,
steps_per_epoch=train_steps_per_epoch,
validation_data=val_dataset,
validation_steps=dev_steps_per_epoch)
Notebook credits: digitalepidemiologylab covid-twitter-bert colab
Below is some code for a classifier. I used pickle to save and load the classifier instructed in this page. However, when I load it to use it, I cannot use the CountVectorizer() and TfidfTransformer() to convert raw text into vectors that the classifier can use.
The only I was able to get it to work is analyze the text immediately after training the classifier, as seen below.
import os
import sklearn
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import pandas
import pickle
class Classifier:
def __init__(self):
self.moviedir = os.getcwd() + '/txt_sentoken'
def Training(self):
# loading all files.
self.movie = load_files(self.moviedir, shuffle=True)
# Split data into training and test sets
docs_train, docs_test, y_train, y_test = train_test_split(self.movie.data, self.movie.target,
test_size = 0.20, random_state = 12)
# initialize CountVectorizer
self.movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=5000)
# fit and tranform using training text
docs_train_counts = self.movieVzer.fit_transform(docs_train)
# Convert raw frequency counts into TF-IDF values
self.movieTfmer = TfidfTransformer()
docs_train_tfidf = self.movieTfmer.fit_transform(docs_train_counts)
# Using the fitted vectorizer and transformer, tranform the test data
docs_test_counts = self.movieVzer.transform(docs_test)
docs_test_tfidf = self.movieTfmer.transform(docs_test_counts)
# Now ready to build a classifier.
# We will use Multinominal Naive Bayes as our model
# Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
self.clf = MultinomialNB()
self.clf.fit(docs_train_tfidf, y_train)
# save the model
filename = 'finalized_model.pkl'
pickle.dump(self.clf, open(filename, 'wb'))
# Predict the Test set results, find accuracy
y_pred = self.clf.predict(docs_test_tfidf)
# Accuracy
print(sklearn.metrics.accuracy_score(y_test, y_pred))
self.Categorize()
def Categorize(self):
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 'It is pretty good',
'This was certainly a movie', 'I fell asleep halfway through',
"We can't wait for the sequel!!", 'I cannot recommend this highly enough', 'What the hell is this shit?']
reviews_new_counts = self.movieVzer.transform(reviews_new) # turn text into count vector
reviews_new_tfidf = self.movieTfmer.transform(reviews_new_counts) # turn into tfidf vector
# have classifier make a prediction
pred = self.clf.predict(reviews_new_tfidf)
# print out results
for review, category in zip(reviews_new, pred):
print('%r => %s' % (review, self.movie.target_names[category]))
With MaximeKan's suggestion, I researched a way to save all 3.
saving the model and the vectorizers
import pickle
with open(filename, 'wb') as fout:
pickle.dump((movieVzer, movieTfmer, clf), fout)
loading the model and the vectorizers for use
import pickle
with open('finalized_model.pkl', 'rb') as f:
movieVzer, movieTfmer, clf = pickle.load(f)
This is happening because you should not only save the classifier, but also the vectorizers. Otherwise, you are retraining the vectorizers on unseen data, which obviously will not contain the exact same words than the train data, and the dimension will change. This is an issue, because your classifier is expecting a certain input format to be provided.
Thus, the solution for your problem is quite simple: you should also save your vectorizers as pickle files and load them along with your classifier before using them.
Note: to avoid having two objects to save and to load, you could consider putting them together in a pipeline, which is equivalent.