Huggingface token classification pipeline giving different outputs than just calling model() directly - pytorch

I am trying to mask named entities in text, using a roberta based model.
The suggested way to use the model is via Huggingface pipeline but i find that it is rather slow to use it that way. Using a pipeline on text data also prevents me from using my GPU for computation, as the text cannot be put onto the GPU.
Due to this, i decided to put the model on the GPU, tokenize the text myself(using the same tokenizer i pass to the pipeline), put the tokens on the GPU and pass them to the model afterwards. This works, but the outputs of the model used directly like this and not via the pipeline differ significantly.
I cant find a reason for this nor a way to fix it.
I tried reading through the token classification pipeline source code but couldnt find a difference in my usage compared to what the pipeline does.
Examples of code which produce different results:
Suggested usage in the model card:
ner_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
classifier = pipeline("ner", model=model, tokenizer=ner_tokenizer, framework='pt')
out = classifier(dataset['text'])
'out' is now a list of lists of dictionary objects which hold information on each named entity in a given string in list of strings 'dataset['text']'.
My custom usage:
text_batch = dataset['text']
encodings_batch = ner_tokenizer(text_batch,padding="max_length", truncation=True, max_length=128, return_tensors="pt")
input_ids = encodings_batch['input_ids']
input_ids = input_ids.to(TORCH_DEVICE)
outputs = model(input_ids)[0]
outputs = outputs.to('cpu')
label_ner_ids = outputs.argmax(dim=2).to('cpu')
'label_ner_ids' is now a tensor of 2 dimensions, the elements of which represent the labels for each token in a given line of text, so label_ner_id[i,j] is the label for the j-th token in the i-th string of text in the list of strings 'text_batch'. The token labels here differ from the outputs of the pipeline usage.

Related

When doing pre-training of a transformer model, how can I add words to the vocabulary?

Given a DistilBERT trained language model for a given language, taken from the Huggingface hub, I want to pre-train the model on a specific domain, and I want to add new words that are:
definitely non existing in the original training set
and impossible to handle via word piece toeknization - basically you can think of these words as "codes" that are a normalized form of a named entity
Consider that:
I would like to avoid to learn a new tokenizer: I am fine to add the new words, and then let the model learn their embeddings via pre-training
the number of the "words" is way larger that the "unused" tokens in the "stock" vocabulary
The only advice that I have found is the one reported here:
Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
Do you think this is the only way of achieve my goal?
If yes, I do not have any idea of how to write this "script": does someone has some hints at how to proceeed (sample code, documentation etc)?
As per my comment, I'm assuming that you go with a pre-trained checkpoint, if only to "avoid [learning] a new tokenizer."
Also, the solution works with PyTorch, which might be more suitable for such changes. I haven't checked Tensorflow (which is mentioned in one of your quotes), so no guarantees that this works across platforms.
To solve your problem, let us divide this into two sub-problems:
Adding the new tokens to the tokenizer, and
Re-sizing the token embedding matrix of the model accordingly.
The first can actually be achieved quite simply by using .add_tokens(). I'm referencing the slow tokenizer's implementation of it (because it's in Python), but from what I can see, this also exists for the faster Rust-based tokenizers.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Will return an integer corresponding to the number of added tokens
# The input could also be a list of strings instead of a single string
num_new_tokens = tokenizer.add_tokens("dennlinger")
You can quickly verify that this worked by looking at the encoded input ids:
print(tokenizer("This is dennlinger."))
# 'input_ids': [101, 2023, 2003, 30522, 1012, 102]
The index 30522 now corresponds to the new token with my username, so we can check the first part. However, if we look at the function docstring of .add_tokens(), it also says:
Note, hen adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.
In order to do that, please use the PreTrainedModel.resize_token_embeddings method.
Looking at this particular function, the description is a bit confusing, but we can get a correctly resized matrix (with randomly initialized weights for new tokens), by simply passing the previous model size, plus the number of new tokens:
from transformers import AutoModel
model = AutoModel.from_pretrained("distilbert-base-uncased")
model.resize_token_embeddings(model.config.vocab_size + num_new_tokens)
# Test that everything worked correctly
model(**tokenizer("This is dennlinger", return_tensors="pt"))
EDIT: Notably, .resize_token_embeddings() also takes care of any associated weights; this means, if you are pre-training, it will also adjust the size of the language modeling head (which should have the same number of tokens), or fix tied weights that would be affected by an increased number of tokens.

How to slice Kinetics400 training dataset? (pytorch)

I am trying to run the official script for video classification.
I want to tweak some functions and running through all examples would cost me too much time.
I wonder how can I slice the training kinetics dataset based on that script.
This is the code I added before
train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
in the script: (let's say I just want to run 100 examples.)
tr_split_len = 100
dataset = torch.utils.data.random_split(dataset, [tr_split_len, len(dataset)-tr_split_len])[0]
Then when hitting train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
, it pops out the error:
AttributeError: 'Subset' object has no attribute 'video_clips'
Yeah, so the type of dataset converts from torchvision.datasets.kinetics.Kinetics400 to torch.utils.data.dataset.Subset.
I understand. So how can I do it? (hopefully not the way using break in the dataloader loop).
Thanks.
It seems that torchvision.datasets.kinetics.Kinetics400 internally uses an object of class VideoClips to store the information about the clips. It is stored in the member variable Kinetics4000().video_clips.
The VideoClips class has a function called subset, that takes a list of indices and returns a new VideoClips object with only the clips with the specified indices. You could then just replace the old VideoClips object with the new one in your dataset.

Customize the encode module in huggingface bert model

I am working on a text classification project using Huggingface transformers module. The encode_plus function provides the users with a convenient way of generating the input ids, attention masks, token type ids, etc. For instance:
from transformers import BertTokenizer
pretrained_model_name = 'bert-base-cased'
bert_base_tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)
sample_text = 'Bamboo poles, ‍installation by an unknown building constructor #discoverhongkong #hongkonginsta'
encoding = bert_base_tokenizer.encode_plus(
cleaned_tweet, hashtag_string,
max_length=70,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=True,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt', # Return PyTorch tensors
)
print('*'*20)
print(encoding['input_ids'])
print(encoding['attention_mask'])
print(encoding['token_type_ids'])
print('*'*20)
However, my current project requires me to generate customized ids for a given text. For instance, for a list of words [HK, US, UK], I want to generate ids for these words and let other words' ids which do not exist in this list as zero. These ids are used to find embedding in another customized embedding matrix, not from pretrained bert module.
How can I achieve this kind of customized encoder? Any suggestions and solutions are welcomed! Thanks~
I think you can use the <unusedxxx> tokens in the BERT vocab and add your custom tokens there. So now you can easily refer to them with a valid token ID.

Is there an easy way with built-in functions to automatically retrain a keras NLP model?

I have a natural language processing model built with keras using keras.preprocessing.text.Tokenizer. I know that I can retrain the old model by calling it's .fit(...) after importing it, but I need to update my tokenizer as well. The tokenizer does some things: tokenizes a string by spaces, eliminates symbols, converts to lower, keeps only the most used tokens after creating it's dictionary, hash the tokens and appends 0 if the sentence is too short.
Ex:
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df_train['message'][0:100].values)
x_train = tokenizer.texts_to_sequences(df_train['message'][0:100].values)
x_train = pad_sequences(x_train, padding='post', maxlen=maxlen)
This process is needed to be able to input the sequences to a nlp network. The problem appears when I try to automatically retrain this. Every time I retrain, the tokenizer must be updated. If I try to add new text, all the values from the dictionary that the tokenizer class uses(meaning the encoding of a word) changes.
Ex:
If I update like this: tokenizer.fit_on_texts(df_train['message'][100:200].values),
then the
x_train = tokenizer.texts_to_sequences(df_train['message'][0:100].values)
will output different encodings for the sentences. I need the same encodings. In the official documentation it's said that the method "def fit_on_texts(self, texts)" "Updates internal vocabulary based on a list of texts.". It updates, but also changes all the old values of the keys, old or new ones.
Is there an official method to keep the old values of the words and generate new values only for the new words?
So according to the source for Tokenizer, all the words are stored in the Tokenizer class. Specifically in:
self.word_counts = OrderedDict()
self.word_docs = defaultdict(int)
So if you wanted to remove words which have appeared before in your input string, you could remove them by finding any item in the new_input that is not in the intersection of word_counts.keys() and new_input.
However, there's one more catch:
When you run .fit_on_texts it expects a list of texts, or a single string. Depending on how you're doing things you might also be incrementing your document counter. If that's intentional, then you do not need to do anything here. Otherwise, you will need to handle decrementing self.document_count at a minimum.
(You might also need to manipulate:
self.word_index
self.index_word
if you don't like the document counter moving)

Tensorflow: Setting an array element with a sequence

I'm trying to train a CNN using my own image dataset, but when passing the batch data and label to the feed_dict I get the error ValueError: setting an array element with a sequence from what I read here, this is a dimension issue, and probably coming from my batch_label Tensor, but I couldn't figure out how to make it a one-hot Tensor (what my graph expects).
I uploaded the full code as a gist here: https://gist.github.com/guivn/f7f753547f77a3b12992
TL;DR: You can't feed a tf.Tensor object (viz. batch_data and batch_labels in your gist) as the value for another tensor. (I believe the error message should be clearer about this in more recent versions of TensorFlow.)
Unfortunately you can't currently use the feed/tf.placeholder() mechanism to pass the result of one TensorFlow graph to another. We are investigating ways to make this easier, since it is a common confusion and feature request. For your exact program, it should be easy to solve this however. Simply move the lines that create the input and replace the placeholders with them. Your program will then look something like:
with graph.as_default():
# Input data.
filename_and_label_tensor = tf.train.string_input_producer(['train.txt'], shuffle=True)
data, label = parse_csv(filename_and_label_tensor)
tf_train_dataset, tf_train_labels = tf.train.batch([data, label], batch_size, num_threads=4)
# Rest of the model construction goes here....
Typically, if you want to pass another dataset through the same model—e.g. for evaluation—it's easiest to make another copy of the graph (perhaps sharing the same tf.Variable objects).

Resources