TokenClassification with BERT: add tensors after text was already tokenized? - pytorch

I am using Bert for NER TokenClassification.
Since I want to manually truncate the (training) text and add padding and special tokens on my own, the tokenizer function looks like this:
tokenized_text = tokenizer.encode_plus(text, add_special_tokens=False, is_split_into_words=True)
I have successfully trained my model and now want to use it to predict new text.
The Huggingface tutorial suggest to do it as follows:
with torch.no_grad():
logits = model(**tokenized_text).logits
predicted_token_class_ids = logits.argmax(dim = -1)
predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
My problem is that in order to use the code above tokenized_text has to be in (pytorch) tensor format, but I originally did not use the return_tensors="pt" parameter, since I wanted to leave "input_ids", "token_type_ids" and "attention_mask" as list datatype to manipulate them easier.
So my question is basically if I can transform an already tokenized text to a tokenized text in the tensor format.
As far as the documentation tells return_tensors="pt" just returns torch.Tensor objects for the "input_ids", "token_type_ids" and the "attention_mask".
So I simply tried to use:
tokenized_text["input_ids"] = torch.Tensor(tokenized_text["input_ids"])
tokenized_text["token_type_ids"] = torch.Tensor(tokenized_text["token_type_ids"])
tokenized_text["attention_mask"] = torch.Tensor(tokenized_text["attention_mask"])
This made my tokenized text look like this:
{'input_ids': tensor([ 101., 5911., 26664., ....
'token_type_ids': tensor([0., 0., 0., ....
'attention_mask': tensor([1., 1., 1., .... }
Which is a bit weird, since if I use return_tensors="pt" from the beginning the tokenized text looks like this: (Basically it has one more layer of [ ] and not a "." after every element.
{'input_ids': tensor([[19770, 30882, 215, ....
'token_type_ids': tensor([[0, 0, 0, ....
'attention_mask': tensor([[1, 1, 1, .... }
I tried that on a custom text just to get the reference, currently it is not really an option for me to use return_tensors="pt" directly during my tokenization.
If I run the prediction code as suggested by Huggingface on the return_tensors="pt" tokenized text it works just fine, but if I use my manually to tensor converted tokenized text I receive the following error:
ValueError: not enough values to unpack (expected 2, got 1)
Does anyone have a suggestion as to what I should change or experienced another way to predict new data with a trained model?

I could solve it after some more digging through the documentation. Turns out that just using torch.Tensor(tokenized_text["input_ids"] was not enough.
I had to add another dimension so that the tensor has the size of [1,512].
I did this with:
local_copy["input_ids"] = local_copy["input_ids"][None, :]
I had to typecast my tensor from float to int with:
local_copy["input_ids"] = local_copy["input_ids"].type(torch.int64)

Related

Custom tokenizer not working in countvectorizer sklearn

I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later.
temp_tok = ["or", "Normal sinus rhythm", "sinus", "anuj","Normal sinus"]
def tokenize(text):
return [temp_tok[0],temp_tok[1], "sinus", "Normal sinus"]
def tokenize2(text):
return [i for i in temp_tok if i in text]
text = "Normal sinus rhythm"
The output of text for both functions is same which is
tokenize(text)
output = ['or', 'Normal sinus rhythm', 'sinus', 'Normal sinus']
But when I build vectorizer with these tokenizer, it gives unexpected output for tokenize2. My vocabulary is temp_tok for both. I experimented with n_gram range but it is not helping.
vectorizer = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize)
vectorizer2 = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize2)
While vectorizer.transform([text]) is giving expected output, vectorizer2.transform([text]) is giving 1 only for "or" and "sinus"
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 1, 1, 0, 1]])
vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 0, 1, 0, 0]])
I also tried passing dictionary instead of list temp_tok as vocabulary to Countvectorizer but it doesn't help. Is this sklearn problem or I am doing something wrong?
Countvectorizer is passing the text by converting it to lower case. So tokenize2 is not working while tokenize works well.
This can be seen by adding a print function in tokenize2.
def tokenize2(text):
print(text)
return [i for i in temp_tok if i in text]
A good solution would be to change the elements in temp_tok to lower cases. Else any technique to handle small case, capital case would work.

How to get words from output of XLNet using Transformers library

I am using Hugging Face's Transformer library to work with different NLP models. Following code does masking with XLNet. It outputs a tensor with numbers. How do I convert the output to words again?
import torch
from transformers import XLNetModel, XLNetTokenizer, XLNetLMHeadModel
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')
# We show how to setup inputs to predict a next token using a bi-directional context.
input_ids = torch.tensor(tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")).unsqueeze(0) # We will predict the masked token
print(input_ids)
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
The current output I get is:
tensor([[[ -5.1466, -17.3758, -17.3392, ..., -12.2839, -12.6421, -12.4505]]],
grad_fn=AddBackward0)
The output you have is a tensor of size 1 by 1 by vocabulary size. The meaning of the nth number in this tensor is the estimated log-odds of the nth vocabulary item. So, if you want to get out the word that the model predicts to be most likely to come in the final position (the position you specified with target_mapping), all you need to do is find the word in the vocabulary with the maximum predicted log-odds.
Just add the following to the code you have:
predicted_index = torch.argmax(next_token_logits[0][0]).item()
predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)
So predicted_token is the token the model predicts as most likely in that position.
Note, by default behaviour of XLNetTokenizer.encoder() adds special tokens and to the end of a string of tokens when it encodes it. The code you have given masks and predicts the final word, which, after running though tokenizer.encoder() is the special character '<cls>', which is probably not what you want.
That is, when you run
tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")
the result is a list of token ids,
[35, 388, 22, 6, 313, 21, 685, 18, 6, 6, 540, 9, 4, 3]
which, if you convert back to tokens (by calling tokenizer.convert_ids_to_tokens() on the above id list), you will see has two extra tokens added at the end,
['▁I', '▁went', '▁to', '<mask>', '▁York', '▁and', '▁saw', '▁the', '<mask>', '<mask>', '▁building', '.', '<sep>', '<cls>']
So, if the word you are meaning to predict is 'building', you should use perm_mask[:, :, -4] = 1.0 and target_mapping[0, 0, -4] = 1.0.

Treat a tuple/list of Tensors as a single Tensor

I'm using Pytorch for some robotics Reinforcement Learning tasks. I'd like to use both images and information about the state as observations for this task. The implementation I'm using does not directly support this so I'm making some amendments. Expected observations are either state, as a 1 dimensional Tensor, or images as a 3 dimensional Tensor (channels, width, height). In my task I would like the observation to be a tuple of Tensors.
In many places in my codebase, the observation is of course expected to be a single Tensor, not a tuple of Tensors. Is there an easy way to treat a tuple of Tensors as a single Tensor?
For example, I would like:
observation.to(device)
to work as normal when observation is a single Tensor, and call .to(device) on each Tensor when observation is a tuple of Tensors.
It should be simple enough to create a data type that can support this, but I'm wondering does such a data type already exist? I haven't found anything so far.
If your tensors are all of the same size, you can use torch.stack to concatenate them into one tensor with one more dimension.
Example:
>>> import torch
>>> a=torch.randn(2,1)
>>> b=torch.randn(2,1)
>>> c=torch.randn(2,1)
>>> a
tensor([[ 0.7691],
[-0.0297]])
>>> b
tensor([[ 0.4844],
[-0.9142]])
>>> c
tensor([[ 0.0210],
[-1.1543]])
>>> torch.stack((a,b,c))
tensor([[[ 0.7691],
[-0.0297]],
[[ 0.4844],
[-0.9142]],
[[ 0.0210],
[-1.1543]]])
You can then use torch.unbind to go the other direction.

How to assign a new value to a pytorch Variable without breaking backpropagation?

I have a pytorch variable that is used as a trainable input for a model. At some point I need to manually reassign all values in this variable.
How can I do that without breaking the connections with the loss function?
Suppose the current values are [1.2, 3.2, 43.2] and I simply want them to become [1,2,3].
Edit
At the time I asked this question, I hadn't realized that PyTorch doesn't have a static graph as Tensorflow or Keras do.
In PyTorch, the training loop is made manually and you need to call everything in each training step. (There isn't the notion of placeholder + static graph for later feeding data).
Consequently, we can't "break the graph", since we will use the new variable to perform all the further computations again. I was worried about a problem that happens in Keras, not in PyTorch.
You can use the data attribute of tensors to modify the values, since modifications on data do not affect the graph. So the graph will still be intact and modifications of the data attribute itself have no influence on the graph. (Operations and changes on data are not tracked by autograd and thus not present in the graph)
Since you haven't given an example, this example is based on your comment statement: 'Suppose I want to change the weights of a layer.'
I used normal tensors here, but this works the same for weight.data and bias.data attributes of a layers.
Here is a short example:
import torch
import torch.nn.functional as F
# Test 1, random vector with CE
w1 = torch.rand(1, 3, requires_grad=True)
loss = F.cross_entropy(w1, torch.tensor([1]))
loss.backward()
print('w1.data', w1)
print('w1.grad', w1.grad)
print()
# Test 2, replacing values of w2 with w1, before CE
# to make sure that everything is exactly like in Test 1 after replacing the values
w2 = torch.zeros(1, 3, requires_grad=True)
w2.data = w1.data
loss = F.cross_entropy(w2, torch.tensor([1]))
loss.backward()
print('w2.data', w2)
print('w2.grad', w2.grad)
print()
# Test 3, replace data after computation
w3 = torch.rand(1, 3, requires_grad=True)
loss = F.cross_entropy(w3, torch.tensor([1]))
# setting values
# the graph of the previous computation is still intact as you can in the below print-outs
w3.data = w1.data
loss.backward()
# data were replaced with values from w1
print('w3.data', w3)
# gradient still shows results from computation with w3
print('w3.grad', w3.grad)
Output:
w1.data tensor([[ 0.9367, 0.6669, 0.3106]])
w1.grad tensor([[ 0.4351, -0.6678, 0.2326]])
w2.data tensor([[ 0.9367, 0.6669, 0.3106]])
w2.grad tensor([[ 0.4351, -0.6678, 0.2326]])
w3.data tensor([[ 0.9367, 0.6669, 0.3106]])
w3.grad tensor([[ 0.3179, -0.7114, 0.3935]])
The most interesting part here is w3. At the time backward is called the values are replaced by values of w1. But the gradients are calculated based on the CE-function with values of original w3. The replaced values have no effect on the graph.
So the graph connection is not broken, replacing had no influence on graph. I hope this is what you were looking for!

How to correctly encode labels with tensorflow's one-hot encoding?

I've been trying to learn Tensorflow with python 3.6 and decided on building a facial recognition program using data from the University of Essex's face data base (http://cswww.essex.ac.uk/mv/allfaces/index.html). So far I've been following Tensorflow's MNIST Expert guide, but when I start testing, my accuracy is 0 for every epoch, so I know something is wrong. I feel most shaky on how I'm handling the labels, so I figure that's where the problem is.
The labels in the dataset are either numeric IDs, like 987323, or someone's name, like "fordj". My idea to deal with this was to create a "pre-encoding" encode_labels function, which gives each unique label in the test and training sets their own unique integer value. I checked to make sure each unique label in the test and train sets have the same unique value. It also returns a dictionary so that I can easily map back to the original label from the encoded version. If I don't do this step and pass the labels as I retrieve them (i.e "fordj"), I get an error saying
UnimplementedError (see above for traceback): Cast string to int32 is not supported
[[Node: Cast = CastDstT=DT_INT32, SrcT=DT_STRING, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
The way I'm interpreting this is that since many of the labels are people's names, tensorflow can't convert a label like "fordj" to a tf.int32. The code to grab labels and paths is here:
def get_paths_and_labels(path):
""" image_paths : list of relative image paths
labels : mix of alphanumeric characters """
image_paths = [path + image for image in os.listdir(path)]
labels = [i.split(".")[-3] for i in image_paths]
labels = [i.split("/")[-1] for i in labels]
return image_paths, labels
def encode_labels(train_labels, test_labels):
""" Assigns a numeric value to each label since some are subject's names """
found_labels = []
index = 0
mapping = {}
for i in train_labels:
if i in found_labels:
continue
mapping[i] = index
index += 1
found_labels.append(i)
return [mapping[i] for i in train_labels], [mapping[i] for i in test_labels], mapping
Here is how I assign my training and testing labels. I then want to use tensorflow's one-hot encoder to encode them again for me.
def main():
# Grabs the labels and each image's relative path
train_image_paths, train_labels = get_paths_and_labels(TRAIN_PATH)
# Smallish dataset so I can read it all into memory
train_images = [cv2.imread(image) for image in train_image_paths]
test_image_paths, test_labels = get_paths_and_labels(TEST_PATH)
test_images = [cv2.imread(image) for image in test_image_paths]
num_classes = len(set(train_labels))
# Placeholders
x = tf.placeholder(tf.float32, shape=[None, IMAGE_SIZE[0] * IMAGE_SIZE[1]])
y_ = tf.placeholder(tf.float32, shape=[None, num_classes])
x_image = tf.reshape(x, [-1, IMAGE_SIZE[0], IMAGE_SIZE[1], 1])
# One-hot labels
train_labels, test_labels, mapping = encode_labels(train_labels, test_labels)
train_labels = tf.one_hot(indices=tf.cast(train_labels, tf.int32), depth=num_classes)
test_labels = tf.one_hot(indices=tf.cast(test_labels, tf.int32), depth=num_classes)
I'm sure I'm doing something wrong. I know sklearn has a LabelEncoder, though I haven't tried it out yet. Thanks for any advice on this, all help is appreciated!
The way I'm interpreting this is that since many of the labels are people's names, tensorflow can't convert a label like "fordj" to a tf.int32.
You're right. Tensorflow can't do that. Instead, you can create a mapping function from a nome to a unique (and progressive) ID. Once you did that, you can correctly one-encode every numeric ID with its one-hot representation.
You already have the relation between the numeric ID and the string label, hence you can do something like:
train_labels, test_labels, mapping = encode_labels(train_labels, test_labels)
numeric_train_ids = [labels[idx] for idx in train_labels]
numeric_test_ids = [labels[idx] for idx in test_labels]
one_hot_train_labels = tf.one_hot(indices=numeric_train_ids, depth=num_classes)
one_hot_test_labels = tf.one_hot(indices=numeric_test_ids, depth=num_classes)

Resources