How to parallelize BERT evaluation - pytorch

For example
with torch.no_grad():
outputs = model(my_token, my_segment)
hidden_states = outputs[2]
the above code takes my_token and evaluates it, if the size of my_token increases the time of evaluation would increase as well, is there a way to parallelize the operation?

Related

Pytorch's Transformer decoder accuracy fluctuation

I have a sequence to sequence POS tagging model which uses Transformer decoder to generate target tokens.
My implementation of Pytorch's Transformer decoder is as follows:
in the initialization:
self.decoder_layer = nn.TransformerDecoderLayer(d_model=ENV_HIDDEN_SIZE, nhead=2,batch_first=True,dim_feedforward=300 ,activation="relu")
self.transformer_decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=2)
and in the forward function:
if infer==False: # for training
embedded=embedded*math.sqrt(ENV_HIDDEN_SIZE)
embedded = self.pos_encoder(embedded)
zol = self.transformer_decoder(tgt=embedded,memory=newtensor
,memory_mask=self.transformer_mask
,memory_key_padding_mask=x_mask
,tgt_mask=self.transformer_mask)
scores = self.slot_trans(self.dropout3(zol))
else: #for inferrence
bos = Variable(torch.LongTensor([[tag2index['<BOS>']]*batch_size])).cuda().transpose(1,0)
bos = self.embedding(bos)
tokens=bos
for i in range(length):
temp_embedded=tokens*math.sqrt(ENV_HIDDEN_SIZE)
temp_embedded = self.pos_encoder(temp_embedded)
zol = self.transformer_decoder(tgt=temp_embedded,
memory=newtensor,
tgt_mask=self.transformer_mask[:i+1,:i+1],
memory_key_padding_mask=x_mask,
memory_mask=self.transformer_mask[:i+1,:]
)
scores = self.slot_trans(self.dropout3(zol))
softmaxed = self.softmax(scores)
_,input = torch.max(softmaxed,2)
newtok = self.embedding(input)
tokens=torch.cat((bos,newtok),dim=1)
and the memory_mask is generated by the function "generate_square_subsequent_mask" as given:
def generate_square_subsequent_mask(sz: int) :
"""Generates an upper-triangular matrix of -inf, with zeros on diag."""
return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)
I am observing something weird. If I do not feed the memory_mask with generate_subsequent_mask -which I should not according to this post-, the accuracy severely decreases. Furthermore, accuracy of the model fluctuates between 50% and 90% on each epoch randomly on the test set but not the training set.
if I do feed the memory_mask, everything is fine, and model accuracy steadily increases to 95% on the test set. Moreover, the final accuracy takes a hit when not feeding the memory_mask.
Things I tried:
Without memory_mask: Tuning the learning rate.
Without memory_mask: Increasing the nhead and num_layers.
Using a simple linear layer.
At the end-note, using a simple linear layer instead of the transformer decoder provides a better accuracy. Any ideas as to why this is happening?

How to evaluate loss only on elements satisfying a condition pytorch

I'm working on a regression problem in pytorch. I get good results on my evaluation set, but I want to make sure it's not because I have many small elements and less large ones. Therefore, I would like to check whether I get similar loss for the large elements (eg. elements > 0.01). I use mse loss.
Can anyone pls suggest a way of doing so?
Thanks!
You can zero-out loss for smaller elements (assuming size of elements is based on your regression target), you can implement your own loss function like this:
import torch
class CustomMSE:
def __init__(self, threshold=0.01, reduction=torch.mean):
self.threshold = threshold
self.reduction = reduction
def __call__(self, output, target):
# Do not reduce, so you get per-element loss
loss = torch.nn.functional.mse_loss(output, target, reduction="none")
loss[target < self.threshold] = 0
return self.reduction(loss)
criterion = CustomMSE()
You can use it just like torch.nn.MSELoss, this should give you an overall idea.

Gradient calculation not disabled in no_grad() PyTorch

Why is the gradient calculation of y not disabled in the following piece of code?
x = torch.randn(3, requires_grad=True)
print(x.requires_grad)
print((x ** 2).requires_grad)
y = x**2
print(y.requires_grad)
with torch.no_grad():
print((x ** 2).requires_grad)
print(y.requires_grad)
Which gives the following output:
True
True
True
False
True
Going through the official documentation says that the results would have require_grad=False even though the inputs have required_grad=True
Disabling gradient calculation is useful for inference, when you are sure
that you will not call :meth:Tensor.backward(). It will reduce memory
consumption for computations that would otherwise have requires_grad=True.
In this mode, the result of every computation will have
requires_grad=False, even when the inputs have requires_grad=True.
I don't know the specific implementation of torch.no_grad(), but the doc contains the sentence the result of every computation which means it only works for the result but not origin variable.
run code below:
with torch.no_grad():
print(x.grad)
which will give output:
True
So as y which is not the result arising within torch.no_grad() context.

Running out of memory during evaluation in Pytorch

I'm training a model in pytorch. Every 10 epochs, I'm evaluating the train and test error on the entire train and test dataset. For some reason the evaluation function is causing out-of-memory on my GPU. This is strange because I have the same batchsize for training and evaluation. I believe it's due to the net.forward() method being called repeated and having all the hidden values stored in memory but I'm not sure how to get around this?
def evaluate(self, data):
correct = 0
total = 0
loader = self.train_loader if data == "train" else self.test_loader
for step, (story, question, answer) in enumerate(loader):
story = Variable(story)
question = Variable(question)
answer = Variable(answer)
_, answer = torch.max(answer, 1)
if self.config.cuda:
story = story.cuda()
question = question.cuda()
answer = answer.cuda()
pred_prob = self.mem_n2n(story, question)[0]
_, output_max_index = torch.max(pred_prob, 1)
toadd = (answer == output_max_index).float().sum().data[0]
correct = correct + toadd
total = total + captions.size(0)
acc = correct / total
return acc
I think it fails during Validation because you don't use optimizer.zero_grad(). The zero_grad executes detach, making the tensor a leaf. It is commonly used every epoch in the training part.
The use of volatile flag in Variable from PyTorch 0.4.0 has been removed.
Ref - migration_guide_to_0.4.0
Starting from 0.4.0, to avoid the gradient being computed during validation, use torch.no_grad()
Code example from the migration guide.
# evaluate
with torch.no_grad(): # operations inside don't track history
for input, target in test_loader:
...
For 0.3.X, using volatile should work.
I would suggest to use volatile flag set to True for all variables used during the evaluation,
story = Variable(story, volatile=True)
question = Variable(question, volatile=True)
answer = Variable(answer, volatile=True)
Thus, the gradients and operation history is not stored and you will save a lot of memory.
Also, you could delete references to those variables at the end of the batch processing:
del story, question, answer, pred_prob
Don't forget to set the model to the evaluation mode (and back to the train mode after you finished the evaluation). For instance, like this
model.eval()

Periodically log gradients without requiring two functions (or slowdown) in Theano

For diagnostic purposes, I am grabbing the gradients of the network periodically. One way to do this is to return the gradients as output of the theano function. However, copying the gradients from the GPU to CPU memory every time may be costly so I would prefer to do it only periodically. At the moment, I am achieving this by creating two function objects, one which returns the gradient and one which doesn't.
However, I do not know whether this is optimal and am looking for a more elegant way to achieve the same thing.
Your first function obviously executes a training step and updates all your parameters.
The second function must return the gradients of your parameters.
The fastest way to do what you are asking is to add the updates for the training step to the second function and when logging the gradients, don't call the first function, but only the second.
gradients = [ ... ]
train_f = theano.function([x, y], [], updates=updates)
train_grad_f = theano.function([x, y], gradients, updates=updates)
num_iters = 1000
grad_array = []
for i in range(num_iters):
# every 10 training steps keep log of gradients
if i % 10 == 0:
grad_array.append(train_grad_f(...))
else:
train_f(...)
Update
if you wish to have a single function to do this, you can do the following
from theano.ifelse import ifelse
no_grad = T.iscalar('no_grad')
example_gradient = T.grad(example_cost, example_variable)
# if no_grad is > 0 then return the gradient, otherwise return zeros array
out_grad = ifelse(T.gt(no_grad,0), example_gradient, T.zeros_like(example_variable))
train_f = theano.function([x, y, no_grad], [out_grad], updates=updates)
So when you want to retrieve the gradients you call
train_f(x_data, y_data, 1)
otherwise
train_f(x_data, y_data, 0)

Resources