RuntimeError: CUDA error: device-side assert triggered - BART model - pytorch

I am trying to run BART language model for a text generation task.
My code was working fine when I used for another encoder-decoder model (T5), but with bart I am getting this error:
File "train_bart.py", line 89, in train
outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels) cs-lab-host1" 12:39 10-Aug-21
File ".../venv/tf_23/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 1308, in forward
return_dict=return_dict,
File ".../venv/tf_23/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 1196, in forward
return_dict=return_dict,
File ".../venv/tf_23/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 985, in forward
attention_mask, input_shape, inputs_embeds, past_key_values_length
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 866, in _prepare_decoder_attent
ion_mask
).to(self.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
And this is where error happens:
for _, data in tqdm(enumerate(loader, 0), total=len(loader), desc='Processing batches..'):
y = data['target_ids'].to(device, dtype = torch.long)
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone().detach()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
ids = data['source_ids'].to(device, dtype = torch.long)
mask = data['source_mask'].to(device, dtype = torch.long)
outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
loss = outputs[0]
loader is the tokenized and processed data.

I suggest you change the batch size to 1 and run the code in CPU temporarily to get a more descriptive traceback error.
This will tell you if you where the bug is.
Sarthak

After fighting for many hours, I found that the error was due to adding new tokens to the Bart tokenizer. Thus I needed to resize the model input embeddings matrix:
model.resize_token_embeddings(len(tokenizer))
The point that is still not clear to me is that, without resizing the embeddings matrix, I was able to fine-tune T5 model without any problem, but not Bart.
Maybe this is because Bart is sharing weights between the input and the output layers (I am not sure of this either).

Related

TypeError: cannot assign 'torch.cuda.FloatTensor' as parameter 'weight_hh_l0' (torch.nn.Parameter or None expected)

I am trying to train the model implemented in this repo https://bitbucket.org/VioletPeng/language-model/src/master/ (the second model: title to title-storyline to story model)
The training would go fine for the first epoch, but as soon as it tries to call the train function to start the second epoch everything breaks and I get the following error:
TypeError: cannot assign 'torch.cuda.FloatTensor' as parameter 'weight_hh_l0' (torch.nn.Parameter or None expected)
I don't know what the issue is, I tried looking this error up and tried changing .cuda to .to(device) and using device= inside the tensor initialization when possible.
But none of this seems to be doing anything.
Below is the full exception stack trace:
File "pytorch_src/main.py", line 253, in <module>
train()
File "pytorch_src/main.py", line 209, in train
output, hidden, rnn_hs, dropped_rnn_hs = model(data, hidden, return_h=True)
File "/home/e/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/e/Documents/Amal/language-model/pytorch_src/model.py", line 81, in forward
raw_output, new_h = rnn(raw_output, hidden[l])
File "/home/e/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/e/Documents/Amal/language-model/pytorch_src/weight_drop.py", line 47, in forward
self._setweights()
File "/home/e/Documents/Amal/language-model/pytorch_src/weight_drop.py", line 44, in _setweights
setattr(self.module, name_w, w)
File "/home/e/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 108, in __setattr__
super(RNNBase, self).__setattr__(attr, value)
File "/home/e/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 801, in __setattr__
.format(torch.typename(value), name))
I downgraded my python to 3.6 and reinstalled all the requirements and it worked.
So probably the issue was an incompatible torch version.
Newer versions of PyTorch require parameters as torch.nn.Parameter.
I think you need to change the code as follows, at least, it helped me with the same error in the code based on the same codebase:
def _setweights(self):
for name_w in self.weights:
raw_w = getattr(self.module, name_w + '_raw')
w = None
w = torch.nn.functional.dropout(raw_w, p=self.dropout, training=self.training)
setattr(self.module, name_w, torch.nn.Parameter(w))

Pytorch - Caught StopIteration in replica 1 on device 1 error while Training on GPU

I am trying to train a BertPunc model on the train2012 data used in the git link: https://github.com/nkrnrnk/BertPunc.
While running on the server, with 4 GPUs enabled, below is the error I get:
StopIteration: Caught StopIteration in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/stenoaimladmin/notebooks/model_BertPunc.py", line 16, in forward
x = self.bert(x)
File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/stenoaimladmin/anaconda3/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py", line 861, in forward
sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask,
File "/home/stenoaimladmin/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/stenoaimladmin/anaconda3/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py", line 727, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
From the link: https://github.com/huggingface/transformers/issues/8145, this appears to be happening when the data gets moved back and forth between multiple GPUs.
As per the git link: https://github.com/interpretml/interpret-text/issues/117, we need to downgrade PyTorch version to 1.4 from 1.7 which I use currently. For me downgrading the version isnt an option as I have other scripts that use Torch 1.7 version. What should I do to overcome this error?
I cant put the whole code here as there are too many lines, but here is the snippet that gives me the error:
bert_punc, optimizer, best_val_loss = train(bert_punc, optimizer, criterion, epochs_top,
data_loader_train, data_loader_valid, save_path, punctuation_enc, iterations_top, best_val_loss=1e9)
Here is my DataParallel code:
bert_punc = nn.DataParallel(BertPunc(segment_size, output_size, dropout)).cuda()
I tried changing the Dataparallel line to divert the training to only 1 GPU , out of 4 present. But that gave me a space issue, and hence had to revert the code back to default.
Here is the link to all scripts that I am using: https://github.com/nkrnrnk/BertPunc
Please advice.
change
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
to
extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) # fp16 compatibility
For more details, see https://github.com/vid-koci/bert-commonsense/issues/6
I second Xiaoou wang answer.
Just adding the path of the file needed to update in my env for better clarity
"/data/home/cohnstav/anaconda3/envs/BestEnv/lib/python3.8/site-packages/pytorch_pretrained_bert/modeling.py"

Variable_scope runtime error when creating keras custom layer using tensorflow hub models and tensorflow 2.0 as backend

I'm trying to use the pretrained tf-hub elmo model by integrating it into a keras layer.
Keras Layer:
class ElmoEmbeddingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(ElmoEmbeddingLayer, self).__init__(**kwargs)
self.dimensions = 1024
self.trainable = True
self.elmo = None
def build(self, input_shape):
url = 'https://tfhub.dev/google/elmo/2'
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
super(ElmoEmbeddingLayer, self).build(input_shape)
def call(self, x, mask=None):
result = self.elmo(
x,
signature="default",
as_dict=True)["elmo"]
return result
def compute_output_shape(self, input_shape):
return input_shape[0], self.dimensions
When I run the code I get the following error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 170, in <module>
validation_steps=validation_dataset.size())
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 79, in train_gpu
model = build_model(self.config, self.embeddings, self.sequence_len, self.out_classes, summary=True)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 8, in build_model
return my_model(embeddings, config, sequence_length, out_classes, summary)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 66, in my_model
inputs, embedding = resolve_inputs(embeddings, sequence_length, model_config, input_type)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 19, in resolve_inputs
return elmo_input(model_conf)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 58, in elmo_input
embedding = ElmoEmbeddingLayer()(input_text)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 616, in __call__
self._maybe_build(inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1966, in _maybe_build
self.build(input_shapes)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\custom_layers.py", line 21, in build
self.elmo = hub.Module(url)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 156, in __init__
abs_state_scope = _try_get_state_scope(name, mark_name_scope_used=False)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 389, in _try_get_state_scope
"name_scope was already taken." % abs_state_scope)
RuntimeError: variable_scope module/ was unused but the corresponding name_scope was already taken.
It seems to be due to the eager execution behaviour. If I disable eager execution I have to surround the model.fit function within a tensorflow session and initialize the variables by using sess.run(global_variables_initializer()) to avoid the next error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 168, in <module>
validation_steps=validation_dataset.size().eval(session=Session()))
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 90, in train_gpu
class_weight=weighted)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training.py", line 643, in fit
use_multiprocessing=use_multiprocessing)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 664, in fit
steps_name='steps_per_epoch')
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 294, in model_iteration
batch_outs = f(actual_inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\backend.py", line 3353, in __call__
run_metadata=self.run_metadata)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
(1) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
[[metrics/f1_micro/Identity/_223]]
0 successful operations.
0 derived errors ignored.
My solution:
with Session() as sess:
sess.run(global_variables_initializer())
history = model.fit(self.train_data.repeat(),
epochs=self.config['epochs'],
validation_data=self.validation_data.repeat(),
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps,
callbacks=self.__callbacks(monitor_metric),
class_weight=weighted)
The main question is if there is another way to use elmo tf-hub module in a keras custom layer and train my model. Another question is if my current solution is not affecting the training performances or give the OOM GPU error (I get the OOM error after a few epochs with a higher batch size, which I've found to be related to sessions not closed or memory leaks).
If you wrap your model in Session() field, you will also have to wrap all another code that uses your model in Session() field. It takes a lot times and efforts. I have another way to deal with it:
firstly, create a elmo module, add a session to keras:
elmo_model = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True,
name='elmo_module')
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
K.set_session(sess)
Instead of create elmo module directly in your ElmoEmbeddinglayer
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
You can do the following, i think it works normally!
self.elmo = elmo_model
self._trainable_weights += trainable_variables(
scope="^elmo_module/.*")
Here is a simple solution that I used in my case:
That thing happened to me while I was using a separated python script to create the module.
To solve it I passed the tf.Session() in the main script to the tf.keras.backend in the other script by creating an entry point to pass it before calling the Layer.init
Example:
Main file:
import tensorflow.compat.v1 as tf
from ModuleFile import ModuleLayer
def __main__():
init_args = [...]
input = ...
sess= tf.keras.backend.get_session()
Module_layer.__init_session___(sess)
module_layer = ModuleLayer(init_args)(input)
Module file:
import tensorflow.compat.v1 as tf
class ModuleLayer(tf.keras.layers.Layer):
#staticmethod
def __init_session__(session):
tf.keras.backend.set_session(session)
def __init__(*args):
...
Hope that helps :)

Error in Keras when I want to calculate the Sensitivity and Specificity

I am writing a code for classification between two types of images based on a CNN.
I want to measure the accuracy, sensitivity, and specificity for my work but unfortunately, I have the following error.
Could you please let me know what my problem is.
m = tf.keras.metrics.SensitivityAtSpecificity(0.5)
model.compile(optimizer='adam', loss=keras.losses.binary_crossentropy, metrics=['accuracy',m])
error:
Traceback (most recent call last):
File "C:/Users/Hamed/PycharmProjects/Deep Learning/CNN.py", line 77, in <module>
validation_steps = 1600//batch_size)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\keras\engine\training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
run_metadata_ptr)
File "C:\Users\Hamed\Anaconda3\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Resource localhost/false_negatives/class tensorflow::Var does not exist.
[[{{node metrics/sensitivity_at_specificity/AssignAddVariableOp_1}}]]
[[{{node metrics/sensitivity_at_specificity/Mean}}]]
The metric tf.keras.metrics.SensitivityAtSpecificity calculates sensitivity at a given specificity Click here.
Unfortunately sensitivity and specificity metrics are not yet included in Keras, so you have to write your own custom metric as is specified here.
The following is one simple way to calculate specificity found at this answer.
def specificity(y_true, y_pred):
"""
param:
y_pred - Predicted labels
y_true - True labels
Returns:
Specificity score
"""
neg_y_true = 1 - y_true
neg_y_pred = 1 - y_pred
fp = K.sum(neg_y_true * y_pred)
tn = K.sum(neg_y_true * neg_y_pred)
specificity = tn / (tn + fp + K.epsilon())
return specificity
You can get Keras implementations for specificity and sensitivity on this link.
You Can try this, if it helps...
import keras
model.compile(optimizer="adam",
loss="categorical_crossentropy",
metrics=[keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.SpecificityAtSensitivity(0.5), keras.metrics.SensitivityAtSpecificity(0.5), 'accuracy'])

Skorch training object from scratch

I'm trying to use skorch class to execut GridSearch on a classifier.
I tried running with the vanilla NeuralNetClassifier object, but I haven't found a way to pass the Adam optimizer only the trainable weights (I'm using pre-trained embeddings and I would like to keep them frozen). It's doable if a module is initialized, and then pass those weights with the optimizer__params option, but module needs an uninitialized model. Is there a way around this?
net = NeuralNetClassifier(module=RNN, module__vocab_size=vocab_size, module__hidden_size=hidden_size,
module__embedding_dim=embedding_dim, module__pad_id=pad_id,
module__dataset=ClaimsDataset, lr=lr, criterion=nn.CrossEntropyLoss,
optimizer=torch.optim.Adam, optimizer__weight_decay=35e-3, device='cuda',
max_epochs=nb_epochs, warm_start=True)
The code above works. However, with the batch_size set at 64, I've got to run the model for the specified number of epochs on every batch! Which is not the behavior I'm seeking. I'd be grateful if someone could suggest a nicer way to do this.
My other issue is with subclassing skorch.NeuralNet. I run into a similar issue: figuring out a way to pass only the trainable weights to Adam optimizer. The code below is what I've got so far.
class Train(skorch.NeuralNet):
def __init__(self, module, lr, norm, *args, **kwargs):
self.module = module
self.lr = lr
self.norm = norm
self.params = [p for p in self.module.parameters(self) if p.requires_grad]
super(Train, self).__init__(*args, **kwargs)
def initialize_optimizer(self):
self.optimizer = torch.optim.Adam(params=self.params, lr=self.lr, weight_decay=35e-3, amsgrad=True)
def train_step(self, Xi, yi, **fit_params):
self.module.train()
self.optimizer.zero_grad()
yi = variable(yi)
output = self.module(Xi)
loss = self.criterion(output, yi)
loss.backward()
nn.utils.clip_grad_norm_(self.params, max_norm=self.norm)
self.optimizer.step()
def score(self, y_t, y_p):
return accuracy_score(y_t, y_p)
Initializing the class gives the error:
Traceback (most recent call last):
File "/snap/pycharm-community/74/helpers/pydev/pydevd.py", line 1664, in <module>
main()
File "/snap/pycharm-community/74/helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/snap/pycharm-community/74/helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/snap/pycharm-community/74/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/l/Documents/Bsrc/cv.py", line 115, in <module>
main()
File "/home/l/B/src/cv.py", line 86, in main
trainer = Train(module=RNN, criterion=nn.CrossEntropyLoss, lr=lr, norm=max_norm)
File "/home/l/B/src/cv.py", line 22, in __init__
self.params = [p for p in self.module.parameters(self) if p.requires_grad]
File "/home/l/B/src/cv.py", line 22, in <listcomp>
self.params = [p for p in self.module.parameters(self) if p.requires_grad]
File "/home/l/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 739, in parameters
for name, param in self.named_parameters():
AttributeError: 'Train' object has no attribute 'named_parameters'
but module needs an uninitialized model
That is not correct, you can pass an initialized model as well. The documentation of the model parameter states:
It is, however, also possible to pass an instantiated module, e.g. a PyTorch Sequential instance.
The problem is that when passing an initialized model you cannot pass any module__ parameters to the NeuralNet as this would require the module to be re-initialized. But of course that's problematic if you want to do a grid search over module parameters.
A solution for this would be to overwrite initialize_model and after creating a new instance loading and freezing the parameters (by setting the parameter's requires_grad attribute to False):
def _load_embedding_weights(self):
return torch.randn(1, 100)
def initialize_module(self):
kwargs = self._get_params_for('module')
self.module_ = self.module(**kwargs)
# load weights
self.module_.embedding0.weight = self._load_embedding_weights()
# freeze layer
self.module_.embedding0.weight.requires_grad = False
return self

Resources