Skorch training object from scratch

Skorch training object from scratch - pytorch

I'm trying to use skorch class to execut GridSearch on a classifier.
I tried running with the vanilla NeuralNetClassifier object, but I haven't found a way to pass the Adam optimizer only the trainable weights (I'm using pre-trained embeddings and I would like to keep them frozen). It's doable if a module is initialized, and then pass those weights with the optimizer__params option, but module needs an uninitialized model. Is there a way around this?
net = NeuralNetClassifier(module=RNN, module__vocab_size=vocab_size, module__hidden_size=hidden_size,
module__embedding_dim=embedding_dim, module__pad_id=pad_id,
module__dataset=ClaimsDataset, lr=lr, criterion=nn.CrossEntropyLoss,
optimizer=torch.optim.Adam, optimizer__weight_decay=35e-3, device='cuda',
max_epochs=nb_epochs, warm_start=True)
The code above works. However, with the batch_size set at 64, I've got to run the model for the specified number of epochs on every batch! Which is not the behavior I'm seeking. I'd be grateful if someone could suggest a nicer way to do this.
My other issue is with subclassing skorch.NeuralNet. I run into a similar issue: figuring out a way to pass only the trainable weights to Adam optimizer. The code below is what I've got so far.
class Train(skorch.NeuralNet):
def __init__(self, module, lr, norm, *args, **kwargs):
self.module = module
self.lr = lr
self.norm = norm
self.params = [p for p in self.module.parameters(self) if p.requires_grad]
super(Train, self).__init__(*args, **kwargs)
def initialize_optimizer(self):
self.optimizer = torch.optim.Adam(params=self.params, lr=self.lr, weight_decay=35e-3, amsgrad=True)
def train_step(self, Xi, yi, **fit_params):
self.module.train()
self.optimizer.zero_grad()
yi = variable(yi)
output = self.module(Xi)
loss = self.criterion(output, yi)
loss.backward()
nn.utils.clip_grad_norm_(self.params, max_norm=self.norm)
self.optimizer.step()
def score(self, y_t, y_p):
return accuracy_score(y_t, y_p)
Initializing the class gives the error:
Traceback (most recent call last):
File "/snap/pycharm-community/74/helpers/pydev/pydevd.py", line 1664, in <module>
main()
File "/snap/pycharm-community/74/helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/snap/pycharm-community/74/helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/snap/pycharm-community/74/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/l/Documents/Bsrc/cv.py", line 115, in <module>
main()
File "/home/l/B/src/cv.py", line 86, in main
trainer = Train(module=RNN, criterion=nn.CrossEntropyLoss, lr=lr, norm=max_norm)
File "/home/l/B/src/cv.py", line 22, in __init__
self.params = [p for p in self.module.parameters(self) if p.requires_grad]
File "/home/l/B/src/cv.py", line 22, in <listcomp>
self.params = [p for p in self.module.parameters(self) if p.requires_grad]
File "/home/l/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 739, in parameters
for name, param in self.named_parameters():
AttributeError: 'Train' object has no attribute 'named_parameters'

but module needs an uninitialized model
That is not correct, you can pass an initialized model as well. The documentation of the model parameter states:
It is, however, also possible to pass an instantiated module, e.g. a PyTorch Sequential instance.
The problem is that when passing an initialized model you cannot pass any module__ parameters to the NeuralNet as this would require the module to be re-initialized. But of course that's problematic if you want to do a grid search over module parameters.
A solution for this would be to overwrite initialize_model and after creating a new instance loading and freezing the parameters (by setting the parameter's requires_grad attribute to False):
def _load_embedding_weights(self):
return torch.randn(1, 100)
def initialize_module(self):
kwargs = self._get_params_for('module')
self.module_ = self.module(**kwargs)
# load weights
self.module_.embedding0.weight = self._load_embedding_weights()
# freeze layer
self.module_.embedding0.weight.requires_grad = False
return self

Related

Why does pytorch lightnings configure_optimizer throw AssertionError: param group must be a dict?

I have set up multiple pytorch lightning projects in the past and while setting up a new quick demo project, I stumbled across this weird error and somehow I cannot get rid of it.
Here are the relevant sections of my model file..
class TSModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.backbone = nn.Sequential(
nn.Conv2d(3, 10, kernel_size=(3, 3), padding=(1, 1)),
nn.ReLU(),
nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))
)
self.classifier = nn.Sequential(
nn.Linear(10*16*16, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
self.criterion = nn.CrossEntropyLoss()
def forward(self, x):
N = x.shape[0]
x = self.backbone(x)
x = x.view(N, -1)
return self.classifier(x)
def configure_optimizers(self):
params = [p for p in self.parameters() if p.requires_grad]
return torch.optim.AdamW(self.parameters())
However, when starting the training process, the program exits and the following is thrown:
Traceback (most recent call last):
File "/torchserve-example/main.py", line 25, in <module>
ts_train()
File "/torchserve-example/main.py", line 21, in ts_train
trainer.fit(model, datamodule)
File ".local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File ".local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 715, in _run
self.accelerator.setup(self, model) # note: this sets up self.lightning_module
File ".local/lib/python3.8/site-packages/pytorch_lightning/accelerators/cpu.py", line 39, in setup
return super().setup(trainer, model)
File ".local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in setup
self.setup_optimizers(trainer)
File ".local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 374, in setup_optimizers
optimizers, lr_schedulers, optimizer_frequencies = self.training_type_plugin.init_optimizers(
File ".local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 190, in init_optimizers
return trainer.init_optimizers(model)
File ".local/lib/python3.8/site-packages/pytorch_lightning/trainer/optimizers.py", line 34, in init_optimizers
optim_conf = model.configure_optimizers()
File "/torchserve-example/model.py", line 52, in configure_optimizers
return torch.optim.AdamW(self.parameters())
File ".local/lib/python3.8/site-packages/torch/optim/adamw.py", line 47, in __init__
super(AdamW, self).__init__(params, defaults)
File ".local/lib/python3.8/site-packages/torch/optim/optimizer.py", line 55, in __init__
self.add_param_group(param_group)
File ".local/lib/python3.8/site-packages/torch/optim/optimizer.py", line 242, in add_param_group
assert isinstance(param_group, dict), "param group must be a dict"
AssertionError: param group must be a dict
When I execute print(type(params[0])) in the configure_optimizers, it prints <class 'torch.nn.parameter.Parameter'> to stdout. Any idea what went wrong here?
Note: As this error occurs during initialization of the optimizer, this is probably not directly related to pytorch lightning which is why I included pytorch as a tag as well.

In the library code, I found:
# if not isinstance(param_groups[0], dict):
# param_groups = [{'params': param_groups}]
When commenting this in, everything works normally.
I leave the question open because it does not feel like a good solution to change the underlying library or to just copy this code section to my file.

Actually, this line is wrong in your code:
def configure_optimizers(self):
params = [p for p in self.parameters() if p.requires_grad]
return torch.optim.AdamW(self.parameters())
You are passing params not self.parameters() as that would work fine.
With params created like this you are essentially passing list with generator inside, which is not an instance of dict.
In PyTorch it is possible to pass multiple different parameters with different learning rates etc. via dicts contained in lists. This is what your params looks like to PyTorch’s API.

RuntimeError: CUDA error: device-side assert triggered - BART model

I am trying to run BART language model for a text generation task.
My code was working fine when I used for another encoder-decoder model (T5), but with bart I am getting this error:
File "train_bart.py", line 89, in train
outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels) cs-lab-host1" 12:39 10-Aug-21
File ".../venv/tf_23/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 1308, in forward
return_dict=return_dict,
File ".../venv/tf_23/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 1196, in forward
return_dict=return_dict,
File ".../venv/tf_23/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 985, in forward
attention_mask, input_shape, inputs_embeds, past_key_values_length
File ".../venv/tf_23/lib/python3.6/site-packages/transformers/models/bart/modeling_bart.py", line 866, in _prepare_decoder_attent
ion_mask
).to(self.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
And this is where error happens:
for _, data in tqdm(enumerate(loader, 0), total=len(loader), desc='Processing batches..'):
y = data['target_ids'].to(device, dtype = torch.long)
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone().detach()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
ids = data['source_ids'].to(device, dtype = torch.long)
mask = data['source_mask'].to(device, dtype = torch.long)
outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
loss = outputs[0]
loader is the tokenized and processed data.

I suggest you change the batch size to 1 and run the code in CPU temporarily to get a more descriptive traceback error.
This will tell you if you where the bug is.
Sarthak

After fighting for many hours, I found that the error was due to adding new tokens to the Bart tokenizer. Thus I needed to resize the model input embeddings matrix:
model.resize_token_embeddings(len(tokenizer))
The point that is still not clear to me is that, without resizing the embeddings matrix, I was able to fine-tune T5 model without any problem, but not Bart.
Maybe this is because Bart is sharing weights between the input and the output layers (I am not sure of this either).

How to load the Keras model with custom layers from .h5 file correctly?

I built a Keras model with a custom layers, and it was saved to a .h5 file by the callback ModelCheckPoint.
When I tried to load this model after the training, the error message below showed up:
__init__() missing 1 required positional argument: 'pool_size'
This is the definition of the custom layer and its __init__ method:
class MyMeanPooling(Layer):
def __init__(self, pool_size, axis=1, **kwargs):
self.supports_masking = True
self.pool_size = pool_size
self.axis = axis
self.y_shape = None
self.y_mask = None
super(MyMeanPooling, self).__init__(**kwargs)
This is how I add this layer to my model:
x = MyMeanPooling(globalvars.pool_size)(x)
This is how I load the model:
from keras.models import load_model
model = load_model(model_path, custom_objects={'MyMeanPooling': MyMeanPooling})
These are the full error messages:
Traceback (most recent call last):
File "D:/My Projects/Attention_BLSTM/script3.py", line 9, in <module>
model = load_model(model_path, custom_objects={'MyMeanPooling': MyMeanPooling})
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\engine\saving.py", line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\engine\saving.py", line 225, in _deserialize_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\engine\saving.py", line 458, in model_from_config
return deserialize(config, custom_objects=custom_objects)
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
printable_module_name='layer')
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\utils\generic_utils.py", line 145, in deserialize_keras_object
list(custom_objects.items())))
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\engine\network.py", line 1022, in from_config
process_layer(layer_data)
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\engine\network.py", line 1008, in process_layer
custom_objects=custom_objects)
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
printable_module_name='layer')
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\utils\generic_utils.py", line 147, in deserialize_keras_object
return cls.from_config(config['config'])
File "D:\ProgramData\Anaconda3\envs\tf\lib\site-packages\keras\engine\base_layer.py", line 1109, in from_config
return cls(**config)
TypeError: __init__() missing 1 required positional argument: 'pool_size'

Actually I don't think you can load this model.
The most likely issue is that you did not implement the get_config() method in your layer. This method returns a dictionary of configuration values that should be saved:
def get_config(self):
config = {'pool_size': self.pool_size,
'axis': self.axis}
base_config = super(MyMeanPooling, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
You have to retrain the model after adding this method to your layer, as the previously saved model does not have the configuration for this layer saved into it. This is why you cannot load it, it requires retraining after making this change.

From the answer of "LiamHe commented on Sep 27, 2017" on the following issue: https://github.com/keras-team/keras/issues/4871.
I met the same problem today : ** TypeError: init() missing 1 required positional arguments**. Here is how I solve the problem : (Keras 2.0.2)
Give the positional arguments of the layer with some default values
Override get_config function to the layer with some thing like
def get_config(self):
config = super().get_config()
config['pool_size'] = # say self._pool_size if you store the argument in __init__
return config
Add layer class to custom_objects when you are loading model.

If you don't have enough time to retrain the model in the solution way of Matias Valdenegro. You can set the default value of pool_size in class MyMeanPooling like the following code. Note that the value of pool_size should be consistent with the value while training the model. Then you can load the model.
class MyMeanPooling(Layer):
def __init__(self, pool_size, axis=1, **kwargs):
self.supports_masking = True
self.pool_size = 2 # The value should be consistent with the value while training the model
self.axis = axis
self.y_shape = None
self.y_mask = None
super(MyMeanPooling, self).__init__(**kwargs)
ref: https://www.jianshu.com/p/e97112c34e43

Variable_scope runtime error when creating keras custom layer using tensorflow hub models and tensorflow 2.0 as backend

I'm trying to use the pretrained tf-hub elmo model by integrating it into a keras layer.
Keras Layer:
class ElmoEmbeddingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(ElmoEmbeddingLayer, self).__init__(**kwargs)
self.dimensions = 1024
self.trainable = True
self.elmo = None
def build(self, input_shape):
url = 'https://tfhub.dev/google/elmo/2'
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
super(ElmoEmbeddingLayer, self).build(input_shape)
def call(self, x, mask=None):
result = self.elmo(
x,
signature="default",
as_dict=True)["elmo"]
return result
def compute_output_shape(self, input_shape):
return input_shape[0], self.dimensions
When I run the code I get the following error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 170, in <module>
validation_steps=validation_dataset.size())
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 79, in train_gpu
model = build_model(self.config, self.embeddings, self.sequence_len, self.out_classes, summary=True)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 8, in build_model
return my_model(embeddings, config, sequence_length, out_classes, summary)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 66, in my_model
inputs, embedding = resolve_inputs(embeddings, sequence_length, model_config, input_type)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 19, in resolve_inputs
return elmo_input(model_conf)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\models.py", line 58, in elmo_input
embedding = ElmoEmbeddingLayer()(input_text)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 616, in __call__
self._maybe_build(inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1966, in _maybe_build
self.build(input_shapes)
File "D:\Google Drive\Licenta\Gemini\Emotion Analysis\nn\architectures\custom_layers.py", line 21, in build
self.elmo = hub.Module(url)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 156, in __init__
abs_state_scope = _try_get_state_scope(name, mark_name_scope_used=False)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow_hub\module.py", line 389, in _try_get_state_scope
"name_scope was already taken." % abs_state_scope)
RuntimeError: variable_scope module/ was unused but the corresponding name_scope was already taken.
It seems to be due to the eager execution behaviour. If I disable eager execution I have to surround the model.fit function within a tensorflow session and initialize the variables by using sess.run(global_variables_initializer()) to avoid the next error:
Traceback (most recent call last):
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 168, in <module>
validation_steps=validation_dataset.size().eval(session=Session()))
File "D:/Google Drive/Licenta/Gemini/Emotion Analysis/nn/trainer/model.py", line 90, in train_gpu
class_weight=weighted)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training.py", line 643, in fit
use_multiprocessing=use_multiprocessing)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 664, in fit
steps_name='steps_per_epoch')
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py", line 294, in model_iteration
batch_outs = f(actual_inputs)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\keras\backend.py", line 3353, in __call__
run_metadata=self.run_metadata)
File "D:\Apps\Anaconda\envs\tf2.0\lib\site-packages\tensorflow\python\client\session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
(1) Failed precondition: Error while reading resource variable module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/module/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/class tensorflow::Var does not exist.
[[{{node elmo_embedding_layer/module_apply_default/bilm/RNN_0/RNN/MultiRNNCell/Cell1/rnn/lstm_cell/bias/Read/ReadVariableOp}}]]
[[metrics/f1_micro/Identity/_223]]
0 successful operations.
0 derived errors ignored.
My solution:
with Session() as sess:
sess.run(global_variables_initializer())
history = model.fit(self.train_data.repeat(),
epochs=self.config['epochs'],
validation_data=self.validation_data.repeat(),
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps,
callbacks=self.__callbacks(monitor_metric),
class_weight=weighted)
The main question is if there is another way to use elmo tf-hub module in a keras custom layer and train my model. Another question is if my current solution is not affecting the training performances or give the OOM GPU error (I get the OOM error after a few epochs with a higher batch size, which I've found to be related to sessions not closed or memory leaks).

If you wrap your model in Session() field, you will also have to wrap all another code that uses your model in Session() field. It takes a lot times and efforts. I have another way to deal with it:
firstly, create a elmo module, add a session to keras:
elmo_model = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True,
name='elmo_module')
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
K.set_session(sess)
Instead of create elmo module directly in your ElmoEmbeddinglayer
self.elmo = hub.Module(url)
self._trainable_weights += trainable_variables(
scope="^{}_module/.*".format(self.name))
You can do the following, i think it works normally!
self.elmo = elmo_model
self._trainable_weights += trainable_variables(
scope="^elmo_module/.*")

Here is a simple solution that I used in my case:
That thing happened to me while I was using a separated python script to create the module.
To solve it I passed the tf.Session() in the main script to the tf.keras.backend in the other script by creating an entry point to pass it before calling the Layer.init
Example:
Main file:
import tensorflow.compat.v1 as tf
from ModuleFile import ModuleLayer
def __main__():
init_args = [...]
input = ...
sess= tf.keras.backend.get_session()
Module_layer.__init_session___(sess)
module_layer = ModuleLayer(init_args)(input)
Module file:
import tensorflow.compat.v1 as tf
class ModuleLayer(tf.keras.layers.Layer):
#staticmethod
def __init_session__(session):
tf.keras.backend.set_session(session)
def __init__(*args):
...
Hope that helps :)

Pytorch: Error in DataParallel for RNN model

I'm trying to use torch.nn.DataParallel for a RNN model. My model looks like this:
class EncoderRNN(nn.Module):
def __init__(self, vocal_size, hidden_size):
super(EncoderRNN, self).__init__()
self.hidden_size = hidden_size
self.embedding = nn.Embedding(vocal_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
def forward(self, input_batch, input_batch_length, hidden):
embedded = self.embedding(input_batch)
packed_input = nn.utils.rnn.pack_padded_sequence(embedded, input_batch_length.cpu().numpy(), batch_first=True)
output, hidden = self.gru(packed_input, hidden)
return output, hidden
class DecoderRNN(nn.Module):
def __init__(self, hidden_size, vocab_size):
super(DecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
self.out = nn.Linear(hidden_size, vocab_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, target_batch, target_batch_length, hidden, train=False):
embedded = self.embedding(target_batch)
output = F.relu(embedded)
if train:
# minus 1 to eliminate <EOS>
packed_target = nn.utils.rnn.pack_padded_sequence(output, (target_batch_length - 1).cpu().numpy(),
batch_first=True)
output, hidden = self.gru(packed_target, hidden)
output = self.softmax(self.out(output[0]))
return output, hidden
And I implemented DataParallel like this when declaring the model:
encoder = nn.DataParallel(encoder)
decoder = nn.DataParallel(decoder)
The code runs on a server with 4 GPUs, and I received following error message:
/home/cjunjie/NLP/DocSummarization/model.py:18: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
output, hidden = self.gru(packed_input, hidden)
Traceback (most recent call last):
File "train.py", line 144, in <module>
train_iteration(encoder, decoder, fileDataSet)
File "train.py", line 110, in train_iteration
target_indices, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
File "train.py", line 41, in train
encoder_output, encoder_hidden = encoder(input_batch, input_batch_length, encoder_hidden)
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 74, in forward
return self.gather(outputs, self.output_device)
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 86, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 65, in gather
return gather_map(outputs)
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 60, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 60, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/utils/rnn.py", line 39, in __new__
return super(PackedSequence, cls).__new__(cls, *args[0])
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 57, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/cjunjie/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 58, in forward
assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
I searched for the same problem, but none of them have a solution. Can anyone help?

In order to run the code on GPUs you need to copy both variables and model weights to cuda. I suspect you did not copy model weights to cuda. To do that you need to do
encoder.cuda()
decoder.cuda()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Skorch training object from scratch - pytorch

Related

Why does pytorch lightnings configure_optimizer throw AssertionError: param group must be a dict?

RuntimeError: CUDA error: device-side assert triggered - BART model

How to load the Keras model with custom layers from .h5 file correctly?

Variable_scope runtime error when creating keras custom layer using tensorflow hub models and tensorflow 2.0 as backend

Pytorch: Error in DataParallel for RNN model

Categories

Resources