I am trying to use WandB gradient visualization to debug the gradient flow in my neural net on Google Colab. Without WandB logging, the training runs without error, taking up 11Gb/16GB on the p100 gpu. However, adding this line wandb.watch(model, log='all', log_freq=3) causes a cuda out of memory error.
How does WandB logging create extra GPU memory overhead?
Is there some way to reduce the overhead?
--adding training loop code--
learning_rate = 0.001
num_epochs = 50
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = MyModel()
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
wandb.watch(model, log='all', log_freq=3)
for epoch in range(num_epochs):
train_running_loss = 0.0
train_accuracy = 0.0
model = model.train()
## training step
for i, (name, output_array, input) in enumerate(trainloader):
output_array = output_array.to(device)
input = input.to(device)
comb = torch.zeros(1,1,100,1632).to(device)
## forward + backprop + loss
output = model(input, comb)
loss = my_loss(output, output_array)
optimizer.zero_grad()
loss.backward()
## update model params
optimizer.step()
train_running_loss += loss.detach().item()
temp = get_accuracy(output, output_array)
print('check 13')
!nvidia-smi | grep MiB | awk '{print $9 $10 $11}'
train_accuracy += temp
-----edit-----
I think WandB is creating an extra copy of the gradient during logging preprocessing. Here is the traceback:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-11-13de83557b55> in <module>()
60 get_ipython().system("nvidia-smi | grep MiB | awk '{print $9 $10 $11}'")
61
---> 62 loss.backward()
63
64 print('check 10')
4 frames
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
253 create_graph=create_graph,
254 inputs=inputs)
--> 255 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
256
257 def register_hook(self, hook):
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
147 Variable._execution_engine.run_backward(
148 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149 allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
150
151
/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in <lambda>(grad)
283 self.log_tensor_stats(grad.data, name)
284
--> 285 handle = var.register_hook(lambda grad: _callback(grad, log_track))
286 self._hook_handles[name] = handle
287 return handle
/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in _callback(grad, log_track)
281 if not log_track_update(log_track):
282 return
--> 283 self.log_tensor_stats(grad.data, name)
284
285 handle = var.register_hook(lambda grad: _callback(grad, log_track))
/usr/local/lib/python3.7/dist-packages/wandb/wandb_torch.py in log_tensor_stats(self, tensor, name)
219 # Remove nans from tensor. There's no good way to represent that in histograms.
220 flat = flat[~torch.isnan(flat)]
--> 221 flat = flat[~torch.isinf(flat)]
222 if flat.shape == torch.Size([0]):
223 # Often the whole tensor is nan or inf. Just don't log it in that case.
RuntimeError: CUDA out of memory. Tried to allocate 4.65 GiB (GPU 0; 15.90 GiB total capacity; 10.10 GiB already allocated; 717.75 MiB free; 14.27 GiB reserved in total by PyTorch)
---update----
Indeed, commenting out the offending line flat = flat[~torch.isinf(flat)]
gets the logging step to just barely fit into the GPU memory.
Related
I am working on a tutorial of PyTorch Lightning.
https://pytorch-lightning.readthedocs.io/en/stable/starter/introduction.html
Because I wanted to try GPU training, I changed definition of trainer as below.
trainer = pl.Trainer(limit_train_batches=100, max_epochs=1, gpus=1)
Then I got the following error.
RuntimeError Traceback (most recent call last)
Cell In [3], line 4
1 # train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
2 # trainer = pl.Trainer(limit_train_batches=100, max_epochs=3)
3 trainer = pl.Trainer(limit_train_batches=100, max_epochs=3, accelerator='gpu', devices=1)
----> 4 trainer.fit(model=autoencoder, train_dataloaders=train_loader)
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:696, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
677 r"""
678 Runs the full optimization routine.
679
(...)
693 datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.LightningDataModule`.
694 """
695 self.strategy.model = model
--> 696 self._call_and_handle_interrupt(
697 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
698 )
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:650, in Trainer._call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
648 return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
649 else:
--> 650 return trainer_fn(*args, **kwargs)
651 # TODO(awaelchli): Unify both exceptions below, where `KeyboardError` doesn't re-raise
652 except KeyboardInterrupt as exception:
[...]
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/pytorch_lightning/core/module.py:1450, in LightningModule.backward(self, loss, optimizer, optimizer_idx, *args, **kwargs)
1433 def backward(
1434 self, loss: Tensor, optimizer: Optional[Optimizer], optimizer_idx: Optional[int], *args, **kwargs
1435 ) -> None:
1436 """Called to perform backward on the loss returned in :meth:`training_step`. Override this hook with your
1437 own implementation if you need to.
1438
(...)
1448 loss.backward()
1449 """
-> 1450 loss.backward(*args, **kwargs)
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/torch/_tensor.py:396, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
387 if has_torch_function_unary(self):
388 return handle_torch_function(
389 Tensor.backward,
390 (self,),
(...)
394 create_graph=create_graph,
395 inputs=inputs)
--> 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File ~/miniconda3/envs/py38-cu116/lib/python3.8/site-packages/torch/autograd/__init__.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors_, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
The only thing I added to the tutorial code is gpus=1, so I cannot figure out what is the problem. How can I fix this?
FYI, I tried giving devices=1, accelerator='ddp' instead of gpus=1, and got a following error.
ValueError: You selected an invalid accelerator name: `accelerator='ddp'`. Available names are: cpu, cuda, hpu, ipu, mps, tpu.
My environments are:
CUDA 11.6
Python 3.8.13
PyTorch 1.12.1
PyTorch Lightning 1.7.7
I think you made a mistake on the trainer's argument.
accelerator should be cpu, cuda, hpu, ipu, mps, tpu;
devices is the number of, say that, gpus;
and then you can pass the "ddp" argument to "strategy"
trainer = pl.Trainer(
accelerator="GPU",
devices=[0],
strategy="ddp"
)
hope it helps!
Though I'm not sure about the reason, the issue disappeared when I used Python 3.10 instead of 3.8.
I am trying to make segmentation model using Pytorch and implement custom IoULoss:
def IoULoss(inputs, targets, smooth=1e-6):
inputs = (inputs.view(inputs.size(0), -1) > 0.5)
targets = targets.view(targets.size(0), -1)
intersection = (inputs & targets).float().sum(1)
union = (inputs | targets).float().sum(1)
IoU = (intersection + smooth) / (union + smooth)
return 1 - IoU.mean()
But when I train model, I am getting error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Is there any good way to cast my predictions to labels?
Full error traceback:
RuntimeError Traceback (most recent call last)
<ipython-input-53-3bfc1b43c8ba> in <module>()
----> 1 my_train(model, 30, torch.optim.Adam(model.parameters(), lr=0.01), IoULoss, train_loader)
2 frames
<ipython-input-41-ebe9c66b1806> in my_train(clf, epochs, optimizer, criterion, train_data, test_data)
22 epoch_loss += loss.item()
23
---> 24 loss.backward()
25 optimizer.step()
26
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
253 create_graph=create_graph,
254 inputs=inputs)
--> 255 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
256
257 def register_hook(self, hook):
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
147 Variable._execution_engine.run_backward(
148 tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 149 allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
150
151
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Model inference:
def my_train(clf, epochs, optimizer, criterion, train_data, test_data=None):
cur_min_loss = 10e8
train_losses = []
for epoch_step in range(epochs):
epoch_loss = 0.0
for i, batch in enumerate(train_data):
X, y = batch
optimizer.zero_grad()
prediction = clf(X)
loss = criterion(prediction, y)
epoch_loss += loss.item()
loss.backward()
optimizer.step()
del prediction
del X
del y
torch.cuda.empty_cache()
train_losses.append(epoch_loss / (i + 1))
Criterion is IoULoss; clf final activation is Sigmoid; optimizer Adam, train_data - custom dataset inherited from PyTorch Dataset
The first expression in your loss function:
inputs.view(inputs.size(0), -1) > 0.5
is not a differentiable operator, hence the gradient cannot propagate through that operation.
I ran this block of code using TF 2.2.0, Keras and some TPU config:
try:
TPU_WORKER = os.environ["TPU_NAME"]
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print(f"Running on TPU: {tpu.cluster_spec().as_dict()['worker']}")
print(f"TPU_WORKER: {TPU_WORKER}")
except ValueError:
tpu = None
gpus = tf.config.experimental.list_logical_devices("GPU")
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
elif len(gpus) > 1: # multiple GPUs on the VM
strategy = tf.distribute.MirroredStrategy(gpus)
else:
strategy = tf.distribute.get_strategy()
and got this error message:
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-27-a49335a43189> in <module>
15
16 if tpu:
---> 17 tf.config.experimental_connect_to_cluster(tpu)
18 tf.tpu.experimental.initialize_tpu_system(tpu)
19 strategy = tf.distribute.experimental.TPUStrategy(tpu)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/remote.py in connect_to_cluster(cluster_spec_or_resolver, job_name, task_index, protocol, make_master_device_default, cluster_device_filters)
181 context.set_server_def(server_def)
182 else:
--> 183 context.update_server_def(server_def)
184
185 if make_master_device_default and isinstance(
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in update_server_def(server_def)
2137
2138 def update_server_def(server_def):
-> 2139 context().update_server_def(server_def)
2140
2141
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in update_server_def(self, server_def, keep_alive_secs)
596 # Current executor might have pending nodes that involves updated remote
597 # devices. Wait for them to finish before updating.
--> 598 self.executor.wait()
599 self.executor.clear_error()
600 pywrap_tfe.TFE_ContextUpdateServerDef(self._context_handle,
/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/executor.py in wait(self)
65 def wait(self):
66 """Waits for ops dispatched in this executor to finish."""
---> 67 pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
68
69 def clear_error(self):
InvalidArgumentError: {{function_node __inference_train_function_75067}} Compilation failure: XLA can't deduce compile time constant output shape for strided slice: [4,?], output shape must be a compile-time constant
[[{{node model/tf_op_layer_strided_slice/strided_slice}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_6359544293025479410/_3]]
This error:
InvalidArgumentError: {{function_node __inference_train_function_75067}} Compilation failure: XLA can't deduce compile time constant output shape for strided slice: [4,?], output shape must be a compile-time constant
[[{{node model/tf_op_layer_strided_slice/strided_slice}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_6359544293025479410/_3]]
did occur during the previous run and now since then, I can't re-run my code.
The workaround would be to restart the notebook instead and re-run it.
But then I get the same error elsewhere:
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
64 def _method_wrapper(self, *args, **kwargs):
65 if not self._in_multi_worker_mode(): # pylint: disable=protected-access
---> 66 return method(self, *args, **kwargs)
67
68 # Running inside `run_distribute_coordinator` already.
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
853 context.async_wait()
854 logs = tmp_logs # No error, now safe to assign to logs.
--> 855 callbacks.on_train_batch_end(step, logs)
856 epoch_logs = copy.copy(logs)
857
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py in on_train_batch_end(self, batch, logs)
387 """
388 if self._should_call_train_batch_hooks:
--> 389 logs = self._process_logs(logs)
390 self._call_batch_hook(ModeKeys.TRAIN, 'end', batch, logs=logs)
391
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py in _process_logs(self, logs)
263 """Turns tensors into numpy arrays or Python scalars."""
264 if logs:
--> 265 return tf_utils.to_numpy_or_python_type(logs)
266 return {}
267
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in to_numpy_or_python_type(tensors)
521 return t # Don't turn ragged or sparse tensors to NumPy.
522
--> 523 return nest.map_structure(_to_single_numpy_or_python_type, tensors)
524
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in map_structure(func, *structure, **kwargs)
615
616 return pack_sequence_as(
--> 617 structure[0], [func(*x) for x in entries],
618 expand_composites=expand_composites)
619
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/nest.py in <listcomp>(.0)
615
616 return pack_sequence_as(
--> 617 structure[0], [func(*x) for x in entries],
618 expand_composites=expand_composites)
619
/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py in _to_single_numpy_or_python_type(t)
517 def _to_single_numpy_or_python_type(t):
518 if isinstance(t, ops.Tensor):
--> 519 x = t.numpy()
520 return x.item() if np.ndim(x) == 0 else x
521 return t # Don't turn ragged or sparse tensors to NumPy.
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in numpy(self)
959 """
960 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
--> 961 maybe_arr = self._numpy() # pylint: disable=protected-access
962 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
963
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _numpy(self)
927 return self._numpy_internal()
928 except core._NotOkStatusException as e:
--> 929 six.raise_from(core._status_to_exception(e.code, e.message), None)
930
931 #property
/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: {{function_node __inference_train_function_78422}} Compilation failure: XLA can't deduce compile time constant output shape for strided slice: [16,?], output shape must be a compile-time constant
[[{{node model/tf_op_layer_strided_slice/strided_slice}}]]
TPU compilation failed
[[tpu_compile_succeeded_assert/_626429452001451780/_8]]
when trying to train/fit the keras layered model, although from the above call-stack it's not clear at which point this error has occurred.
One more question would be, how do we clear the cache or buffer that is storing this error, so that we can reset the TPU and run our code again after making changes. And not have to restart sessions or kernels?
When I run the same TPU initialization code in Colab (runtime set to TPU):
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print(f"Running on TPU: {tpu.cluster_spec().as_dict()['worker']}")
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
It runs without errors and reinitializes TPU and clears out eager cache, here are the logs:
Running on TPU: ['10.18.71.154:8470']
WARNING:tensorflow:TPU system grpc://10.18.71.154:8470 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
WARNING:tensorflow:TPU system grpc://10.18.71.154:8470 has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
INFO:tensorflow:Initializing the TPU system: grpc://10.18.71.154:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.18.71.154:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
....
For your second issue of "[4,?], output shape must be a compile-time constant", please give your model's input an output shapes when building your model.
I am trying to implement a Deep Q network (DQN) using a graph convolution network (GCN) using the dynamic graph library (DGL). The base code is taken from this repository. However, after I calculate the loss between the policy network and target network and run loss.backward(), I get TypeError: 'NoneType' object is not iterable. I have printed the loss value and it is not None.
I ran the original code from the repository and it is running perfectly. I have also implemented the GCN code in DGL and it seems to run. I have also visualized the graph using the torchviz, but I am unable to find why it is giving an error.
The code snippet is given below:
target = reward_tens + self.gamma * torch.max(self.model_target(observation_tens, self.G) + observation_tens * (-1e5), dim=1)[0]
current_q_values= self.model(last_observation_tens, self.G)
next_q_values=current_q_values.clone()
current_q_values[range(self.minibatch_length),action_tens,:] = target
L=self.criterion(current_q_values,next_q_values)
print('loss:',L.item())
self.optimizer.zero_grad()
L.backward(retain_graph=True)
self.optimizer.step()
loss: 1461729.125
TypeError Traceback (most recent call last)
<ipython-input-17-cd5e862dd609> in <module>()
62
63 if __name__ == "__main__":
---> 64 main()
7 frames
<ipython-input-17-cd5e862dd609> in main()
55 print("Running a single instance simulation...")
56 my_runner = Runner(env_class, agent_class, args.verbose)
---> 57 final_reward = my_runner.loop(graph_dic,args.ngames,args.epoch, args.niter)
58 print("Obtained a final reward of {}".format(final_reward))
59 agent_class.save_model()
<ipython-input-14-45cfc883a37b> in loop(self, graphs, games, nbr_epoch, max_iter)
45 # if self.verbose:
46 # print("Simulation step {}:".format(i))
---> 47 (obs, act, rew, done) = self.step()
48 action_list.append(act)
49
<ipython-input-14-45cfc883a37b> in step(self)
16 #reward = torch.tensor([reward], device=device)
17
---> 18 self.agent.reward(observation, action, reward,done)
19
20 return (observation, action, reward, done)
<ipython-input-16-76d612e8663c> in reward(self, observation, action, reward, done)
129 print('loss:',L.item())
130 self.optimizer.zero_grad()
--> 131 L.backward(retain_graph=True)
132 self.optimizer.step()
133
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
148 products. Defaults to ``False``.
149 """
--> 150 torch.autograd.backward(self, gradient, retain_graph, create_graph)
151
152 def register_hook(self, hook):
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
---> 99 allow_unreachable=True) # allow_unreachable flag
100
101
/usr/local/lib/python3.6/dist-packages/torch/autograd/function.py in apply(self, *args)
75
76 def apply(self, *args):
---> 77 return self._forward_cls.backward(self, *args)
78
79
/usr/local/lib/python3.6/dist-packages/dgl/backend/pytorch/tensor.py in backward(ctx, grad_out)
394 def backward(ctx, grad_out):
395 reducer, graph, target, in_map, out_map, in_data_nd, out_data_nd, degs \
--> 396 = ctx.backward_cache
397 ctx.backward_cache = None
398 grad_in = None
TypeError: 'NoneType' object is not iterable
Update: Issue solved by building from source on master branch.
Check out this issue for details.
So I had the same issue, when generating a toy dataset of random graphs in DGL. For each graph I computed the corresponding targets using G.update_all(fn.copy_e(‘msg’),fn.sum(‘msg’,’c’)) and target=dgl.sum_nodes(G,’c’). When I called loss.backward() I got the same error as you do.
I fixed this by adding torch.no_grad() around the creation of my DataLoader object, so loss.backward() does not call the backward() functions inside target, where ctx.backward_cache = None, when the forward() of the CopyReduce() class in dgl/tensor.py has not been called prior to it.
I am not sure my fix applies directly to your problem, but you should check if you have a backward pass without a forward pass prior to this or the tensors in your loss-function are referring to the same computations in the graph and thereby calling backward() twice.
I hope this helps.
I am getting the following error while doing seq to seq on characters and feeding to LSTM, and decoding to words using attention. The forward propagation is fine but while computing loss.backward() I am getting the following error.
RuntimeError: Gradients aren't CUDA tensors
My train() function is as followed.
def train(input_batch, input_batch_length, target_batch, target_batch_length, batch_size):
# Zero gradients of both optimizers
encoderchar_optimizer.zero_grad()
encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()
encoder_input = Variable(torch.FloatTensor(len(input_batch), batch_size, 500))
for ix , w in enumerate(input_batch):
w = w.contiguous().view(15, batch_size)
reshaped_input_length = [x[ix] for x in input_batch_length] # [15 ,.. 30 times] * 128
if USE_CUDA:
w = w.cuda()
#reshaped_input_length = Variable(torch.LongTensor(reshaped_input_length)).cuda()
hidden_all , output = encoderchar(w, reshaped_input_length)
encoder_input[ix] = output.transpose(0,1).contiguous().view(batch_size, -1)
if USE_CUDA:
encoder_input = encoder_input.cuda()
temporary_target_batch_length = [15] * batch_size
encoder_hidden_all, encoder_output = encoder(encoder_input, target_batch_length)
decoder_input = Variable(torch.LongTensor([SOS_token] * batch_size))
decoder_hidden = encoder_output
max_target_length = max(temporary_target_batch_length)
all_decoder_outputs = Variable(torch.zeros(max_target_length, batch_size, decoder.output_size))
# Move new Variables to CUDA
if USE_CUDA:
decoder_input = decoder_input.cuda()
all_decoder_outputs = all_decoder_outputs.cuda()
target_batch = target_batch.cuda()
# Run through decoder one time step at a time
for t in range(max_target_length):
decoder_output, decoder_hidden, decoder_attn = decoder(
decoder_input, decoder_hidden, encoder_hidden_all
)
all_decoder_outputs[t] = decoder_output
decoder_input = target_batch[t] # Next input is current target
if USE_CUDA:
decoder_input = decoder_input.cuda()
# Loss calculation and backpropagation
loss = masked_cross_entropy(
all_decoder_outputs.transpose(0, 1).contiguous(), # -> batch x seq
target_batch.transpose(0, 1).contiguous(), # -> batch x seq
target_batch_length
)
loss.backward()
# Clip gradient norms
ecc = torch.nn.utils.clip_grad_norm(encoderchar.parameters(), clip)
ec = torch.nn.utils.clip_grad_norm(encoder.parameters(), clip)
dc = torch.nn.utils.clip_grad_norm(decoder.parameters(), clip)
# Update parameters with optimizers
encoderchar_optimizer.step()
encoder_optimizer.step()
decoder_optimizer.step()
return loss.data[0], ec, dc
Full Stack Trace is here.
RuntimeError Traceback (most recent call last)
<ipython-input-10-9778e12ded02> in <module>()
11 data_target_batch_index= Variable(torch.LongTensor(data_target_batch_index)).transpose(0,1)
12 # Send the data for training
---> 13 loss, ar1, ar2 = train(data_input_batch_index, data_input_batch_length, data_target_batch_index, data_target_batch_length, batch_size)
14
15 # Keep track of loss
<ipython-input-8-9c71c385f8cd> in train(input_batch, input_batch_length, target_batch, target_batch_length, batch_size)
54 target_batch_length
55 )
---> 56 loss.backward()
57
58 # Clip gradient norms
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/variable.py in backward(self, gradient, retain_variables)
144 'or with gradient w.r.t. the variable')
145 gradient = self.data.new().resize_as_(self.data).fill_(1)
--> 146 self._execution_engine.run_backward((self,), (gradient,), retain_variables)
147
148 def register_hook(self, hook):
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/function.py in _do_backward(self, gradients, retain_variables)
207 def _do_backward(self, gradients, retain_variables):
208 self.retain_variables = retain_variables
--> 209 result = super(NestedIOFunction, self)._do_backward(gradients, retain_variables)
210 if not retain_variables:
211 del self._nested_output
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/autograd/function.py in backward(self, *gradients)
215 def backward(self, *gradients):
216 nested_gradients = _unflatten(gradients, self._nested_output)
--> 217 result = self.backward_extended(*nested_gradients)
218 return tuple(_iter_None_tensors(result))
219
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/nn/_functions/rnn.py in backward_extended(self, grad_output, grad_hy)
314 grad_hy,
315 grad_input,
--> 316 grad_hx)
317
318 if any(self.needs_input_grad[1:]):
/home/ubuntu/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py in backward_grad(fn, input, hx, weight, output, grad_output, grad_hy, grad_input, grad_hx)
371 hidden_size, dcy.size()))
372 if not dhy.is_cuda or not dy.is_cuda or (dcy is not None and not dcy.is_cuda):
--> 373 raise RuntimeError('Gradients aren\'t CUDA tensors')
374
375 check_error(cudnn.lib.cudnnRNNBackwardData(
RuntimeError: Gradients aren't CUDA tensors
any suggestions about why I am doing wrong?
Make sure that all the objects that inherit nn.Module also call their .cuda(). Make sure to call before you pass any tensor to them. (essentially before training)
For example, (and I am guessing your encoder and decoder are such objects), do this right before you call train().
encoder = encoder.cuda()
decoder = decoder.cuda()
This ensures that all of the model's parameters are initialized in cuda memory.
Edit
In general, whenever you have this kind of error,
RuntimeError: Gradients aren't CUDA tensors
somewhere, (from your model creation, to defining inputs, to finally supplying the outputs to the loss function) you missed specifying a Variable object to be in GPU memory. You will have go through every step in your model, verifying all Variable objects to be in GPU memory.
Additionally, you dont have to call .cuda() on the outputs. Given that the inputs are in gpu's memory, all operations also takes place in gpu's memory, and so are your outputs.