Extracting Meaningful Error Message from 'RuntimeError: CUDA error: device-side assert triggered' on Google Colab in Pytorch - pytorch

I am experiencing the following error while training a generative network via Pytorch 1.9.0+cu102:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
While using a Google Colaboratory GPU session. This segment was triggered on either one of these two lines:
running_loss += loss.item()
or
target = target.to(device)
It produces the error on the first line when I am first running the notebook, and the second line each subsequent time I try to run the block. The first error occurs after training for 3 batches. The second error happens on the first batch. I can confirm that the device is cuda0, that device is available, and target is a pytorch tensor. Naturally, I tried to take the advice of the error and run:
!CUDA_LAUNCH_BLOCKING=1
and
os.system('CUDA_LAUNCH_BLOCKING=1')
However, neither of these lines changes the error message. According to a different post, this is because colab is running these lines in a subshell. The error does not occur when running on CPU, and I do not have access to a GPU device besides the GPU on Colab. While this question has been asked in many different forms, no answers are particularly helpful to me because they either recommend passing the aforementioned line, are about a situation fundamentally different from my own (such as training a classifier with an inappropriate number of classes), or recommend a solution which I have already tried, such as resetting the runtime or switching to CPU.
I am hoping to gain insight into the following questions:
Is there a way for me to get a more specific error message? Efforts to set the launch blocking variable have been unsuccessful.
How could it be that I am getting this error on two seemingly very different lines? How could it be that my network trains for 3 batches (it is always 3), but fails on the fourth?
Does this situation remind anyone of an error that they have encountered previously, and have a possible route for ameliorating it given the limited information I can extract?

I was successfully able to get more information about the error by executing:
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
BEFORE importing torch. This allowed me to get a more detailed traceback and ultimately diagnose the problem as an inappropriate loss function.

This can be mainly due to 2 reasons:
Inconsistency in the number of classes
Wrong input for the loss function
If it's the first one, then see you should get the same error when you change the runtime back to CPU.
In my case, it was the second one. I had used BCE loss, and its input should be between 0 and 1. If it's any other value, this error might appear. So I fixed this by using:
criterion=nn.BCEWithLogitsLoss()
instead of:
criterion=nn.BCELoss()
Oh yeah, and I also used:
CUDA_LAUNCH_BLOCKING = "1"
at the beginning of the code.

Related

Synchronization problem while executing Simulink FMU in ROS 2-Gazebo (TF_OLD_DATA warning)

I'm working in a co-simulation project between Simulink and Gazebo. The aim is to move a robot model in Gazebo with the trajectory coordinates computed from Simulink. I'm using MATLAB R2022a, ROS 2 Dashing and Gazebo 9.9.0 in a computer running Ubuntu 18.04.
The problem is that when launching the FMU with the fmi_adapter, I'm obtaining the following. It is tagged as [INFO], but actually messing up all my project.
[fmi_adapter_node-1] [INFO] [fmi_adapter_node]: Simulation time 1652274762.959713 is greater than timer's time 1652274762.901340. Is your step size to large?
Note the timer's time is higher than the simulation time. Even if I try to change the step size with the optional argument of the fmi_adapter_node, the same log appears with small differences in the times. I'm using the next commands:
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu # default step size: 0.2
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu _step_size:=0.001
As you would expect, the outputs of the FMU are the xyz coordinates of the robot trajectory in each time step. Since the fmi_adapter_node creates topics for both inputs and outputs, I'm reading the output xyz values by means of 3 subscribers with the next code. Then, those coordinates are being used to program the robot trajectories with the MoveIt-Python API.
When I run the previous Python code, I'm obtaining the following warning once and again and the robot manipulator actually doesn't move.
[ WARN] [1652274804.119514250]: TF_OLD_DATA ignoring data from the past for frame motor6_link at time 870.266 according to authority unknown_publisher
Possible reasons are listed at http://wiki.ros.org/tf/Errors%20explained
The previous warning is explained here, but I'm not able to fix it. I've tried clicking Reset in RViz, but nothing changes. I've also tried the following without success:
ros2 param set /fmi_adapter_node use_sim_time true # it just sets the timer's time to 0
It seems that the clock is taking negative values, so there is a synchronization problem.
Any help is welcome.
The warning message by the FMIAdapterNode is emitted if the timer's period is only slightly greater than the simulation step-size and if the timer is preempted by other processes or threads.
I created an issue at https://github.com/boschresearch/fmi_adapter/issues/9 which explains this in more detail and lists two possible fixes. It would be great if you could contribute to this discussion.
I assume that the TF_OLD_DATA error is not related to the fmi_adapter. Looking at the code snippet at ROS Answers, I wondered whether x,y,z values are re-published at all given that the lines
pose.position.x = listener_x.value
pose.position.y = listener_y.value
pose.position.z = listener_z.value
are not inside a callback and executed even before rospy.spin(), but maybe that's just truncated.

pytorch loss.backward() keeps running for hours

I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version

Debugging the optmization run while training variables of a pre-trained tensorflow model

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
model.build()
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0
SkipThoughts/logits/weights:0
SkipThoughts/logits/biases:0
SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0
...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.

tf.layers.batch_normalization throws data_type error while using Tensorflow v 1.4 but not 1.0

The following statement,
bn1 = tf.contrib.layers.batch_normalization(inputs=conv1, axis=1, training = is_training)
Throws an error: InternalError: The CPU implementation of FusedBatchNorm only supports NHWC tensor format for now.
on my CPU using tensorflow v:1.4
However, I've ensured that the code uses data in NHWC format. The same piece of code works on my friend's CPU, the only difference being he's using Tensorflow v.1.0 and the code runs smoothly without issues.
I tried to look up tensorflow documentation,
https://www.tensorflow.org/performance/performance_guide
It suggests feeding in two extra arguments: fused=True, data_format='NHWC'.
However, as per
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
there is no such provision for the 2 above mentioned arguments. And in fact, the code throws an error saying batch_normalization received an unexpected argument.
Any responses on the potential reason behind the issue and how I could get around it without rolling back my Tensorflow version (because that would be absurd) are most welcome.
Thank you so uch for your time and effort.

Sklearn OneClassSVM can not handle large data sets(0xC0000005)

I am using OneClassSVM for outlier detection.
clf = svm.OneClassSVM(kernel='rbf', nu=k, tol=0.001)
clf.fit(train_x)
However it encountered the following error.
Process finished with exit code -1073741819 (0xC0000005)
The data size is around 20MB(train_x), which is not very big too me.
Memory of my computer is 8GB.
However if I reduce the file size, it worked. And this behavior is inconsistent.
Sometimes it worked and did not work otherwise.
Has any one had this problem before?
thanks!

Resources