Can't load trained PyTorch model

Can't load trained PyTorch model - pytorch

I have trained the ResNet152 on a custom dataset.
When I try to load it this way:
trained_model = torch.nn.Module.load_state_dict(torch.load('/content/drive/My Drive/X-Ray-pneumonia-with-CV/X-ray-pytorch-model.pth'))
trained_model.eval()
i got an error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
And when I add map_location:
trained_model = torch.nn.Module.load_state_dict(torch.load('/content/drive/My Drive/X-Ray-pneumonia-with-CV/X-ray-pytorch-model.pth',
map_location = torch.device('cpu')))
trained_model.eval()
I got another error:
TypeError: load_state_dict() missing 1 required positional argument: 'state_dict'
So what did I do wrong? Please, help

Instead of invoking torch.nn.Module.load_state_dict you should first instantiate an object of module class you want to load. Otherwise, argument self of load_state_dict is not bound to anything. This way, state dict that you load via torch.load is being passed as self instead of state_dict. Have a look at this answer to understand the difference.

Related

Extracting Meaningful Error Message from 'RuntimeError: CUDA error: device-side assert triggered' on Google Colab in Pytorch

I am experiencing the following error while training a generative network via Pytorch 1.9.0+cu102:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
While using a Google Colaboratory GPU session. This segment was triggered on either one of these two lines:
running_loss += loss.item()
or
target = target.to(device)
It produces the error on the first line when I am first running the notebook, and the second line each subsequent time I try to run the block. The first error occurs after training for 3 batches. The second error happens on the first batch. I can confirm that the device is cuda0, that device is available, and target is a pytorch tensor. Naturally, I tried to take the advice of the error and run:
!CUDA_LAUNCH_BLOCKING=1
and
os.system('CUDA_LAUNCH_BLOCKING=1')
However, neither of these lines changes the error message. According to a different post, this is because colab is running these lines in a subshell. The error does not occur when running on CPU, and I do not have access to a GPU device besides the GPU on Colab. While this question has been asked in many different forms, no answers are particularly helpful to me because they either recommend passing the aforementioned line, are about a situation fundamentally different from my own (such as training a classifier with an inappropriate number of classes), or recommend a solution which I have already tried, such as resetting the runtime or switching to CPU.
I am hoping to gain insight into the following questions:
Is there a way for me to get a more specific error message? Efforts to set the launch blocking variable have been unsuccessful.
How could it be that I am getting this error on two seemingly very different lines? How could it be that my network trains for 3 batches (it is always 3), but fails on the fourth?
Does this situation remind anyone of an error that they have encountered previously, and have a possible route for ameliorating it given the limited information I can extract?

I was successfully able to get more information about the error by executing:
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
BEFORE importing torch. This allowed me to get a more detailed traceback and ultimately diagnose the problem as an inappropriate loss function.

This can be mainly due to 2 reasons:
Inconsistency in the number of classes
Wrong input for the loss function
If it's the first one, then see you should get the same error when you change the runtime back to CPU.
In my case, it was the second one. I had used BCE loss, and its input should be between 0 and 1. If it's any other value, this error might appear. So I fixed this by using:
criterion=nn.BCEWithLogitsLoss()
instead of:
criterion=nn.BCELoss()
Oh yeah, and I also used:
CUDA_LAUNCH_BLOCKING = "1"
at the beginning of the code.

Choose 2nd GPU on server

I am running code on server. There are 2 GPUs there, and the 1st one is busy. Yet, I can't find a way to switch between them. I am using pytorch if that is important. Following lines of code should be modified:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
Modification may be stated only here.
Thanks.

cuda by defaults chooses cuda:0, switching to the other GPU may be done through cuda:1
So, your line becomes:
device = 'cuda:1' if torch.cuda.is_available() else 'cpu'
You can read more about CUDA semantics.

Here is the way I'm doing it while using FastAI and pre-trained model for inference.
First, while model definition with fai (import fastai.vision.all as fai) I obtain the model instance and put it to specified GPU (say with gpu_id=3):
model = fai.nn.Sequential(body, head)
model.cuda(device=gpu_id)
Then, while loading model weights I also specify which device to use (otherwise it creates the copy of a model in GPU 0):
model.load_state_dict(torch.load(your_model_state_filepath, torch.device(gpu_id)))

Resource exhausted: OOM when allocating tensor with shape[32,128,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0

I have been getting this error after a few epochs.
I have tried several suggestions found in similar questions such as
reduce the batch size of both training and test to 1
reduce the data size
use kill -9 pid
use more than one GPU by setting os.environ['CUDA_VISIBLE_DEVICES'] = '0,2'
reduced the number of output neurons of the LSTM model
Adding gpu_options = tf.GPUOptions(allow_growth=True)
session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))
Adding del model after using the model
Adding k.clear_session(). I'm not sure I used this particular one correctly.
None of them work.
Does anyone have any other suggestions? Please help.
The tensor shape changes in different runs and different error message but the error message remains the same.
I'm using Python3.7, tensorflow-gpu==1.14, CuDNN=7.6.5, CUDA==10.0.

Debugging the optmization run while training variables of a pre-trained tensorflow model

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
model.build()
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0
SkipThoughts/logits/weights:0
SkipThoughts/logits/biases:0
SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0
...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.

tf.layers.batch_normalization throws data_type error while using Tensorflow v 1.4 but not 1.0

The following statement,
bn1 = tf.contrib.layers.batch_normalization(inputs=conv1, axis=1, training = is_training)
Throws an error: InternalError: The CPU implementation of FusedBatchNorm only supports NHWC tensor format for now.
on my CPU using tensorflow v:1.4
However, I've ensured that the code uses data in NHWC format. The same piece of code works on my friend's CPU, the only difference being he's using Tensorflow v.1.0 and the code runs smoothly without issues.
I tried to look up tensorflow documentation,
https://www.tensorflow.org/performance/performance_guide
It suggests feeding in two extra arguments: fused=True, data_format='NHWC'.
However, as per
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
there is no such provision for the 2 above mentioned arguments. And in fact, the code throws an error saying batch_normalization received an unexpected argument.
Any responses on the potential reason behind the issue and how I could get around it without rolling back my Tensorflow version (because that would be absurd) are most welcome.
Thank you so uch for your time and effort.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string