*** RuntimeError: One of the differentiated Tensors does not require grad - pytorch

GN_params = list(np.load('/home/linghuxiongkun/workspace/guided-diffusion/guided_diffusion/GN_param_names.npy'))
for k,v in self.model.named_parameters():
if k not in GN_params:
v.requires_grad = False
self.opt = AdamW(
filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.lr, weight_decay=self.weight_decay
I try to fine tune part of the parameters of the network, but the error occurs. Is there any solution to this problem?

Welcome Eric to the community. Although we might benefit from more information on your side in order to reproduce the bug. However it seems the error message is straightforward.
It looks like you are trying to fine tune on a some tensors that do not require gradients. Normally you would fine tune only on the network heads leaving the backbone frozen. In here it seems that the optimizer is trying to run gradient over the frozen part of the network.
Instead of filtering the model parameters by requires_grad try just passing the parameters.
# Change the optimizer call
self.opt = AdamW(
self.model.parameters(), lr=self.lr, weight_decay=self.weight_decay
)

Related

ValueError: Tensor Tensor("dense_4/Sigmoid:0", shape=(?, 1025), dtype=float32) is not an element of this graph

Today I suddenly started getting this error for no apparent reason, while I was running model.fit(). This used to work before, I am using TF 2.3.0, more specifically its Keras module.
The function is called on validation inside a generator, which is fed into model.predict().
Basically, I load a checkpoint, I resume training the network, and I make a prediction on validation.
The error keeps occurring even when training a model from scratch, and erasing all the related data. It's like if something has been hardcoded, somewhere, as I was able to run model.fit() up until a few hours ago.
I saw several solutions like THIS, but none of these variations really work for me, as they lead to more tricky error messages.
I even tried installing a different version of TF, thinking that this was due to some old version, but the error still occurs.
I will answer my own question, as this one was particularly tricky and none of the solutions I found on the internet has worked for me, probably because outdated.
I'll write down just the relevant part to add in the code, feel free to add more technical explanations.
I like using args for passing variables, but it can work without:
from tensorflow.python.keras.backend import set_session
from tensorflow.keras.models import load_model
import generator # custom generator
def main(args):
# open new session and define TF graph
args.sess = tf.compat.v1.Session()
args.graph = tf.compat.v1.get_default_graph()
set_session(args.sess)
# define training generator
train_generator = generator(args.train_data)
# load model
args.model = load_model(args.model_path)
args.model.fit(train_generator)
Then, in the model prediction function:
# In my specific case, the predict_output() function is
# called inside the generator function
def predict_output(args, x):
with args.graph.as_default():
set_session(args.sess)
y = model.predict(x)
return y

Calling `scipy.optimize.minimize` inside an `sklearn` classifier makes it break in a parallel job

I have ran into a silent crash that I am attributing to breaking thread-safety.
Here is the details of what happened. First I have defined a custom sklearn estimator, that uses scipy.optimize at fitting time, similar to:
class CustomClassifier(BaseEstimator, ClassifierMixin):
...
def fit(self, X, y=None):
...
#optimizes with respect to some metric by using scipy.optimize.minimize
...
return self
...
Downstream, I run a cross-validated measurement of its performance, looking like:
cv_errors = cross_val_score( CustomClassifier(), X, y, n_jobs=-1)
cross_val_score is the sklearn out-of-the-box function, n_jobs=-1 means that I am asking for it to be parallelised on as many cores as available.
The output is that my cv_errors is an array of NaNs. After doing some bug chasing, I notice that setting n_jobs=1 gives me an array populated by the errors, as expected. It looks like the parallelisation step, coupled with the use of scipy.optimize.minimize is the culprit.
Is there are way to have it working in parallel?
I think I found a way around here:
with parallel_backend('multiprocessing'):
cv_errors = cross_val_score( CustomClassifier(), X, y, n_jobs=-1, error_score='raise')
seems to be safe here. If anyone has explanation of what is happening behind the scenes, and why the 'locky' backend breaks while 'multiprocessing' does not, I am listening. Also, setting error_score='raise' means that a crash will not be silenced.

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

I got the following error when I ran my pytorch deep learning model in colab
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
1370 ret = torch.addmm(bias, input, weight.t())
1371 else:
-> 1372 output = input.matmul(weight.t())
1373 if bias is not None:
1374 output += bias
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
I even reduced batch size from 128 to 64 i.e., reduced to half, but still, I got this error. Earlier, I ran the same code with a batch size of 128 but didn't get any error like this.
No, batch size does not matter in this case
The most likely reason is that there is an inconsistency between number of labels and number of output units.
Try printing the size of the final output in the forward pass and check the size of the output
print(model.fc1(x).size())
Here fc1 would be replaced by the name of your model's last linear layer before returning
Make sure that label.size() is equal to prediction.size() before calculating the loss
And even after fixing that problem, you'll have to restart the GPU runtime (I needed to do this in my case when using a colab GPU)
This answer might also be helpful
This error can actually be due to different reasons. It is recommended to debug CUDA errors by running the code on the CPU, if possible. If that’s not possible, try to execute the script via:
CUDA_LAUNCH_BLOCKING=1 python [YOUR_PROGRAM]
This will help you get the right line of code which raised the error in the stack trace so that you can resolve it.
Reducing batch size works for me and the training proceeds as planned.
First, try running the same on your CPU to check if everything is fine with your tensors' shapes.
In my case everything was fine. And since this error means "Resource allocation failed inside the cuBLAS library", I tried decreasing the batch size and it solved the issue. You said you increased to 64 and it didn't help. Can you try 32, 8, 1?
I encountered this problem when the number of label is not equaled with the number of network's output channel, i.e the number of classes predicted.
I had the same problem while I don't know the reason to be exactly I know the cause,
my last line of the NN.module was
self.fc3 = nn.Linear(84, num_classes)
I changed my real num_classes to be 2 times as much
but it did not change the value of the variable num_classes, this probably made a mistake when I was outputting the results somewhere.
after I fixed the value of num_classes it just worked out
i recommend going over the numbers in your model again
I was facing CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle) on colab
Updating the pytorch to 1.8.1 fixed the issue.
I ran into this issue because I was passing parameters in the wrong order to the BCELoss function. This became apparent only after switching to CPU.
Good chance that there is a layer mismatch. Double check to make sure all the dimensions are consistent at each layer.
The accurate error message can be obtained by switching to CPU. In my case I had 8 class placeholders at the input of torch.nn.CrossEntropyLoss, but there are 9 different labels (0~8).
My model is to classify two classes with only one neuron in the last layer. I had this problem when the last layer is nn.Linear(512,1) in pytorch environment. But my label is just [0] or [1]. I solved this problem by adding the layer: nn.sigmoid()
For a large-scale dataset, just delete the temple variables
for batch_idx, (x, target) in enumerate(train_dataloader):
...
del x,target,loss,outputs
Reducing the batch size worked for me.
Reducing the batch size didn't work for me. I have defined num_classes in main.py and also in model structure. I forgot to change the num_classes in model structure therefore i got an error. After changing, training process has been started
This is probably a mismatch of dimensions or indexes. You can have a more clear feedback about the error by running your model on cpu. You can reduce the datasets size if needed, in my case, as it was a simple prediction, I just switched for cpu and found out it was a token outside my model's vocabulary range.
Reducing the maximum sequence length for a model that has a limit (e.g. BERT) solves the error for me.
Also, I faced the same issue when I resized the embedding layer of a model: model.resize_token_embeddings(NEW_SIZE), trained, and saved it.
At prediction time, when I loaded the model, I needed to resize the embedding layer again!
I got that same issue in google colab with GPU runtime and so i changed from GPU to TPU.
Then it got resolved.
I recommend u trying the same.

How to run predictions on image using a pretrained tensorflow model?

I have adapted this retrain.py script to use with several pretraineds model,
after training is done this generates a 'retrained_graph.pb' which I then read and try to use to run predictions on an image using this code:
def get_top_labels(image_data):
'''
Returns a list of labels and their probabilities
image_data: content of image as string
'''
with tf.compat.v1.Session() as sess:
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, {'DecodeJpeg/contents:0': image_data})
return predictions
This works fine for inception_v3 model because it has a tensor called 'DecodeJpeg', other models I'm using such as inception_v4, mobilenet and inception_resnet_v2 don't.
My question is can I add an ops to the graph, like the one used in add_jpeg_decoding in the retrain.py script so that I can afterwards use that for prediction ?
Would it be possible to do something like this:
predictions = sess.run(softmax_tensor, {image_data_tensor: image_data}) where image_data_tensor is a variable that depends on what model I'm using ?
I looked through stackoverflow and couldn't find a question that solves my problem, I'd really appreciate any help with this, thanks.
I need to at least know if it's possible.
Sorry for repost I got no views on my first one.
So after some research, I figured out a way, leaving an answer here in case someone needs it. What you need to do is do the decoding yourself get a tensor from the image using t = read_tensor_from_image_file found here, then you can run your predictions using this piece of code:
start = time.time()
results = sess.run(output_layer_name,
{input_layer_name: t})
end = time.time()
return results
usually input_layer_name = input:0 and output_layer_name = final_result:0.

CUDA vs. DataParallel: Why the difference?

I have a simple neural network model and I apply either cuda() or DataParallel() on the model like following.
model = torch.nn.DataParallel(model).cuda()
OR,
model = model.cuda()
When I don't use DataParallel, rather simply transform my model to cuda(), I need to explicitly convert the batch inputs to cuda() and then give it to the model, otherwise it returns the following error.
torch.index_select received an invalid combination of arguments - got (torch.cuda.FloatTensor, int, torch.LongTensor)
But with DataParallel, the code works fine. Rest of the other things are same. Why this happens? Why when I use DataParallel, I don't need to transform the batch inputs explicitly to cuda()?
Because, DataParallel allows CPU inputs, as it's first step is to transfer inputs to appropriate GPUs.
Info source: https://discuss.pytorch.org/t/cuda-vs-dataparallel-why-the-difference/4062/3

Resources