I have saved and loaded the checkpoints as per the pytorch manual, and it all seems OK. Now, usually, when I want to start training, I have something like this in pytorch:
for itr in range(1, args.niters + 1):
optimizer.zero_grad() # should I or should I not when checkpoints are loaded?
I am unsure if I should do zero_grad() here (which I use when I start training from scratch), since I am reloading all my weights and bias.
Apologies if this is a daft question.
Related
I'm trying to update the weights of a model during training only for those batches in which the loss is smaller than that obtained in the previous batch.
So, in the batches loop, I store the loss obtained at each iteration, and then I have tried evaluating a condition: if loss at time t-1 is smaller that that a time t, then I proceed as follows:
if loss[t-1] <= loss[t]:
loss.backward()
optimizer.step()
else:
#do nothing or what ?
Then, nothing should be done in the else part. Nonetheless, I get an error saying CUDA is running out of memory.
Of course, before computing the loss, I perform an optimizer.zero_grad() sentence.
The for loop that runs over batches seems to be running fine, but memory usage blows up. I read that maybe setting gradients to None would prevent the weights update process but I have tried many sentences (output.clone().detach() also optimizer.zero_grad(set_to_none=True)) but I'm not sure they work. I think they did not. Nonetheless, the memory usage explosion still occurs.
Is there a way to get this done?
This is a common problem when storing losses from consecutive steps.
The out-of-memory error is caused because you are storing the losses in a list. The computational graphs will still remain and will stay in memory as long as you keep a reference to your losses. An easy fix is to detach the tensor when you append it to the list:
# loss = loss_fn(...)
losses.append(loss.detach())
Then you can work with
if losses[t] <= losses[t-1]: # current loss is smaller
losses[t].backward()
optimizer.step()
else:
pass
Storing the loss in a list would store the whole graph for that batch for each element in losses. Instead what you can do is the following:
losses.append(loss.cpu().tolist())
optimizer.zero_grad()
if losses[-1] <= losses[-2]: # current loss is smaller
loss.backward()
optimizer.step()
As you only update the model if the current loss is smaller than the previous one you don't actually need to store all the losses. The last one and the value of the previous one is enough. Otherwise if you want to store a finite number of graphs you need to be careful about your available memory which is quite limited in many applications.
I am new to pytorch. May I ask what is the difference between adding 'loss.item()' or not? The following 2 parts of code:
for epoch in range(epochs):
trainingloss =0
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
trainingloss += criterion.item()
and this
for epoch in range(epochs):
for i in range(0,X.size()[1], batch_size):
indices = permutation[i:i+batch_size]
F = model.forward(X[n])
optimizer.zero_grad()
criterion = loss(X,n)
criterion.backward()
optimizer.step()
If anyone has any idea please help. Thank you very much.
Calling loss.item() allows you to take a loss variable that is detached from the computation graph that PyTorch creates (this is what .item() does for PyTorch variables).
If you add the line trainingloss += criterion.item() at the end of each "batch loop", this will keep track of the batch loss throughout the iteration by incrementally adding the loss for each minibatch in your training set. This is necessary since you are using minibatches - the loss for each minibatch will not be equal to the loss over all the batches.
Note: If you use PyTorch variables outside the optimization loop, e.g. in a different scope, which could happen if you call something like return loss, it is crucial that you call .item() on any PyTorch variables that are part of the computation graph (as a general rule of thumb, any outputs/loss/models that interact with PyTorch methods will likely be part of your computation graph). If not, this can cause the computation graph to not be de-allocated/deleted from Python memory, and can lead to CPU/GPU memory leaks. What you have above looks correct though!
Also, in the future, PyTorch's DataLoader class can help you with minibatches with less boilerplate code - it can loop over your dataset such that each item you loop over is a training batch - i.e. you don't require two for loops in your optimization.
I hope you enjoy learning/using PyTorch!
In your training loop, the criterion.backward() part computes the gradient of each trainable parameters of the feed forward path, then the optimizer.step() part updates the parameters based on the computed gradients and the optimization techniques. So at the end of this step the training of the model for a particular batch has been finished and the trainingloss += criterion.item() part is only for tracking and monitoring training process and loss values for each step of training.
A model should be set in the evaluation mode for inference by calling model.eval().
Do we need to also do this during training before getting the model outputs? Like within a training epoch if the network contains one or more dropout and/or batch-normalization layers.
If this is not done then the output of the forward pass in the training epoch might be affected by the randomness in the dropout?
Many example codes do not do this and something along these lines is the common approach:
for t in range(num_epochs):
# forward pass
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()
For example here is an example code to look at : convolutional_neural_network/main.py
Should this instead be?
for t in range(num_epochs):
# forward pass
model.eval() # disable dropout etc
yhat = model(x)
# get the loss
loss = criterion(yhat , y)
# backward pass, optimizer step
model.train()
optimizer.zero_grad()
loss.backward()
optimizer.step()
TLDR:
Should this instead be?
No!
Why?
More explanation:
Different Modules behave differently depending on whether they are in training or evaluation/test mode.
BatchNorm and Dropout are only two examples of such modules, basically any module that has a training phase follows this rule.
When you do .eval(), you are signaling all modules in the model to shift operations accordingly.
Update
The answer is during training you should not use eval mode and yes, as long as you have not set the eval mode, the dropout will be active and act randomly in each forward passes. Similarly all other modules that have two phases, will perform accordingly. That is BN will always update the mean/var for each pass, and also if you use batch_size of 1, it will error out as it can not do BN with batch of 1
As it was pointed out in comments, it should be noted that during training, you should not do eval() before the forward pass, as it effectively disables all modules that has different phases for train/test mode such as BN and Dropout (basically any module that has updateable/learnable parameters, or impacts network topology like dropout) will be disabled and you will not see them contributing to your network learning. So don't code like that!
Let me explain a bit what happens during training:
When you are in training mode, all of your modules that make up your model may have two modes, training and test mode. These modules either have learnable parameters that need to be updated during training, like BN, or affect network topology in a sense like Dropout (by disabling some features during forward pass). some modules such as ReLU() only operate in one mode and thus do not have any change when modes change.
When you are in training mode, you feed an image, it passes trough layers until it faces a dropout and here, some features are disabled, thus theri responses to the next layer is omitted, the output goes to other layers until it reaches the end of the network and you get a prediction.
the network may have correct or wrong predictions, which will accordingly update the weights. if the answer was right, the features/combinations of features that resulted in the correct answer will be positively affected and vice versa.
So during training you do not need and should not disable dropout, as it affects the output and should be affecting it so that the model learns a better set of features.
I hope this makes it a bit more clear for you. if you still feel you need more, say so in the comments.
In Keras, we use ModelCheckpoint to save our trained models. In a document of Keras, it explains that "monitor: quantity to monitor.", but I still can't understand it. What's the effect of monitor in our machine learning process?
keras.callbacks.callbacks.ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=False, save_weights_only=False, mode='auto', period=1)
https://keras.io/callbacks/
From the keras documentation, I'm explaining the parameters of the ModelCheckpoint. It's used to save your best model while training, the reason is maybe after training for few epochs the model may start to diverge or show poor performance/ may get overfitted. Many epochs do not always mean best performance, so it's better to keep saving the weights while training.
save_best_only: if save_best_only=True, the latest best model according to the quantity monitored will not be overwritten. Here, it clearly says, the model will be saved based on the value of the quantity to be monitored.
mode: one of {auto, min, max}. If save_best_only=True, the decision to overwrite the current save file is made based on either the maximization or the minimization of the monitored quantity. For val_acc, this should be max, for val_loss this should be min, etc. In auto mode, the direction is automatically inferred from the name of the monitored quantity.
The mode will be determined based on your monitoring metric, if it's a loss then the mode must be min, if it's something like accuracy, f1 score etc. then the mode must be max. (You want to save the weights which shows least loss, and best accuracy so far)
verbose: verbosity mode, 0 or 1. verbose determines how much information you want to get printed about your metrics (0 means nothing will be printed, 1 means some information will be printed)
Other parameters should be very easy to understand.
I am training blood cell images using chainer. While training the epoch details doesn't get updated and does not run the given set of epochs.
I want to understand the cause of this problem..
When the training is interrupted and restarted only a single epoch is updated and displayed..
I am not sure of the reason behind the problem..so I can't point towards a particular section of the code..whether it is the data pre-processing, or data feeding or the classifier/evaluator section.
You can see the whole code here...https://github.com/atom2k17/BloodCell-Chainer/blob/master/WithoutKerasDD-checkpoint.ipynb
After training epoch, main/loss, validation/loss, etc should be populated with values from each epoch..and each epoch should get updated after each epoch is finished.
Can you try modifying
valid_iter = iterators.SerialIterator(valid, batch_size)
to
valid_iter = iterators.SerialIterator(valid, batch_size, repeat=False, shuffle=False)?
Without repeat=False option, the iterator will not finish so
E.Evaluator(valid_iter, model_loss, device=gpu_id) never finish.