Question on restoring training after loading model - pytorch

Having trained for 24 hours, the training process saved the model files via torch.save. There was a power-off or other issues caused the process exited. Normally, we can load the model and continue training from the last step.
Why should not we load the states of optimizers (Adam, etc), is it necessary?

Yes, you can load the model from the last step and retrain it from that very step.
if you want to use it only for inference, you will save the state_dict of the model as
torch.save(model, PATH)
And load it as
model = torch.load(PATH)
model.eval()
However, for your concern you need to save the optimizer state dict as well. For that purpose, you need to save it as
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
...
}, PATH)
and load the model for further training as:
model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
model.eval()
# - or -
model.train()
It is necessary to save the optimizer state dictionary, since this contains buffers and parameters that are updated as the model trains.

It is necessary to load the states of the optimizers in some cases, such as the case that a learning rate scheduler is being used.
In that particular case, learning rate for the optimizer will be re-adjusted to the point where it was at the saved state.

Related

How can I save loss and accuracy metrics in mlflow after each epoch?

I would like to see metrics like loss and accuracy as a graph by storing each value for the corresponding metrics after each epoch during training/testing phase of a keras model.
PS: I know that we can do it by using autolog feature of mlflow for keras like below, but I dont want to use that.
mlflow.keras.autolog()
After searching through the internet and combining a few concepts, I was able to solve the problem that I had asked. In Keras, we can create custom callbacks that can be called at various points (start/end of epoch, batch, etc) during training, testing, and prediction phase of a model.
So, I created a Keras custom callback to store loss/accuracy values after each epoch as mlflow metrics like below.
class CustomCallback(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
mlflow.log_metrics({
"loss": logs["loss"],
"sparse_categorical_accuracy":
logs["sparse_categorical_accuracy"],
"val_loss": logs["val_loss"],
"val_sparse_categorical_accuracy":
logs["val_sparse_categorical_accuracy"],
})
I called this above callback during training of my model like below.
history = model.fit(
features_train,
labels_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
callbacks=[CustomCallback()],
validation_split=0.2
)
The keras custom callback stored all the values during training after each epoch which I was able to see as a graph in mlflow UI like below.

tf.keras how to save ModelCheckPoint object

ModelCheckpoint can be used to save the best model based on a specific monitored metrics. So it obviously has information about the best metrics stored within its object. If you train on google colab for example, your instance can be killed without warning and you would lose this info after a long training session.
I tried to pickle the ModelCheckpoint object but got:
TypeError: can't pickle _thread.lock objects
Such that i can reuse this same object when I bring my notebook back. Is there a good way to do this? You can try to reproduce by:
chkpt_cb = tf.keras.callbacks.ModelCheckpoint('model.{epoch:02d}-{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True)
with open('chkpt_cb.pickle', 'w') as f:
pickle.dump(chkpt_cb, f, protocol=pickle.HIGHEST_PROTOCOL)
If callback object is not to be pickled (due to thread issue and not advisable), I can pickle this instead:
best = chkpt_cb.best
This stores the best monitored metrics that callback has seen, and it is a float, which you can pickle and reload next time, and then do this:
chkpt_cb.best = best # if chkpt_cb is a brand new object you create when colab killed your session.
This is my own setup:
# All paths should be on Google Drive, I omitted it here for simplicity.
chkpt_cb = tf.keras.callbacks.ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True)
if os.path.exists('chkpt_cb.best.pickle'):
with open('chkpt_cb.best.pickle', 'rb') as f:
best = pickle.load(f)
chkpt_cb.best = best
def save_chkpt_cb():
with open('chkpt_cb.best.pickle', 'wb') as f:
pickle.dump(chkpt_cb.best, f, protocol=pickle.HIGHEST_PROTOCOL)
save_chkpt_cb_callback = tf.keras.callbacks.LambdaCallback(
on_epoch_end=lambda epoch, logs: save_chkpt_cb()
)
history = model.fit_generator(generator=train_data_gen,
validation_data=dev_data_gen,
epochs=5,
callbacks=[chkpt_cb, save_chkpt_cb_callback])
So even when your colab session got killed, you can still retrieve the last best metrics and inform your new instance about it, and continue training as usual. This especially help when you re-compile a stateful optimizer and may cause a regression in the loss/metric and don't want to save those models for first few epochs.
I think you might be misunderstanding the intended usage of the ModelCheckpoint object. It is a callback that periodically gets called during training at a particular phase. The ModelCheckpoint callback in particular gets called after every epoch (if you keep the default period=1) and saves your model to disk in the filename you specify to the filepath argument. The model is saved in the same way described here. Then if you want to load that model later, you can do something like
from keras.models import load_model
model = load_model('my_model.h5')
Other answers on SO provide nice guidance and examples for continuing training from a saved model, for example: Loading a trained Keras model and continue training. Importantly, the saved H5 file stores everything about your model that is needed to continue training.
As suggested in the Keras documentation, you should not use pickle to serialize your model. Simply register the ModelCheckpoint callback with your 'fit' function:
chkpt_cb = tf.keras.callbacks.ModelCheckpoint('model.{epoch:02d}-{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True)
model.fit(x_train, y_train,
epochs=100,
steps_per_epoch=5000,
callbacks=[chkpt_cb])
Your model will be saved in an H5 file named as you have it, with the epoch number and loss values automatically formated for you. For example, your saved file for the 5th epoch with loss 0.0023 would look like model.05-.0023.h5, and since you set save_best_only=True, the model will only be saved if your loss is better than the previously saved one so you don't pollute your directory with a bunch of unneeded model files.

Early Stopping, Model has gone through how many epochs?

I am using Keras. I am training my Neural Network and using Early Stopping. My patience is 10 and the epoch with the lowest validation loss is 15. My network runs til 25 epochs and stops however my model is the one with 25 epochs not 15 if I understand correctly
Is there an easy way to revert to the 15 epoch model or do I need to re-instantiate the model and run 15 epochs?
Yes, there is one, the restore_best_weights parameter in the EarlyStopping callback, set this to True and Keras will keep track of the weights producing the best loss:
callback = EarlyStopping(..., restore_best_weights=True)
See all the parameters for this callback here.
Yes, you get the model (weights) corresponding to the epoch when it stops. A commonly used strategy is to save the model whenever the validation loss/acc improves.
Early Stopping doesn't work the way you are thinking, that it should return the lowest loss or highest accuracy model, it works if there is no improvement in model accuracy or loss, for about x epochs (10 in your case, the patience parameter) then it will stop.
you should use callback modelcheckpoint functions instead e.g.
keras.callbacks.ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', period=1)
https://keras.io/callbacks/
This will save or checkpoint best model encountered during the training history.

keras: load saved model weights in a model for evaluation

I finishing model training processing. During training, I used ModelCheckpint to save the weights of the best model by:
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1,
save_best_only=True, mode='max')
After training, I load the model weights in to a model for evaluation, but I found the model does not give the best accuracy observed during training. I reload the model as follows:
model.load_weights(filepath) #load saved weights
model = Sequential()
model.add(Convolution2D(32, 7, 7, input_shape=(3, 128, 128)))
....
....
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
#evaluate the model
scores = model.evaluate_generator(test_generator,val_samples)
print("Accuracy = ", scores[1])
The highest accuracy saved by Modelcheckpoint is about 85%, but the re-compiled model only gives an accuracy of 16%?
Is there something wrong I am doing?
To be safe, is there any way to directly save the best model rather than the model weights?
Putting model.load_weights(filepath) after compiling the model fixes the problem!!
But I am still curious about saving the best model during training
Two tips for making sure you're using the best model trained:
Add the val_acc to the file name
You can create your ModelCheckpoint like this:
checkpoint = ModelCheckpoint('my-model-{val_acc:.2f}.hdf5', monitor='val_acc', verbose=1,
save_best_only=True, mode='max')
That way, you'll have multiple files, and you would be able to make sure you pick the best model.
Read the training output
When you look at the output of Keras while fitting, you'll see:
Epoch 000XX: val_acc improved from 0.8 to 0.85, saving model to my-model-0.85.hdf5
Let's say you have a bunch of data that you are training on and you decide to save the weights for your best iteration only. Now, if you have not iterated through all of your data before you find your 'best' model weights you will be effectively throwing away data and any later evaluation using the so called best weights will not correlate to your in-batch evaluation.

Is it possible to continue training from a specific epoch?

A resource manager I'm using to fit a Keras model limits the access to a server to 1 day at a time. After this day, I need to start a new job. Is it possible with Keras to save the current model at epoch K, and then load that model to continue training epoch K+1 (i.e., with a new job)?
You can save weights after every epoch by specifying a callback:
weight_save_callback = ModelCheckpoint('/path/to/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss', verbose=0, save_best_only=False, mode='auto')
model.fit(X_train,y_train,batch_size=batch_size,nb_epoch=nb_epoch,callbacks=[weight_save_callback])
This will save the weights after every epoch. You can then load them with:
model = Sequential()
model.add(...)
model.load('path/to/weights.hf5')
Of course your model needs to be the same in both cases.
You can add the initial_epoch argument. This will allow you to continue training from a specific epoch.
You can automatically start your training at the next epoch..!
What you need is to keep track of your training with a training log file as follow:
from keras.callbacks import ModelCheckpoint, CSVLogger
if len(sys.argv)==1:
model=... # you start training normally, no command line arguments
model.compile(...)
i_epoch=-1 # you need this to start at epoch 0
app=False # you want to start logging from scratch
else:
from keras.models import load_model
model=load_model(sys.argv[1]) # you give the saved model as input file
with open(csvloggerfile) as f: # you use your training log to get the right epoch number
i_epoch=list(f)
i_epoch=int(i_epoch[-2][:i_epoch[-2].find(',')])
app=True # you want to append to the log file
checkpointer = ModelCheckpoint(savemodel...)
csv_logger = CSVLogger(csvloggerfile, append=app)
model.fit(X, Y, initial_epoch=i_epoch+1, callbacks=[checkpointer,csv_logger])
That's all folks!

Resources