Is it possible to continue training from a specific epoch? - keras

A resource manager I'm using to fit a Keras model limits the access to a server to 1 day at a time. After this day, I need to start a new job. Is it possible with Keras to save the current model at epoch K, and then load that model to continue training epoch K+1 (i.e., with a new job)?

You can save weights after every epoch by specifying a callback:
weight_save_callback = ModelCheckpoint('/path/to/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss', verbose=0, save_best_only=False, mode='auto')
model.fit(X_train,y_train,batch_size=batch_size,nb_epoch=nb_epoch,callbacks=[weight_save_callback])
This will save the weights after every epoch. You can then load them with:
model = Sequential()
model.add(...)
model.load('path/to/weights.hf5')
Of course your model needs to be the same in both cases.

You can add the initial_epoch argument. This will allow you to continue training from a specific epoch.

You can automatically start your training at the next epoch..!
What you need is to keep track of your training with a training log file as follow:
from keras.callbacks import ModelCheckpoint, CSVLogger
if len(sys.argv)==1:
model=... # you start training normally, no command line arguments
model.compile(...)
i_epoch=-1 # you need this to start at epoch 0
app=False # you want to start logging from scratch
else:
from keras.models import load_model
model=load_model(sys.argv[1]) # you give the saved model as input file
with open(csvloggerfile) as f: # you use your training log to get the right epoch number
i_epoch=list(f)
i_epoch=int(i_epoch[-2][:i_epoch[-2].find(',')])
app=True # you want to append to the log file
checkpointer = ModelCheckpoint(savemodel...)
csv_logger = CSVLogger(csvloggerfile, append=app)
model.fit(X, Y, initial_epoch=i_epoch+1, callbacks=[checkpointer,csv_logger])
That's all folks!

Related

writing training and testing data sets into separate files

i am training an autoencoder neural network for my work purpose.However i am taking
the image numpy array dataset as input(total samples 16110) and want to split dataset into training and test set using the below autoencoder.fit command. Additionally while training the network it is writing like Train on 12856 samples, validate on 3254 samples.
However, i need to save both the training and testing data into separate files. How can i do it?
es=EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=5)
mc=ModelCheckpoint('best_model.h5',monitor='val_loss',mode='min',save_best_only=True)
history = autoencoder.fit(dataNoise,dataNoise, epochs=30, batch_size=256, shuffle=256,callbacks=[es,mc], validation_split = 0.2)
you can use the train_test_split function from sklearn. See code below
from sklearn.model_selection import train_test_split
train_split=.9 # set this as the % you want for training
train_noise, valid_noise=train_test_split(dataNoise, train_size=train_split, shuffle=True,
random_state=123)
now use train noise as x,y and valid noise for validation data in model.fit

Question on restoring training after loading model

Having trained for 24 hours, the training process saved the model files via torch.save. There was a power-off or other issues caused the process exited. Normally, we can load the model and continue training from the last step.
Why should not we load the states of optimizers (Adam, etc), is it necessary?
Yes, you can load the model from the last step and retrain it from that very step.
if you want to use it only for inference, you will save the state_dict of the model as
torch.save(model, PATH)
And load it as
model = torch.load(PATH)
model.eval()
However, for your concern you need to save the optimizer state dict as well. For that purpose, you need to save it as
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
...
}, PATH)
and load the model for further training as:
model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
model.eval()
# - or -
model.train()
It is necessary to save the optimizer state dictionary, since this contains buffers and parameters that are updated as the model trains.
It is necessary to load the states of the optimizers in some cases, such as the case that a learning rate scheduler is being used.
In that particular case, learning rate for the optimizer will be re-adjusted to the point where it was at the saved state.

tf.keras how to save ModelCheckPoint object

ModelCheckpoint can be used to save the best model based on a specific monitored metrics. So it obviously has information about the best metrics stored within its object. If you train on google colab for example, your instance can be killed without warning and you would lose this info after a long training session.
I tried to pickle the ModelCheckpoint object but got:
TypeError: can't pickle _thread.lock objects
Such that i can reuse this same object when I bring my notebook back. Is there a good way to do this? You can try to reproduce by:
chkpt_cb = tf.keras.callbacks.ModelCheckpoint('model.{epoch:02d}-{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True)
with open('chkpt_cb.pickle', 'w') as f:
pickle.dump(chkpt_cb, f, protocol=pickle.HIGHEST_PROTOCOL)
If callback object is not to be pickled (due to thread issue and not advisable), I can pickle this instead:
best = chkpt_cb.best
This stores the best monitored metrics that callback has seen, and it is a float, which you can pickle and reload next time, and then do this:
chkpt_cb.best = best # if chkpt_cb is a brand new object you create when colab killed your session.
This is my own setup:
# All paths should be on Google Drive, I omitted it here for simplicity.
chkpt_cb = tf.keras.callbacks.ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True)
if os.path.exists('chkpt_cb.best.pickle'):
with open('chkpt_cb.best.pickle', 'rb') as f:
best = pickle.load(f)
chkpt_cb.best = best
def save_chkpt_cb():
with open('chkpt_cb.best.pickle', 'wb') as f:
pickle.dump(chkpt_cb.best, f, protocol=pickle.HIGHEST_PROTOCOL)
save_chkpt_cb_callback = tf.keras.callbacks.LambdaCallback(
on_epoch_end=lambda epoch, logs: save_chkpt_cb()
)
history = model.fit_generator(generator=train_data_gen,
validation_data=dev_data_gen,
epochs=5,
callbacks=[chkpt_cb, save_chkpt_cb_callback])
So even when your colab session got killed, you can still retrieve the last best metrics and inform your new instance about it, and continue training as usual. This especially help when you re-compile a stateful optimizer and may cause a regression in the loss/metric and don't want to save those models for first few epochs.
I think you might be misunderstanding the intended usage of the ModelCheckpoint object. It is a callback that periodically gets called during training at a particular phase. The ModelCheckpoint callback in particular gets called after every epoch (if you keep the default period=1) and saves your model to disk in the filename you specify to the filepath argument. The model is saved in the same way described here. Then if you want to load that model later, you can do something like
from keras.models import load_model
model = load_model('my_model.h5')
Other answers on SO provide nice guidance and examples for continuing training from a saved model, for example: Loading a trained Keras model and continue training. Importantly, the saved H5 file stores everything about your model that is needed to continue training.
As suggested in the Keras documentation, you should not use pickle to serialize your model. Simply register the ModelCheckpoint callback with your 'fit' function:
chkpt_cb = tf.keras.callbacks.ModelCheckpoint('model.{epoch:02d}-{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True)
model.fit(x_train, y_train,
epochs=100,
steps_per_epoch=5000,
callbacks=[chkpt_cb])
Your model will be saved in an H5 file named as you have it, with the epoch number and loss values automatically formated for you. For example, your saved file for the 5th epoch with loss 0.0023 would look like model.05-.0023.h5, and since you set save_best_only=True, the model will only be saved if your loss is better than the previously saved one so you don't pollute your directory with a bunch of unneeded model files.

how to correctly shape input of a multiclass classification using keras stacked LSTM model

I am working on a multiple classification problem and after dabbling with multiple neural network architectures, I settled for a stacked LSTM structure as it yields the best accuracy for my use-case. Unfortunately the network takes a long time (almost 48 hours) to reach a good accuracy (~1000 epochs) even when I use GPU acceleration. The resulting accuracy and loss functions are:
At this point, giving the good performance but the very slow training I suspect a bug in my code. I tested it using the golden tests mentioned here, which consist of running tests with 2 points only either in the testing set or the training set along with eliminating the dropouts. Unfortunately, the outputs of these runs result in testing accuracy better than the training accuracy, which should not be the case as far as I know. I suspect that I am shaping my data in the wrong way. Any hints, suggestions and advises are appreciated.
My code is the following:
# -*- coding: utf-8 -*-
import keras
import numpy as np
from time import time
from utils import dmanip, vis
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.utils import to_categorical
from keras.callbacks import TensorBoard
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
###############################################################################
####################### Extract the data from .csv file #######################
###############################################################################
# get data
data, column_names = dmanip.get_data(file_path='../data_one_outcome.csv')
# split data
X = data.iloc[:, :-1]
y = data.iloc[:, -1:].astype('category')
###############################################################################
########################## init global config vars ############################
###############################################################################
# check if GPU is used
print(device_lib.list_local_devices())
# init
n_epochs = 1500
n_comps = X.shape[1]
###############################################################################
################################## Keras RNN ##################################
###############################################################################
# encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y))
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.35,
random_state=True,
shuffle=True)
# expand dimensions
x_train = np.expand_dims(x_train, axis=2)
x_test = np.expand_dims(x_test, axis=2)
# define model
model = Sequential()
model.add(LSTM(units=n_comps, return_sequences=True,
input_shape=(x_train.shape[1], 1),
dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(4 ,activation='softmax'))
# print model architecture summary
print(model.summary())
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Create a TensorBoard instance with the path to the logs directory
tensorboard = TensorBoard(log_dir='./logs/rnn/{}'.format(time()))
# fit the model
history = model.fit(x_train, y_train, epochs=n_epochs, batch_size=100,
validation_data=(x_test, y_test), callbacks=[tensorboard])
# plot results
vis.plot_nn_stats(history=history, stat_type="accuracy", fname="RNN-accuracy")
vis.plot_nn_stats(history=history, stat_type="loss", fname="RNN-loss")
My data is a large 2d matrix (38607, 150), where 149 is the number of features and 38607 is the number of samples, with a target vector including 4 classes.
feat1 feat2 ... feat148 feat149 target
1 2.250 0.926 ... 16.0 0.0 class1
2 2.791 1.235 ... 1.0 0.0 class2
. . . . . .
. . . . . .
. . . . . .
38406 2.873 1.262 ... 281.0 0.0 class3
38407 3.222 1.470 ... 467.0 1.0 class4
Regarding the Slowness of Training: You can think of using tf.data instead of Data Frames and Numpy Arrays because, Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. The tf.data API helps to build flexible and efficient input pipelines.
For more information regarding tf.data, please refer this Tensorflow Documentation 1, Documentation 2.
This Tensorflow Tutorial guides you to convert your Data Frame to tf.data format.
One more important feature of use to you can be tf.profiler. Using Tensorflow Profiler, you can not only Visualize the Time and Memory Consumed in each phase of Data Science Project but it also provides us a Suggestion/Recommendation to reduce the Time/Memory Consumption and hence to Optimize our Project.
For more information on Tensorflow Profiler, refer this Documentation, this Tutorial and this Tensorflow DevSummit Youtube Video.
Regarding Testing Accuracy more than Training Accuracy: This is not a big problem and happens sometimes.
Probable Reason 1: Dropout ==> What is the reason for you to use Dropout and recurrent_dropout in your Model? Was the Model Overfitting? If the Model is not Overfitting, without Dropout and recurrent_dropout, then you can think of removing them because, If you set Dropout (0.2) and recurrent_dropout (0.2) it means 20% of features will be 0 and 20% of Time Steps will be 0, during training. However, during testing all the Features and Timesteps are used, so the model is more robust and have better testing accuracy.
Probable Reason 2: 35% of Testing Data is bit more than usual. You can make it either 20% or 25%.
Probable Reason 3: Your training data might have several arduous cases to learn and Your Testing data may contain easier cases to predict. To mitigate this, you can Split the Data Once again with different Random Seed.
For more information, please refer this Research Gate Link and this Stack Overflow Link.
Hope this helps. Happy Learning!

Is it ok to run same model multiple times?

My Question is I ran a keras model for 100 epochs(gave epochs=100) and stopped for some time for cooling purpose of CPU and GPU.
Than I ran 100 epochs again and loss is decreasing from where it has stopped in the previous 100 epochs .
Is it works in all conditions.
Like if there are on 1000 epochs I want to train my model, can I stop after every 100 epochs and wait until my cpu and GPU cools and run the next 100 epochs.
Can I do this?
It will not work in all conditions. For examples if you shuffle the data and perform a validation split like this :
fit(x,y,epochs=1, verbose=1, validation_split=0.2, shuffle=True)
You will use the entire dataset for training which is not what you expect.
Furthermore, by doing multiple fit you will erase history information (accuracy, loss, etc at each epoch), given by :
model.history
So some callback functions that use this history will not work properly, like EarlyStopping (source code here).
Otherwise, it works as it does not mess around with the keras optimizer as you can see in the source code of keras optimizers (Adadelta optimizer).
However, I do not recommend to do this. Because it could cause bugs in future development. A cleaner way to do that would be to create a custom callback function with a delay like this :
import time
class DelayCallback(keras.callbacks.Callback):
def __init__(self,delay_value=10, epoch_to_complete=10):
self.delay_value = delay_value # in second
self.epoch_to_complete = epoch_to_complete
def on_epoch_begin(self, epoch, logs={}):
if (epoch+1) % self.epoch_to_complete == 0:
print("cooling down")
time.sleep(self.delay_value)
return
model.fit(x_train, y_train,
batch_size=32,
epochs=20,
verbose=1, callbacks=[DelayCallback()])

Resources