Azure ML Service dump logs - azure-machine-learning-service

With the AzureML service, how can I dump the correct Loss curve or Accuracy curve for different epochs for keras deep learning on multiple nodes with Horovod?
The Loss vs epochs plt from Keras deep learning using Horovod and AzureML appears to have issues.
Training CNN with Keras/Horovod (2 GPUs) and AMLS SDK generates weird graphs

It seems like you might be training 2 models and the averaging of the gradients from the different nodes is note happening. Can you share more of your training script -- are you wrapping your optimizer in a DistributedOptimizer like so:
# Horovod: adjust learning rate based on number of GPUs.
opt = keras.optimizers.Adadelta(1.0 * hvd.size())
# Horovod: add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
In addition, you really only want one machine to log, so usually only attach an AzureML logger for rank 0, like so:
class LogToAzureMLCallback(tf.keras.callbacks.Callback):
def on_batch_end(self, batch, logs=None):
Run.get_context().log('acc',logs['acc'])
def on_epoch_end(self, epoch, logs=None):
Run.get_context().log('epoch_acc',logs['acc'])
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0)
]
# Horovod: save checkpoints only on worker 0 and only log to AzureML from worker 0.
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
callbacks.append(LogToAzureMLCallback())
model.fit(x_train, y_train,
batch_size=batch_size,
callbacks=callbacks,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))

How are you logging these metrics? From the graph it looks like there's two sets of datapoints interleaved.

Related

How can I save loss and accuracy metrics in mlflow after each epoch?

I would like to see metrics like loss and accuracy as a graph by storing each value for the corresponding metrics after each epoch during training/testing phase of a keras model.
PS: I know that we can do it by using autolog feature of mlflow for keras like below, but I dont want to use that.
mlflow.keras.autolog()
After searching through the internet and combining a few concepts, I was able to solve the problem that I had asked. In Keras, we can create custom callbacks that can be called at various points (start/end of epoch, batch, etc) during training, testing, and prediction phase of a model.
So, I created a Keras custom callback to store loss/accuracy values after each epoch as mlflow metrics like below.
class CustomCallback(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
mlflow.log_metrics({
"loss": logs["loss"],
"sparse_categorical_accuracy":
logs["sparse_categorical_accuracy"],
"val_loss": logs["val_loss"],
"val_sparse_categorical_accuracy":
logs["val_sparse_categorical_accuracy"],
})
I called this above callback during training of my model like below.
history = model.fit(
features_train,
labels_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
callbacks=[CustomCallback()],
validation_split=0.2
)
The keras custom callback stored all the values during training after each epoch which I was able to see as a graph in mlflow UI like below.

GridSearchCV, Data Leaks & Production Process Clarity

I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:
# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold
# is respectively transformed by fit scalar, preventing data leaks between each test/train
pipe = Pipeline([('sc', StandardScaler()),
('knn', KNeighborsClassifier())
])
# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}
# create grid
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True)
search.fit(X_train, y_train)
Assuming my understanding and the above process is correct, my question is what's next?
My guess is that we:
fit X_train to our scaler
transform X_train and X_test with our scaler
train a new model using X_train and our newly discovered best parameters from the grid search process
test the new model with our very first holdout-test set.
Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.
Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?
Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.
Bonus Question:
I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.
As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.
You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:
search = GridSearchCV(estimator=pipe,
param_grid=params,
cv=5,
return_train_Score=True,
refit=True)
search.fit(X_train, y_train)
tuned_pipe = search.best_estimator_
tuned_pipe.score(X_test, y_test)

Is it ok to run same model multiple times?

My Question is I ran a keras model for 100 epochs(gave epochs=100) and stopped for some time for cooling purpose of CPU and GPU.
Than I ran 100 epochs again and loss is decreasing from where it has stopped in the previous 100 epochs .
Is it works in all conditions.
Like if there are on 1000 epochs I want to train my model, can I stop after every 100 epochs and wait until my cpu and GPU cools and run the next 100 epochs.
Can I do this?
It will not work in all conditions. For examples if you shuffle the data and perform a validation split like this :
fit(x,y,epochs=1, verbose=1, validation_split=0.2, shuffle=True)
You will use the entire dataset for training which is not what you expect.
Furthermore, by doing multiple fit you will erase history information (accuracy, loss, etc at each epoch), given by :
model.history
So some callback functions that use this history will not work properly, like EarlyStopping (source code here).
Otherwise, it works as it does not mess around with the keras optimizer as you can see in the source code of keras optimizers (Adadelta optimizer).
However, I do not recommend to do this. Because it could cause bugs in future development. A cleaner way to do that would be to create a custom callback function with a delay like this :
import time
class DelayCallback(keras.callbacks.Callback):
def __init__(self,delay_value=10, epoch_to_complete=10):
self.delay_value = delay_value # in second
self.epoch_to_complete = epoch_to_complete
def on_epoch_begin(self, epoch, logs={}):
if (epoch+1) % self.epoch_to_complete == 0:
print("cooling down")
time.sleep(self.delay_value)
return
model.fit(x_train, y_train,
batch_size=32,
epochs=20,
verbose=1, callbacks=[DelayCallback()])

How Keras really fits the models via epochs

I am a bit confused on how Keras fits the models. In general, Keras models are fitted by simply using model.fit(...) something like the following:
model.fit(X_train, y_train, epochs=300, batch_size=64, validation_data=(X_test, y_test))
My question is: Because I stated the testing data by the argument validation_data=(X_test, y_test), does it mean that each epoch is independent? In other words, I understand that at each epoch, Keras train the model using the training data (after getting shuffled) followed by testing the trained model using the provided validation_data. If that's the case, then no matter how many epochs I choose, I only take the results of the last epoch!!
If this scenario is correct, so we do we need multiple epoches? Unless these epoches are dependent somwhow where each epoch uses the same NN weights from the previous epoch, correct?
Thank you
When Keras fit your model it pass throught all the dataset at each epoch by a step corresponding to your batch_size.
For exemple if you have a dataset of 1000 items and a batch_size of 8, the weight of your model will be updated by using 8 items and this until it have seen all your data set.
At the end of that epoch, the model will try to do a prediction on your validation set.
If we have made only one epoch, it would mean that the weight of the model is updated only once per element (because it only "saw" one time the complete dataset).
But in order to minimize the loss function and by backpropagation, we need to update those weights multiple times in order to reach the optimum loss, so pass throught all the dataset multiple times, in other word, multiple epochs.
I hope i'm clear, ask if you need more informations.

Overfitting after one epoch

I am training a model using Keras.
model = Sequential()
model.add(LSTM(units=300, input_shape=(timestep,103), use_bias=True, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=536))
model.add(Activation("sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
while True:
history = model.fit_generator(
generator = data_generator(x_[train_indices],
y_[train_indices], batch = batch, timestep=timestep),
steps_per_epoch=(int)(train_indices.shape[0] / batch),
epochs=1,
verbose=1,
validation_steps=(int)(validation_indices.shape[0] / batch),
validation_data=data_generator(
x_[validation_indices],y_[validation_indices], batch=batch,timestep=timestep))
It is a multiouput classification accoriding to scikit-learn.org definition:
Multioutput regression assigns each sample a set of target values.This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.
Thus, it is a recurrent neural network I tried out different timestep sizes. But the result/problem is mostly the same.
After one epoch, my train loss is around 0.0X and my validation loss is around 0.6X. And this values keep stable for the next 10 epochs.
Dataset is around 680000 rows. Training data is 9/10 and validation data is 1/10.
I ask for intuition behind that..
Is my model already over fittet after just one epoch?
Is 0.6xx even a good value for a validation loss?
High level question:
Therefore it is a multioutput classification task (not multi class), I see the only way by using sigmoid an binary_crossentropy. Do you suggest an other approach?
I've experienced this issue and found that the learning rate and batch size have a huge impact on the learning process. In my case, I've done two things.
Reduce the learning rate (try 0.00005)
Reduce the batch size (8, 16, 32)
Moreover, you can try the basic steps for preventing overfitting.
Reduce the complexity of your model
Increase the training data and also balance each sample per class.
Add more regularization (Dropout, BatchNorm)

Resources