Setting adaptable learning rate for Adam without using callbacks - python-3.x

I am exploring the translation model with attention from Tensorflow docs - NMT with Attention . Here in TF 2.0, optimizer is defined as
optimizer = tf.keras.optimizers.Adam()
How do I set a learning rate in this case? Is it just initializing the argument like below? How do I set an adaptable learning rate?
tf.keras.optimizers.Adam(learning_rate=0.001)
In the NMT with Attention model, they dont use Keras to define the model and I could not use Callbacks or 'model.fit' like below:
model.fit(x_train, y_train, callbacks=[LearningRateReducerCb()], epochs=5)

I have not much experience with NMT, but regarding the link you have provided, the best would be probably to use LearningRateSchedule that can be directly used as learning rate parameter in any optimizer (e.g. in Adam in your example). The process would be following:
Define your adaptive learning rate schedule (e.g. AdaptiveLRSchedule that would inherit from LearningRateSchedule and would implement any adaptive learning rate you prefer, similarly like in this example).
Instantiate your object - learning_rate = AdaptiveLRSchedule(your_parameters)
Use it as a learning rate in Adam optimizer - optimizer = tf.keras.optimizers.Adam(learning_rate)
Keep the rest as is in the example (optimizer.apply_gradients(zip(gradients, variables) should now correctly apply gradients and use adaptive learning rate according to your definition).
Note, that instead of defining your own class in the first step you can use one of schedules that already exists in TF like ExponencialDecay.
The second option would be to manually set lr in train_step (here in your example). After backpropagation (see apply_gradients method in (4) above that comes from your example) you can directly set your lr. This can be achieved via optimizer.learning_rate.assign(new_lr) where new_lr would be new learning rate coming from your adaptive lr function that you have to define (something like new_lr = adaptive_lr(optimizer.learning_rate) where adaptive_lr would implement it).

Related

Change learning rate in Keras once loss stop decreasing

I'm new to deep learning.
I have build a small architecture and compiling it using Adam optimizer as shown below:
model.compile(optimizer=Adam(learning_rate=0.0001), loss="mse")
#Train it by providing training images
model.fit(x, x, epochs=10, batch_size=16)
Now i'm aware of all type of decay where I can change learning rate at some epoch, but is there a way where I can change my learning rate automatically once my loss stop decreasing.
PS: It might be silly ques but please don't be harsh on me as I'm new!
You can use the Callbacks API in Keras.
It provides the following classes in keras.callbacks to alter learning rate on each epoch:
1. LearningRateScheduler
You can create your own learning rate schedule as a function of epoch.
Then pass the callback object to the callbacks argument to the fit method in a list.
For example, say your callback object is called lr_callback, then you would use:
model.fit(train_X, train_y, epochs=10, callbacks=[lr_callback]
Refer: keras.callbacks.LearningRateScheduler
2. ReduceLROnPlateau
This reduces the learning rate once your learning rate stops decreasing by min_delta amount. You can also set the patience and other useful parameters.
You can pass the callback to the fit method in the same way as done above.
Refer: keras.callbacks.ReduceLROnPlateau
The usage for both the callbacks is detailed quite well in the docs, which I have linked above already.
Alternatively, you could define your own callback to schedule the learning rate, if the above do not satisfy your requirements.
As you use the Adam optimizer, the learning rate is automatically adjusted depending on the gradient. So the more the gradient reaches a global minimum, the smaller the learning rate gets, so it does not "jump" over the global minimum. I am not sure, what you want to achieve, but the learning_rate you define in Adam() is just a start learning rate.
For more information on Adam and optimizers, I recommend the book hands-on machine learning of Aurélien Géron

Monitor F1 Score (or a custom metric in general) in a keras callback

Keras 2.0 removed F1 score, but I would like to monitor its value. I am using a sequential model to train a Neural Net.
I defined a function, as suggested here How to calculate F1 Macro in Keras?.
This function works fine only if used it inside model.compile. In this way I see its value at each step. The problem is that I don't want just to see its value but I would like my training to behave differently according to its value, using the callbacks of Keras.
If I try to insert my custom metric in the callbacks then I get this error:
'function object is not iterable'
Do you know how to define a function such that it can be used as an argument in the callbacks?
Callback of Keras will enable us to retrieve the model at different period, based on the metric which we keep track of. This will not affect the training procedure of the model.
You can train your model only with respect to some loss function. For example, cross entropy for classification problem. The readily available loss function in keras are given here
Precision, recall or f1-score are not differentialable functions. Hence, we cannot use that as a loss function for model training.
May be, if you want to tune your hyperparameter (such as learning rate, class weights) for improving f1 score, then you can be do that.
For tuning hyper parameters you can use hyperopt, tutorials

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Adaptive learning rate Lasagne

I am using Lasagne and Theano library to build my own deep learning model following the MNIST example. Can anyone please tell me how the adaptively change the learning rate?
I recommend having a look at https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py.
If you are using sgd, then you can use a momentum term (e.g. https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L156) to adaptively change the learning rate. If you want to make anything non-standard, the momentum implementation give you enough hints how to create something similar on your own.
I think the best way of doing this is by creating a theano shared variable for your learning rate, passing the shared variable to the updates function and changing through the set_value method, as follows:
lr_shared = theano.shared(np.array(0.1, dtype=theano.config.floatX))
updates = lasagne.updates.rmsprop(..., learning_rate=lr_shared)
...
for epoch in range(num_epochs):
if epoch % 10 == 0:
lr_shared.set_value(lr_shared.get_value() / 10)
Of course you can change the optimizer and the if codition, this is just an example.

Custom operation implementation for RBM/DBN with tensorflow?

Since Google released out tensorflow, it becomes kind of trend in the current deep learning selections.
I'd like to do some experiments about RBM/DBN (Restricted Boltzmann Machine/Deep Belief Network), I've made some attempt by myself and kind of implement it well through the combination of available APIs from tensorflow. See code and previous answer.
So, if doesn't bother the code running performance, here's the gift for RBM/DBN implementation with tensorflow.
But, the running performance must be considered for the future. Because of the special progress of CD (Contrastive Divergence) algorithm, I think it just works against the framework (data flow graph) used by tensorflow. That's why my code seems weired.
So, the custom operation should be implemented for acceleration. I've followed the current documentation about adding custom ops.
REGISTER_OP("NaiveRbm")
.Input("visible: float32")
.Input("weights: float32")
.Input("h_bias: float32")
.Input("v_bias: float32")
.Output("hidden: float32")
.Doc(R"doc(
Naive Rbm for seperate training use. DO NOT mix up with other operations
)doc");
In my design, NaiveRbm should is an operation that takes visible,weights,h_bias,v_bias as input, but output by only first 3 Variables ( simply sigmoid(X*W+hb) ), its gradient should return at least gradients for last 3 Variables.
Imagine example psuedo code like this:
X = tf.placeholder()
W1, hb1, vb1 = tf.Variable()
W2, hb2, vb2 = tf.Variable()
rbm1 = NaiveRbm(X,W1,hb1,vb1)
train_op = tf.train.MomentumOptimizer(0.01, 0.5).minimize(rbm1)
rbm2 = NaiveRbm(tf.stop_gradient(rbm1), W2, hb2, vb2)
train_op2 = tf.train.MomentumOptimizer(0.01, 0.5).minimize(rbm2)
with tf.Session() as sess:
for batch in batches:
sess.run(train_op, feed_dict={X: batch})
for batch in batches:
sess.run(train_op2, feed_dict={X: batch})
But the tensorflow library is too complex for me. And after too much time seeking for how to implement these existing operations (sigmoid, matmul, ma_add, relu, random_uniform) in custom operation, no solution is found by myself.
So, I'd like to ask if someone could help me achieve the remain works.
PS: before getting some ideas, I'd like to dive into Theano since it implements RBM/DBN already. Just in my opinion, Caffe is kind of not suitable for RBM/DBN because of its framework.
Update: After scratch through the tutorials from Theano, I found the key reason for Theano implemented the RBM/DBN while the tensorflow haven't is the scan technology. So, there might wait tensorflow to implement scan technology to prepare for RBM/DBN implementation.

Resources