Why models often benefit from reducing the learning rate during training - keras

In Keras official documentation for ReduceLROnPlateau class (https://keras.io/api/callbacks/reduce_lr_on_plateau/)
they mention that
"Models often benefit from reducing the learning rate"
Why is that so?
It's counter-intuitive for me at least, since from what I know- a higher learning rate allows taking further steps from my current position.
Thanks!

Neither too high nor too low learning rate should be considered for training a NN. A large learning rate can miss the global minimum and in extreme cases can cause the model to diverge completely from the optimal solution. On the other hand, a small learning rate can stuck to a local minimum.
ReduceLROnPlateau purpose is to track your model's performance and reduce the learning rate when there is no improvement for x number of epochs. The intuition is that the model approached a sub-optimal solution with current learning rate and oscillate around the global minimum. Reducing the learning rate would enable the model to take smaller learning steps to the optimal solution of the cost function.
Image source

Related

How to put more weight on one class during training in Pytorch [duplicate]

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

Multilabel classification with class imbalance in Pytorch

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

XGBOOST faster than random forest?

I am doing kaggle inclass challege of bosten hosing prices and learnt that XGBoost is faster than RandomForest but when implemented was slower.i Want to ask when XGBoost becomes faster and when RandomForest??.I am new to machine learning and need your help.Thanking in advance
Mainly, the parameters you choose have strong impact in the speed of your algorithm, (e.g learning rate, depth of the tree, number of features etc.), there's a trade-off between accuracy and speed, so i suggest you put the parameters you've chosen for every model and see how to change it to get faster performance with reasonable accuracy.

Why does removing validation samples from Keras model improve test accuracy so much

I'm doing a programming assignment for Andrew Ng's Deep Learning course on Convolutional Models that involves training and evaluating a model using Keras. What I've observed after a little playing with various knobs is something curious: The test accuracy of the model greatly improves (from 50 percentile to 90 percentile) by setting the validation_fraction parameter on the Model.fit operation to 0. This is surprising to me; I would have thought that eliminating the validation samples would lead to over-fitting of the model, which would, in turn, reduce accuracy on the test set.
Can someone please explain why this is happening?
You're right, there is more training data, but the increase is pretty negligible since dI was setting the validation fraction to 0.1, so that would increase the training data by 11.111...% However, thinking about it some more, I realized that removing the validation step doesn't have any effect on the model, hence no impact on test accuracy. I think that I must have changed some other parameter, too, though I don't remember which.
As Matias says, it means there is more training data to work with.
However, I'd also make sure that the test accuracy is actually increasing from 50 to 90% consistently. Run it over a couple times to make sure. There is a possibility that, because there is very little validation samples, that the model got lucky. That's why it is important to have a lot of validation data - to make sure the model isn't just getting lucky, and that there's actually a method to the madness.
I go over some of the "norms" when it comes to training and testing data in my book about stock prediction (another great way in my opinion to learn about Deep Learning). Feel free to check it out and learn more, as it's great for beginners.
Good Luck!

Computational Learning theory based on PAC-learning framework

Consider a Machine Learning Algorithm which train from a training set, with the help of PAC learning model we get bounds on training sample size needed so the probability that error is limited(by epsilon) is bounded(by delta).
What does PAC learning model say about computational(time) complexity.
Suppose a Learning Algorithm is given more time(like more iterations) how the error and probability that error is limited changes
As an learning algorithm which takes one hour to train is of no practical use in financial prediction problems. I need how the performance changes as time given to algorithm changes both in terms of error bounds and what is the probability that error is bounded
The PAC model simply tells you how many pieces of data you need to get a certain level of error with some probability. This can be translated into the impact on the run time by looking at the actual machine learning algorithm your using.
For example, if your algorithm runs in time O(2^n), and the PAC model says you need 1000 examples to have a 95% chance of having .05 error and 10,000 example for .005 error, then you know you should expect a HUGE slowdown for the increased accuracy. Whereas the same PAC information for a O(log n) algorithm would probably lead you to go ahead and get the lower error.
On a side note, it sounds like you might be confused about how most supervised learning algorithms work:
Suppose a Learning Algorithm is given more time(like more iterations) how the error and probability that error is limited changes
In most cases you can't really just give the same algorithm more time and expect better results, unless you chance the parameters (e.g. learning rate) or increase the number of examples. Perhaps by 'iterations' you meant examples, in which case the impact of the number of examples on the probably and error rate can be found by manipulating the system of equations used for the PAC learning model; see the wiki article.

Resources