Choosing the learning_rate using fastai's learn.lr_find() - pytorch

I am going over this Heroes Recognition ResNet34 notebook published on Kaggle.
The author uses fastai's learn.lr_find() method to find the optimal learning rate.
Plotting the loss function against the learning rate yields the following figure:
It seems that the loss reaches a minimum for 1e-1, yet in the next step the author passes 1e-2 as the max_lr in fit_one_cycle in order to train his model:
learn.fit_one_cycle(6,1e-2)
Why use 1e-2 over 1e-1 in this example? Wouldn't this only make the training slower?

The idea for a learning rate range test as done in lr_find comes from this paper by Leslie Smith: https://arxiv.org/abs/1803.09820 That has a lot of other useful tuning tips; it's worth studying closely.
In lr_find, the learning rate is slowly ramped up (in a log-linear way). You don't want to pick the point at which loss is lowest; you want to pick the point at which it is dropping fastest per step (=net is learning as fast as possible). That does happen somewhere around the middle of the downward slope or 1e-2, so the guy who wrote the notebook has it about right. Anything between 0.5e-2 and 3e-2 has roughly the same slope and would be a reasonable choice; the smaller values would correspond to a bit slower learning (=more epochs needed, also less regularization) but with a bit less risk of reaching a plateau too early.
I'll try to add a bit of intuition about what is happening when loss is the lowest in this test, say learning rate=1e-1. At this point, the gradient descent algorithm is taking large steps in the direction of the gradient, but loss is not decreasing. How can this happen? Well, it would happen if the steps are consistently too large. Think of trying to get into a well (or canyon) in the loss landscape. If your step size is larger than the size of the well, you can consistently step over it every time and end up on the other side.
This picture from a nice blog post by Jeremy Jordan shows it visually:
In the picture, it shows the gradient descent climbing out of a well by taking too large steps (maybe lr=1+0 in your test). I think this rarely happens exactly like that unless lr is truly excessive; more likely, the well is in a relatively flat landscape, and the gradient descent can step over it, not being able to get into the well in the first place. High-dimensional loss landscapes are hard to visualize, and may be very irregular, but in a sense the lr_find test is looking for the scale of the typical features in the landscape and then picking a learning rate that gives you a step which is similar sized but a bit smaller.

You can find the suggested learning rate as follows:
_, lr = learner.lr_find()

Related

Bias/Dirft compensation for integration of linear accelerometer data using Kalman filtering

I have become a part of this infinite question of how to estimate position from accelerometer data achieved by an Inertial measurement unit. I am wondering how to compensate for integration ''drift'' during linear movement using Kalman filtering.
At this moment I got my acceleration in a fixed coordinate system and all movements are in know directions with no change in angular position.
So at this point we got acceleration in 3D (x-y-z) in known directions, an acceleration in x will yield for zero acceleration in y and z and so on. Assuming perfect conditions, which are not the case, of course some noise with be added to the other directions when moving in one direction but lets ''leave'' this out at this point. In addition, It is important to note that the system only has to estimate a limited period, approximately about 1 second using a sampling freq of 512 Hz.
It also important to note that I have compensated for the offset (gravity and misalignment of the accelerometer in the IMU) and bias of the acceleromter data when static. Meaning when the sensor is non-moving all my readings are constant zero before going into the Kalman filter.
To more characterize my problem I have this graph to illustrate my problem with drift. This is estimations on 5 seconds to more show what I'm struggling with.
Position-estimation-drift-problem
Here we are looking into a movement in one direction, the movement are 20cm movement in y direction which in my case are forward relative to my starting position.
Is there a way to reduce/eliminate this drift when integrating my signal. For instance assume something about drifting when my sensor is non-moving. Or to compute using some correction in my Kalman algorithm to subtract or add to my estimated velocity and position. The system does not have to run in real time so any tuning bias compensation can be adjusted for looking back into the data. But I would be preferable if it was possible to take new measurements with slightly different movements and not tune more then needed.
Finally where/how can I compensate for this, in the Kalman algorithm or before/after, or should I be in for a disappointment already?
If I left out some important information please ask so i can elaborate more, an at last any thoughts/ideas are welcome!
Remember I do only need to estimate for second’s worth of time so my hope is that this makes it more achievable, but i might be wrong?
I can only guess/suggest few tricks, but you will probably get some significant error if you only based on accelerometer.
seems that detecting motionless is not resetting the speed, just acceleration (according to your graph) so this should be an easy fix
if we are talking an a car/other type of surface motion with contact / friction, your motionless can be set by characterizing the noise of in motion/self sensor noise
kalman parameters may be off
run multiple kernels and average results (may also try particle filter)
if its not for online application you can also try fitting offsets/drift and reduce them by assuming there is not motion in constant speed or other approaches that can replace the kalman filter which is designed for real time best estimation.
error seems a-symmetric in time, just run it in both directions (:
what are you measuring at 512 Hz??? maybe you can better model it
I can go on and on but if you supply data and code, it would be much easier.
Good luck,
Lev

How to visualize error surface in keras?

We see pretty pictures of error surface with a global minima and convergence of a neural network in many books. How can I visualize something similar in keras i.e containing error surface and how my model is converging to achieve global minimal error? Below is an example image of such illustrations. And this link has animated illustration of different optimizers. I explored tensorboard log callback for this purpose but could not find any such thing. A little guidance will be appreciated.
The pictures and animations are made for didatic purposes, but the error surface is completely unknown (or incredibly complex to be understood or visualized). That's the whole idea behind using gradient descent.
We only know, at a single point, the direction towards which the funcion increases, through getting the current gradient.
You could try to plot the way (line) you're following by getting the weights values at each iteration and the error, but then you'd face another problem: it's a massively multidimensional function. It's not actually a surface. The number of variables is the number of weights you have in the model (often thousands or even millions). This is absolutely impossible to visualize or even conceive as a visual thing.
To plot such a surface, you'd have to manually change all thousands of weights to get the error for each arrangement. Besides the "impossible to visualize" problem, this would be excessively time consuming.

Gaussian-Bernoulli RBM high reconstruction error

I'm normalizing my data to zero mean and unit variance as recommended in most literature to pre-train a GB-RBM. But whatever learning rate I choose and whatsoever is the number of epochs, my mean reconstruction error never drops below around 0.6.
Reconstruction errors for the stacked BB-RBMs easily drop to 0.01 within a few epochs. I've used several toolkits which implement GBRBMs as mentioned in http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf but all have the same issue. Am I missing something or is the reconstruction error meant to stay above 50% ?
I'm normalizing my data by subtracting mean and dividing by the standard deviation along each dimension of input vector:
size(mfcc) --> [mlength rows x 39 cols]
mmean=mean(mfcc);
mstd=std(mfcc);
mfcc=mfcc-ones(mlength,1)*mmean;
mfcc=mfcc./(ones(mlength,1)*mstd);
This does give me zero mean and unit var along each dimension. I have tried different datasets, different features and different toolkits, but my reconstr error never drops below 0.6 for GBRBMs.
Thanks
I would guess you are using exp() as the sigmoid and then using a 3rd party library to do the matrix functions?
if the above is true, I would guess the 3rd party library is swallowing the exp() overflow errors but still stopping the calculation and so the hidden/recreated vectors are invalid.
edit based on comment below:
theano.tensor.nnet.sigmoid() is using exp() so I would first try switching to hard_sigmoid(). It won't be as nice of a curve, but it won't overflow/underflow so you can see if that is the source of error.
I assume you tried other data preprocessing and still had the high reconstruction errors?

How do I measure the distribution of an attribute of a given population?

I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.

2D shape optimization through genetic algorithms

I just recently started learning about genetic algorithms and am now trying to implement them in 2D shape optimization in physics simulaiton. The simulation produces a single scalar for each shape. (I guess this is kind of similar to boxcar2d http://boxcar2d.com/)
The 2D shapes are actually the union of several 2D "sub shapes." Each subshape is stored as an list of angles/radii. The 2D shape is then stored as a list of subshape lists. This serves as my chromosone right now.
Right now for fitness, I will probably use the scalar the simulation produced. My question is, how should I go about the selection and reproduction process? Would tournament be more appropriate, or would I want to use truncation in combination with the proportional selection? Also, how do you find a good mutation rate/population size, etc
sorry for so many questions but thanks in advance. I just don't really know where to start.
On my point of view the best way is to use adaptive reproduction strategy during evolution: at the first steps (let name it - "the first phase of calculations") you might set high mutation probability, at this phase you should find enough good solution. At the "second phase" of algorithm you might set decreasing of mutation probability every few steps - at this phase you should improve your solution. But sometimes in my practice I've noticed degradation of population during second phase of optimization (when each chromosome is strongly similar to other) - which effects with extremly slowing down of optimization pperformance, so my solution was to improve algorithm with high valued mutation random perturbations and it helps.
Also I'll advice you to read about differential evolution algorithm - http://en.wikipedia.org/wiki/Differential_evolution. As for me it's performance is much more faster than genetic algorithm.

Resources