How to use Adam optim considering of its adaptive learning rate? - pytorch

In the Adam optimization algorithm, the learning speed is adjusted according to the number of iterations. I don't quite understand Adam's design, especially when using batch training. When using batch training, if there are 19,200 pictures, each time 64 pictures are trained, it is equivalent to 300 iterations. If our epoch has 200 times, there are a total of 60,000 iterations. I don't know if such multiple iterations will reduce the learning speed to a very small size. So when we are training, shall we initialize the optim after each epoch, or do nothing throughout the process?
Using pytorch. I have tried to initialize the optim after each epoch if I use batch train, and I do nothing when the number of data is small.
For expample, I don't know whether the two pieces of code is correct:
optimizer = optim.Adam(model.parameters(), lr=0.1)
for epoch in range(100):
###Some code
optim.step()
Another piece of code:
for epoch in range(100):
optimizer = optim.Adam(model.parameters(), lr=0.1)
###Some code
optim.step()

You can read the official paper here https://arxiv.org/pdf/1412.6980.pdf
Your update looks somewhat like this (for brevity, sake I have omitted the warmup-phase):
new_theta = old_theta-learning_rate*momentum/(velocity+eps)
The intuition here is that if momentum>velocity, then the optimizer is in a plateau, so the the learning_rate is increased because momentum/velocity > 1. on the other hand if momentum<velocity, then the optimizer is in a steep slope or noisy region, so the learning_rate is decreased.
The learning_rate isn't necessarily decreased throughout the training, as you have mentioned in you question.

Related

Mask R-CNN optimizer and learning rate scheduler in Pytorch

In the Mask R-CNN paper here the optimizer is described as follows training on MS COCO 2014/2015 dataset for instance segmentation (I believe this is the dataset, correct me if this is wrong)
We train on 8 GPUs (so effective minibatch
size is 16) for 160k iterations, with a learning rate of
0.02 which is decreased by 10 at the 120k iteration. We
use a weight decay of 0.0001 and momentum of 0.9. With
ResNeXt [45], we train with 1 image per GPU and the same
number of iterations, with a starting learning rate of 0.01.
I'm trying to write an optimizer and learning rate scheduler in Pytorch for a similar application, to match this description.
For the optimizer I have:
def get_Mask_RCNN_Optimizer(model, learning_rate=0.02):
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=0.0001)
return optimizer
For the learning rate scheduler I have:
def get_MASK_RCNN_LR_Scheduler(optimizer, step_size):
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gammma=0.1, verbose=True)
return scheduler
When the authors say "decreased by 10" do they mean divide by 10? Or do they literally mean subtract by 10, in which case we have a negative learning rate, which seems odd/wrong. Any insights appreciated.
The authors mean divide by 10, as pointed out in the comments.

Regarding number of epochs for torchvision models

I was trying to find, how many epochs was the pretrained Alexnet model (available from torchvision) trained for on Imagenet and also what learning rate was used? I tried checking the checkpoint keys to see if any epoch info was stored.
Any suggestions on how to find it out?
According to this comment on GitHub by a PyTorch team member, most of the training was done with a variant of https://github.com/pytorch/examples/tree/master/imagenet. All the models were trained on Imagenet. According to the file:
The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs, though they recommend using 0.01 for Alexnet as initial learning rate.
The default value for epochs is 90.

Wasserstein GAN critic training ambiguity

I'm running a DCGAN-based GAN, and am experimenting with WGANs, but am a bit confused about how to train the WGAN.
In the official Wasserstein GAN PyTorch implementation, the discriminator/critic is said to be trained Diters (usually 5) times per each generator training.
Does this mean that the critic/discriminator trains on Diters batches or the whole dataset Diters times? If I'm not mistaken, the official implementation suggests the discriminator/critic is trained on the whole dataset Diters times, but other implementations of WGAN (in PyTorch and TensorFlow etc.) do the opposite.
Which is correct? The WGAN paper (to me, at least), indicates that it is Diters batches. Training on the whole dataset is obviously orders of magnitude slower.
Thanks in advance!
The correct is to consider an iteration as a batch.
In the original paper, for each iteration of the critic/discriminator they are sampling a batch of size m of the real data and a batch of size m of prior samples p(z) to work it. After the critic is trained over Diters iterations, they train the generator which also starts by the sampling of a batch of prior samples of p(z).
Therefore, each iteration is working on a batch.
In the official implementation this is also happening. What may be confusing is that they use the variable name niter to represent the number of epochs to train the model. Although they use a different scheme to set Diters at lines 162-166:
# train the discriminator Diters times
if gen_iterations < 25 or gen_iterations % 500 == 0:
Diters = 100
else:
Diters = opt.Diters
they are, as in the paper, training the critic over Diters batches.
This implementation of WGAN shows it as Diter batches for the discriminator for each run of the generator - https://github.com/shayneobrien/generative-models/blob/74fbe414f81eaed29274e273f1fb6128abdb0ff5/src/w_gan.py#L88

Multivariate LSTM Forecast Loss and evaluation

I have a CNN-RNN model architecture with Bidirectional LSTMS for time series regression problem. My loss does not converge over 50 epochs. Each epoch has 20k samples. The loss keeps bouncing between 0.001 - 0.01.
batch_size=1
epochs = 50
model.compile(loss='mean_squared_error', optimizer='adam')
trainingHistory=model.fit(trainX,trainY,epochs=epochs,batch_size=batch_size,shuffle=False)
I tried to train the model with incorrectly paired X and Y data for which the
loss stays around 0.5, is it reasonable conclusion that my X and Y
have a non linear relationship which can be learned by my model over
more epochs ?
The predictions of my model capture the pattern but with an offset, I use dynamic time warping distance to manually check the accuracy of predictions, is there a better way ?
Model :
model = Sequential()
model.add(LSTM(units=128, dropout=0.05, recurrent_dropout=0.35, return_sequences=True, batch_input_shape=(batch_size,featureSteps,input_dim)))
model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=False))
model.add(Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
If you tested with:
Wrong data: loss ~0.5
Correct data: loss ~0.01
Then your model is actually cabable of learning something.
There are some possibilities there:
Your output data does not fit in the range of the last layer's activation
Your model reached a limit for the current learning rate (gradient update steps are too big and can't improve the model anymore).
Your model is not good enough for the task.
Your data has some degree of random factors
Case 1:
Make sure your Y is within the range of your last activation function.
For a tanh (the LSTM's default), all Y data should be between -1 and + 1
For a sigmoid, between 0 and 1
For a softmax, between 0 and 1, but make sure your last dimension is not 1, otherwise all results will be 1, always.
For a relu, between 0 and infinity
For linear, any value
Convergence goes better if you have a limited activation instead of one that goes to infinity.
In the first case, you can recompile (after training) the model with a lower learning rate, usually we divide it by 10, where the default is 0.0001:
Case 2:
If data is ok, try decreasing the learning rate after your model stagnates.
The default learning rate for adam is 0.0001, we often divide it by 10:
from keras.optimizers import Adam
#after training enough with the default value:
model.compile(loss='mse', optimizer=Adam(lr=0.00001)
trainingHistory2 = model.fit(.........)
#you can even do this again if you notice that the loss decreased and stopped again:
model.compile(loss='mse',optimizer=Adam(lr=0.000001)
If the problem was the learning rate, this will make your model learn more than it already did (there might be some difficult at the beginning until the optimizer adjusts itself).
Case 3:
If you got no success, maybe it's time to increase the model's capability.
Maybe add more units to the layers, add more layers or even change the model.
Case 4:
There's probably nothing you can do about this...
But if you increased the model like in case 3, be careful with overfitting (keep some test data to compare the test loss versus the training loss).
Too good models can simply memorize your data instead of learning important insights about it.

Overfitting after one epoch

I am training a model using Keras.
model = Sequential()
model.add(LSTM(units=300, input_shape=(timestep,103), use_bias=True, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=536))
model.add(Activation("sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
while True:
history = model.fit_generator(
generator = data_generator(x_[train_indices],
y_[train_indices], batch = batch, timestep=timestep),
steps_per_epoch=(int)(train_indices.shape[0] / batch),
epochs=1,
verbose=1,
validation_steps=(int)(validation_indices.shape[0] / batch),
validation_data=data_generator(
x_[validation_indices],y_[validation_indices], batch=batch,timestep=timestep))
It is a multiouput classification accoriding to scikit-learn.org definition:
Multioutput regression assigns each sample a set of target values.This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.
Thus, it is a recurrent neural network I tried out different timestep sizes. But the result/problem is mostly the same.
After one epoch, my train loss is around 0.0X and my validation loss is around 0.6X. And this values keep stable for the next 10 epochs.
Dataset is around 680000 rows. Training data is 9/10 and validation data is 1/10.
I ask for intuition behind that..
Is my model already over fittet after just one epoch?
Is 0.6xx even a good value for a validation loss?
High level question:
Therefore it is a multioutput classification task (not multi class), I see the only way by using sigmoid an binary_crossentropy. Do you suggest an other approach?
I've experienced this issue and found that the learning rate and batch size have a huge impact on the learning process. In my case, I've done two things.
Reduce the learning rate (try 0.00005)
Reduce the batch size (8, 16, 32)
Moreover, you can try the basic steps for preventing overfitting.
Reduce the complexity of your model
Increase the training data and also balance each sample per class.
Add more regularization (Dropout, BatchNorm)

Resources