In the Mask R-CNN paper here the optimizer is described as follows training on MS COCO 2014/2015 dataset for instance segmentation (I believe this is the dataset, correct me if this is wrong)
We train on 8 GPUs (so effective minibatch
size is 16) for 160k iterations, with a learning rate of
0.02 which is decreased by 10 at the 120k iteration. We
use a weight decay of 0.0001 and momentum of 0.9. With
ResNeXt [45], we train with 1 image per GPU and the same
number of iterations, with a starting learning rate of 0.01.
I'm trying to write an optimizer and learning rate scheduler in Pytorch for a similar application, to match this description.
For the optimizer I have:
def get_Mask_RCNN_Optimizer(model, learning_rate=0.02):
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=0.0001)
return optimizer
For the learning rate scheduler I have:
def get_MASK_RCNN_LR_Scheduler(optimizer, step_size):
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gammma=0.1, verbose=True)
return scheduler
When the authors say "decreased by 10" do they mean divide by 10? Or do they literally mean subtract by 10, in which case we have a negative learning rate, which seems odd/wrong. Any insights appreciated.
The authors mean divide by 10, as pointed out in the comments.
Related
I am trying to train a CNN on football games audio to predict highlights. The data is composed of MFCC Spectrograms (https://librosa.org/doc/main/generated/librosa.feature.mfcc.html) of duration t=1s, and rate 10Hz. These MFCC spectrograms (~3000 per game, t=300s of labelled footage) are all labelled: 1 if corresponds to a highlight situation, 0 if corresponds to a lowlight. They are all 32x40 matrices: 40 high for the number of MFC coefficients (see librosa doc) and 32 wide for 32 samples per second.
I am training a CNN on this data. Here's its architecture:
CNN architecture
I have a balanced set taken from PSGvMU game, composed of 50% highlight/50% lowlight MFCC spectrograms. This set is split into a 80% balanced training dataset and 20% balanced validation dataset.
I am training my model with 10 epochs, 32 batch size, adam optimizer with lr=0.001. Here are the trainign epochs:
Training epochs accuracies and validation accuracies
Every time I test my model on new MFCC spectrograms, the predictions (between 0 and 1) have a very high mean (~0.99+) and the optimal classification threshold (calculated by doing argmax(accuracy | threshold)) is often also very high, typically around 0.99-0.999.
Accuracy as function of classification threshold graph
The problem is that I need to know the true labels to get a good classification threshold and hence good results.
What do you think about my approach? Is there something wrong with my model? Or am I just lacking data/overfitting a lot ?
In the Adam optimization algorithm, the learning speed is adjusted according to the number of iterations. I don't quite understand Adam's design, especially when using batch training. When using batch training, if there are 19,200 pictures, each time 64 pictures are trained, it is equivalent to 300 iterations. If our epoch has 200 times, there are a total of 60,000 iterations. I don't know if such multiple iterations will reduce the learning speed to a very small size. So when we are training, shall we initialize the optim after each epoch, or do nothing throughout the process?
Using pytorch. I have tried to initialize the optim after each epoch if I use batch train, and I do nothing when the number of data is small.
For expample, I don't know whether the two pieces of code is correct:
optimizer = optim.Adam(model.parameters(), lr=0.1)
for epoch in range(100):
###Some code
optim.step()
Another piece of code:
for epoch in range(100):
optimizer = optim.Adam(model.parameters(), lr=0.1)
###Some code
optim.step()
You can read the official paper here https://arxiv.org/pdf/1412.6980.pdf
Your update looks somewhat like this (for brevity, sake I have omitted the warmup-phase):
new_theta = old_theta-learning_rate*momentum/(velocity+eps)
The intuition here is that if momentum>velocity, then the optimizer is in a plateau, so the the learning_rate is increased because momentum/velocity > 1. on the other hand if momentum<velocity, then the optimizer is in a steep slope or noisy region, so the learning_rate is decreased.
The learning_rate isn't necessarily decreased throughout the training, as you have mentioned in you question.
I was trying to find, how many epochs was the pretrained Alexnet model (available from torchvision) trained for on Imagenet and also what learning rate was used? I tried checking the checkpoint keys to see if any epoch info was stored.
Any suggestions on how to find it out?
According to this comment on GitHub by a PyTorch team member, most of the training was done with a variant of https://github.com/pytorch/examples/tree/master/imagenet. All the models were trained on Imagenet. According to the file:
The default learning rate schedule starts at 0.1 and decays by a factor of 10 every 30 epochs, though they recommend using 0.01 for Alexnet as initial learning rate.
The default value for epochs is 90.
Is it true that "In Keras, if you multiply the loss function of a model by some constant C, and also divide the learning rate by C, no difference in the training process will be occurred" ?
I have a model implemented by Keras. I define a loss function as:
def my_loss(y_true, y_est):
return something
In the first scenario I use an Adam optimizer with learning rate equal to 0.005, and I compile the model with that loss function and optimizer. I fit the model on a set of training data and I observe that its loss falls down from 0.2 to 0.001 in less than 100 epochs.
In the second scenario I change the loss function to:
def my_loss(y_true, y_est):
return 1000 * something
and the learning rate of the optimizer to 0.000005 . Then I compile the model with the new loss function and optimizer, and see what happens to its loss function.
In my understanding, since the gradient of the new loss is 1000 times of the previous gradient, and the new learning rate is 0.001 times of the previous learning rate, in the second scenario, the loss function should fall down from 200 to 1 in less than 100 epochs. But surprisingly, I observe that the loss function is stuck around 200 and almost does not decrease.
Does anyone have any justifications for that?
If you try to use SGD, the result would be what you expect. However, loss scale has no effect on adam. I suggest you to understand those formulas about adam. Therefore, you just changed the learning rate of the network and the learning rate is too small for you network.
I have a CNN-RNN model architecture with Bidirectional LSTMS for time series regression problem. My loss does not converge over 50 epochs. Each epoch has 20k samples. The loss keeps bouncing between 0.001 - 0.01.
batch_size=1
epochs = 50
model.compile(loss='mean_squared_error', optimizer='adam')
trainingHistory=model.fit(trainX,trainY,epochs=epochs,batch_size=batch_size,shuffle=False)
I tried to train the model with incorrectly paired X and Y data for which the
loss stays around 0.5, is it reasonable conclusion that my X and Y
have a non linear relationship which can be learned by my model over
more epochs ?
The predictions of my model capture the pattern but with an offset, I use dynamic time warping distance to manually check the accuracy of predictions, is there a better way ?
Model :
model = Sequential()
model.add(LSTM(units=128, dropout=0.05, recurrent_dropout=0.35, return_sequences=True, batch_input_shape=(batch_size,featureSteps,input_dim)))
model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=False))
model.add(Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
If you tested with:
Wrong data: loss ~0.5
Correct data: loss ~0.01
Then your model is actually cabable of learning something.
There are some possibilities there:
Your output data does not fit in the range of the last layer's activation
Your model reached a limit for the current learning rate (gradient update steps are too big and can't improve the model anymore).
Your model is not good enough for the task.
Your data has some degree of random factors
Case 1:
Make sure your Y is within the range of your last activation function.
For a tanh (the LSTM's default), all Y data should be between -1 and + 1
For a sigmoid, between 0 and 1
For a softmax, between 0 and 1, but make sure your last dimension is not 1, otherwise all results will be 1, always.
For a relu, between 0 and infinity
For linear, any value
Convergence goes better if you have a limited activation instead of one that goes to infinity.
In the first case, you can recompile (after training) the model with a lower learning rate, usually we divide it by 10, where the default is 0.0001:
Case 2:
If data is ok, try decreasing the learning rate after your model stagnates.
The default learning rate for adam is 0.0001, we often divide it by 10:
from keras.optimizers import Adam
#after training enough with the default value:
model.compile(loss='mse', optimizer=Adam(lr=0.00001)
trainingHistory2 = model.fit(.........)
#you can even do this again if you notice that the loss decreased and stopped again:
model.compile(loss='mse',optimizer=Adam(lr=0.000001)
If the problem was the learning rate, this will make your model learn more than it already did (there might be some difficult at the beginning until the optimizer adjusts itself).
Case 3:
If you got no success, maybe it's time to increase the model's capability.
Maybe add more units to the layers, add more layers or even change the model.
Case 4:
There's probably nothing you can do about this...
But if you increased the model like in case 3, be careful with overfitting (keep some test data to compare the test loss versus the training loss).
Too good models can simply memorize your data instead of learning important insights about it.