Optimum number of iterations when visualizing deep network - keras

I'm reading through keras-visualization document and the number of iterations when computing the gradient with respect to the filter activation loss was set to 20:
for i in range(20):
loss_value, grads_value = iterate([input_img_data])
input_img_data += grads_value * step
Now, if I want to find the optimum number of iterations, how to find it? Should I wait until the gradient becomes zero and use the second derivative test whether it's a max or min value? If yes, are there already built-in functions in keras that support this?

Do you mean epochs when you say the best number of iterations? If so, I recommend that you set an early stopping, where the loss function value does not change for a while means that it converges and finds the best epochs.

Related

Pytorch: correct way to sum batch loss with epoch loss

I'm calculating two losses. One per batch and one per epoch, at the end of the batches loop. When I try to sum these two losses I get the following error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 49]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I have my reasons for summing these two losses.
The general idea of the code is something like this:
loss_epoch = 0 # it's zero in the first epoch
for epoch in epochs:
for batch in batches:
optimizer.zero_grad()
loss_batch = criterion_batch(output_batch, target_batch)
loss = loss_batch + loss_epoch # adds zero in the first epoch
loss.backward()
optimizer.step()
loss_epoch = criterion_epoch(output_epoch, target_epoch)
I get that the problem is I'm modifying the gradient when I calculate another loss at the end of the first loop (the loop that goes through the batches) but I couldn't solve this problem.
It also might have something to do with the order of the operations (loss calculation, backward, zero_grad, step).
I need to calculate the loss_epoch at the end of the batch loop because I'm using the entire dataset to calculate this loss.
Assuming that you do not want to backpropagate the epoch_loss through every forward pass for the entire dataset (which of course would be computationally infeasible for a dataset of any non-trivial size), you could detach the epoch_loss and essentially add it as a scalar which is updated once per epoch. Not entirely sure if this is the behavior you want though.

What is the correct way to implement gradient accumulation in pytorch?

Broadly there are two ways:
Call loss.backward() on every batch, but only call optimizer.step() and optimizer.zero_grad() every N batches. Is it the case that the gradients of the N batches are summed up? Hence to maintain the same learning rate per effective batch, we have to divide the learning rate by N?
Accumulate loss instead of gradient, and call (loss / N).backward() every N batches. This is easy to understand, but does it defeat the purpose of saving memory (because the gradients of the N batches are computed at once)? The learning rate doesn't need adjusting to maintain the same learning rate per effective batch, but should be multiplied by N if you want to maintain the same learning rate per example.
Which one is better, or more commonly used in packages such as pytorch-lightning? It seems that optimizer.zero_grad() is a prefect fit for gradient accumulation, therefore (1) should be recommended.
You can use PytorchLightning and you get this feature of the box, see the Trainer argument accumulate_grad_batches which you can also pair with gradient_clip_val, more in docs.

Batch normalization setup train and test time

Recently ,I read so many articles talking about keras batch normalization had been discussed a lot.
According to this website:
Set “training=False” of “tf.layers.batch_normalization” when training will get a better validation result
The answer said that:
If you turn on batch normalization with training = True that will start to normalize the batches within themselves and collect a moving average of the mean and variance of each batch. Now here's the tricky part. The moving average is an exponential moving average, with a default momentum of 0.99 for tf.layers.batch_normalization(). The mean starts at 0, the variance at 1 again. But since each update is applied with a weight of ( 1 - momentum ), it will asymptotically reach the actual mean and variance in infinity. For example in 100 steps it will reach about 73.4% of the real value, because 0.99100 is 0.366. If you have numerically large values, the difference can be enormous.
Since my batch size is small which means that more steps to take , and the difference could be big between training and test which lead bad result while predicting.
So,I have to set the training=False in call ,which again from the link above said that:
When you set training = False that means the batch normalization layer will use its internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance.
And I know that during test time we should use the moving mean and moving variance from training time.And I Know the
moving_mean_initializer can be set.
keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
I am not sure if my opinion is correct or not:
(1) set the training =False when testing andtraining=True when training
(2)Use hsitory_weight = ModelCheckpoint(filepath="weights.{epoch:02d}.hdf5",save_weights_only=True,save_best_only=False) to store the normalization weight(including moving average and variance of course gomma and beta)
(3) initialize it with what we get from step (2)
Not sure if anything that I mentioned above is wrong,if it it ,please do correct me.
And I am not sure how people typically do to deal with the problem?Is the one that I propose working?
Thanks in advance!
I do some test ,After training ,
I set all batch layers's moving mean and moving variance to zero.
And it gives the bad result.
I believe at inference mode,the keras would use moving mean and moving variance.
And the part training flag,no matter you set the to True or False the only difference between these two is
whether the moving variance and moving mean would be updated or not.

Keras reinforcement training with softmax

A project i am working on has a reinforcement learning stage using the REINFORCE algorithm. The used model has a final softmax activation layer and because of that a negative learning rate is used as a replacement for negative rewards. I have some doubts about this process and can't find much literature on using a negative learning rate.
Does reinforement learning work with switching learning rate between positive and negative? and if not what would be a better approach, get rid of softmax or has keras a nice option for this?
Loss function:
def log_loss(y_true, y_pred):
'''
Keras 'loss' function for the REINFORCE algorithm,
where y_true is the action that was taken, and updates
with the negative gradient will make that action more likely.
We use the negative gradient because keras expects training data
to minimize a loss function.
'''
return -y_true * K.log(K.clip(y_pred, K.epsilon(), 1.0 - K.epsilon()))
Switching learning rate:
K.set_value(optimizer.lr, lr * (+1 if won else -1))
learner_net.train_on_batch(np.concatenate(st_tensor, axis=0),
np.concatenate(mv_tensor, axis=0))
Update, test results
I ran a test with only positive reinforcement samples, omitting all negative examples and thus the negative learning rate. Winning rate is rising, it is improving and i can safely assume using a negative learning rate is not correct.
anybody any thoughts on how we should implement it?
Update, model explanation
We are trying to recreate AlphaGo as described by DeepMind, the slow policy net:
For the first stage of the training pipeline, we build on prior work
on predicting expert moves in the game of Go using supervised
learning13,21–24. The SL policy network pσ(a| s) alternates between convolutional
layers with weights σ, and rectifier nonlinearities. A final softmax
layer outputs a probability distribution over all legal moves a.
Not sure if it the best way but at least i found a way that works.
for all negative training samples i reuse the network prediction, set the action i want to unlearn to zero and adjust all values to sum up to one again
i tried several ways to adjust them afterwards but haven't run enough tests to be sure what works best:
apply softmax ( action that has to be unlearned gets a nonzero value.. )
redistribute old action value over all other actions
set all illigal action values to zero and distribute the total removed value
distribute value proportional to value of other values
probably there are several other ways to do so, it might depend on use case what works best and there might be a better way to do so but this one works at least.

Adaboost Implementation with Decision stump

I have been trying to implement Adaboost using decision stump as weak classifier but i do not know how to give preference to the weighted miss classified instances?
A decision stump is basically a rule that specifies a feature, a threshold and a polarity. So given samples you have to find the one feature-threshold-polarity combination that has the lowest error. Usually you count the misclassifications and divide it by the number of samples to get the error. In Adaboost a weighted error used, which means that instead of counting the misclassifications, you sum up the weights that are assigned to the misclassified samples. I hope this is all clear so far.
Now, to give a higher preference to misclassified sample in the next round you adjust the weights assigned to the samples by either increasing the weights of the misclassified samples or decreasing the weights of the correctly classified ones. Assume that E is your weighted error, you multiply the misclassified sample weights by the value (1-E)/E. Since the decision stump is better than random guessing, E will be < 0.5 which means that (1-E)/E will be > 1, so that the weights are increased (e.g. E = 0.4 => (1-E)/E = 1.5). If on the other hand, you want to decrease the correctly classified sample weights, use E/(1-E) instead. However, do not forget to normalized the weights afterwards so that they sum up to 1. This is important for the computation of the weighted error.

Resources