I'm using a CycleGAN to convert summer to winter images. While the generatorloss is still very high after 100 epochs a decrease can be seen. While on the ither hand the discriminator loss is almost zero from the very beginning.
Summer to winter generator loss & winter discriminator loss]1
The cycle consistency and identiy loss look okayish I think.
Cycleloss summer, winter loss in the first row and idtenty summer winter loss in the second row
As can be seen in this image, the mountains get a purple tint and correct me if im wrong but does this result from the and discriminator loss? full cycled Image
So far to improve the discriminatorloss I tired a few things:
adjust Adam optimizer value for discriminator and generator
added GaussianNoise to the samples before validation with the discriminator
Does anybody has an idea what else I could try to fix the discriminator loss.
Thank you in advance :)
Related
I'm working on a competition on Kaggle. First, I trained a Longformer base with the competition dataset and achieved a quite good result on the leaderboard. Due to the CUDA memory limit and time limit, I could only train 2 epochs with a batch size of 1. The loss started at about 2.5 and gradually decreased to 0.6 at the end of my training.
I then continued training 2 more epochs using that saved weights. This time I used a little bit larger learning rate (the one on the Longformer paper) and added the validation data to the training data (meaning I no longer split the dataset 90/10). I did this to try to achieve a better result.
However, this time the loss started at about 0.4 and constantly increased to 1.6 at about half of the first epoch. I stopped because I didn't want to waste computational resources.
Should I have waited more? Could it eventually lead to a better test result? I think the model could have been slightly overfitting at first.
Your model got fitted to the original training data the first time you trained it. When you added the validation data to the training set the second time around, the distribution of your training data must have changed significantly. Thus, the loss increased in your second training session since your model was unfamiliar with this new distribution.
Should you have waited more? Yes, the loss would have eventually decreased (although not necessarily to a value lower than the original training loss)
Could it have led to a better test result? Probably. It depends on if your validation data contains patterns that are:
Not present in your training data already
Similar to those that your model will encounter in deployment
In fact it's possible for an increase in training loss to lead to an increase in training accuracy. Accuracy is not perfectly (negatively) correlated with any loss function. This is simply because a loss function is a continuous function of the model outputs whereas accuracy is a discrete function of model outputs. For example, a model that predicts low confidence but always correct is 100% accurate, whereas a model that predicts high confidence but is occasionally wrong can produce a lower loss value but less than 100% accuracy.
I'm training a Deep Learning Model using Tensorflow.keras. The Loss function is Triplet Loss. The optimizer is Adam, with learning rate as 0.00005.
Initially the training Loss was 0.38 and it started converging slowly. At 17th Epoch the val_loss became 0.1705. And suddenly in the 18th Epoch training Loss and val_loss both became 0.5 and the same continued for 5-6 epochs. The Loss values didn't change.
Any insight on this behavior would be great. Thank you.
For some models, the loss goes down during training, but at some point an update is applied to a variable that drastically changes your models output. If this is the case, sometimes the model is not able to recover from this new state, and the best it can do from then on is to output 0.5 for every input (I assume that you try to do a binary classification task).
Why can these erroneous updates happen even though they are bad for your model? This is because updates are done using gradient descent. Gradient descent uses the first derivative only though. This means, that the model does know how it needs to change a specific variable, but only very close to its current value. If your learning rate is too high, then the update can be to big and your gradient descent update might be very bad for your models performance.
I read first as training loss much greater than validation loss. That is underfitting.
I read second as training loss much less than validation loss. That is overfitting.
Not an expert but my assumptions have been
Typically validation loss should be similar to but slightly higher than training loss. As long as validation loss is lower than or even equal to training loss one should keep doing more training.
If training loss is reducing without increase in validation loss then again keep doing more training
If validation loss starts increasing then it is time to stop
If overall accuracy still not acceptable then review mistakes model is making and think of what can one change:
More data? More / different data augmentations? Generative data?
Different architecture?
Underfitting – Validation and training error high
Overfitting – Validation error is high, training error low
Good fit – Validation error low, slightly higher than the training
error
Unknown fit - Validation error low, training error ‘high’
My question is related to this one
I am working to implement the method described in the article https://drive.google.com/file/d/1s-qs-ivo_fJD9BU_tM5RY8Hv-opK4Z-H/view . The final algorithm to use is here (it is on page 6):
d are units vector
xhi is a non-null number
D is the loss function (sparse cross-entropy in my case)
The idea is to do an adversarial training, by modifying the data in the direction where the network is the most sensible to small changes and training the network with the modified data but with the same label as the original data.
The loss function used to train the model is here:
l is a loss measure on the labelled data
Rvadv is the value inside the gradient in the picture of algorithm 1
the article chose alpha = 1
The idea is to incorporate the performances of the model for the labelled dataset in the loss
I am trying to implement this method in Keras with the MNIST dataset and a mini-batch of 100 data. When I tried to do the final gradient descent to update the weights, after some iterations I have Nan values that appear, and I don't know why. I posted the notebook on a collab session (I
don't for how much time it will stand so I also post the code in a gist):
collab session: https://colab.research.google.com/drive/1lowajNWD-xvrJDEcVklKOidVuyksFYU3?usp=sharing
gist : https://gist.github.com/DridriLaBastos/e82ec90bd699641124170d07e5a8ae4c
It's kind of stander problem of NaN in training, I suggest you read this answer about issue NaN with Adam solver for the cause and solution in common case.
Basically I just did following two change and code running without NaN in gradients:
Reduce the learning rate in optimizer at model.compile to optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
Replace the C = [loss(label,pred) for label, pred in zip(yBatchTrain,dumbModel(dataNoised,training=False))] to C = loss(yBatchTrain,dumbModel(dataNoised,training=False))
If you still have this kind of error then the next few thing you could try is:
Clip the loss or gradient
Switch all tensor from tf.float32 to tf.float64
Next time when you facing this kind of error, you could using tf.debugging.check_numerics to find root cause of the NaN
I am new at deep learning.
I have a dataset with 1001 values of human pose upper body. The model I trained for that has 4 Conv layers and 2 fully connected layers with ReLu and Dropout. This is the result I got after 200 iterations. Does anyone have any ideas about why the curve of training loss decreases in a sharp way?
I think probably I need more data, as my dataset concludes numerical values what do you think is the best data augmentation method I have to use here?
Is it true that "In Keras, if you multiply the loss function of a model by some constant C, and also divide the learning rate by C, no difference in the training process will be occurred" ?
I have a model implemented by Keras. I define a loss function as:
def my_loss(y_true, y_est):
return something
In the first scenario I use an Adam optimizer with learning rate equal to 0.005, and I compile the model with that loss function and optimizer. I fit the model on a set of training data and I observe that its loss falls down from 0.2 to 0.001 in less than 100 epochs.
In the second scenario I change the loss function to:
def my_loss(y_true, y_est):
return 1000 * something
and the learning rate of the optimizer to 0.000005 . Then I compile the model with the new loss function and optimizer, and see what happens to its loss function.
In my understanding, since the gradient of the new loss is 1000 times of the previous gradient, and the new learning rate is 0.001 times of the previous learning rate, in the second scenario, the loss function should fall down from 200 to 1 in less than 100 epochs. But surprisingly, I observe that the loss function is stuck around 200 and almost does not decrease.
Does anyone have any justifications for that?
If you try to use SGD, the result would be what you expect. However, loss scale has no effect on adam. I suggest you to understand those formulas about adam. Therefore, you just changed the learning rate of the network and the learning rate is too small for you network.