I'm trying to implement a multi-class segmentation in Keras:
input image is grayscale (i.e 1 channel)
ground truth image has 3 channels, each pixel is a one-hot vector of length 3
prediction is standard U-Net trained with categorical_crossentropy outputting 3 channels (softmax-ed)
What is wrong with this setup? The training loss has some weird behaviour:
in my lucky cases it behaves as expected (decreases)
90 % of the time it's stuck at ~0.9
My implementation can be found here
I don't think there is anything wrong with the code: if my ground truth is 1-channel (i.e 0s everywhere and 1s somewhere) and use binary_crossentropy + sigmoid as final activation I see no weird behaviour.
I'll answer my own question. The solution is to weight each class i.e using a weighted cross entropy loss
Related
I am doing landuse classification with 4 classes and as an output from softmax function of unet model i gained output of [8,4,128,128] and my mask image is [8,1,128,128] so to calculate loss I used nn.cross entropy function should i have to make any modification for good results as I find the output of testing image was 4 channels images which are not like mask image.``
I am debugging a sequence-to-sequence model and purposely tried to perfectly overfit a small dataset of ~200 samples (sentence pairs of length between 5-50). I am using negative log-likelihood loss in pytorch. I get low loss (~1e^-5), but the accuracy on the same dataset is only 33%.
I trained the model on 3 samples as well and obtained 100% accuracy, yet during training I had loss. I was under the impression that negative log-likelihood only gives loss (loss is in the same region of ~1e^-5) if there is a mismatch between predicted and target label?
Is a bug in my code likely?
There is no bug in your code.
The way things usually work in deep nets is that the networks predicts the logits (i.e., log-likelihoods). These logits are then transformed to probability using soft-max (or a sigmoid function). Cross-entropy is finally evaluated based on the predicted probabilities.
The advantage of this approach is that is numerically stable, and easy to train with. On the other side, because of the soft-max you can never have "perfect" 0/1 probabilities for your predictions: That is, even when your network has perfect accuracy it will never assign probability 1 to the correct prediction, but "close to one". As a result, the loss will always be positive (albeit small).
I am following this keras tutorial to create an autoencoder using the MNIST dataset. Here is the tutorial: https://blog.keras.io/building-autoencoders-in-keras.html.
However, I am confused with the choice of activation and loss for the simple one-layer autoencoder (which is the first example in the link). Is there a specific reason sigmoid activation was used for the decoder part as opposed to something such as relu? I am trying to understand whether this is a choice I can play around with, or if it should indeed be sigmoid, and if so why? Similarily, I understand the loss is taken by comparing each of the original and predicted digits on a pixel-by-pixel level, but I am unsure why the loss is binary_crossentropy as opposed to something like mean squared error.
I would love clarification on this to help me move forward! Thank you!
MNIST images are generally normalized in the range [0, 1], so the autoencoder should output images in the same range, for easier learning. This is why a sigmoid activation is used at the output.
The mean squared error loss has a non-linear penalty, with big errors having a larger penalty than smaller errors, which generally leads to converging to the mean of the solution, instead of a more accurace solution. The binary cross-entropy does not have this problem, and thus it is preferred. It works because the output of the model and the labels are in the [0, 1] range, and the loss is applied to all pixels.
I recently have implemented the RESUNET for a parasite segmentation on blood sample images. The model is described in this papaer, https://arxiv.org/pdf/1711.10684.pdf and here is the code https://github.com/DuFanXin/deep_residual_unet/blob/master/res_unet.py. The segmentation output is a binary image. I trained the model with the weighted Binary cross-entropy Loss, given more weight to the parasite class since there is an imbalance of classes in my images. The last ouput layer has a sigmoid activation.
I calculate precision, recall, and Dice Coefficient value to verify how good is the segmentation on trainning. On training and validation I got good numerical results:
Training
dice_coeff: .6895, f2: 0.8611, precision: 0.6320, recall: 0.9563
Validation
val_dice_coeff: .6433, val_f2: 0.7752, val_precision: 0.6052, val_recall: 0.8499
However, when I try to visually see the segmentations of the validation set my algorithm outputs all black. After analyzing the predictions returned by the model, almost all values are close to zero, so it cannot correctly differenciate between background and foreground. The problems is: Why my metrics shows good numerical values but the segmentation ouput not?
I mean, the metrics are not giving me good information? Why the recall value is higher even if the output is all black?
I trained for about 50 epochs, and my training curves shows constantly learning. Is this because the vanishing gradient problem?
No, you do not have a vanishing gradient issue.
I am almost 100% sure that the problem is related to the way in which you test.
The numbers in your training/validation do not lie.
Ensure that you use the exact same preprocessing on your test dataset, exactly the same preprocessing that is applied during the training.
E.g. : If you use "rescale = 1/255.0" parameter in Keras ImageDataGenerator(), ensure that when you load the test image, divide it by 255.0 before predicting on it.
Note that the aforementioned is a pure example; your inconsistency in train/test preprocessing may stem from other reasons.
I am training an encoder-decoder LSTM in keras for text summarization and the CNN dataset with the following architecture
Picture of bidirectional encoder-decoder LSTM
I am pretraining the word embedding (of size 256) using skip-gram and
I then pad the input sequences with zeros so all articles are of equal length
I put a vector of 1's in each summary to act as the "start" token
Use MSE, RMSProp, tanh activation in the decoder output later
Training: 20 epochs, batch_size=100, clip_norm=1,dropout=0.3, hidden_units=256, LR=0.001, training examples=10000, validation_split=0.2
The network trains and training and validation MSE go down to 0.005, however during inference, the decoder keeps producing a repetition of a few words that make no sense and are nowhere near the real summary.
My question is, is there anything fundamentally wrong in my training approach, the padding, loss function, data size, training time so that the network fails to generalize?
Your model looks ok, except for the loss function. I can't figure out how MSE is applicable to word prediction. Cross-entropy loss looks like a natural choice here.
Generated word repetition can be caused by the way the decoder works at inference time: you should not simply select the most probable word from the distribution, but rather sample from it. This will give more variance to the generated text. Start looking at beam search.
If I were to pick a single technique to boost sequence to sequence model performance, it's certainly attention mechanism. There are lots of post about it, you can start with this one, for example.