PyTorch Inequality Gradient - pytorch

I wrote a PyTorch model roughly as follows:
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.layer1 = nn.Sequential(nn.Linear(64 * 64, 16), nn.LeakyReLU(0.2))
self.layer2 = nn.Sequential(nn.Linear(16, 32), nn.LeakyReLU(0.2))
self.layer3 = nn.Sequential(nn.Linear(32, 64), nn.LeakyReLU(0.2))
self.layer4 = nn.Sequential(nn.Linear(64, 15), nn.Tanh())
def forward(self, x):
return (self.layer4(self.layer3(self.layer2(self.layer1(x)))) < 0).float()
Notice what I want to do: I want forward to return a tensor of 0s and 1s. However, this does not train, probably because the derivative of the inequality is zero.
How can I make a model like this one train, for example, if I want to do image segmentation?

As you said, you can't train something like x<0.
You should be fine even if you get rid of the <0 part and use
return self.layer4(self.layer3(self.layer2(self.layer1(x))))
as long as you are using the appropriate loss. I think what you would want to use is nn.BCEWithLogitsLoss. In that case you should get the Tanh out of the last layer since nn.BCEWithLogitsLoss internally computes with sigmoid.
(There are options of using nn.BCEloss() with sigmoid at the last layer, or even stick with Tanh, but I don't think there's a reason to take the long way.)
So in the training phase, the neural network tries as hard as it could to fit the output to 0s and 1s. After that, it is the testing phase that you should take the output of the layer, and give it some kind of threshold to change the values to precisely 1s and 0s.(like you did (output<0).float())
You will find useful sources if you search for multilabel classification.


how to handle different size of input data using Pytorch built in neural network

I build a simple pytorch model as below. However, I receive error message that mat1 and mat2 size are not aligned. How do I tweek the code to allow the flexibility of different dimension of data?
class simpleNet(nn.Module):
def __init__(self, **input_dim, hidden_size, num_classes**):
:param input_dim: input feature dimension
:param hidden_size: hidden dimension
:param num_classes: total number of classes
super(TwoLayerNet, self).__init__()
# hidden layer
self.hidden = nn.Linear(input_dim, hidden_size)
# Second fully connected layer that outputs our 10 labels
self.output = nn.Linear(hidden_size, num_classes)
def forward(self, x):
out = None
x = self.hidden(x)
x = torch.sigmoid(x)
x = self.output(x)
out = x
trying to build a toy neural network using Pytorch.
For your neural network to work, your output from your previous layer should be equal to your input for next layer, since its a code snippet for just your architecture without the initializations code, I cannot tell what you can simplify, not having equals in transition is not a good practice though. However, you can use reshape function from torch to make your output of previous layer equal to your next layer to make it work as a brute force method. Refer to:

Multivariate multi-step time forecasting bad prediction results PyTorch LSTM Seq2Seq

I am trying to build an LSTM based Seq2Seq model in PyTorch for multivariate multistep prediction.
The data used is shown in the figure above, where the last column is the target, and all the front columns are features. For preprocessing, I use MaxMinScaler to scale all data between -1 and 1.
Features and Target
Then I used an Encoder-Decoder structure.
class Seq2Seq(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size, batch_size):
self.output_size = output_size
self.Encoder = Encoder(input_size, hidden_size, num_layers, batch_size)
self.Decoder = Decoder(input_size, hidden_size,
num_layers, output_size, batch_size)
def forward(self, input_seq):
batch_size, seq_len, _ = input_seq.shape[0], input_seq.shape[1], input_seq.shape[2]
h, c = self.Encoder(input_seq)
outputs = torch.zeros(batch_size, seq_len, self.output_size).to(device)
for t in range(seq_len):
_input = input_seq[:, t, :]
# print(_input.shape)
output, h, c = self.Decoder(_input, h, c)
outputs[:, t, :] = output
return outputs[:, -1, :]
The Traning
def seq2seq_train(model, Dtr, Val, path):
model = model
loss_function = nn.MSELoss().to(device)
# loss_function = nn.L1Loss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001,
After 100 epochs of training, the obtained losses and test results are as follows.
Loss History
Test Result
The validation loss doesn't seem to drop, and the prediction seems bad.
Then I used Optuna to optimize hyperparameters, including different number of hidden layer nodes, LSTM layers, dropout, etc., but the results are not good, all have high validation loss.
I would like to know what caused this result, is it a problem with the data, the model structure or the hyperparameters?
I hope to get help, thank you very much.
Tentative answer based on info provided:
Note that when one uses cross-entropy loss for classification as it is usually done, bad predictions are penalized much more strongly than good predictions are rewarded. For a cat image, the loss is log(1−prediction), so even if many cat images are correctly predicted (low loss), a single misclassified cat image will have a high loss, hence "blowing up" your mean loss. See this answer for further illustration of this phenomenon. (Getting increasing loss and stable accuracy could also be caused by good predictions being classified a little worse, but I find it less likely because of this loss "asymmetry").
So I think that when both accuracy and loss are increasing, the network is starting to overfit, and both phenomena are happening at the same time. The network is starting to learn patterns only relevant for the training set and not great for generalization, leading to said phenomenon, some images from the validation set get predicted really wrong, with an effect amplified by the "loss asymmetry". However, it is at the same time still learning some patterns which are useful for generalization (phenomenon one, "good learning") as more and more images are being correctly classified.
There is also a great explanation in this Tweet that concisely explains why you may encounter validation loss being lower than training loss.

What does PyTorch classifier output?

So i am new to deep learning and started learning PyTorch. I created a classifier model with following structure.
class model(nn.Module):
def __init__(self):
super(model, self).__init__()
resnet = models.resnet34(pretrained=True)
layers = list(resnet.children())[:8]
self.features1 = nn.Sequential(*layers[:6])
self.features2 = nn.Sequential(*layers[6:])
self.classifier = nn.Sequential(nn.BatchNorm1d(512), nn.Linear(512, 3))
def forward(self, x):
x = self.features1(x)
x = self.features2(x)
x = F.relu(x)
x = nn.AdaptiveAvgPool2d((1,1))(x)
x = x.view(x.shape[0], -1)
return self.classifier(x)
So basically I wanted to classify among three things {0,1,2}. While evaluating, I passed the image it returned a Tensor with three values like below
(tensor([[-0.1526, 1.3511, -1.0384]], device='cuda:0', grad_fn=<AddmmBackward>)
So my question is what are these three numbers? Are they probability ?
P.S. Please pardon me If I asked something too silly.
The final layer nn.Linear (fully connected layer) of self.classifier of your model produces values, that we can call a scores, for example, it may be: [10.3, -3.5, -12.0], the same you can see in your example as well: [-0.1526, 1.3511, -1.0384] which are not normalized and cannot be interpreted as probabilities.
As you can see it's just a kind of "raw unscaled" network output, in other words these values are not normalized, and it's hard to use them or interpret the results, that's why the common practice is converting them to normalized probability distribution by using softmax after the final layer, as #skinny_func has already described. After that you will get the probabilities in the range of 0 and 1, which is more intuitive representation.
So after training what you would want to do is to apply softmax to the output tensor to extract the probability of each class, then you choose the maximal value (highest probability).
in your case:
prob = torch.nn.functional.softmax(model(x), dim=1)
_, pred_class = torch.max(prob, dim=1)

Why does multi layer perceprons outperform RNN in CartPole?

Recently, I compared two models for a DQN on CartPole-v0 environment. One of them is a multilayer perceptron with 3 layers and the other is an RNN built up from an LSTM and 1 fully connected layer. I have an experience replay buffer of size 200000 and the training doesn't start until it is filled up.
Although MLP has solved the problem under a reasonable amount of training steps (this means to achieve a mean reward of 195 for the last 100 episodes), the RNN model could not converge as quickly and its maximum mean reward did not even reach 195 too!
I have already tried to increase batch size, add more neurons to the LSTM'S hidden state, increase the RNN'S sequence length and making the fully connected layer more complex - but every attempt failed as I saw enormous fluctuations in mean reward so the model hardly converged at all. May these are the sings of early overfitting?
class DQN(nn.Module):
def __init__(self, n_input, output_size, n_hidden, n_layers, dropout=0.3):
super(DQN, self).__init__()
self.n_layers = n_layers
self.n_hidden = n_hidden
self.lstm = nn.LSTM(input_size=n_input,
self.dropout= nn.Dropout(dropout)
self.fully_connected = nn.Linear(n_hidden, output_size)
def forward(self, x, hidden_parameters):
batch_size = x.size(0)
output, hidden_state = self.lstm(x.float(), hidden_parameters)
seq_length = output.shape[1]
output1 = output.contiguous().view(-1, self.n_hidden)
output2 = self.dropout(output1)
output3 = self.fully_connected(output2)
new = output3.view(batch_size, seq_length, -1)
new = new[:, -1]
return new.float(), hidden_state
def init_hidden(self, batch_size, device):
weight = next(self.parameters()).data
hidden = (, batch_size, self.n_hidden).zero_().to(device),, batch_size, self.n_hidden).zero_().to(device))
return hidden
Contrarily to what I expected, the simpler model gave a much better result than the other; even though RNN is supposed to be better in processing time series data.
Can anybody tell me what's the reason for this?
Also, I have to state that I applied no feature engineering and both DQN's worked with raw data. Could RNN outperform the MLP on using normalized features? (I mean feeding both models with normalized data)
Is there anything you can recommend me to improve training efficiency on RNN's to achieve the best results?
Contrary to what I expected the simpler model gave much better result that the other; even though RNN's supposed to be better in processing time series data.
There is no time series in the cart-pole, the state contains all the information needed for optimal decision. It would be different if, for instance, you would learn from images and you would need to estimate the pole velocity from a series of images.
Also, it is not true that the more complex model should perform better. On the contrary, it is more likely to overfit. For the cart-pole you don't even need a NN, a simple linear approximator with RBFs or random Fourier features would suffice. A RNN + LSTM is for sure an overkill for such a simple problem.

Keras: Pixelwise class imbalance in binary image segmentation

I have a task in which I input a 500x500x1 image and get out a 500x500x1 binary segmentation. When working, only a small fraction of the 500x500 should be triggered (small "targets"). I'm using a sigmoid activation at the output. Since such a small fraction is desired to be positive, the training tends to stall with all outputs at zero, or very close. I've written my own loss function that partially deals with it, but I'd like to use binary cross entropy with a class weighting if possible.
My question is in two parts:
If I naively apply binary_crossentropy as the loss to my 500x500x1 output, will it apply on a per pixel basis as desired?
Is there a way for keras to apply class weighting with the single sigmoid output per pixel?
To answer your questions.
Yes, binary_cross_entropy will work per-pixel based, provided you feed to your image segmentation neural network pairs of the form (500x500x1 image(grayscale image) + 500x500x1 (corresponding mask to your image).
By feeding the parameter 'class_weight' parameter in
Suppose you have 2 classes with 90%-10% distribution. Then you may want to penalise your algorithm 9 times more when it makes a mistake for the less well represented class(the class with 10% in this case). Suppose you have 900 examples of class 1 and 100 examples of class 2.
Then your class weights dictionary(there are multiple ways to compute it, what is important is to assign a greater weight to the less well represented class),
class_weights = {0:1000/900,1:1000/100}
Example :, Y_train, epochs = 30, batch_size=32, class_weight=class_weight)
NOTE: This is available only on 2d cases(class_weight). For 3D or higher dimensional spaces, one should use 'sample_weights'. For segmentation purposes, you would rather use sample_weights parameter.
The biggest gain you will have is by means of other loss functions. Other losses, apart from binary_crossentropy and categorical_crossentropy, inherently perform better on unbalanced datasets. Dice Loss is such a loss function.
Keras implementation:
smooth = 1.
def dice_coef(y_true, y_pred):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
def dice_coef_loss(y_true, y_pred):
return 1 - dice_coef(y_true, y_pred)
You can also use as a loss function the sum of binary_crossentropy
and other losses if it suits you : i.e. loss = dice_loss + bce
