I'm trying to calculate the rmse error of two torch tensors. I would like to ignore/mask the rows where the labels are 0 (missing values). How could I modify this line to take that restriction into account?
torch.sqrt(((preds.detach() - labels) ** 2).mean()).item()
Thank you in advance.
This can be solved by defining a custom MSE loss function* that masks out the missing values, 0 in your case, from both the input and target tensors:
def mse_loss_with_nans(input, target):
# Missing data are nan's
# mask = torch.isnan(target)
# Missing data are 0's
mask = target == 0
out = (input[~mask]-target[~mask])**2
loss = out.mean()
return loss
(*) Computing MSE is equivalent to RMSE from an optimisation point of view -- with the advantage of being computationally faster.
Related
I have a loss based on 2 things:
MSE loss
A custom loss term based on the network weights.
I have this code:
net = CustomNet()
mse_loss = torch.nn.MSELoss()
def custom_loss(output, target):
weights = net.linear_layer.weight.data
return mse_loss(output, target) + torch.linalg.norm(weights # weights.T -
torch.eye(weights.shape[0]))
When I try to remove the MSE loss (so my loss is only based on the weights):
def custom_loss(output, target):
weights = net.linear_layer.weight.data
return torch.linalg.norm(weights # weights.T -
torch.eye(weights.shape[0]))
I am getting the error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I can notice that mse loss has grad_fn=<MseLossBackward object at 0x14908c450>
What am I doing wrong? Why can't I use only the second loss?
You can't use the second term alone because it doesn't have a grad_fn function as the first term does. This means if you're having both terms it will only backpropagate on the first term (the MSE loss) and will not consider the second. Having no grad_fn means it is seen as a constant w.r.t. the input or parameter and has no effect on the gradient.
The tensors you use to compute the second term do not require a gradient. More specifically any tensor that you get using the data attribute won't require a gradient. In your case net.linear_layer.weight.data.
Instead you should access the tensor directly via:
>>> weights = net.linear_layer.weight
I am trying to solve one multilabel problem with 270 labels and i have converted target labels into one hot encoded form. I am using BCEWithLogitsLoss(). Since training data is unbalanced, I am using pos_weight argument but i am bit confused.
pos_weight (Tensor, optional) – a weight of positive examples. Must be a vector with length equal to the number of classes.
Do i need to give total count of positive values of each label as a tensor or they mean something else by weights?
The PyTorch documentation for BCEWithLogitsLoss recommends the pos_weight to be a ratio between the negative counts and the positive counts for each class.
So, if len(dataset) is 1000, element 0 of your multihot encoding has 100 positive counts, then element 0 of the pos_weights_vector should be 900/100 = 9. That means that the binary crossent loss will behave as if the dataset contains 900 positive examples instead of 100.
Here is my implementation:
(new, based on this post)
pos_weight = (y==0.).sum()/y.sum()
(original)
def calculate_pos_weights(class_counts):
pos_weights = np.ones_like(class_counts)
neg_counts = [len(data)-pos_count for pos_count in class_counts]
for cdx, pos_count, neg_count in enumerate(zip(class_counts, neg_counts)):
pos_weights[cdx] = neg_count / (pos_count + 1e-5)
return torch.as_tensor(pos_weights, dtype=torch.float)
Where class_counts is just a column-wise sum of the positive samples. I posted it on the PyTorch forum and one of the PyTorch devs gave it his blessing.
Maybe is a little late, but here is how I calculate the same. Looking into the documentation:
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100 = 3.
So an easy way to calcule the positive weight is using the tensor methods with your label vector "y", in my case train_dataset.data.y. And then calculating the total negative labels.
num_positives = torch.sum(train_dataset.data.y, dim=0)
num_negatives = len(train_dataset.data.y) - num_positives
pos_weight = num_negatives / num_positives
Then the weights can be used easily as:
criterion = torch.nn.BCEWithLogitsLoss(pos_weight = pos_weight)
PyTorch solution
Well, actually I have gone through docs and you can simply use pos_weight indeed.
This argument gives weight to positive sample for each class, hence if you have 270 classes you should pass torch.Tensor with shape (270,) defining weight for each class.
Here is marginally modified snippet from documentation:
# 270 classes, batch size = 64
target = torch.ones([64, 270], dtype=torch.float32)
# Logits outputted from your network, no activation
output = torch.full([64, 270], 0.9)
# Weights, each being equal to one. You can input your own here.
pos_weight = torch.ones([270])
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
criterion(output, target) # -log(sigmoid(0.9))
Self-made solution
When it comes to weighting, there is no built-in solution, but you may code one yourself really easily:
import torch
class WeightedMultilabel(torch.nn.Module):
def __init__(self, weights: torch.Tensor):
self.loss = torch.nn.BCEWithLogitsLoss()
self.weights = weights.unsqueeze()
def forward(outputs, targets):
return self.loss(outputs, targets) * self.weights
Tensor has to be of the same length as the number of classes in your multilabel classification (270), each giving weight for your specific example.
Calculating weights
You just add labels of every sample in your dataset, divide by the minimum value and inverse at the end.
Sort of snippet:
weights = torch.zeros_like(dataset[0])
for element in dataset:
weights += element
weights = 1 / (weights / torch.min(weights))
Using this approach class occurring the least will give normal loss, while others will have weights smaller than 1.
It might cause some instability during training though, so you might want to experiment with those values a little (maybe log transform instead of linear?)
Other approach
You may think about upsampling/downsampling (though this operation is complicated as you would add/delete other classes as well, so advanced heuristics would be needed I think).
Just to provide a quick revision on #crypdick's answer, this implementation of the function worked for me:
def calculate_pos_weights(class_counts,data):
pos_weights = np.ones_like(class_counts)
neg_counts = [len(data)-pos_count for pos_count in class_counts]
for cdx, (pos_count, neg_count) in enumerate(zip(class_counts, neg_counts)):
pos_weights[cdx] = neg_count / (pos_count + 1e-5)
return torch.as_tensor(pos_weights, dtype=torch.float)
Where data is the dataset you're trying to apply weights to.
I have to train a network on unlabelled data of binary type (True/False), which sounds like unsupervised learning. This is what the normalised data look like:
array([[-0.05744527, -1.03575495, -0.1940105 , -1.15348956, -0.62664491,
-0.98484037],
[-0.05497629, -0.50935675, -0.19396862, -0.68990988, -0.10551919,
-0.72375012],
[-0.03275552, 0.31480204, -0.1834951 , 0.23724946, 0.15504367,
0.29810553],
...,
[-0.05744527, -0.68482282, -0.1940105 , -0.87534175, -0.23580062,
-0.98484037],
[-0.05744527, -1.50366446, -0.1940105 , -1.52435329, -1.14777063,
-0.98484037],
[-0.05744527, -1.26970971, -0.1940105 , -1.33892142, -0.88720777,
-0.98484037]])
However, I do have a constraint on the total number of True labels in my data. This doesn't mean I can build a classical custom loss function in Keras taking (y_true, y_pred) arguments as required: my external constraint is just on the predicted total of True and False, not on the individual labels.
My question is whether there is a somewhat "standard" approach to this kind of problems, and how that is implementable in Keras.
POSSIBLE SOLUTION
Should I assign y_true randomly as 0/1, have a network return y_pred as 1/0 with a sigmoid activation function, and then define my loss function as
sum_y_true = 500 # arbitrary constant known a priori
def loss_function(y_true, y_pred):
loss = np.abs(y_pred.sum() - sum_y_true)
return loss
In the end, I went with the following solution, which worked.
1) Define batches in your dataframe df with a batch_id column, so that in each batch Y_train is your identical "batch ground truth" (in my case, the total number of True labels in the batch). You can then pass these instances together to the network. This can be done with a generator:
def grouper(g,x,y):
while True:
for gr in g.unique():
# this assigns indices to the entire set of values in g,
# then subsects to all the rows in which g == gr
indices = g == gr
yield (x[indices],y[indices])
# train set
train_generator = grouper(df.loc[df['set'] == 'train','batch_id'], X_train, Y_train)
# validation set
val_generator = grouper(df.loc[df['set'] == 'val','batch_id'], X_val, Y_val)
2) define a custom loss function, to track how close the total number of instances predicted as true matches the ground truth:
def custom_delta(y_true, y_pred):
loss = K.abs(K.mean(y_true) - K.sum(y_pred))
return loss
def custom_wrapper():
def custom_loss_function(y_true, y_pred):
return custom_delta(y_true, y_pred)
return custom_loss_function
Note that here
a) Each y_true label is already the sum of the ground truth in our batch (cause we don't have individual values). That's why y_true is not summed over;
b) K.mean is actually a bit of an overkill to extract a single scalar from this uniform tensor, in which all y_true values in each batch are identical - K.min or K.max would also work, but I haven't tested whether their performance is faster.
3) Use fit_generator instead of fit:
fmodel = Sequential()
# ...your layers...
# Create the loss function object using the wrapper function above
loss_ = custom_wrapper()
fmodel.compile(loss=loss_, optimizer='adam')
history1 = fmodel.fit_generator(train_generator, steps_per_epoch=total_batches,
validation_data=val_generator,
validation_steps=df.loc[encs.df['set'] == 'val','batch_id'].nunique(),
epochs=20, verbose = 2)
This way the problem is basically addressed as one of supervised learning, although without individual labels, which means that notions like true/false positive are meaningless here.
This approach not only managed to give me a y_pred that closely matches the totals I know per batch. It actually finds two groups (True/False) that occupy the expected different portions of parameter space.
It’s known that sparse_categorical_crossentropy in keras can get the average loss function among each category. But what if only one certain category was I concerned most? Like if I want to define the precision(=TP/(TP+FP)) based on this category as loss function, how can I write it? Thanks!
My codes were like:
from keras import backend as K
def my_loss(y_true,y_pred):
y_true = K.cast(y_true,"float32")
y_pred = K.cast(K.argmax(y_pred),"float32")
nominator = K.sum(K.cast(K.equal(y_true,y_pred) & K.equal(y_true, 0),"float32"))
denominator = K.sum(K.cast(K.equal(y_pred,0),"float32"))
return -(nominator + K.epsilon()) / (denominator + K.epsilon())
And the error is like:
argmax is not differentiable
I don't recommend you to use precision as the loss function.
It is not differentiable that can't be set as a loss function for nn.
you can max it by predicting all the instance as class negative, that makes no sense.
One of the alternative solution is using F1 as the loss function, then tuning the probability cut-off manually for obtaining a desirable level of precision as well as recall is not too low.
You can pass to the fit method a parameter class_weight where you determine which classes are more important.
It should be a dictionary:
{
0: 1, #class 0 has weight 1
1: 0.5, #class 1 has half the importance of class 0
2: 0.7, #....
...
}
Custom loss
If that is not exactly what you need, you can create loss functions like:
import keras.backend as K
def customLoss(yTrue,yPred):
create operations with yTrue and yPred
- yTrue = the true output data (equal to y_train in most examples)
- yPred = the model's calculated output
- yTrue and yPred have exactly the same shape: (batch_size,output_dimensions,....)
- according to the output shape of the last layer
- also according to the shape of y_train
all operations must be like +, -, *, / or operations from K (backend)
return someResultingTensor
You cannot used argmax as it is not differentiable. That means that backprop will not work if loss function can't be differentiated.
Instead of using argmax, do y_true * y_pred.
I have a 1000 classes in the network and they have multi-label outputs. For each training example, the number of positive output is same(i.e 10) but they can be assigned to any of the 1000 classes. So 10 classes have output 1 and rest 990 have output 0.
For the multi-label classification, I am using 'binary-cross entropy' as cost function and 'sigmoid' as the activation function. When I tried this rule of 0.5 as the cut-off for 1 or 0. All of them were 0. I understand this is a class imbalance problem. From this link, I understand that, I might have to create extra output labels.Unfortunately, I haven't been able to figure out how to incorporate that into a simple neural network in keras.
nclasses = 1000
# if we wanted to maximize an imbalance problem!
#class_weight = {k: len(Y_train)/(nclasses*(Y_train==k).sum()) for k in range(nclasses)}
inp = Input(shape=[X_train.shape[1]])
x = Dense(5000, activation='relu')(inp)
x = Dense(4000, activation='relu')(x)
x = Dense(3000, activation='relu')(x)
x = Dense(2000, activation='relu')(x)
x = Dense(nclasses, activation='sigmoid')(x)
model = Model(inputs=[inp], outputs=[x])
adam=keras.optimizers.adam(lr=0.00001)
model.compile('adam', 'binary_crossentropy')
history = model.fit(
X_train, Y_train, batch_size=32, epochs=50,verbose=0,shuffle=False)
Could anyone help me with the code here and I would also highly appreciate if you could suggest a good 'accuracy' metric for this problem?
Thanks a lot :) :)
I have a similar problem and unfortunately have no answer for most of the questions. Especially the class imbalance problem.
In terms of metric there are several possibilities: In my case I use the top 1/2/3/4/5 results and check if one of them is right. Because in your case you always have the same amount of labels=1 you could take your top 10 results and see how many percent of them are right and average this result over your batch size. I didn't find a possibility to include this algorithm as a keras metric. Instead, I wrote a callback, which calculates the metric on epoch end on my validation data set.
Also, if you predict the top n results on a test dataset, see how many times each class is predicted. The Counter Class is really convenient for this purpose.
Edit: If found a method to include class weights without splitting the output.
You need a numpy 2d array containing weights with shape [number classes to predict, 2 (background and signal)].
Such an array could be calculated with this function:
def calculating_class_weights(y_true):
from sklearn.utils.class_weight import compute_class_weight
number_dim = np.shape(y_true)[1]
weights = np.empty([number_dim, 2])
for i in range(number_dim):
weights[i] = compute_class_weight('balanced', [0.,1.], y_true[:, i])
return weights
The solution is now to build your own binary crossentropy loss function in which you multiply your weights yourself:
def get_weighted_loss(weights):
def weighted_loss(y_true, y_pred):
return K.mean((weights[:,0]**(1-y_true))*(weights[:,1]**(y_true))*K.binary_crossentropy(y_true, y_pred), axis=-1)
return weighted_loss
weights[:,0] is an array with all the background weights and weights[:,1] contains all the signal weights.
All that is left is to include this loss into the compile function:
model.compile(optimizer=Adam(), loss=get_weighted_loss(class_weights))