Why RNN need two bias vectors?

Why RNN need two bias vectors? - pytorch

In pytorch RNN implementation, there are two biases, b_ih and b_hh.
Why is this? Is it different from using one bias? If yes, how? Will it affect performance or efficiency?

Actually, the previous (accepted) answer is wrong. The second bias parameter is required only due to compatibility with CuDNN. See same code documentation:
class RNNBase(Module):
...
def __init__(self, ...):
...
w_ih = Parameter(torch.empty((gate_size, layer_input_size), **factory_kwargs))
w_hh = Parameter(torch.empty((gate_size, real_hidden_size), **factory_kwargs))
b_ih = Parameter(torch.empty(gate_size, **factory_kwargs))
# Second bias vector included for CuDNN compatibility. Only one <--- this
# bias vector is needed in standard definition. <--- comment
b_hh = Parameter(torch.empty(gate_size, **factory_kwargs))
...

The formular in Pytorch Document in RNN is self-explained. That is b_ih and b_hh in the equation.
You may think that b_ih is bias for input (which pair with w_ih, weight for input) and b_hh is bias for hidden (pair with w_hh, weight for hidden)

Related

How to understand the bias term in language model head (when we tie the word embeddings)?

I was learning the masked language modeling codebase in Huggingface Transformers. Just a question to understand the language model head.
Here at the final linear layer where we project hidden size to vocab size (https://github.com/huggingface/transformers/blob/f2fbe4475386bfcfb3b83d0a3223ba216a3c3a91/src/transformers/models/bert/modeling_bert.py#L685-L702).
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
self.decoder.bias = self.bias
We set the bias term to zero at the moment. And later when we initialize the weight, we tie the weight of the linear layer and the word embedding.
But we don't do such a thing for the bias term. I wonder how we can understand that and why we want to initialize the bias term as a zero vector.
https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L1060-L1079
def tie_weights(self):
"""
Tie the weights between the input embeddings and the output embeddings.
If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the
weights instead.
"""
if getattr(self.config, "tie_word_embeddings", True):
output_embeddings = self.get_output_embeddings()
if output_embeddings is not None:
self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False):
if hasattr(self, self.base_model_prefix):
self = getattr(self, self.base_model_prefix)
self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)
for module in self.modules():
if hasattr(module, "_tie_weights"):
module._tie_weights()
My understanding:
Because the final linear weight accepts the hidden representations that have been transformed by several feed-forward layers. We might not be able to match them exactly, we need the bias term to somehow regularize them.
As I'm not sure my understanding is accurate, I would like to seek your opinions.

Why my cross entropy loss function does not converge?

I try to write a cross entropy loss function by myself. My loss function gives the same loss value as the official one, but when i use my loss function in the code instead of official cross entropy loss function, the code does not converge. When i use the official cross entropy loss function, the code converges. Here is my code, please give me some suggestions. Thanks very much
The input 'out' is a tensor (B*C) and 'label' contains class indices (1 * B)
class MylossFunc(nn.Module):
def __init__(self):
super(MylossFunc, self).__init__()
def forward(self, out, label):
out = torch.nn.functional.softmax(out, dim=1)
n = len(label)
loss = torch.FloatTensor([0])
loss = Variable(loss, requires_grad=True)
tmp = torch.log(out)
#print(out)
torch.scalar_tensor(-100)
for i in range(n):
loss = loss - torch.max(tmp[i][label[i]], torch.scalar_tensor(-100) )/n
loss = torch.sum(loss)
return loss

Instead of using torch.softmax and torch.log, you should use torch.log_softmax, otherwise your training will become unstable with nan values everywhere.
This happens because when you take the softmax of your logits using the following line:
out = torch.nn.functional.softmax(out, dim=1)
you might get a zero in one of the components of out, and when you follow that by applying torch.log it will result in nan (since log(0) is undefined). That is why torch (and other common libraries) provide a single stable operation, log_softmax, to avoid the numerical instabilities that occur when you use torch.softmax and torch.log individually.

Using weights in CrossEntropyLoss and BCELoss (PyTorch)

I am training a PyTorch model to perform binary classification. My minority class makes up about 10% of the data, so I want to use a weighted loss function. The docs for BCELoss and CrossEntropyLoss say that I can use a 'weight' for each sample.
However, when I declare CE_loss = nn.BCELoss() or nn.CrossEntropyLoss() and then do CE_Loss(output, target, weight=batch_weights), where output, target, and batch_weights are Tensors of batch_size, I get the following error message:
forward() got an unexpected keyword argument 'weight'

Another way you could accomplish your goal is to use reduction=none when initializing the loss and then multiply the resulting tensor by your weights before computing the mean.
e.g.
loss = torch.nn.BCELoss(reduction='none')
model = torch.sigmoid
weights = torch.rand(10,1)
inputs = torch.rand(10,1)
targets = torch.rand(10,1)
intermediate_losses = loss(model(inputs), targets)
final_loss = torch.mean(weights*intermediate_losses)
Of course for your scenario you still would need to calculate the weights tensor. But hopefully this helps!

Could it be that you want to apply separate fixed weights to all elements of class 0 and class 1 in your dataset? It is not clear what value you are passing for batch_weights here. If so, then that is not what the weight parameter in BCELoss does. The weight parameter expects you to pass a separate weight for every ELEMENT in the dataset, not for every CLASS. There are several ways around this. You could construct a weight table for every element. Alternatively, you could use a custom loss function that does what you want:
def BCELoss_class_weighted(weights):
def loss(input, target):
input = torch.clamp(input,min=1e-7,max=1-1e-7)
bce = - weights[1] * target * torch.log(input) - (1 - target) * weights[0] * torch.log(1 - input)
return torch.mean(bce)
return loss
Note that it is important to add a clamp to avoid numerical instability.
HTH Jeroen

the issue is wherein your providing the weight parameter. As it is mentioned in the docs, here, the weights parameter should be provided during module instantiation.
For example, something like,
from torch import nn
weights = torch.FloatTensor([2.0, 1.2])
loss = nn.BCELoss(weights=weights)
You can find a more concrete example here or another helpful PT forum discussion here.

you need to pass weights like below:
CE_loss = CrossEntropyLoss(weight=[…])

This is similar to the idea of #Jeroen Vuurens, but the class weights are determined by the target mean:
y_train_mean = y_train.mean()
bi_cls_w2 = 1/(1 - y_train_mean)
bi_cls_w1 = 1/y_train_mean - bi_cls_w2
bce_loss = nn.BCELoss(reduction='none')
loss_fun = lambda pred, target: ((bi_cls_w1*target + bi_cls_w2) * bce_loss(pred, target)).mean()

upsampling convolution has no parameters

I have read many papers where convolutional neuronal networks are used for super-resolution or for image segmentation or autoencoder and so on. They use different kinds of upsampling aka deconvolutions and a discussion over here in a different question.
Here in Tensorflow there is a function
Here in Keras there are some
I implemented the Keras one:
x = tf.keras.layers.UpSampling1D(size=2)(x)
and I used this one stolen from an super-resolution repo here:
class SubPixel1D(tf.keras.layers.Layer):
def __init__(self, r):
super(SubPixel1D, self).__init__()
self.r = r
def call(self, inputs):
with tf.name_scope('subpixel'):
X = tf.transpose(inputs, [2,1,0]) # (r, w, b)
X = tf.compat.v1.batch_to_space_nd(X, [self.r], [[0,0]]) # (1, r*w, b)
X = tf.transpose(X, [2,1,0])
return X
But I realized that both don't have parameters in my model summary. Is this not necessary for those functions to have parameters so they can learn the upsampling??

In Keras Upsampling simply copies your input to the size provided. you can find the documentation here, So there is no need to have parameters for these layers.
I think you have confused upsampling with Transposed Convolution/ Deconvolution.

In UpSampling1D, if you look at the actual source code on github, the up-sampling involved is either nearest neighbor or bi-linear. And both these interpolation schemes have no learning parameters, like any weight or biases, unless and until they are followed by a convolution layer.
Since in Subpixel1D also no convolution layer or learnable layers is used, hence no training parameters

How to calculate unbalanced weights for BCEWithLogitsLoss in pytorch

I am trying to solve one multilabel problem with 270 labels and i have converted target labels into one hot encoded form. I am using BCEWithLogitsLoss(). Since training data is unbalanced, I am using pos_weight argument but i am bit confused.
pos_weight (Tensor, optional) – a weight of positive examples. Must be a vector with length equal to the number of classes.
Do i need to give total count of positive values of each label as a tensor or they mean something else by weights?

The PyTorch documentation for BCEWithLogitsLoss recommends the pos_weight to be a ratio between the negative counts and the positive counts for each class.
So, if len(dataset) is 1000, element 0 of your multihot encoding has 100 positive counts, then element 0 of the pos_weights_vector should be 900/100 = 9. That means that the binary crossent loss will behave as if the dataset contains 900 positive examples instead of 100.
Here is my implementation:
(new, based on this post)
pos_weight = (y==0.).sum()/y.sum()
(original)
def calculate_pos_weights(class_counts):
pos_weights = np.ones_like(class_counts)
neg_counts = [len(data)-pos_count for pos_count in class_counts]
for cdx, pos_count, neg_count in enumerate(zip(class_counts, neg_counts)):
pos_weights[cdx] = neg_count / (pos_count + 1e-5)
return torch.as_tensor(pos_weights, dtype=torch.float)
Where class_counts is just a column-wise sum of the positive samples. I posted it on the PyTorch forum and one of the PyTorch devs gave it his blessing.

Maybe is a little late, but here is how I calculate the same. Looking into the documentation:
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100 = 3.
So an easy way to calcule the positive weight is using the tensor methods with your label vector "y", in my case train_dataset.data.y. And then calculating the total negative labels.
num_positives = torch.sum(train_dataset.data.y, dim=0)
num_negatives = len(train_dataset.data.y) - num_positives
pos_weight = num_negatives / num_positives
Then the weights can be used easily as:
criterion = torch.nn.BCEWithLogitsLoss(pos_weight = pos_weight)

PyTorch solution
Well, actually I have gone through docs and you can simply use pos_weight indeed.
This argument gives weight to positive sample for each class, hence if you have 270 classes you should pass torch.Tensor with shape (270,) defining weight for each class.
Here is marginally modified snippet from documentation:
# 270 classes, batch size = 64
target = torch.ones([64, 270], dtype=torch.float32)
# Logits outputted from your network, no activation
output = torch.full([64, 270], 0.9)
# Weights, each being equal to one. You can input your own here.
pos_weight = torch.ones([270])
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
criterion(output, target) # -log(sigmoid(0.9))
Self-made solution
When it comes to weighting, there is no built-in solution, but you may code one yourself really easily:
import torch
class WeightedMultilabel(torch.nn.Module):
def __init__(self, weights: torch.Tensor):
self.loss = torch.nn.BCEWithLogitsLoss()
self.weights = weights.unsqueeze()
def forward(outputs, targets):
return self.loss(outputs, targets) * self.weights
Tensor has to be of the same length as the number of classes in your multilabel classification (270), each giving weight for your specific example.
Calculating weights
You just add labels of every sample in your dataset, divide by the minimum value and inverse at the end.
Sort of snippet:
weights = torch.zeros_like(dataset[0])
for element in dataset:
weights += element
weights = 1 / (weights / torch.min(weights))
Using this approach class occurring the least will give normal loss, while others will have weights smaller than 1.
It might cause some instability during training though, so you might want to experiment with those values a little (maybe log transform instead of linear?)
Other approach
You may think about upsampling/downsampling (though this operation is complicated as you would add/delete other classes as well, so advanced heuristics would be needed I think).

Just to provide a quick revision on #crypdick's answer, this implementation of the function worked for me:
def calculate_pos_weights(class_counts,data):
pos_weights = np.ones_like(class_counts)
neg_counts = [len(data)-pos_count for pos_count in class_counts]
for cdx, (pos_count, neg_count) in enumerate(zip(class_counts, neg_counts)):
pos_weights[cdx] = neg_count / (pos_count + 1e-5)
return torch.as_tensor(pos_weights, dtype=torch.float)
Where data is the dataset you're trying to apply weights to.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why RNN need two bias vectors? - pytorch

In pytorch RNN implementation, there are two biases, b_ih and b_hh. Why is this? Is it different from using one bias? If yes, how? Will it affect performance or efficiency?

The formular in Pytorch Document in RNN is self-explained. That is b_ih and b_hh in the equation. You may think that b_ih is bias for input (which pair with w_ih, weight for input) and b_hh is bias for hidden (pair with w_hh, weight for hidden)

Related

How to understand the bias term in language model head (when we tie the word embeddings)?

Why my cross entropy loss function does not converge?

Using weights in CrossEntropyLoss and BCELoss (PyTorch)

upsampling convolution has no parameters

How to calculate unbalanced weights for BCEWithLogitsLoss in pytorch

Categories

Resources