What is the range of random_state in train_test_split

What is the range of random_state in train_test_split - scikit-learn

I have a dataset with 300 observations, I am doing train_test_split with 75% as train data and 25% as test data.
I got an accuracy of 90% for random_state = 2.
for random_state = 138 , accuracy = 92%
if i increase random state somewhere I will get 96% to 100%.
I wanted to know the range of random_state.

According to the documentation:
Integer values must be in the range [0, 2**32 - 1]
In other words, 0 to 4294967295.
There is absolutely no correspondence between the random seed and the performance of your model. Don't treat it like a hyperparameter.
It's a good idea to read section 10.3 of the User Guide. This also explains how you can control the random number generation with more nuance.

Related

Weighted random sampler - oversample or undersample?

Problem
I am training a deep learning model in PyTorch for binary classification, and I have a dataset containing unbalanced class proportions. My minority class makes up about 10% of the given observations. To avoid the model learning to just predict the majority class, I want to use the WeightedRandomSampler from torch.utils.data in my DataLoader.
Let's say I have 1000 observations (900 in class 0, 100 in class 1), and a batch size of 100 for my dataloader.
Without weighted random sampling, I would expect each training epoch to consist of 10 batches.
Questions
Will only 10 batches be sampled per epoch when using this sampler - and consequently, would the model 'miss' a large portion of the majority class during each epoch, since the minority class is now overrepresented in the training batches?
Will using the sampler result in more than 10 batches being sampled per epoch (meaning the same minority class observations may appear many times, and also that training would slow down)?

A small snippet of code to use WeightedRandomSampler
First, define the function:
def make_weights_for_balanced_classes(images, nclasses):
n_images = len(images)
count_per_class = [0] * nclasses
for _, image_class in images:
count_per_class[image_class] += 1
weight_per_class = [0.] * nclasses
for i in range(nclasses):
weight_per_class[i] = float(n_images) / float(count_per_class[i])
weights = [0] * n_images
for idx, (image, image_class) in enumerate(images):
weights[idx] = weight_per_class[image_class]
return weights
And after this, use it in the following way:
import torch
dataset_train = datasets.ImageFolder(traindir)
# For unbalanced dataset we create a weighted sampler
weights = make_weights_for_balanced_classes(dataset_train.imgs, len(dataset_train.classes))
weights = torch.DoubleTensor(weights)
sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights))
train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=args.batch_size, shuffle = True,
sampler = sampler, num_workers=args.workers, pin_memory=True)

It depends on what you're after, check torch.utils.data.WeightedRandomSampler documentation for details.
There is an argument num_samples which allows you to specify how many samples will actually be created when Dataset is combined with torch.utils.data.DataLoader (assuming you weighted them correctly):
If you set it to len(dataset) you will get the first case
If you set it to 1800 (in your case) you will get the second case
Will only 10 batches be sampled per epoch when using this sampler - and consequently, would the model 'miss' a large portion of the majority class during each epoch [...]
Yes, but new samples will be returned after this epoch passes
Will using the sampler result in more than 10 batches being sampled per epoch (meaning the same minority class observations may appear many times, and also that training would slow down)?
Training would not slow down, each epoch would take longer, but convergence should be approximately the same (as less epochs will be necessary due to more data in each).

Can we send same datapoints in same epoch?

If we set the steps_per_epoch (in ImageDataGenerator) higher than the total possible batches(total_samples/batch_Size). Will the model revisit the same data points from starting or will it ignore?
Ex:
Flattened image shape which will go to Dense layer: (2000*1)
batch size: 20
Total no of batches possible: 100 (2000/20)
steps per epoch: 1000 (set explicitly)

As far as I know, steps_per_epoch is independent of the 'real' epoch (which is number_of_inputs/batch_size). Let's use an example similar to what you want to know, with 2000 data points and batch_size of 20 (which means 2000/20 = 100 steps for one 'real' epoch):
If you set steps_per_epoch = 1000: Keras asks for a loop of 1000 batches, which basically means 10 'real' epochs (or 10 times of whole data traversal).
If you set steps_per_epoch = 50: Keras asks for a loop of 50 batches, and the remaining 50 batches of one 'real' epoch is visited in the next loop.

Found input variables with inconsistent numbers of samples: [799996, 199999]

I am splitting a single df so why is it giving Inconsistent no of samples in X_train, X_test (if that is what the error means)?
X_train, X_test = train_test_split(df[categorical_cols+ numeric_cols], test_size=0.2, random_state=4)
regression = LinearRegression().fit(X_train, X_test)
regression.score(X)

In your example, the method will do something roughly equivalent to the following:
Generate a random number between 0 and 1 for each record
Put records where the random number is below .2 in the test set
Put the rest in the training set
There is some randomness to how many actually get put in the train/test sets because the number of random numbers under .2 won't always be exactly 20%.

Nan loss in keras with triplet loss

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.
I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.
Here is my code:
base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')
embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])
model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)
batch_size = 16
epochs = 2
embedding_size = 64
opt = Adam(lr=0.0001)
model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])
label_list = np.vstack(label_list)
x_train = image_list[:2500]
x_val = image_list[2500:]
y_train = label_list[:2500]
y_val = label_list[2500:]
dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))
H = model.fit(
x=[x_train,y_train],
y=dummy_gt_train,
batch_size=batch_size,
epochs=epochs,
validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)
The images are 3366 with values scaled in range [0, 1].
The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.
I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.
What should I try to do?
Thanks in advance and sorry for my english.

In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.
As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :
semi_hard_triplet_loss_distance = math_ops.truediv(
math_ops.reduce_sum(
math_ops.maximum(
math_ops.multiply(loss_mat, mask_positives), 0.0)),
num_positives,
name='triplet_semihard_loss')
There is a division by the number of positives pairs (num_positives), which can lead to NaN.
I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).

Try increasing your batch size. It happened to me also. As mentioned in the previous answer, network is unable to find any num_positives. I had 250 classes and was getting nan loss initially. I increased it to 128/256 and then there was no issue.
I saw that Paris6k has 15 classes or 12 classes. Increase your batch size 32 and if the GPU memory occurs you can try with model with less parameters. You can work on Efficient B0 model for starting. It has 5.3M compared to VGG16 which has 138M parameters.

I have implemented a package for triplet generation so that every batch is guaranteed to include postive pairs. It is compatible with TF/Keras only.
https://github.com/ma7555/kerasgen (Disclaimer: I am the owner)

How to calculate unbalanced weights for BCEWithLogitsLoss in pytorch

I am trying to solve one multilabel problem with 270 labels and i have converted target labels into one hot encoded form. I am using BCEWithLogitsLoss(). Since training data is unbalanced, I am using pos_weight argument but i am bit confused.
pos_weight (Tensor, optional) – a weight of positive examples. Must be a vector with length equal to the number of classes.
Do i need to give total count of positive values of each label as a tensor or they mean something else by weights?

The PyTorch documentation for BCEWithLogitsLoss recommends the pos_weight to be a ratio between the negative counts and the positive counts for each class.
So, if len(dataset) is 1000, element 0 of your multihot encoding has 100 positive counts, then element 0 of the pos_weights_vector should be 900/100 = 9. That means that the binary crossent loss will behave as if the dataset contains 900 positive examples instead of 100.
Here is my implementation:
(new, based on this post)
pos_weight = (y==0.).sum()/y.sum()
(original)
def calculate_pos_weights(class_counts):
pos_weights = np.ones_like(class_counts)
neg_counts = [len(data)-pos_count for pos_count in class_counts]
for cdx, pos_count, neg_count in enumerate(zip(class_counts, neg_counts)):
pos_weights[cdx] = neg_count / (pos_count + 1e-5)
return torch.as_tensor(pos_weights, dtype=torch.float)
Where class_counts is just a column-wise sum of the positive samples. I posted it on the PyTorch forum and one of the PyTorch devs gave it his blessing.

Maybe is a little late, but here is how I calculate the same. Looking into the documentation:
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100 = 3.
So an easy way to calcule the positive weight is using the tensor methods with your label vector "y", in my case train_dataset.data.y. And then calculating the total negative labels.
num_positives = torch.sum(train_dataset.data.y, dim=0)
num_negatives = len(train_dataset.data.y) - num_positives
pos_weight = num_negatives / num_positives
Then the weights can be used easily as:
criterion = torch.nn.BCEWithLogitsLoss(pos_weight = pos_weight)

PyTorch solution
Well, actually I have gone through docs and you can simply use pos_weight indeed.
This argument gives weight to positive sample for each class, hence if you have 270 classes you should pass torch.Tensor with shape (270,) defining weight for each class.
Here is marginally modified snippet from documentation:
# 270 classes, batch size = 64
target = torch.ones([64, 270], dtype=torch.float32)
# Logits outputted from your network, no activation
output = torch.full([64, 270], 0.9)
# Weights, each being equal to one. You can input your own here.
pos_weight = torch.ones([270])
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
criterion(output, target) # -log(sigmoid(0.9))
Self-made solution
When it comes to weighting, there is no built-in solution, but you may code one yourself really easily:
import torch
class WeightedMultilabel(torch.nn.Module):
def __init__(self, weights: torch.Tensor):
self.loss = torch.nn.BCEWithLogitsLoss()
self.weights = weights.unsqueeze()
def forward(outputs, targets):
return self.loss(outputs, targets) * self.weights
Tensor has to be of the same length as the number of classes in your multilabel classification (270), each giving weight for your specific example.
Calculating weights
You just add labels of every sample in your dataset, divide by the minimum value and inverse at the end.
Sort of snippet:
weights = torch.zeros_like(dataset[0])
for element in dataset:
weights += element
weights = 1 / (weights / torch.min(weights))
Using this approach class occurring the least will give normal loss, while others will have weights smaller than 1.
It might cause some instability during training though, so you might want to experiment with those values a little (maybe log transform instead of linear?)
Other approach
You may think about upsampling/downsampling (though this operation is complicated as you would add/delete other classes as well, so advanced heuristics would be needed I think).

Just to provide a quick revision on #crypdick's answer, this implementation of the function worked for me:
def calculate_pos_weights(class_counts,data):
pos_weights = np.ones_like(class_counts)
neg_counts = [len(data)-pos_count for pos_count in class_counts]
for cdx, (pos_count, neg_count) in enumerate(zip(class_counts, neg_counts)):
pos_weights[cdx] = neg_count / (pos_count + 1e-5)
return torch.as_tensor(pos_weights, dtype=torch.float)
Where data is the dataset you're trying to apply weights to.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

What is the range of random_state in train_test_split - scikit-learn

Related

Weighted random sampler - oversample or undersample?

Can we send same datapoints in same epoch?

Found input variables with inconsistent numbers of samples: [799996, 199999]

Nan loss in keras with triplet loss

How to calculate unbalanced weights for BCEWithLogitsLoss in pytorch

Categories

Resources