Does a `DataLoader` created from `ConcatDataset` create a batch from a different files, or a single file? - pytorch

I am working with multiple files, and multiple training samples in each file. I will use ConcatDataset as described here:
https://discuss.pytorch.org/t/dataloaders-multiple-files-and-multiple-rows-per-column-with-lazy-evaluation/11769/7
I need to have negative samples in addition to my true samples, and I need my negative samples to be randomly selected from all the training data files. So, I am wondering, would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?
If there are more details needed about what I am trying to do exactly, it's because I am trying to train over a TPU with Pytorch XLA.
Normally for negative samples, I would just use a 2nd DataSet and DataLoader, however, I am trying to train over TPUs with Pytorch XLA (alpha was just released a few days ago https://github.com/pytorch/xla ), and to do that I need to send my DataLoader to a torch_xla.distributed.data_parallel.DataParallel object, like model_parallel(train_loop_fn, train_loader) which can be seen in these example notebooks
https://github.com/pytorch/xla/blob/master/contrib/colab/resnet18-training-xrt-1-15.ipynb
https://github.com/pytorch/xla/blob/master/contrib/colab/mnist-training-xrt-1-15.ipynb
So, I am now limited to a single DataLoader, which will need to handle both the true samples, and negative samples that need to be randomly selected from all my files.

ConcatDataset is a custom class that is subclassed from torch.utils.data.Dataset. Let's take a look at one example.
class ConcatDataset(torch.utils.data.Dataset):
def __init__(self, *datasets):
self.datasets = datasets
def __getitem__(self, i):
return tuple(d[i] for d in self.datasets)
def __len__(self):
return min(len(d) for d in self.datasets)
train_loader = torch.utils.data.DataLoader(
ConcatDataset(
dataset1,
dataset2
),
batch_size=args.batch_size,
shuffle=True,
num_workers=args.workers,
pin_memory=True)
for i, (input, target) in enumerate(train_loader):
...
Here, two datasets namely dataset1 (a list of examples) and dataset2 are combined to form a single training dataset. The __getitem__ function returns one example from the dataset and will be used by the BatchSampler to form the training mini-batches.
Would the returned batch samples just be a random consecutive chuck from a single file, or would be batch span across multiple random indexes across all the datafiles?
Since you have combined all your data files to form one dataset, now it depends on what BatchSampler do you use to sample mini-batches. There are several samplers implemented in PyTorch, for example, RandomSampler, SequentialSampler, SubsetRandomSampler, WeightedRandomSampler. See their usage in the documentation.
You can have your custom BatchSampler too as follows.
class MyBatchSampler(Sampler):
def __init__(self, *params):
# write your code here
def __iter__(self):
# write your code here
# return an iterable
def __len__(self):
# return the size of the dataset
The __iter__ function should return an iterable of mini-batches. You can implement your logic of forming mini-batches in this function.
To randomly sample negative examples for training, one alternative could be to pick negative examples for each positive example in the __init__ function of the ConcatDataset class.

Related

Subclass of PyTorch DataLoader for changing batch output

I'm interested in a way of applying a transform to a batch generated by a PyTorch DataLoader class. My minimal example is something like this:
class CustomLoader(torch.utils.data.DataLoader):
def __iter__(self):
result = super().__iter__()
return some_function(result)
But this errors since the DataLoader.__iter()__ returns _MultiProcessingDataLoaderIter or _SingleProcessingDataLoaderIter. Weirdly though, directly returning the output does return a Tensor, so any explanation there would be greatly appreciated!
I understand that in general, transform to data should be done in the subclassed Dataset class. However, in my case the data is tabular and the transform is via numpy, and doing it on a sample-wise basis is much slower (5x) than doing it on an entire batch, since surely these operations are vectorized under the hood.
I know I can do something simple like
for X, y in loader:
X = some_function(X)
But I'd also like to use the DataLoader with pytorch-lightning, so this isn't an option.
What is the proper way to subclass PyTorch Dataloaders?
__iter__() is a generator. You will need to yield the result instead of returning it. You can read more about generators here
Regarding your problem to apply a transform to a batch, you can create a custom Dataset instead of DataLoader and then apply the transforms.
class MyDataset(Dataset):
def __init__(self, transforms=None):
super().__init__()
self.data = ... # define your data here
self.transforms = transforms
def __getitem__(self, idx):
x = self.data[idx]
if self.transforms: x = self.transforms(x)
return x
# use your `MyDataset` class for creating your dataloader
dataloader = DataLoader(MyDataset(transforms = CustomTransforms(), batch_size=4)
You can use this dataloader with PyTorch Lightning Trainer as well.
If you are using PyTorch Lightning, I would suggest you to join our Slack channel and ask questions on Github Discussions as well.
Thanks :)
EDIT: (Add transforms to Batch)
If you are using PyTorch Lightning then I would recommend to use LightningDataModule which provides on_before_batch_transfer hook that can be used to apply transforms on a batch ;)
Here is an example:
def on_before_batch_transfer(self, batch, dataloader_idx):
batch['x'] = transforms(batch['x'])
return batch
Checkout the documentation for more

How to load data from multiply datasets in pytorch

I have two datasets of images - indoors and outdoors, they don't have the same number of examples.
Each dataset has images that contain a certain number of classes (minimum 1 maximum 4), these classes can appear in both datasets, and each class has 4 categories - red, blue, green, white.
Example:
Indoor - cats, dogs, horses
Outdoor - dogs, humans
I am trying to train a model, where I tell it, "here is an image that contains a cat, tell me it's color" regardless of where it was taken (Indoors, outdoors, In a car, on the moon)
To do that,
I need to present my model examples so that every batch has only one category (cat, dog, horse or human), but I want to sample from all datasets (two in this case) that contains these objects and mix them. How can I do this?
It has to take into account that the number of examples in each dataset is different, and that some categories appear in one dataset where others can appear in more than one.
and each batch must contain only one category.
I would appreciate any help, I have been trying to solve this for a few days now.
Assuming the question is:
Combine 2+ data sets with potentially overlapping categories of objects (distinguishable by label)
Each object has 4 "subcategories" for each color (distinguishable by label)
Each batch should only contain a single object category
The first step will be to ensure consistency of the object labels from both data sets, if not already consistent. For example, if the dog class is label 0 in the first data set but label 2 in the second data set, then we need to make sure the two dog categories are correctly merged. We can do this "translation" with a simple data set wrapper:
class TranslatedDataset(Dataset):
"""
Args:
dataset: The original dataset.
translate_label: A lambda (function) that maps the original
dataset label to the label it should have in the combined data set
"""
def __init__(self, dataset, translate_label):
super().__init__()
self._dataset = dataset
self._translate_label = translate_label
def __len__(self):
return len(self._dataset)
def __getitem__(self, idx):
inputs, target = self._dataset[idx]
return inputs, self._translate_label(target)
The next step is combining the translated data sets together, which can be done easily with a ConcatDataset:
first_original_dataset = ...
second_original_dataset = ...
first_translated = TranslateDataset(
first_original_dataset,
lambda y: 0 if y is 2 else 2 if y is 0 else y, # or similar
)
second_translated = TranslateDataset(
second_original_dataset,
lambda y: y, # or similar
)
combined = ConcatDataset([first_translated, second_translated])
Finally, we need to restrict batch sampling to the same class, which is possible with a custom Sampler when creating the data loader.
class SingleClassSampler(torch.utils.data.Sampler):
def __init__(self, dataset, batch_size):
super().__init__()
# We need to create sequential groups
# with batch_size elements from the same class
indices_for_target = {} # dict to store a list of indices for each target
for i, (_, target) in enumerate(dataset):
# converting to string since Tensors hash by reference, not value
str_targ = str(target)
if str_targ not in indices_for_target:
indices_for_target[str_targ] = []
indices_for_target[str_targ] += [i]
# make sure we have a whole number of batches for each class
trimmed = {
k: v[:-(len(v) % batch_size)]
for k, v in indices_for_target.items()
}
# concatenate the lists of indices for each class
self._indices = sum(list(trimmed.values()))
def __len__(self):
return len(self._indices)
def __iter__(self):
yield from self._indices
Then to use the sampler:
loader = DataLoader(
combined,
sampler=SingleClassSampler(combined, 64),
batch_size=64,
shuffle=True
)
I haven't run this code, so it might not be exactly right, but hopefully it will put you on the right track.
torch.utils.data Docs

PyTorch: while loading batched data using Dataloader, how to transfer the data to GPU automatically

If we use a combination of the Dataset and Dataloader classes (as shown below), I have to explicitly load the data onto the GPU using .to() or .cuda(). Is there a way to instruct the dataloader to do it automatically/implicitly?
Code to understand/reproduce the scenario:
from torch.utils.data import Dataset, DataLoader
import numpy as np
class DemoData(Dataset):
def __init__(self, limit):
super(DemoData, self).__init__()
self.data = np.arange(limit)
def __len__(self):
return self.data.shape[0]
def __getitem__(self, idx):
return (self.data[idx], self.data[idx]*100)
demo = DemoData(100)
loader = DataLoader(demo, batch_size=50, shuffle=True)
for i, (i1, i2) in enumerate(loader):
print('Batch Index: {}'.format(i))
print('Shape of data item 1: {}; shape of data item 2: {}'.format(i1.shape, i2.shape))
# i1, i2 = i1.to('cuda:0'), i2.to('cuda:0')
print('Device of data item 1: {}; device of data item 2: {}\n'.format(i1.device, i2.device))
Which will output the following; note - without explicit device transfer instruction, the data is loaded onto CPU:
Batch Index: 0
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
Batch Index: 1
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
A possible solution is at this PyTorch GitHub repo. Issue(still open at the time this question was posted), but, I am unable to make it to work when the dataloader has to return multiple data-items!
You can modify the collate_fn to handle several items at once:
from torch.utils.data.dataloader import default_collate
device = torch.device('cuda:0') # or whatever device/cpu you like
# the new collate function is quite generic
loader = DataLoader(demo, batch_size=50, shuffle=True,
collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
Note that if you want to have multiple workers for the dataloader, you'll need to add
torch.multiprocessing.set_start_method('spawn')
after your if __name__ == '__main__' (see this issue).
Having said that, it seems like using pin_memory=True in your DataLoader would be much more efficient. Have you tried this option?
See memory pinning for more information.
Update (Feb 8th, 2021)
This post made me look at my "data-to-model" time spent during training.
I compared three alternatives:
DataLoader works on CPU and only after the batch is retrieved data is moved to GPU.
Same as (1) but with pin_memory=True in DataLoader.
The proposed method of using collate_fn to move data to GPU.
From my limited experimentation it seems like the second option performs best (but not by a big margin).
The third option required fussing about the start_method of the data loader processes, and it seems to incur an overhead at the beginning of each epoch.

How to deal with imbalanced classes in Keras

I am working on a multi-label image classification problem with Keras and so I utilize functions flow_from_dataframe() and fit_generator().
I have about 2000 classes and as you can guess they are highly skewed / imbalanced. After searching a bit I came across with arguments class_weight and classes and I decided to give them a try. My problem is, I am not sure if I use them correctly. Here is an example:
Let's assume that I have flatten all class occurrences so that I get the following list of (duplicated) labels:
labels = ['classD', 'classA', 'classA', 'classC', 'classD', 'classD']
And this is the function that computes classes and class_weight:
from collections import Counter
def get_classes_weights(l, n):
counter = Counter(l).most_common(n)
classes = [cls for cls, ocu in counter]
majority = max([ocu for cls, ocu in counter])
weights = {idx: float(majority/ocu) for idx, (cls, ocu) in enumerate(counter)}
return classes, weights
Let's also assume that I what to consider the top-2 classes only:
classes, class_weight = get_classes_weights(labels, 2)
This gives:
classes: ['classD', 'classA']
and:
class_weight: {0: 1.0, 1: 1.5}
And finally, this is how I use them within the functions:
generator_train.flow_from_dataframe(
classes=classes,
)
model.fit_generator(
class_weight=class_weight
)
So my question are:
Is the above the right way to apply weights given that I work on a multi-label image classification problem?
Does my validation set need to be balanced or it is OK if it has been taken from the same distribution as the training set (20% and 80% random selection, respectively)?

Keras: Get True labels (y_test) from ImageDataGenerator or predict_generator

I am using ImageDataGenerator().flow_from_directory(...) to generate batches of data from directories.
After the model builds successfully I'd like to get a two column array of True and Predicted class labels. With model.predict_generator(validation_generator, steps=NUM_STEPS) I can get a numpy array of predicted classes. Is it possible to have the predict_generator output the corresponding True class labels?
To add: validation_generator.classes does print the True labels but in the order that they are retrieved from the directory, it doesn't take into account the batching or sample expansion by augmentation.
You can get the prediction labels by:
y_pred = numpy.rint(predictions)
and you can get the true labels by:
y_true = validation_generator.classes
You should set shuffle=False in the validation generator before this.
Finally, you can print confusion matrix by
print confusion_matrix(y_true, y_pred)
There's another, slightly "hackier" way, of retrieving the true labels as well. Note that this approach can handle when setting shuffle=True in your generator (it's generally speaking a good idea to shuffle your data - either if you do this manually where you've stored the data, or through the generator, which is probably easier). You will need your model for this approach though.
# Create lists for storing the predictions and labels
predictions = []
labels = []
# Get the total number of labels in generator
# (i.e. the length of the dataset where the generator generates batches from)
n = len(generator.labels)
# Loop over the generator
for data, label in generator:
# Make predictions on data using the model. Store the results.
predictions.extend(model.predict(data).flatten())
# Store corresponding labels
labels.extend(label)
# We have to break out from the generator when we've processed
# the entire once (otherwise we would end up with duplicates).
if (len(label) < generator.batch_size) and (len(predictions) == n):
break
Your predictions and corresponding labels should now be stored in predictions and labels, respectively.
Lastly, remember that we should NOT add data augmentation on the validation and test sets/generators.
Using np.rint() method will get one hot coding result like [1., 0., 0.] which once I've tried creating a confusion matrix with confusion_matrix(y_true, y_pred) it caused error. Because validation_generator.classes returns class labels as a single number.
In order to get a class number for example 0, 1, 2 as class label specified, I have found the selected answer in this topic useful. here
You should try this to resolve the class probabilities and convert it to a single class based on the score.
if Y_preds.ndim !=1:
Y_preds = np.argmax(Y_preds, axis=1)

Resources