Pytorch: multiple datasets with multiple losses - pytorch

I am using multiple datasets. I have multiple losses, each of which must be evaluated on a subset of these datasets. I want to generate a batch from each dataset, and evaluate each loss on all of its appropriate batches. Some of the losses are pairwise (need to load pairs of corresponding datapoints) whereas others are computed on single datapoints. I need to design this in such a way that is open to easily adding new datasets. Is there any pytorch builtin that would help with this? What is the best way to design this in pytorch? Thanks in advance.

It's not clear from your question what exactly your settings are.
However, you can have multiple Datasets instances, one for each of your datasets.
On top of your datasets, you can implement a "tagged dataset", a dataset that adds a "tag" for all samples:
class TaggedDataset(data.Dataset):
def __init__(dataset, tag):
super(TaggedDataset, self).__init__()
self.ds_ = dataset
self.tag_ = tag
def __len__(self):
return len(self.ds_)
def __getitem__(self, index):
return self.ds_[index], self.tag_
Give a different tag to each dataset, concat all of them into a single ConcatDataset, and wrap a regular DataLoader around it.
Now, in your training code
for input, label, tag in my_tagged_loader:
# process each input according to the dataset tag it got.

Related

Writing a custom pytorch dataloader iter with pre-processing on batch

A typical custom PyTorch Dataset looks like this,
class TorchCustomDataset(torch.utils.data.Dataset):
def __init__(self, filenames, speech_labels):
pass
def __len__(self):
return 100
def __getitem__(self, idx):
return 1, 0
Here, with __getitem__ I can read any file, and apply any pre-processing for that specific file.
What if I want to apply some tensor-level pre-processing to the whole batch of data? Technically, it's possible to just iterate through the data loader to get the batch sample and apply the pre-processing on it.
But how to do it with a custom data loader? In short, what will be the __getitem__ equivalent for data loader to apply some operation on the whole batch of data?
You can override the collate_fn of DataLoader: This function takes the individual items from the underlying Dataset and forms the batch. You can add your custom pre-processing at that point by modifying the collate_fn.

Pytorch: How to get the first N item from dataloader

There are 3000 pictures in my list, but I only want the first N of them, like 1000, for training.
I wonder how can I achieve this by changing the loop code:
for (image, label) in enumerate(train_loader):
for (image, label) in list(enumerate(train_loader))[:1000]:
This is not a good way to partition training and validation data though.
First, the dataloader class supports lazy loading (examples are not loaded into memory until needed) whereas casting as a list will require all data to be loaded into memory, likely triggering an out-of-memory error. Second, this may not always return the same 1000 elements if the dataloader has shuffling. In general, the dataloader class does not support indexing so is not really suitable for selecting a specific subset of our dataset. Casting as a list works around this but at the expense of the useful attributes of the dataloader class.
Best practice is to use a separate data.dataset object for the training and validation partitions, or at least to partition the data in the dataset rather than relying on stopping the training after the first 1000 examples. Then, create a separate dataloader for the training partition and validation partition.

Parallelize a Random Forest prediction

I am trying to code a specific version of a Random Forest and to make both the training and the prediction computations parallel using joblib Parallel.
Suppose that I have written a TreeEstimator with .fit and .predict methods. The .fit method outputs the construction of the tree and the .predict method outputs an array of value. Then my code looks like this.
class RandomForest()
def __init__():
...
def fit(self, dataset):
self.roots = Parallel(n_jobs=self.n_jobs)(delayed(tree.fit)(dataset)
for tree in self.trees)
def predict(self, dataset):
preds = Parallel(n_jobs=self.n_jobs)(delayed(self.parallel_pred)(row)
for row in dataset)
return np.array(preds)
def parallel_pred(self, row):
pred = Parallel(n_jobs=self.n_jobs)(delayed(tree.predict_row)(self.roots[i],row)
for i,tree in enumerate(self.trees))
return pred
The .fit methods works just fine: all my CPUs are used up to a good percentage. However, the .predict methods seems to mainly do the computations on only one CPU and few others CPUs are sometimes used at ~2-5% at most. I have also tried doing the following: one loop for over all the rows of the dataset and applying the .parallel_pred method on each row of the dataset. It does not work, i.e runs mostly on one CPU. I have also tried doing a
Parallel()(delayed(tree_preds)(row) for row in dataset) with the tree_preds function being a for loop over all trees to get each prediction one by one but it still runs on mostly one CPU.
To sum up, I want to do one loop over my test dataset and one loop over my tree estimators to get for each row of the dataset all the predictions of all trees. I would like to make this in a parallelizable way.
I use Python 3.+ with Ubuntu 20.+)

Why override Dataset instead of directly pass in input and labels, pytorch

Sorry if what I say here is wrong -- new to pytorch.
From what I can tell there are two main ways of getting training data and passing through a network. One is to override Dataset and the other is to just prepare your data correctly and then iterate over it, like shown in this example: pytorch classification example
which does something like
rnn(input, hidden, output)
for i in range(input.size()[0]):
output, hidden = rnn(input[i], hidden)
The other way would be to do something like
for epoch in range(epochs):
for data, target in trainloader:
computer model etc
where in this method, trainloader is from doing something like
trainloader = DataLoader(my_data)
after overriding getitem and len
My question here, is what are the differences between these methods, and why would you use one over the other? Also, it seems to me that overriding Dataset doesn't work for something that has lets say an input layer of size 100 nodes with an output of 10 nodes, since when you return getitem it needs a pair of (data, label). This seems like a case where I probably don't understand how to use Dataset very well, but that is why I'm asking in the first place. I think I read something about a collate function which might help in this scenario?
Dataset class and the Dataloader class in PyTorch help us to feed our own training data into the network. Dataset class is used to provide an interface for accessing all the training or testing samples in your dataset. In order to achieve this, you have to implement at least two methods, __getitem__ and __len__ so that each training sample can be accessed by its index. In the initialization part of the class, we load the dataset (as float type) and convert them into Float torch tensors. __getitem__ will return the features and target value.
What are the differences between these methods?
In PyTorch either you can prepare your data such that the PyTorch DataLoader can consume it and you get an iterable object or you can overload the default DataLoader to perform some custom operations like if you want to do some preprocessing of text/images, stack frames from videos clips, etc.
Our DataLoader behaves like an iterator, so we can loop over it and fetch a different mini-batch every time.
Basic Sample
from torch.utils.data import DataLoader
train_loader = DataLoader(dataset=train_data, batch_size=16, shuffle=True)
valid_loader = DataLoader(dataset=valid_data, batch_size=16, shuffle=True)
# To retrieve a sample mini-batch, one can simply run the command below —
# it will return a list containing two tensors:
# one for the features, another one for the labels.
next(iter(train_loader))
next(iter(valid_loader))
Custom Sample
import torch
from torch.utils.data import Dataset, Dataloader
class SampleData(Dataset):
def __init__(self, data):
self.data = torch.FloatTensor(data.values.astype('float'))
def __len__(self):
return len(self.data)
def __getitem__(self, index):
target = self.data[index][-1]
data_val = self.data[index] [:-1]
return data_val,target
train_dataset = SampleData(train_data)
valid_dataset = SampleData(valid_data)
device = "cuda" if torch.cuda.is_available() else "cpu"
kwargs = {'num_workers': 1, 'pin_memory': True} if device=='cuda' else {}
train_loader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True, **kwargs)
test_loader = DataLoader(valid_dataset, batch_size=test_batch_size, shuffle=False, **kwargs)
Why would you use one over the other?
It solely depends on your use-case and the amount of control you want. PyTorch has given you all the power and it is you who is going to decide how much you want to. Suppose you are solving a simple image classification problem, then,
You can simply put all the images in a root folder with each subfolder containing the samples belonging to a particular class and label the folder with the class name. When training we just need to specify the path to the root folder and the PyTorch DataLoader will automatically pick images from each folder and training the model.
But on the other hand, if you have classifying video clips or video sequences generally known as video tagging in a large video file then you need to write your custom DataLoader to load the frames from the video, stack it and give input to the DataLoader.
Use can find some useful links below for further reference:
https://pytorch.org/docs/stable/data.html
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

PyTorch: Relation between Dynamic Computational Graphs - Padding - DataLoader

As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?
I will also post this question in the PyTorch Forum...
Thanks!
In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length.
This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.
But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
You can use the DataLoader given you write your own Dataset class and you are using batch_size=1. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate will give you a hard time):
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
class FooDataset(Dataset):
def __init__(self, data, target):
assert len(data) == len(target)
self.data = data
self.target = target
def __getitem__(self, index):
return self.data[index], self.target[index]
def __len__(self):
return len(self.data)
data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']
ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)
print(list(enumerate(dl)))
# [(0, [
# 1 2 3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
# 4 5 6 7 8
# [torch.LongTensor of size 1x5]
# , ('b',)])]
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end [...], will it have any negative effect on my training [...]?
Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence and pack_padded_sequence. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).

Resources