Writing a custom pytorch dataloader iter with pre-processing on batch - pytorch

A typical custom PyTorch Dataset looks like this,
class TorchCustomDataset(torch.utils.data.Dataset):
def __init__(self, filenames, speech_labels):
pass
def __len__(self):
return 100
def __getitem__(self, idx):
return 1, 0
Here, with __getitem__ I can read any file, and apply any pre-processing for that specific file.
What if I want to apply some tensor-level pre-processing to the whole batch of data? Technically, it's possible to just iterate through the data loader to get the batch sample and apply the pre-processing on it.
But how to do it with a custom data loader? In short, what will be the __getitem__ equivalent for data loader to apply some operation on the whole batch of data?

You can override the collate_fn of DataLoader: This function takes the individual items from the underlying Dataset and forms the batch. You can add your custom pre-processing at that point by modifying the collate_fn.

Related

PyTorch - Save just the model structure without weights and then load and train it

I want to separate model structure authoring and training. The model author designs the model structure, saves the untrained model to a file and then sends it training service which loads the model structure and trains the model.
Keras has the ability to save the model config and then load it.
How can the same be accomplished with PyTorch?
You can write your own function to do that in PyTorch. Saving of weights is straight forward where you simply do a torch.save(model.state_dict(), 'weightsAndBiases.pth').
For saving the model structure, you can do this:
(Assume you have a model class named Network, and you instantiate yourModel = Network())
model_structure = {'input_size': 784,
'output_size': 10,
'hidden_layers': [each.out_features for each in yourModel.hidden_layers],
'state_dict': yourModel.state_dict() #if you want to save the weights
}
torch.save(model_structure, 'model_structure.pth')
Similarly, we can write a function to load the structure.
def load_structure(filepath):
structure = torch.load(filepath)
model = Network(structure['input_size'],
structure['output_size'],
structure['hidden_layers'])
# model.load_state_dict(structure['state_dict']) if you had saved weights as well
return model
model = load_structure('model_structure.pth')
print(model)
Edit:
Okay, the above was the case when you had access to source code for your class, or if the class was relatively simple so you could define a generic class like this:
class Network(nn.Module):
def __init__(self, input_size, output_size, hidden_layers, drop_p=0.5):
''' Builds a feedforward network with arbitrary hidden layers.
Arguments
---------
input_size: integer, size of the input layer
output_size: integer, size of the output layer
hidden_layers: list of integers, the sizes of the hidden layers
'''
super().__init__()
# Input to a hidden layer
self.hidden_layers = nn.ModuleList([nn.Linear(input_size, hidden_layers[0])])
# Add a variable number of more hidden layers
layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])
self.hidden_layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes])
self.output = nn.Linear(hidden_layers[-1], output_size)
self.dropout = nn.Dropout(p=drop_p)
def forward(self, x):
''' Forward pass through the network, returns the output logits '''
for each in self.hidden_layers:
x = F.relu(each(x))
x = self.dropout(x)
x = self.output(x)
return F.log_softmax(x, dim=1)
However, that will only work for simple cases so I suppose that's not what you intended.
One option is, you can define the architecture of model in a separate .py file and import it along with other necessities(if the model architecture is complex) or you can altogether define the model then and there.
Another option is converting your pytorch model to onxx and saving it.
The other option is that, in Tensorflow you can create a .pb file that defines both the architecture and the weights of the model and in Pytorch you would do something like that this way:
torch.save(model, filepath)
This will save the model object itself, as torch.save() is just a pickle-based save at the end of the day.
model = torch.load(filepath)
This however has limitations, your model class definition might not for example be picklable(possible in some complicated models).
Because this is a such an iffy workaround, the answer that you'll usually get is - No, you have to declare the class definition before loading the trained model, ie you need to have access to the model class source code.
Side notes:
An official answer by one of the core PyTorch devs on limitations of loading a pytorch model without code:
We only save the source code of the class definition. We do not save beyond that (like the package sources that the class is referring to).
import foo
class MyModel(...):
def forward(input):
foo.bar(input)
Here the package foo is not saved in the model checkpoint.
There are limitations on robustly serializing python constructs. For example the default picklers cannot serialize lambdas. There are helper packages that can serialize more python constructs than the standard, but they still have limitations. Dill 25 is one such package.
Given these limitations, there is no robust way to have torch.load work without having the original source files.

Why override Dataset instead of directly pass in input and labels, pytorch

Sorry if what I say here is wrong -- new to pytorch.
From what I can tell there are two main ways of getting training data and passing through a network. One is to override Dataset and the other is to just prepare your data correctly and then iterate over it, like shown in this example: pytorch classification example
which does something like
rnn(input, hidden, output)
for i in range(input.size()[0]):
output, hidden = rnn(input[i], hidden)
The other way would be to do something like
for epoch in range(epochs):
for data, target in trainloader:
computer model etc
where in this method, trainloader is from doing something like
trainloader = DataLoader(my_data)
after overriding getitem and len
My question here, is what are the differences between these methods, and why would you use one over the other? Also, it seems to me that overriding Dataset doesn't work for something that has lets say an input layer of size 100 nodes with an output of 10 nodes, since when you return getitem it needs a pair of (data, label). This seems like a case where I probably don't understand how to use Dataset very well, but that is why I'm asking in the first place. I think I read something about a collate function which might help in this scenario?
Dataset class and the Dataloader class in PyTorch help us to feed our own training data into the network. Dataset class is used to provide an interface for accessing all the training or testing samples in your dataset. In order to achieve this, you have to implement at least two methods, __getitem__ and __len__ so that each training sample can be accessed by its index. In the initialization part of the class, we load the dataset (as float type) and convert them into Float torch tensors. __getitem__ will return the features and target value.
What are the differences between these methods?
In PyTorch either you can prepare your data such that the PyTorch DataLoader can consume it and you get an iterable object or you can overload the default DataLoader to perform some custom operations like if you want to do some preprocessing of text/images, stack frames from videos clips, etc.
Our DataLoader behaves like an iterator, so we can loop over it and fetch a different mini-batch every time.
Basic Sample
from torch.utils.data import DataLoader
train_loader = DataLoader(dataset=train_data, batch_size=16, shuffle=True)
valid_loader = DataLoader(dataset=valid_data, batch_size=16, shuffle=True)
# To retrieve a sample mini-batch, one can simply run the command below —
# it will return a list containing two tensors:
# one for the features, another one for the labels.
next(iter(train_loader))
next(iter(valid_loader))
Custom Sample
import torch
from torch.utils.data import Dataset, Dataloader
class SampleData(Dataset):
def __init__(self, data):
self.data = torch.FloatTensor(data.values.astype('float'))
def __len__(self):
return len(self.data)
def __getitem__(self, index):
target = self.data[index][-1]
data_val = self.data[index] [:-1]
return data_val,target
train_dataset = SampleData(train_data)
valid_dataset = SampleData(valid_data)
device = "cuda" if torch.cuda.is_available() else "cpu"
kwargs = {'num_workers': 1, 'pin_memory': True} if device=='cuda' else {}
train_loader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True, **kwargs)
test_loader = DataLoader(valid_dataset, batch_size=test_batch_size, shuffle=False, **kwargs)
Why would you use one over the other?
It solely depends on your use-case and the amount of control you want. PyTorch has given you all the power and it is you who is going to decide how much you want to. Suppose you are solving a simple image classification problem, then,
You can simply put all the images in a root folder with each subfolder containing the samples belonging to a particular class and label the folder with the class name. When training we just need to specify the path to the root folder and the PyTorch DataLoader will automatically pick images from each folder and training the model.
But on the other hand, if you have classifying video clips or video sequences generally known as video tagging in a large video file then you need to write your custom DataLoader to load the frames from the video, stack it and give input to the DataLoader.
Use can find some useful links below for further reference:
https://pytorch.org/docs/stable/data.html
https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Pytorch: multiple datasets with multiple losses

I am using multiple datasets. I have multiple losses, each of which must be evaluated on a subset of these datasets. I want to generate a batch from each dataset, and evaluate each loss on all of its appropriate batches. Some of the losses are pairwise (need to load pairs of corresponding datapoints) whereas others are computed on single datapoints. I need to design this in such a way that is open to easily adding new datasets. Is there any pytorch builtin that would help with this? What is the best way to design this in pytorch? Thanks in advance.
It's not clear from your question what exactly your settings are.
However, you can have multiple Datasets instances, one for each of your datasets.
On top of your datasets, you can implement a "tagged dataset", a dataset that adds a "tag" for all samples:
class TaggedDataset(data.Dataset):
def __init__(dataset, tag):
super(TaggedDataset, self).__init__()
self.ds_ = dataset
self.tag_ = tag
def __len__(self):
return len(self.ds_)
def __getitem__(self, index):
return self.ds_[index], self.tag_
Give a different tag to each dataset, concat all of them into a single ConcatDataset, and wrap a regular DataLoader around it.
Now, in your training code
for input, label, tag in my_tagged_loader:
# process each input according to the dataset tag it got.

How to parallelize a for loop inside some_module.forward(some_input) (on the GPU)?

Let's say I have a model (in pseudo-code, mostly):
class SomeLayer(nn.Module):
def __init__(self, s):
#init some layers etc
self.N = s*s
def forward(self, input_tensor):
#intialize some variables
some_results=[]
for iter_i in range(self.N):
# do independent operations on different parts of input_tensor
# each operation is basically a copy of a subtensor of input_tensor
# such that its size depends on iter_i
# append_results to some_results
return some_results
What is the correct way of parallelizing that sort of for loop? Currently I'm planing to write a small CUDA kernel for that and load that from python, but it feels a bit overkill, and I asssume there should be a simple way to do that although I haven't been able to find it in the documentation.

Keras- Loss per sample within batch

How do I get the sample loss while training instead of the total loss? The loss history is available which gives the total batch loss but it doesn't provide the loss for individual samples.
If possible I would like to have something like this:
on_batch_end(batch, logs, **sample_losses**)
Is something like this available and if not can you provide some hints how to change the code to support this?
To the best of my knowledge it is not possible to get this information via callbacks since the loss is already computed once the callbacks are called (have a look at keras/engine/training.py). To simply inspect the losses you may override the loss function, e.g.:
def myloss(ytrue, ypred):
x = keras.objectives.mean_squared_error(ytrue, ypred)
return theano.printing.Print('loss for each sample')(x)
model.compile(loss=myloss)
Actually this can be done using a callback. This is now included in the keras documentation on callbacks. Define your own callback like this
class LossHistory(keras.callbacks.Callback):
def on_train_begin(self, logs={}):
self.losses = []
def on_batch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
And then pass in this callback to your model. You should get per batch losses appended to the history ojbect.
I have also not found any existing functions in the Keras API that can return individual sample losses, while still computing on a minibatch. It seems you have to hack keras, or maybe access the tensorflow graph directly.
set batch size to 1 and use callbacks in model.evaluate OR manually calculate the loss between prediction (model.predict) and ground truth.

Resources