Data loading with variable batch size? - python-3.x

I am currently working on patch based super-resolution. Most of the papers divide an image into smaller patches and then use the patches as input to the models.I was able to create patches using custom dataloader. The code is given below:
import torch.utils.data as data
from torchvision.transforms import CenterCrop, ToTensor, Compose, ToPILImage, Resize, RandomHorizontalFlip, RandomVerticalFlip
from os import listdir
from os.path import join
from PIL import Image
import random
import os
import numpy as np
import torch
def is_image_file(filename):
return any(filename.endswith(extension) for extension in [".png", ".jpg", ".jpeg", ".bmp"])
class TrainDatasetFromFolder(data.Dataset):
def __init__(self, dataset_dir, patch_size, is_gray, stride):
super(TrainDatasetFromFolder, self).__init__()
self.imageHrfilenames = []
self.imageHrfilenames.extend(join(dataset_dir, x)
for x in sorted(listdir(dataset_dir)) if is_image_file(x))
self.is_gray = is_gray
self.patchSize = patch_size
self.stride = stride
def _load_file(self, index):
filename = self.imageHrfilenames[index]
hr = Image.open(self.imageHrfilenames[index])
downsizes = (1, 0.7, 0.45)
downsize = 2
w_ = int(hr.width * downsizes[downsize])
h_ = int(hr.height * downsizes[downsize])
aug = Compose([Resize([h_, w_], interpolation=Image.BICUBIC),
RandomHorizontalFlip(),
RandomVerticalFlip()])
hr = aug(hr)
rv = random.randint(0, 4)
hr = hr.rotate(90*rv, expand=1)
filename = os.path.splitext(os.path.split(filename)[-1])[0]
return hr, filename
def _patching(self, img):
img = ToTensor()(img)
LR_ = Compose([ToPILImage(), Resize(self.patchSize//2, interpolation=Image.BICUBIC), ToTensor()])
HR_p, LR_p = [], []
for i in range(0, img.shape[1] - self.patchSize, self.stride):
for j in range(0, img.shape[2] - self.patchSize, self.stride):
temp = img[:, i:i + self.patchSize, j:j + self.patchSize]
HR_p += [temp]
LR_p += [LR_(temp)]
return torch.stack(LR_p),torch.stack(HR_p)
def __getitem__(self, index):
HR_, filename = self._load_file(index)
LR_p, HR_p = self._patching(HR_)
return LR_p, HR_p
def __len__(self):
return len(self.imageHrfilenames)
Suppose the batch size is 1, it takes an image and gives an output of size [x,3,patchsize,patchsize]. When batch size is 2, I will have two different outputs of size [x,3,patchsize,patchsize] (for example image 1 may give[50,3,patchsize,patchsize], image 2 may give[75,3,patchsize,patchsize] ). To handle this a custom collate function was required that stacks these two outputs along dimension 0. The collate function is given below:
def my_collate(batch):
data = torch.cat([item[0] for item in batch],dim = 0)
target = torch.cat([item[1] for item in batch],dim = 0)
return [data, target]
This collate function concatenates along x (From the above example, I finally get [125,3,patchsize,pathsize]. For training purposes, I need to train the model using a minibatch size of say 25. Is there any method or any functions which I can use to directly get an output of size [25 , 3, patchsize, pathsize] directly from the dataloader using the necessary number of images as input to the Dataloader?

The following code snippet works for your purpose.
First, we define a ToyDataset which takes in a list of tensors (tensors) of variable length in dimension 0. This is similar to the samples returned by your dataset.
import torch
from torch.utils.data import Dataset
from torch.utils.data.sampler import RandomSampler
class ToyDataset(Dataset):
def __init__(self, tensors):
self.tensors = tensors
def __getitem__(self, index):
return self.tensors[index]
def __len__(self):
return len(tensors)
Secondly, we define a custom data loader. The usual Pytorch dichotomy to create datasets and data loaders is roughly the following: There is an indexed dataset, to which you can pass an index and it returns the associated sample from the dataset. There is a sampler which yields an index, there are different strategies to draw indices which give rise to different samplers. The sampler is used by a batch_sampler to draw multiple indices at once (as many as specified by batch_size). There is a dataloader which combines sampler and dataset to let you iterate over a dataset, importantly the data loader also owns a function (collate_fn) which specifies how the multiple samples retrieved from the dataset using the indices from the batch_sampler should be combined. For your use case, the usual PyTorch dichotomy does not work well, because instead of drawing a fixed number of indices, we need to draw indices until the objects associated with the indices exceed the cumulative size we desire. This means we need immediate inspection of the objects and use this knowledge to decide whether to return a batch or keep drawing indices. This is what the custom data loader below does:
class CustomLoader(object):
def __init__(self, dataset, my_bsz, drop_last=True):
self.ds = dataset
self.my_bsz = my_bsz
self.drop_last = drop_last
self.sampler = RandomSampler(dataset)
def __iter__(self):
batch = torch.Tensor()
for idx in self.sampler:
batch = torch.cat([batch, self.ds[idx]])
while batch.size(0) >= self.my_bsz:
if batch.size(0) == self.my_bsz:
yield batch
batch = torch.Tensor()
else:
return_batch, batch = batch.split([self.my_bsz,batch.size(0)-self.my_bsz])
yield return_batch
if batch.size(0) > 0 and not self.drop_last:
yield batch
Here we iterate over the dataset, after drawing an index and loading the associated object, we concatenate it to the tensors we drew before (batch). We keep doing this until we reach the desired size, such that we can cut out and yield a batch. We retain the rows in batch, which we did not yield. Because it may be the case that a single instance exceeds the desired batch_size, we use a while loop.
You could modify this minimal CustomDataloader to add more features in the style of PyTorch's dataloader. There is also no need to use a RandomSampler to draw in indices, others would work equally well. It would also be possible to avoid repeated concats, in case your data is large by using for example a list and keeping track of the cumulative length of its tensors.
Here is an example, that demonstrates it works:
patch_size = 5
channels = 3
dim0sizes = torch.LongTensor(100).random_(1, 100)
data = torch.randn(size=(dim0sizes.sum(), channels, patch_size, patch_size))
tensors = torch.split(data, list(dim0sizes))
ds = ToyDataset(tensors)
dl = CustomLoader(ds, my_bsz=250, drop_last=False)
for i in dl:
print(i.size(0))

(Related, but not exactly in topic)
For batch size adaptation you can use the code as exemplified in this repo. It is implemented for a different purpose (maximize GPU memory usage), but it is not too hard to translate to your problem.
The code does batch adaptation and batch spoofing.

To improve the previous answer, I found a repo that uses DataManger to achieve different patch sizes and batch sizes. It is basically initiating different dataloaders with different settings and a set_epoch function is used to set the appropriate dataloader for a given epoch.

Related

Tensorflow : How to retrieve parts of my multi input Dataset and their respective loss?

First of all i am quite new regarding how AI and Tensorflow work.
My problem is the following : I need to train my neural network on 2 paired images. One that is unchanged and the same one that is transformed. This implies at the end a joint loss calculation of the paired images in order to calculate the mutual information for an unsupervised image analysis problem.
Also, since my dataset are 256*256 RGB images * 4 000 i need to use a data generator.
Here is an example of what i already did about my data generator:
class dataset(object):
def __init__(self, data_list, batch_size):
self.dataset = None
self.batch_size = BATCH_SIZE
self.current_batch = 0
self.data_list = data_list
self.normal_image = None
self.transformed_image = None
self.label = None
def generator(self):
index = self.current_batch * self.batch_size
self.current_batch = self.current_batch + 1
for image, label in self.data_list[index:]:
self.label = label
image = image / 255.0
self.normal_image = image
self.transformed_image = utils.get_random_crop(image, height = 200, width = 200)
yield ({'normal_image' : self.normal_image,
'transformed_image' : self.transformed_image},
{'label' : self.label})
def data_loader(self):
self.dataset = tf.data.Dataset.from_generator(self.generator,
output_types=(
{'normal_image' : tf.float32,
'transformed_image' : tf.float32},
{'label' : tf.int32})).batch(self.batch_size)
return self.dataset
train_dataset = dataset(train_list, BATCH_SIZE)
test_dataset = dataset(test_list, BATCH_SIZE)
Note that train_list & test_list are just raw numpy arrays that i have retrieved from my images collection.
Here are my 2 questions :
How can i retrieve specifically the loss from my normal & transformed images so that i can do a joint loss calculation at the end of each epoch ?
I got my data generator(seems to work fine) each next() retrieve the next batch of my collection. However as you can see i have a (kind of ?) tuple inside of my dataset {normal_image, transformed_image}.
I am having a hard time to find how to access specifically one of those data inside of this (kind of ?) tuple in order to feed my CNN with the normal_imageand the transformed_image one at the time ect...
dataset.transformed_image would have been too good Haha !
Also, in my dataset class i have a self.normal_image & self.transformed_image but i use them only for plotting. They are not tensors... like in my dataset :(
Thanks for your time !

Reduce multiclass image classification to binary classification in Pytorch

I am working on an stl-10 image dataset that consists of 10 different classes. I want to reduce this multiclass image classification problem to the binary class image classification such as class 1 Vs rest. I am using PyTorch torchvision to download and use the stl data but I am unable to do it as one Vs the rest.
train_data=torchvision.datasets.STL10(root='data',split='train',transform=data_transforms['train'], download=True)
test_data=torchvision.datasets.STL10(root='data',split='test',transform=data_transforms['val'], download=True)
train_dataloader = DataLoader(train_data,batch_size = 64,shuffle=True,num_workers=2)
test_dataloader = DataLoader(test_data,batch_size = 64,shuffle=True,num_workers=2)
For torchvision datasets, there is an inbuilt way to do this. You need to define a transformation function or class and add that into the target_transform while creating the dataset.
torchvision.datasets.STL10(root: str, split: str = 'train', folds: Union[int, NoneType] = None, transform: Union[Callable, NoneType] = None, target_transform: Union[Callable, NoneType] = None, download: bool = False)
Here is a working example for reference :
import torchvision
from torch.utils.data import DataLoader
from torchvision import transforms
class Multi2UniLabelTfm():
def __init__(self,pos_label=5):
if isinstance(pos_label,int) or isinstance(pos_label,float):
pos_label = [pos_label,]
self.pos_label = pos_label
def __call__(self,y):
# if y==self.pos_label:
if y in self.pos_label:
return 1
else:
return 0
if __name__=='__main__':
test_tfms = transforms.Compose([
transforms.ToTensor()
])
data_transforms = {'val':test_tfms}
#Original Labels
# target_transform = None
# Label 5 is converted to 1. Rest are 0.
# target_transform = Multi2UniLabelTfm(pos_label=5)
# Labels 5,6,7 are converted to 1. Rest are 0.
target_transform = Multi2UniLabelTfm(pos_label=[5,6,7])
test_data=torchvision.datasets.STL10(root='data',split='test',transform=data_transforms['val'], download=True, target_transform=target_transform)
test_dataloader = DataLoader(test_data,batch_size = 64,shuffle=True,num_workers=2)
for idx,(x,y) in enumerate(test_dataloader):
print(idx,y)
if idx == 5:
break
You need to relabel the image. At the beginning, class 0 corresponds to label 0, class 1 corresponds to label 1, ..., and class 10 corresponds to label 9. If you want to achieve binary classification, you need to change the label of the picture of category 1 (or other) to 0, and the picture of all other categories to 1.
One way is to update label values at runtime before passing them to loss function in the training loop. Let's say we want to relabel class 5 as 1, and the rest as 0:
my_class_id = 5
for imgs, labels in train_dataloader:
labels = torch.where(labels == my_class_id, 1, 0)
...
You may also need to do similar relabeling for test_dataloader. Also, I am not sure about the datatype of labels. If its float, change accordingly.

How to load data from multiply datasets in pytorch

I have two datasets of images - indoors and outdoors, they don't have the same number of examples.
Each dataset has images that contain a certain number of classes (minimum 1 maximum 4), these classes can appear in both datasets, and each class has 4 categories - red, blue, green, white.
Example:
Indoor - cats, dogs, horses
Outdoor - dogs, humans
I am trying to train a model, where I tell it, "here is an image that contains a cat, tell me it's color" regardless of where it was taken (Indoors, outdoors, In a car, on the moon)
To do that,
I need to present my model examples so that every batch has only one category (cat, dog, horse or human), but I want to sample from all datasets (two in this case) that contains these objects and mix them. How can I do this?
It has to take into account that the number of examples in each dataset is different, and that some categories appear in one dataset where others can appear in more than one.
and each batch must contain only one category.
I would appreciate any help, I have been trying to solve this for a few days now.
Assuming the question is:
Combine 2+ data sets with potentially overlapping categories of objects (distinguishable by label)
Each object has 4 "subcategories" for each color (distinguishable by label)
Each batch should only contain a single object category
The first step will be to ensure consistency of the object labels from both data sets, if not already consistent. For example, if the dog class is label 0 in the first data set but label 2 in the second data set, then we need to make sure the two dog categories are correctly merged. We can do this "translation" with a simple data set wrapper:
class TranslatedDataset(Dataset):
"""
Args:
dataset: The original dataset.
translate_label: A lambda (function) that maps the original
dataset label to the label it should have in the combined data set
"""
def __init__(self, dataset, translate_label):
super().__init__()
self._dataset = dataset
self._translate_label = translate_label
def __len__(self):
return len(self._dataset)
def __getitem__(self, idx):
inputs, target = self._dataset[idx]
return inputs, self._translate_label(target)
The next step is combining the translated data sets together, which can be done easily with a ConcatDataset:
first_original_dataset = ...
second_original_dataset = ...
first_translated = TranslateDataset(
first_original_dataset,
lambda y: 0 if y is 2 else 2 if y is 0 else y, # or similar
)
second_translated = TranslateDataset(
second_original_dataset,
lambda y: y, # or similar
)
combined = ConcatDataset([first_translated, second_translated])
Finally, we need to restrict batch sampling to the same class, which is possible with a custom Sampler when creating the data loader.
class SingleClassSampler(torch.utils.data.Sampler):
def __init__(self, dataset, batch_size):
super().__init__()
# We need to create sequential groups
# with batch_size elements from the same class
indices_for_target = {} # dict to store a list of indices for each target
for i, (_, target) in enumerate(dataset):
# converting to string since Tensors hash by reference, not value
str_targ = str(target)
if str_targ not in indices_for_target:
indices_for_target[str_targ] = []
indices_for_target[str_targ] += [i]
# make sure we have a whole number of batches for each class
trimmed = {
k: v[:-(len(v) % batch_size)]
for k, v in indices_for_target.items()
}
# concatenate the lists of indices for each class
self._indices = sum(list(trimmed.values()))
def __len__(self):
return len(self._indices)
def __iter__(self):
yield from self._indices
Then to use the sampler:
loader = DataLoader(
combined,
sampler=SingleClassSampler(combined, 64),
batch_size=64,
shuffle=True
)
I haven't run this code, so it might not be exactly right, but hopefully it will put you on the right track.
torch.utils.data Docs

parallelising tf.data.Dataset.from_generator with TF2.1

They are already 2 posts about this topics, but they have not been updated for the recent TF2.1 release...
In brief, I've got a lot of tif images to read and parse with a specific pipeline.
import tensorflow as tf
import numpy as np
files = # a list of str
labels = # a list of int
n_unique_label = len(np.unique(labels))
gen = functools.partial(generator, file_list=files, label_list=labels, param1=x1, param2=x2)
dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.float32, tf.int32))
dataset = dataset.map(lambda b, c: (b, tf.one_hot(c, depth=n_unique_label)))
This processing works well. Nevertheless, I need to parallelize the file parsing part, trying the following solution:
files = # a list of str
files = tensorflow.data.Dataset.from_tensor_slices(files)
def wrapper(file_path):
parser = partial(tif_parser, param1=x1, param2=x2)
return tf.py_function(parser, inp=[file_path], Tout=[tf.float32])
dataset = files.map(wrapper, num_parallel_calls=2)
The difference is that I parse one file at a time here with the parser function. However, but it does not work:
File "loader.py", line 643, in tif_parser
image = numpy.array(Image.open(file_path)).astype(float)
File "python3.7/site-packages/PIL/Image.py", line 2815, in open
fp = io.BytesIO(fp.read())
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'read'
[[{{node EagerPyFunc}}]] [Op:IteratorGetNextSync]
As far as I understand, the tif_parser function does not receive a string but an (unevaluated) tensor. At now, this function is fairly simple:
def tif_parser(file_path, param1=1, param2=2):
image = numpy.array(Image.open(file_path)).astype(float)
image /= 255.0
return image
Here is how I have proceeded
dataset = tf.data.Dataset.from_tensor_slices((files, labels))
def wrapper(file_path, label):
import functools
parser = functools.partial(tif_parser, param1=x1, param2=x2)
return tf.data.Dataset.from_generator(parser, (tf.float32, tf.int32), args=(file_path, label))
dataset = dataset.interleave(wrapper, cycle_length=tf.data.experimental.AUTOTUNE)
# The labels are converted to 1-hot vectors, could be integrated in tif_parser
dataset = dataset.map(lambda i, l: (i, tf.one_hot(l, depth=unique_label_count)))
dataset = dataset.shuffle(buffer_size=file_count, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
Concretely, I generate a data set every time the parser is called. The parser is run cycle_length time at each call, meaning that cycle_length images are read at once. This is suited to my specific case, because I cannot load all the images in memory. I am unsure whether the prefetch is used correctly or not here.

How to use multiprocessing in PyTorch?

I'm trying to use PyTorch with complex loss function. In order to accelerate the code, I hope that I can use the PyTorch multiprocessing package.
The first trial, I put 10x1 features into the NN and get 10x4 output.
After that, I want to pass 10x4 parameters into a function to do some calculation. (The calculation will be complex in the future.)
After calculating, the function will return a 10x1 array in total. This array will be set as NN_energy and calculate loss function.
Besides, I also want to know if there is another method to create a backward-able array to store the NN_energy array, instead of using
NN_energy = net(Data_in)[0:10,0]
Thanks a lot.
Full Code:
import torch
import numpy as np
from torch.autograd import Variable
from torch import multiprocessing
def func(msg,BOP):
ans = (BOP[msg][0]+BOP[msg][1]/BOP[msg][2])*BOP[msg][3]
return ans
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden_1, n_hidden_2, n_output):
super(Net, self).__init__()
self.hidden_1 = torch.nn.Linear(n_feature , n_hidden_1) # hidden layer
self.hidden_2 = torch.nn.Linear(n_hidden_1, n_hidden_2) # hidden layer
self.predict = torch.nn.Linear(n_hidden_2, n_output ) # output layer
def forward(self, x):
x = torch.tanh(self.hidden_1(x)) # activation function for hidden layer
x = torch.tanh(self.hidden_2(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
if __name__ == '__main__': # apply_async
Data_in = Variable( torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() )
Ground_truth = Variable( torch.from_numpy( np.asarray(list(range(20,30))).reshape(10,1) ).float() )
net = Net( n_feature=1 , n_hidden_1=15 , n_hidden_2=15 , n_output=4 ) # define the network
optimizer = torch.optim.Rprop( net.parameters() )
loss_func = torch.nn.MSELoss() # this is for regression mean squared loss
NN_output = net(Data_in)
args = range(0,10)
pool = multiprocessing.Pool()
return_data = pool.map( func, zip(args, NN_output) )
pool.close()
pool.join()
NN_energy = net(Data_in)[0:10,0]
for i in range(0,10):
NN_energy[i] = return_data[i]
loss = torch.sqrt( loss_func( NN_energy , Ground_truth ) ) # must be (1. nn output, 2. target)
print(loss)
Error messages:
File
"C:\ProgramData\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py",
line 126, in reduce_tensor
raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which
requires_grad, since autograd does not support crossing process
boundaries. If you just want to transfer the data, call detach() on
the tensor before serializing (e.g., putting it on the queue).
First of all, Torch Variable API is deprecated since a very long time, just don't use it.
Next, torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() is wrong at many levels: np.asarray of list is useless since a copy will be performed anyway, and np.array takes list as input by design. Then, np.arange is available to return a range as numpy array, and it is also available on Torch. Next, specifying both dimension for reshape is useless and error prone, you could simply do reshape((-1, 1)), or even better unsqueeze(-1).
Here is the simplified expression torch.arange(10, dtype=torch.float32, requires_grad=True).unsqueeze(-1).
Using multiprocessing pool is a bad practice if using batch processing is possible. It will be both way more efficient and readable. Indeed, performing N small algebraic operations in parallel is always slower and a larger single algebraic operation, and even more on GPU. More importantly, computing the gradient is not supported by multiprocessing, hence the error that you get. Yet, this is partially true, because it is supports for tensors on cpu since 1.6.0. Have a lok, to the official release changelog.
Could you post a more representative example of what func method could be to make sure you really need it ?
NB: Distributed autograd as you are looking is now available in Pytorch as an experimental feature available in beta since 1.6.0. Have a look to the official documentation.

Resources