How to use a Batchsampler within a Dataloader - pytorch

I have a need to use a BatchSampler within a pytorch DataLoader instead of calling __getitem__ of the dataset multiple times (remote dataset, each query is pricy). I cannot understand how to use the batchsampler with any given dataset.
e.g
class MyDataset(Dataset):
def __init__(self, remote_ddf, ):
self.ddf = remote_ddf
def __len__(self):
return len(self.ddf)
def __getitem__(self, idx):
return self.ddf[idx] --------> This is as expensive as a batch call
def get_batch(self, batch_idx):
return self.ddf[batch_idx]
my_loader = DataLoader(MyDataset(remote_ddf),
batch_sampler=BatchSampler(Sampler(), batch_size=3))
The thing I do not understand, neither found any example online or in torch docs, is how do I use my get_batch function instead of the __getitem__ function.
Edit:
Following the answer of Szymon Maszke, this is what I tried and yet, \_\_get_item__ gets one index each call, instead of a list of size batch_size
class Dataset(Dataset):
def __init__(self):
...
def __len__(self):
...
def __getitem__(self, batch_idx): ------> here I get only one index
return self.wiki_df.loc[batch_idx]
loader = DataLoader(
dataset=dataset,
batch_sampler=BatchSampler(
SequentialSampler(dataset), batch_size=self.hparams.batch_size, drop_last=False),
num_workers=self.hparams.num_data_workers,
)

You can't use get_batch instead of __getitem__ and I don't see a point to do it like that.
torch.utils.data.BatchSampler takes indices from your Sampler() instance (in this case 3 of them) and returns it as list so those can be used in your MyDataset __getitem__ method (check source code, most of samplers and data-related utilities are easy to follow in case you need it).
I assume your self.ddf supports list slicing (e.g. self.ddf[[25, 44, 115]] returns values correctly and uses only one expensive call). In this case simply switch get_batch into __getitem__ and you are good to go.
class MyDataset(Dataset):
def __init__(self, remote_ddf, ):
self.ddf = remote_ddf
def __len__(self):
return len(self.ddf)
def __getitem__(self, batch_idx):
return self.ddf[batch_idx] -> batch_idx is a list
EDIT: You have to specify batch_sampler as sampler, otherwise the batch will be divided into single indices. This should be fine:
loader = DataLoader(
dataset=dataset,
# This line below!
sampler=BatchSampler(
SequentialSampler(dataset), batch_size=self.hparams.batch_size, drop_last=False
),
num_workers=self.hparams.num_data_workers,
)

Related

PyTorch Dataset / Dataloader from random source

I have a source of random (non-deterministic, non-repeatable) data, that I'd like to wrap in Dataset and Dataloader for PyTorch training. How can I do this?
__len__ is not defined, as the source is infinite (with possible repition).
__getitem__ is not defined, as the source is non-deterministic.
When defining a custom dataset class, you'd ordinarily subclass torch.utils.data.Dataset and define __len__() and __getitem__().
However, for cases where you want sequential but not random access, you can use an iterable-style dataset. To do this, you instead subclass torch.utils.data.IterableDataset and define __iter__(). Whatever is returned by __iter__() should be a proper iterator; it should maintain state (if necessary) and define __next__() to obtain the next item in the sequence. __next__() should raise StopIteration when there's nothing left to read. In your case with an infinite dataset, it never needs to do this.
Here's an example:
import torch
class MyInfiniteIterator:
def __next__(self):
return torch.randn(10)
class MyInfiniteDataset(torch.utils.data.IterableDataset):
def __iter__(self):
return MyInfiniteIterator()
dataset = MyInfiniteDataset()
dataloader = torch.utils.data.DataLoader(dataset, batch_size = 32)
for batch in dataloader:
# ... Do some stuff here ...
# ...
# if some_condition:
# break

Subclass of PyTorch DataLoader for changing batch output

I'm interested in a way of applying a transform to a batch generated by a PyTorch DataLoader class. My minimal example is something like this:
class CustomLoader(torch.utils.data.DataLoader):
def __iter__(self):
result = super().__iter__()
return some_function(result)
But this errors since the DataLoader.__iter()__ returns _MultiProcessingDataLoaderIter or _SingleProcessingDataLoaderIter. Weirdly though, directly returning the output does return a Tensor, so any explanation there would be greatly appreciated!
I understand that in general, transform to data should be done in the subclassed Dataset class. However, in my case the data is tabular and the transform is via numpy, and doing it on a sample-wise basis is much slower (5x) than doing it on an entire batch, since surely these operations are vectorized under the hood.
I know I can do something simple like
for X, y in loader:
X = some_function(X)
But I'd also like to use the DataLoader with pytorch-lightning, so this isn't an option.
What is the proper way to subclass PyTorch Dataloaders?
__iter__() is a generator. You will need to yield the result instead of returning it. You can read more about generators here
Regarding your problem to apply a transform to a batch, you can create a custom Dataset instead of DataLoader and then apply the transforms.
class MyDataset(Dataset):
def __init__(self, transforms=None):
super().__init__()
self.data = ... # define your data here
self.transforms = transforms
def __getitem__(self, idx):
x = self.data[idx]
if self.transforms: x = self.transforms(x)
return x
# use your `MyDataset` class for creating your dataloader
dataloader = DataLoader(MyDataset(transforms = CustomTransforms(), batch_size=4)
You can use this dataloader with PyTorch Lightning Trainer as well.
If you are using PyTorch Lightning, I would suggest you to join our Slack channel and ask questions on Github Discussions as well.
Thanks :)
EDIT: (Add transforms to Batch)
If you are using PyTorch Lightning then I would recommend to use LightningDataModule which provides on_before_batch_transfer hook that can be used to apply transforms on a batch ;)
Here is an example:
def on_before_batch_transfer(self, batch, dataloader_idx):
batch['x'] = transforms(batch['x'])
return batch
Checkout the documentation for more

python self keyword in a list

I am reading a tutorial about keras and came a cross this class that is inherited by an other class.
class LearningRateDecay:
def plot(self, epochs, title="Learning Rate Schedule"):
lrs = [self(i) for i in epochs]
plt.style.use("ggplot")
plt.figure()
plt.plot(epochs, lrs)
plt.title(title)
plt.xlabel("Epoch #")
plt.ylabel("Learning Rate")
in which epochs is defined like this :
N = 100
epochs = np.arange(0, N)
and is passed to plot function like this :
a = LearningRateDecay
a.plot(epochs,"Learning Rate Schedule")
I cant understand what is self(i) meaning? is this about accessing to self elements or some thing else?
Any help is greatly appreciated.
Thank you
self(i) refers to calling an object i.e. using their __call__ method.
For example, consider this class:
class MyClass:
def __call__(self, arg):
print(arg)
def method(self):
self('test')
a = MyClass()
a.method() # Prints 'test'
a('other') # Prints 'other'
In your example, this class probably have some defined behavior associated with __call__ (it can also be inherited from an upper object). Refer to the class documentation to learn about it.

Create wrapper to return particular values from an already existing function

Hi I have a function called
tfnet.return_predict()
which when run on an image outputs certain set o values such as the class of object confidence and coordinates of bounding box. What i want to do is make a wrapper which returns only the confidence value.
So my code is as follows. I am using Darkflow to perform Prediction of classes on images.
#Initialise Libraries
# Load the YOLO Neural Network
tfnet = TFNet(options) #call the YOLO network
image = cv2.imread('C:/darkflow/Car.jpg', cv2.IMREAD_COLOR) #Load image
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print(tfnet.return_predict(image)) #function to run predictions
The output of print is
[{'label': 'Car', 'confidence': 0.32647023, 'topleft': {'x': 98, 'y': 249}, 'bottomright': {'x': 311, 'y': 455}}]
So from this i want to create a wrapper which just returns the 'confidence' value.
I know how to create wrappers and define functions for it but how to do it for already defined functions.
Any suggestion is of great help to mee
EDIT: I tried:
def log_calls(tfnet.return_predict):
def wrapper(*args, **kwargs):
#name = func.__name__
print('before {name} was called')
r = func(*args, **kwargs)
print('after {name} was called')
return r
return wrapper
But the 'tfnet.return_predict' is returning error
SyntaxError: invalid syntax
Do you need to redefine the tfnet.return_predict function to only return confidence? Or is having a separate function okay? If it's the latter, then it seems like you can just do this:
def conf_only(*args, **kwargs):
out = tfnet.return_predict(*args, **kwargs)
return out[0]["confidence"]
and calling conf_only returns just that part of the dict.
If you need to have tfnet.return_predict redefined and want that to only return confidence, then you can make a decorator:
def conf_deco(func):
def wrapper(*args, **kwargs):
return func(*args, **kwargs)[0]["confidence"]
return wrapper
For example, pretending dummy_function is already predefined
def dummy_function(*args, **kwargs):
print(args, kwargs)
return [{"confidence": .32, "other": "asdf"}]
In [4]: dummy_function("something", kw='else')
('something',) {'kw': 'else'}
Out[4]: [{'confidence': 0.32, 'other': 'asdf'}]
Now redefine it with:
In [6]: dummy_function = conf_deco(dummy_function)
and it'll only return the confidence value
In [7]: dummy_function("something", kw='else')
('something',) {'kw': 'else'}
Out[7]: 0.32

pytorch: how can I use picture as label in dataloader?

I want to do some image reconstruction using autoencoders in pytorch, however, I didn't find a way to use image as label for an input image.(the label image is different from original ones)
I've tried the image folder method, but I think that's for classfication and I am currently unable to come up with one solution. Should I create a custom dataset for this...
Thanks in advance!
Write your custom Dataset, below is a simple example.
import torch.utils.data.Dataset as Dataset
class CustomDataset(Dataset):
def __init__(self, input_imgs, label_imgs, transform):
self.input_imgs = input_imgs
self.label_imgs = label_imgs
self.transform = transform
def __len__(self):
return len(self.input_imgs)
def __getitem__(self, idx):
input_img, label_img = self.input_imgs[idx], self.label_imgs[idx]
return self.transform(input_img), self.transform(label_img)
And then, pass it to Dataloader:
dataloader = DataLoader(CustomDataset)

Resources