I have two datasets, but one is larger than the other and I want to subsample it (resample in each epoch).
I probably cannot use dataloader argument sampler, as I would pass to Dataloader the already concatenated dataset.
How do I achieve this simply?
I think one solution would be to write a class SubsampledDataset(IterableDataset) which would resample every time __iter__ is called (each epoch).
(Or better use a map-style dataset, but is there a hook that gets called every epoch, like __iter__ gets?)
This is what I have so far (untested). Usage:
dataset1: Any = ...
# subsample original_dataset2, so that it is equally large in each epoch
dataset2 = RandomSampledDataset(original_dataset2, num_samples=len(dataset1))
concat_dataset = ConcatDataset([dataset1, dataset2])
data_loader = torch.utils.data.DataLoader(
concat_dataset,
sampler=RandomSamplerWithNewEpochHook(dataset2.new_epoch_hook, concat_dataset)
)
The result is that the concat_dataset will be shuffled each epoch (RandomSampler), in addition, the dataset2 component is a new sample of the (possibly larger) original_dataset2, different in each epoch.
You can add more datasets to be subsampled by doing instead of:
sampler=RandomSamplerWithNewEpochHook(dataset2.new_epoch_hook
this:
sampler=RandomSamplerWithNewEpochHook(lambda: dataset2.new_epoch_hook and dataset3.new_epoch_hook and dataset4.new_epoch_hook, ...
Code:
class RandomSamplerWithNewEpochHook(RandomSampler):
""" Wraps torch.RandomSampler and calls supplied new_epoch_hook before each epoch. """
def __init__(self, new_epoch_hook: Callable, data_source: Sized, replacement: bool = False,
num_samples: Optional[int] = None, generator=None):
super().__init__(data_source, replacement, num_samples, generator)
self.new_epoch_hook = new_epoch_hook
def __iter__(self):
self.new_epoch_hook()
return super().__iter__()
class RandomSampledDataset(Dataset):
""" Subsamples a dataset. The sample is different in each epoch.
This helps when concatenating datasets, as the subsampling rate can be different for each dataset.
Call new_epoch_hook before each epoch. (This can be done using e.g. RandomSamplerWithNewEpochHook.)
This would be arguably harder to achieve with a concatenated dataset and a sampler argument to Dataloader. The
sampler would have to be aware of the indices of subdatasets' items in the concatenated dataset, of the subsampling
for each subdataset."""
def __init__(self, dataset, num_samples, transform=lambda im: im):
self.dataset = dataset
self.transform = transform
self.num_samples = num_samples
self.sampler = RandomSampler(dataset, num_samples=num_samples)
self.current_epoch_samples = None
def new_epoch_hook(self):
self.current_epoch_samples = torch.tensor(iter(self.sampler), dtype=torch.int)
def __len__(self):
return self.num_samples
def __getitem__(self, item):
if item < 0 or item >= len(self):
raise IndexError
img = self.dataset[self.current_epoch_samples[item].item()]
return self.transform(img)
You can stop to iterate by raising StopIteration. This error is caught by Dataloader and simply stop the iteration. So you can do something like that:
class SubDataset(Dataset):
"""SubDataset class."""
def __init__(self, dataset, length):
self.dataset = dataset
self.elem = 0
self.length = length
def __getitem__(self, index):
self.elem += 1
if self.elem > self.length:
self.elem = 0
raise StopIteration # caught by DataLoader
return self.dataset[index]
def __len__(self):
return len(self.dataset)
if __name__ == '__main__':
torch.manual_seed(0)
dataloader = DataLoader(SubDataset(torch.arange(10), 5), shuffle=True)
for _ in range(3):
for x in dataloader:
print(x)
print(len(dataloader)) # 10!!
Output:
Note that setting __len__ to self.length will cause a problem because dataloader will use only indices between 0 and length-1 (that is not what you want). Unfortunately I found nothing to set the actually length without having this behaviour (due to Dataloader restriction). Thus be careful: len(dataset) is the original length and dataset.length is the new length.
Related
I am trying to create a data pipeline for U-net for Image Segmentation. I came across Keras.utils.Sequence class through which, I can create a data pipeline, But I am unable to understand how this is working.
link for the code Keras code , Source code
def __iter__(self):
"""Create a generator that iterate over the Sequence."""
for item in (self[i] for i in range(len(self))):
yield item
I will highly appreciate if anyone can tell me how this works ?
You don't need a generator. The sequence class is there to manage that. You need to define a class inherited from tensorflow.keras.utils.Sequence and define the methods:
__init__, __getitem__, __len__. In addition, you can define the method on_epoch_end, which is called at the end of each epoch and is usually used to shuffle the sample indexes.
There is an example in the link you gave Tensorflow Sequence.
Below is another example of Sequence.
Note that you can pass the data to the __init__ constructor, but you may as well read the data from files in the __getitem__ method, assuming you know where to read it, e.g. by passing the name of a directory or directories into the constructor. This is necessary if there is a lot of data.
from tensorflow import keras
import numpy as np
class SequenceExample(keras.utils.Sequence):
def __init__(self, x_in, y_in, batch_size, shuffle=True):
# Initialization
self.batch_size = batch_size
self.shuffle = shuffle
self.x = x_in
self.y = y_in
self.datalen = len(y_in)
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)
def __getitem__(self, index):
# get batch indexes from shuffled indexes
batch_indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
x_batch = self.x[batch_indexes]
y_batch = self.y[batch_indexes]
return x_batch, y_batch
def __len__(self):
# Denotes the number of batches per epoch
return self.datalen // self.batch_size
def on_epoch_end(self):
# Updates indexes after each epoch
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)
I am trying to train a classifier with MNIST dataset using pytorch-lightening.
import pytorch_lightning as pl
from torchvision import transforms
from torchvision.datasets import MNIST, SVHN
from torch.utils.data import DataLoader, random_split
class MNISTData(pl.LightningDataModule):
def __init__(self, data_dir='./', batch_size=256):
super().__init__()
self.data_dir = data_dir
self.batch_size = batch_size
self.transform = transforms.ToTensor()
def download(self):
MNIST(self.data_dir, train=True, download=True)
MNIST(self.data_dir, train=False, download=True)
def setup(self, stage=None):
if stage == 'fit' or stage is None:
mnist_train = MNIST(self.data_dir, train=True, transform=self.transform)
self.mnist_train, self.mnist_val = random_split(mnist_train, [55000, 5000])
if stage == 'test' or stage is None:
self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)
def train_dataloader(self):
mnist_train = DataLoader(self.mnist_train, batch_size=self.batch_size)
return mnist_train
def val_dataloader(self):
mnist_val = DataLoader(self.mnist_val, batch_size=self.batch_size)
return mnist_val
def test_dataloader(self):
mnist_test = DataLoader(self.mnist_test, batch_size=self.batch_size)
After using MNISTData().setup(), I gained MNISTData().mnist_train, MNISTData().mnist_val, MNISTData().mnist_test whose length are 55000, 5000, 10000 with type of torch.utils.data.dataset.Subset.
But when i call dataloader w.r.t MNISTData().train_dataloader, MNISTData().val_dataloader, MNISTData().test_dataloader I only get DataLoader with 215, 20, None datas in them.
Can someone know the reason or could fix the problem?
As I told in the comments, and Ivan posted in his answer, there was missing return statement:
def test_dataloader(self):
mnist_test = DataLoader(self.mnist_test, batch_size=self.batch_size)
return mnist_test # <<< missing return
As per your comment, if we try:
a = MNISTData()
# skip download, assuming you already have it
a.setup()
b, c, d = a.train_dataloader(), a.val_dataloader(), a.test_dataloader()
# len(b)=215, len(c)=20, len(d)=40
I think your question is why the length of b, c, d are different from the length of the datasets. The answer is that the len() of a DataLoader is equal to the number of batches, not the number of samples, therefore:
import math
batch_size = 256
len(b) = math.ceil(55000 / batch_size) = 215
len(c) = math.ceil(5000 / batch_size) = 20
len(d) = math.ceil(10000 / batch_size) = 40
BTW, we're using math.ceil because DataLoader has drop_last=False by default, otherwise it would be math.floor.
Your test_dataloader function is missing a return statement!
def test_dataloader(self):
mnist_test = DataLoader(self.mnist_test, batch_size=self.batch_size)
return mnist_test
>>> ds = MNISTData()
>>> ds.download()
>>> ds.setup()
Then:
>>> [len(subset) for subset in \
(ds.mnist_train, ds.mnist_val, ds.mnist_test)]
[55000, 5000, 10000]
>>> [len(loader) for loader in \
(ds.train_dataloader(), ds.val_dataloader(), ds.test_dataloader())]
[215, 20, 40]
Others pointing out the fact that you are missing a return is the test_dataloader() is certainly correct.
Judging by how the question is framed, it seems you are confused about the length of a Dataset and a DataLoader.
len(Dataset(..)) returns the number of data samples in your dataset.
whereas, len(DataLoader(ds, ...)) returns the number of batches; and that depends of how much batch_size=... you requested, whether you want to drop_last batch etc. The exact calculations are provided correctly by #Berriel
I am new to PyTorch and I am learning to create batches of data for segmentation. The code is shown below:
class NumbersDataset(Dataset):
def __init__(self):
self.X = list(df['input_img'])
self.y = list(df['mask_img'])
def __len__(self):
return len(self.X), len(self.y)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
if __name__ == '__main__':
dataset = NumbersDataset()
dataloader = DataLoader(dataset, batch_size=50, shuffle=True, num_workers=2)
# print(len(dataset))
# plt.imshow(dataset[100])
# plt.show()
print(next(iter(dataloader)))
where df['input_img'] column contains the location of the image ('/path/to/pic/480p/boxing-fisheye/00010.jpg') and df['mask_img'] contains the location of all the mask images. I am trying to load the images but I get the error:
TypeError: 'tuple' object cannot be interpreted as an integer
However, if I don't use DataLoader and just do the following:
dataset = NumbersDataset()
print(len(dataset))
print(dataset[10:20])
then I get what I expect. Can someone tell me what am I doing wrong?
You cannot return a tuple for the __len__ method. The expected type is int
# perhaps you can add the list length's for the total length
# but no matter how you choose to implement the method you can
# only return on value of type integer `int`
def __len__(self):
return len(self.X) + len(self.y)
I have a dataset of 3-D(time_stepinputsizetotal_num) matrix which is a .mat file. I want to use DataLoader to get a input dataset for LSTM which batch_size is 5. My code is as following:
file_path = "…/database/frameLength100/notOverlap/a.mat"
mat_data = s.loadmat(file_path)
tensor_data = torch.from_numpy(mat_data[‘a’]) #Tensor
class CustomDataset(Dataset):
def __init__(self, tensor_data):
self.tensor_data = tensor_data
def __getitem__(self, index):
data = self.tensor_data[index]
label = 1;
return data, label
def __len__(self):
return len(self.tensor_data)
custom_dataset = CustomDataset(tensor_data=tensor_data)
train_loader = DataLoader(dataset=custom_dataset, batch_size=5, shuffle=True)
I think the code is wrong but I have no idea how to correct it. What makes me confused is how how can I make DataLoader know which dimension is ‘total_num’ so that I get the dataset which batch size is 5.
If I understand correctly, you want the batching to happen along the total_num dimension, i. e. dimension 2.
You could simply use that the dimension to index your dataset, i.e. change __getitem__ to data = self.tensor_data[:, :, index], and accordingly in __len__, return self.tensor_data.size(2) instead of len(self.tensor_data). Each batch will then have size [time_step, inputsize, 5].
I'm new to Python and Keras, and I have successfully built a neural network that saves weight files after every Epoch. However, I want more granularity (I'm visualizing layer weight distributions in time series) and would like to save the weights after every N batches, rather than every epoch.
Does anyone have any suggestions?
You can create your own callback (https://keras.io/callbacks/). Something like:
from keras.callbacks import Callback
class WeightsSaver(Callback):
def __init__(self, N):
self.N = N
self.batch = 0
def on_batch_end(self, batch, logs={}):
if self.batch % self.N == 0:
name = 'weights%08d.h5' % self.batch
self.model.save_weights(name)
self.batch += 1
I use self.batch instead of the batch argument provided because the later restarts at 0 at each epoch.
Then add it to your fit call. For example, to save weights every 5 batches:
model.fit(X_train, Y_train, callbacks=[WeightsSaver(5)])
As stated by grovina you can create your own callback.
https://keras.io/callbacks/
Look at the source of 'callbacks' here : https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L148
Under Callback class, there are numerous functions for your desired requirement. In your case, if you want to save models every N epochs then define the function 'on_epoch_end'.
Sample Code
class WeightsSaver(Callback):
def __init__(self, N):
self.N = N
self.epoch = 0
def on_epoch_end(self, epoch, logs={}):
if self.epoch % self.N == 0:
name = ('weights%04d.hdf5') % self.epoch
self.model.save_weights(name)
self.epoch += 1
callbacks_list = [WeightsSaver(10)] #save every 10 models
model.fit(train_X,train_Y,epochs=n_epochs,batch_size=batch_size, callbacks=callbacks_list )