PyTorch Dataloader: Dataset complete in RAM - pytorch

I was wondering if the PyTorch Dataloader can also fetch the complete dataset into RAM so that performance does not suffer if there is enough RAM available

A concrete example of previous answer:
class mydataset(torch.utils.data.Dataset):
def __init__(self, data):
self.data = data
def __getitem__(self, index):
return self.data['x'][index,:], self.data['y'][index,:]
def __len__(self):
return self.data['x'].shape[0]
torch_data_train = mydataset(data_train)
dataload_train = DataLoader(torch_data_train, batch_size=batch_size, shuffle=True, num_workers=2)

You can extend torch.util.data.Dataset and create your own Dataset implementation. In the __init__ function of your custom dataset you can then load all data in a list or any other data structure, which will be fully loaded into ram. The __getitem__ will then only access the structure and return a single item.

Related

How may I integrate Pyarrow with Pytorch Dataset when dataset is too large to load into memory at once?

A typical Pytorch Dataset implementation can be as follows:
import torch
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data):
self.data = data[:][:-1]
self.target = data[:][-1]
def __len__(self):
return len(self.data)
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return x, y
The reason I want to implement it in Dataset is because I want to use pytorch DataLoader later for my mini-batch training.
However, if data is from a directory containing multiple parquet files, how can I write def __getitem__(self, index): for it with loading all data into memory? I know Pyarrow is good at loading data in bathes but I didn't find a good reference to make it work. Any suggestions? Thank you in advance!

PyTorch: Apply data augmentation on training data after random_split

I have a dataset that does not have separate folders for training and testing. I want to apply data augmentation with transforms only on the training data after doing the split
train_data, valid_data = D.random_split(dataset, lengths=[train_size, valid_size])
Does anyone know how this can be achieved? I have a custom dataset with initialization and getitem. The training and validation datasets are further passed to the DataLoader.
You can have a custom Dataset only for the transformations:
class TrDataset(Dataset):
def __init__(self, base_dataset, transformations):
super(TrDataset, self).__init__()
self.base = base_dataset
self.transformations = transformations
def __len__(self):
return len(self.base)
def __getitem__(self, idx):
x, y = self.base[idx]
return self.transformations(x), y
Once you have this Dataset wrapper, you can have different transformations for the train and validation sets:
raw_train_data, raw_valid_data = D.random_split(dataset, lengths=[train_size, valid_size])
train_data = TrDataset(raw_train_data, train_transforms)
valid_data = TrDataset(raw_valid_data, val_transforms)

How keras.utils.Sequence works?

I am trying to create a data pipeline for U-net for Image Segmentation. I came across Keras.utils.Sequence class through which, I can create a data pipeline, But I am unable to understand how this is working.
link for the code Keras code , Source code
def __iter__(self):
"""Create a generator that iterate over the Sequence."""
for item in (self[i] for i in range(len(self))):
yield item
I will highly appreciate if anyone can tell me how this works ?
You don't need a generator. The sequence class is there to manage that. You need to define a class inherited from tensorflow.keras.utils.Sequence and define the methods:
__init__, __getitem__, __len__. In addition, you can define the method on_epoch_end, which is called at the end of each epoch and is usually used to shuffle the sample indexes.
There is an example in the link you gave Tensorflow Sequence.
Below is another example of Sequence.
Note that you can pass the data to the __init__ constructor, but you may as well read the data from files in the __getitem__ method, assuming you know where to read it, e.g. by passing the name of a directory or directories into the constructor. This is necessary if there is a lot of data.
from tensorflow import keras
import numpy as np
class SequenceExample(keras.utils.Sequence):
def __init__(self, x_in, y_in, batch_size, shuffle=True):
# Initialization
self.batch_size = batch_size
self.shuffle = shuffle
self.x = x_in
self.y = y_in
self.datalen = len(y_in)
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)
def __getitem__(self, index):
# get batch indexes from shuffled indexes
batch_indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
x_batch = self.x[batch_indexes]
y_batch = self.y[batch_indexes]
return x_batch, y_batch
def __len__(self):
# Denotes the number of batches per epoch
return self.datalen // self.batch_size
def on_epoch_end(self):
# Updates indexes after each epoch
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)

Load data into GPU directly using PyTorch

In training loop, I load a batch of data into CPU and then transfer it to GPU:
import torch.utils as utils
train_loader = utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4, pin_memory=True)
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
This way of loading data is very time-consuming. Any way to directly load data into GPU without transfer step ?
#PeterJulian first of all thanks for the reply. As far as I know there is no single line command for loading a whole dataset to GPU. Actually in my reply I meant to use .to(device) in the __init__ of the data loader. There are some examples in the link that I had shared previously. Also, I left an example data loader code below. Hope both the examples in the link and the code below helps.
class SampleDataset(Dataset):
def __init__(self, device='cuda'):
super(SampleDataset, self).__init__()
self.data = torch.ones(1000)
self.data = self.data.to(device)
def __len__(self):
return len(self.data)
def __getitem__(self, i):
element = self.data[i]
return element
You can load all the data to in tensor than move it yo GPU memory.(assuming that you have enough memory) When you need it use the one inside the tensor which is already at GPU memory. Hope it helps.

Pytorch Data Wont Fit in Memory - Example?

I am trying to find an example of training in Pytorch in batch from data on disk - akin to the Keras fit_generator. How would I alter the code below to read the csv from disk instead of loading it to memory?
I have found that one can iterate over a custom data loader like below, but I am unsure how to do this without loading all the data in memory.
I would like to:
Train and validate the model with data held on disk
Use mini-batches of the full data on disk
Repeat x epochs
class testLoader(Dataset):
def __init__(self):
#regular old numpy
boston = load_boston()
x=boston.data
y=boston.target
self.x = torch.from_numpy(x)
self.y = torch.from_numpy(y)
self.length = x.shape[0]
self.vars =x.shape[1]
def __getitem__(self, index):
return self.x[index], self.y[index]
def __len__(self):
return self.length
training_samples=testLoader()
train_loader = utils_data.DataLoader(training_samples, batch_size=64, shuffle=True)

Resources