Pytorch Data Wont Fit in Memory - Example? - pytorch

I am trying to find an example of training in Pytorch in batch from data on disk - akin to the Keras fit_generator. How would I alter the code below to read the csv from disk instead of loading it to memory?
I have found that one can iterate over a custom data loader like below, but I am unsure how to do this without loading all the data in memory.
I would like to:
Train and validate the model with data held on disk
Use mini-batches of the full data on disk
Repeat x epochs
class testLoader(Dataset):
def __init__(self):
#regular old numpy
boston = load_boston()
x=boston.data
y=boston.target
self.x = torch.from_numpy(x)
self.y = torch.from_numpy(y)
self.length = x.shape[0]
self.vars =x.shape[1]
def __getitem__(self, index):
return self.x[index], self.y[index]
def __len__(self):
return self.length
training_samples=testLoader()
train_loader = utils_data.DataLoader(training_samples, batch_size=64, shuffle=True)

Related

How may I integrate Pyarrow with Pytorch Dataset when dataset is too large to load into memory at once?

A typical Pytorch Dataset implementation can be as follows:
import torch
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data):
self.data = data[:][:-1]
self.target = data[:][-1]
def __len__(self):
return len(self.data)
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return x, y
The reason I want to implement it in Dataset is because I want to use pytorch DataLoader later for my mini-batch training.
However, if data is from a directory containing multiple parquet files, how can I write def __getitem__(self, index): for it with loading all data into memory? I know Pyarrow is good at loading data in bathes but I didn't find a good reference to make it work. Any suggestions? Thank you in advance!

Changing the checkpoint path of lr_find

I want to tune the learning rate for my PyTorch Lightning model. My code runs on a GPU cluster, so I can only write to certain folders that I bind mount. However, trainer.tuner.lr_find tries to write the checkpoint to the folder where my script runs and since this folder is not writable, it fails with the following error:
OSError: [Errno 30] Read-only file system: '/opt/xrPose/.lr_find_43df1c5c-0aed-4205-ac56-2fe4523ca4a7.ckpt'
Is there anyway to change the checkpoint path for lr_find? I checked the documentation but I couldn't find any information on that, in the part related to checkpointing.
My code is below:
res = trainer.tuner.lr_find(model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader, min_lr=1e-5)
logging.info(f"suggested learning rate: {res.suggestion()}")
model.hparams.learning_rate = res.suggestion()
You may need to specify default_root_dir when initialize Trainer:
trainer = Trainer(default_root_dir='./my_dir')
Description from the Official Documentation:
default_root_dir - Default path for logs and weights when no logger or
pytorch_lightning.callbacks.ModelCheckpoint callback passed.
Code example:
import numpy as np
import torch
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset
class MyDataset(Dataset):
def __init__(self) -> None:
super().__init__()
def __getitem__(self, index):
x = np.zeros((10,), np.float32)
y = np.zeros((1,), np.float32)
return x, y
def __len__(self):
return 100
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.model = torch.nn.Linear(10, 1)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = torch.nn.MSELoss()(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
model = MyModel()
trainer = Trainer(default_root_dir='./my_dir')
train_dataloader = DataLoader(MyDataset())
trainer.tuner.lr_find(model, train_dataloader)
As it is defined in the lr_finder.py as:
# Save initial model, that is loaded after learning rate is found
ckpt_path = os.path.join(trainer.default_root_dir, f".lr_find_{uuid.uuid4()}.ckpt")
trainer.save_checkpoint(ckpt_path)
The only way of changing the directory for saving the checkpoint is to change the default_root_dir. But be aware that this is also the directory that the lightning logs are saved to.
You can easily change it with trainer = Trainer(default_root_dir='./NAME_OF_THE_DIR').

PyTorch Dataloader: Dataset complete in RAM

I was wondering if the PyTorch Dataloader can also fetch the complete dataset into RAM so that performance does not suffer if there is enough RAM available
A concrete example of previous answer:
class mydataset(torch.utils.data.Dataset):
def __init__(self, data):
self.data = data
def __getitem__(self, index):
return self.data['x'][index,:], self.data['y'][index,:]
def __len__(self):
return self.data['x'].shape[0]
torch_data_train = mydataset(data_train)
dataload_train = DataLoader(torch_data_train, batch_size=batch_size, shuffle=True, num_workers=2)
You can extend torch.util.data.Dataset and create your own Dataset implementation. In the __init__ function of your custom dataset you can then load all data in a list or any other data structure, which will be fully loaded into ram. The __getitem__ will then only access the structure and return a single item.

How keras.utils.Sequence works?

I am trying to create a data pipeline for U-net for Image Segmentation. I came across Keras.utils.Sequence class through which, I can create a data pipeline, But I am unable to understand how this is working.
link for the code Keras code , Source code
def __iter__(self):
"""Create a generator that iterate over the Sequence."""
for item in (self[i] for i in range(len(self))):
yield item
I will highly appreciate if anyone can tell me how this works ?
You don't need a generator. The sequence class is there to manage that. You need to define a class inherited from tensorflow.keras.utils.Sequence and define the methods:
__init__, __getitem__, __len__. In addition, you can define the method on_epoch_end, which is called at the end of each epoch and is usually used to shuffle the sample indexes.
There is an example in the link you gave Tensorflow Sequence.
Below is another example of Sequence.
Note that you can pass the data to the __init__ constructor, but you may as well read the data from files in the __getitem__ method, assuming you know where to read it, e.g. by passing the name of a directory or directories into the constructor. This is necessary if there is a lot of data.
from tensorflow import keras
import numpy as np
class SequenceExample(keras.utils.Sequence):
def __init__(self, x_in, y_in, batch_size, shuffle=True):
# Initialization
self.batch_size = batch_size
self.shuffle = shuffle
self.x = x_in
self.y = y_in
self.datalen = len(y_in)
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)
def __getitem__(self, index):
# get batch indexes from shuffled indexes
batch_indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
x_batch = self.x[batch_indexes]
y_batch = self.y[batch_indexes]
return x_batch, y_batch
def __len__(self):
# Denotes the number of batches per epoch
return self.datalen // self.batch_size
def on_epoch_end(self):
# Updates indexes after each epoch
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)

Load data into GPU directly using PyTorch

In training loop, I load a batch of data into CPU and then transfer it to GPU:
import torch.utils as utils
train_loader = utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4, pin_memory=True)
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
This way of loading data is very time-consuming. Any way to directly load data into GPU without transfer step ?
#PeterJulian first of all thanks for the reply. As far as I know there is no single line command for loading a whole dataset to GPU. Actually in my reply I meant to use .to(device) in the __init__ of the data loader. There are some examples in the link that I had shared previously. Also, I left an example data loader code below. Hope both the examples in the link and the code below helps.
class SampleDataset(Dataset):
def __init__(self, device='cuda'):
super(SampleDataset, self).__init__()
self.data = torch.ones(1000)
self.data = self.data.to(device)
def __len__(self):
return len(self.data)
def __getitem__(self, i):
element = self.data[i]
return element
You can load all the data to in tensor than move it yo GPU memory.(assuming that you have enough memory) When you need it use the one inside the tensor which is already at GPU memory. Hope it helps.

Resources