How may I integrate Pyarrow with Pytorch Dataset when dataset is too large to load into memory at once? - pytorch

A typical Pytorch Dataset implementation can be as follows:
import torch
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data):
self.data = data[:][:-1]
self.target = data[:][-1]
def __len__(self):
return len(self.data)
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return x, y
The reason I want to implement it in Dataset is because I want to use pytorch DataLoader later for my mini-batch training.
However, if data is from a directory containing multiple parquet files, how can I write def __getitem__(self, index): for it with loading all data into memory? I know Pyarrow is good at loading data in bathes but I didn't find a good reference to make it work. Any suggestions? Thank you in advance!

Related

Changing the checkpoint path of lr_find

I want to tune the learning rate for my PyTorch Lightning model. My code runs on a GPU cluster, so I can only write to certain folders that I bind mount. However, trainer.tuner.lr_find tries to write the checkpoint to the folder where my script runs and since this folder is not writable, it fails with the following error:
OSError: [Errno 30] Read-only file system: '/opt/xrPose/.lr_find_43df1c5c-0aed-4205-ac56-2fe4523ca4a7.ckpt'
Is there anyway to change the checkpoint path for lr_find? I checked the documentation but I couldn't find any information on that, in the part related to checkpointing.
My code is below:
res = trainer.tuner.lr_find(model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader, min_lr=1e-5)
logging.info(f"suggested learning rate: {res.suggestion()}")
model.hparams.learning_rate = res.suggestion()
You may need to specify default_root_dir when initialize Trainer:
trainer = Trainer(default_root_dir='./my_dir')
Description from the Official Documentation:
default_root_dir - Default path for logs and weights when no logger or
pytorch_lightning.callbacks.ModelCheckpoint callback passed.
Code example:
import numpy as np
import torch
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset
class MyDataset(Dataset):
def __init__(self) -> None:
super().__init__()
def __getitem__(self, index):
x = np.zeros((10,), np.float32)
y = np.zeros((1,), np.float32)
return x, y
def __len__(self):
return 100
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.model = torch.nn.Linear(10, 1)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = torch.nn.MSELoss()(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
model = MyModel()
trainer = Trainer(default_root_dir='./my_dir')
train_dataloader = DataLoader(MyDataset())
trainer.tuner.lr_find(model, train_dataloader)
As it is defined in the lr_finder.py as:
# Save initial model, that is loaded after learning rate is found
ckpt_path = os.path.join(trainer.default_root_dir, f".lr_find_{uuid.uuid4()}.ckpt")
trainer.save_checkpoint(ckpt_path)
The only way of changing the directory for saving the checkpoint is to change the default_root_dir. But be aware that this is also the directory that the lightning logs are saved to.
You can easily change it with trainer = Trainer(default_root_dir='./NAME_OF_THE_DIR').

PyTorch Dataloader: Dataset complete in RAM

I was wondering if the PyTorch Dataloader can also fetch the complete dataset into RAM so that performance does not suffer if there is enough RAM available
A concrete example of previous answer:
class mydataset(torch.utils.data.Dataset):
def __init__(self, data):
self.data = data
def __getitem__(self, index):
return self.data['x'][index,:], self.data['y'][index,:]
def __len__(self):
return self.data['x'].shape[0]
torch_data_train = mydataset(data_train)
dataload_train = DataLoader(torch_data_train, batch_size=batch_size, shuffle=True, num_workers=2)
You can extend torch.util.data.Dataset and create your own Dataset implementation. In the __init__ function of your custom dataset you can then load all data in a list or any other data structure, which will be fully loaded into ram. The __getitem__ will then only access the structure and return a single item.

JIT the collate function in Pytorch

I need to create a DataLoader where the collator function would require to have non trivial computation, actually a double layer loop which is significantly slowing down the training process. For example, consider this toy code where I try to use numba to JIT the collate function:
import torch
import torch.utils.data
import numba as nb
class Dataset(torch.utils.data.Dataset):
def __init__(self):
self.A = np.zeros((100000, 300))
self.B = np.ones((100000, 300))
def __getitem__(self, index):
return self.A[index], self.B[index]
def __len__(self):
return self.A.shape[0]
#nb.njit(cache=True)
def _collate_fn(batch):
batch_data = np.zeros((len(batch), 300))
for i in range(len(batch)):
batch_data[i] = batch[i][0] + batch[i][1]
return batch_data
and then I create the DataLoader as follows:
train_dataset = Dataset()
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=256,
num_workers=6,
collate_fn=_collate_fn,
shuffle=True)
However this just gets stuck but works fine if I remove the JITing of the _collate_fn. I am not able to understand what is happening here. I don't have to stick to numba and can use anything which will help me overcome the loop inefficiencies in Python. TIA and Happy 12,021.

DataLoader Class Errors Pytorch

I am beginner pytorch user, and I am trying to use dataloader.
Actually, I am trying to implement this into my network but it takes a very long time to load. And so, I debugged my network to see if the network itself has the problem, but it turns out it has something to with my dataloader class. Here is the code:
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
class DiabetesDataset(Dataset):
def __init__(self, csv):
self.xy = pd.read_csv(csv)
def __len__(self):
return len(self.xy)
def __getitem__(self, index):
self.x_data = torch.Tensor(xy.iloc[:, 0:-1].values)
self.y_data = torch.Tensor(xy.iloc[:, [-1]].values)
return self.x_data[index], self.y_data[index]
dataset = DiabetesDataset("trial.csv")
train_loader = DataLoader(dataset=dataset,
batch_size=1,
shuffle=True,
num_workers=2)`
for a in train_loader:
print(a)
To verify that the dataloader causes all the delay, I created a dummy csv file with 2 columns of 1s and 2s, for a total of 10 samples for each columns. Then, I looped over the train_loader object, it has been more than 1 hr and it is still running, considering that the sample size is small and batch size is set to 1.
I am not sure as to what the error to my code is and it is causing this issue.
Any comments/inputs are greatly appreciated!
There are some bugs in your code - could you check if this works (it is working on my computer with your toy example):
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import torch
class DiabetesDataset(Dataset):
def __init__(self, csv):
self.xy = pd.read_csv(csv)
def __len__(self):
return len(self.xy)
def __getitem__(self, index):
x_data = torch.Tensor(self.xy.iloc[:, 0:-1].values)
y_data = torch.Tensor(self.xy.iloc[:, [-1]].values)
return x_data[index], y_data[index]
dataset = DiabetesDataset("trial.csv")
train_loader = DataLoader(
dataset=dataset,
batch_size=1,
shuffle=True,
num_workers=2)
if __name__ == '__main__':
for a in train_loader:
print(a)
Edit: Your code is not working because you are missing a self in the __getitem__ method (self.xy.iloc...) and because you do not have a if __name__ == '__main__ at the end of your script. For the second error, see RuntimeError on windows trying python multiprocessing

Pytorch Data Wont Fit in Memory - Example?

I am trying to find an example of training in Pytorch in batch from data on disk - akin to the Keras fit_generator. How would I alter the code below to read the csv from disk instead of loading it to memory?
I have found that one can iterate over a custom data loader like below, but I am unsure how to do this without loading all the data in memory.
I would like to:
Train and validate the model with data held on disk
Use mini-batches of the full data on disk
Repeat x epochs
class testLoader(Dataset):
def __init__(self):
#regular old numpy
boston = load_boston()
x=boston.data
y=boston.target
self.x = torch.from_numpy(x)
self.y = torch.from_numpy(y)
self.length = x.shape[0]
self.vars =x.shape[1]
def __getitem__(self, index):
return self.x[index], self.y[index]
def __len__(self):
return self.length
training_samples=testLoader()
train_loader = utils_data.DataLoader(training_samples, batch_size=64, shuffle=True)

Resources