JIT the collate function in Pytorch - pytorch

I need to create a DataLoader where the collator function would require to have non trivial computation, actually a double layer loop which is significantly slowing down the training process. For example, consider this toy code where I try to use numba to JIT the collate function:
import torch
import torch.utils.data
import numba as nb
class Dataset(torch.utils.data.Dataset):
def __init__(self):
self.A = np.zeros((100000, 300))
self.B = np.ones((100000, 300))
def __getitem__(self, index):
return self.A[index], self.B[index]
def __len__(self):
return self.A.shape[0]
#nb.njit(cache=True)
def _collate_fn(batch):
batch_data = np.zeros((len(batch), 300))
for i in range(len(batch)):
batch_data[i] = batch[i][0] + batch[i][1]
return batch_data
and then I create the DataLoader as follows:
train_dataset = Dataset()
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=256,
num_workers=6,
collate_fn=_collate_fn,
shuffle=True)
However this just gets stuck but works fine if I remove the JITing of the _collate_fn. I am not able to understand what is happening here. I don't have to stick to numba and can use anything which will help me overcome the loop inefficiencies in Python. TIA and Happy 12,021.

Related

How may I integrate Pyarrow with Pytorch Dataset when dataset is too large to load into memory at once?

A typical Pytorch Dataset implementation can be as follows:
import torch
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, data):
self.data = data[:][:-1]
self.target = data[:][-1]
def __len__(self):
return len(self.data)
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return x, y
The reason I want to implement it in Dataset is because I want to use pytorch DataLoader later for my mini-batch training.
However, if data is from a directory containing multiple parquet files, how can I write def __getitem__(self, index): for it with loading all data into memory? I know Pyarrow is good at loading data in bathes but I didn't find a good reference to make it work. Any suggestions? Thank you in advance!

Changing the checkpoint path of lr_find

I want to tune the learning rate for my PyTorch Lightning model. My code runs on a GPU cluster, so I can only write to certain folders that I bind mount. However, trainer.tuner.lr_find tries to write the checkpoint to the folder where my script runs and since this folder is not writable, it fails with the following error:
OSError: [Errno 30] Read-only file system: '/opt/xrPose/.lr_find_43df1c5c-0aed-4205-ac56-2fe4523ca4a7.ckpt'
Is there anyway to change the checkpoint path for lr_find? I checked the documentation but I couldn't find any information on that, in the part related to checkpointing.
My code is below:
res = trainer.tuner.lr_find(model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader, min_lr=1e-5)
logging.info(f"suggested learning rate: {res.suggestion()}")
model.hparams.learning_rate = res.suggestion()
You may need to specify default_root_dir when initialize Trainer:
trainer = Trainer(default_root_dir='./my_dir')
Description from the Official Documentation:
default_root_dir - Default path for logs and weights when no logger or
pytorch_lightning.callbacks.ModelCheckpoint callback passed.
Code example:
import numpy as np
import torch
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset
class MyDataset(Dataset):
def __init__(self) -> None:
super().__init__()
def __getitem__(self, index):
x = np.zeros((10,), np.float32)
y = np.zeros((1,), np.float32)
return x, y
def __len__(self):
return 100
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.model = torch.nn.Linear(10, 1)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = torch.nn.MSELoss()(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
model = MyModel()
trainer = Trainer(default_root_dir='./my_dir')
train_dataloader = DataLoader(MyDataset())
trainer.tuner.lr_find(model, train_dataloader)
As it is defined in the lr_finder.py as:
# Save initial model, that is loaded after learning rate is found
ckpt_path = os.path.join(trainer.default_root_dir, f".lr_find_{uuid.uuid4()}.ckpt")
trainer.save_checkpoint(ckpt_path)
The only way of changing the directory for saving the checkpoint is to change the default_root_dir. But be aware that this is also the directory that the lightning logs are saved to.
You can easily change it with trainer = Trainer(default_root_dir='./NAME_OF_THE_DIR').

How keras.utils.Sequence works?

I am trying to create a data pipeline for U-net for Image Segmentation. I came across Keras.utils.Sequence class through which, I can create a data pipeline, But I am unable to understand how this is working.
link for the code Keras code , Source code
def __iter__(self):
"""Create a generator that iterate over the Sequence."""
for item in (self[i] for i in range(len(self))):
yield item
I will highly appreciate if anyone can tell me how this works ?
You don't need a generator. The sequence class is there to manage that. You need to define a class inherited from tensorflow.keras.utils.Sequence and define the methods:
__init__, __getitem__, __len__. In addition, you can define the method on_epoch_end, which is called at the end of each epoch and is usually used to shuffle the sample indexes.
There is an example in the link you gave Tensorflow Sequence.
Below is another example of Sequence.
Note that you can pass the data to the __init__ constructor, but you may as well read the data from files in the __getitem__ method, assuming you know where to read it, e.g. by passing the name of a directory or directories into the constructor. This is necessary if there is a lot of data.
from tensorflow import keras
import numpy as np
class SequenceExample(keras.utils.Sequence):
def __init__(self, x_in, y_in, batch_size, shuffle=True):
# Initialization
self.batch_size = batch_size
self.shuffle = shuffle
self.x = x_in
self.y = y_in
self.datalen = len(y_in)
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)
def __getitem__(self, index):
# get batch indexes from shuffled indexes
batch_indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
x_batch = self.x[batch_indexes]
y_batch = self.y[batch_indexes]
return x_batch, y_batch
def __len__(self):
# Denotes the number of batches per epoch
return self.datalen // self.batch_size
def on_epoch_end(self):
# Updates indexes after each epoch
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)

Parallelization of multiples independent models in tensorflow-gpu / keras

I need to train a set of models but do not benefit from GPU acceleration using tensorflow-gpu / keras as time augments linearly with the number of models trained.
In
class Models(tf.keras.Model):
def __init__(self,N_MODELS=1):
super(Models, self).__init__()
self.block_i = [estimate_affine()
for node in range(N_MODELS)]
def call(self, inputs):
x = [self.block_i[i](input_i) for i,input_i in enumerate(inputs)]
return x
a list of N_MODELS layers are built and as are idenpendant should be parallelized. As it is not the case, even though output is what I expect, I guess my implementation is not optimal. Any idea how to make it parallelizable ?
Best
Paul
Here is a toynet of N_MODELS of linear regression
import tensorflow as tf
tf.enable_eager_execution()
from tensorflow.keras import layers
import numpy as np
from numpy import random
import time
class estimate_affine(layers.Layer):
def __init__(self):
'''
'''
super(estimate_affine, self).__init__()
self.a = tf.Variable(initial_value=[0.], dtype='float32',trainable=True,name='par1')
self.b = tf.Variable(initial_value=[0.], dtype='float32',trainable=True,name='par2')
def call(self, inputs):
return (self.a,self.b)
class Models(tf.keras.Model):
def __init__(self,N_MODELS=1):
super(Models, self).__init__()
self.block_i = [estimate_affine()
for node in range(N_MODELS)]
def call(self, inputs):
x = [self.block_i[i](input_i) for i,input_i in enumerate(inputs)]
return x
N_ITERATIONS=100
N_POINTS=100
ls_t=[]
for N_MODELS in [5,10,50,100,1000]:
t=time.time()
### Aim is to fit N_MODELS on N_POINTS which are basically N_MODELS of ax+b
a=np.random.randint(0,10,N_MODELS)
b=np.random.randint(0,10,N_MODELS)
noise=np.random.rand(N_POINTS) * 1
x=np.linspace(0,1,N_POINTS)
dataset=np.array([a_i *( x + noise) + b_i for a_i,b_i in zip(a,b)])
model=Models(N_MODELS=N_MODELS)
optimizer=tf.keras.optimizers.SGD(learning_rate=5e-3)
for i in range(N_ITERATIONS):
with tf.GradientTape() as tape:
outputs=model(dataset)
L=tf.reduce_sum([((outputs[idx][0]*x+outputs[idx][1])
- dataset[idx,:])**2 for idx in range(N_MODELS)])
grads = tape.gradient(L, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
t_diff=time.time()-t
print('N_MODEL : {}, time : {}'.format(N_MODELS,t_diff))
ls_t.append(t_diff)

DataLoader Class Errors Pytorch

I am beginner pytorch user, and I am trying to use dataloader.
Actually, I am trying to implement this into my network but it takes a very long time to load. And so, I debugged my network to see if the network itself has the problem, but it turns out it has something to with my dataloader class. Here is the code:
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
class DiabetesDataset(Dataset):
def __init__(self, csv):
self.xy = pd.read_csv(csv)
def __len__(self):
return len(self.xy)
def __getitem__(self, index):
self.x_data = torch.Tensor(xy.iloc[:, 0:-1].values)
self.y_data = torch.Tensor(xy.iloc[:, [-1]].values)
return self.x_data[index], self.y_data[index]
dataset = DiabetesDataset("trial.csv")
train_loader = DataLoader(dataset=dataset,
batch_size=1,
shuffle=True,
num_workers=2)`
for a in train_loader:
print(a)
To verify that the dataloader causes all the delay, I created a dummy csv file with 2 columns of 1s and 2s, for a total of 10 samples for each columns. Then, I looped over the train_loader object, it has been more than 1 hr and it is still running, considering that the sample size is small and batch size is set to 1.
I am not sure as to what the error to my code is and it is causing this issue.
Any comments/inputs are greatly appreciated!
There are some bugs in your code - could you check if this works (it is working on my computer with your toy example):
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import torch
class DiabetesDataset(Dataset):
def __init__(self, csv):
self.xy = pd.read_csv(csv)
def __len__(self):
return len(self.xy)
def __getitem__(self, index):
x_data = torch.Tensor(self.xy.iloc[:, 0:-1].values)
y_data = torch.Tensor(self.xy.iloc[:, [-1]].values)
return x_data[index], y_data[index]
dataset = DiabetesDataset("trial.csv")
train_loader = DataLoader(
dataset=dataset,
batch_size=1,
shuffle=True,
num_workers=2)
if __name__ == '__main__':
for a in train_loader:
print(a)
Edit: Your code is not working because you are missing a self in the __getitem__ method (self.xy.iloc...) and because you do not have a if __name__ == '__main__ at the end of your script. For the second error, see RuntimeError on windows trying python multiprocessing

Resources