Forward-pass for PyTorch Module in Cython - pytorch

I am creating a neural network for DeepSet where each element in the set is itself a set. Since these sets are not vectorizable I am representing my input as lists of lists of torch.tensors.
For the forward-pass of my network I don't think there is any alternative to using for-loops/list comprehension. Potentially these for-loops iterate a relatively big number of times. As such, training my networks is very time consuming since they are running in Python.
I have tried with TorchScript, without a major improvement in runningtime (about 40% improved runningtime). I hope that re-writing my loops in Cython might yield better reuslts. However I can't figure out how to combine nn.Modules from PyTorch with Cython. Any suggestions?
Here is my module:
class DSSN(nn.Module):
def __init__(self, pd_rho, dh_rho, fc_network, device):
super(DSSN, self).__init__()
self.pers_lay1 = pd_rho
self.pers_lay2 = dh_rho
self.fc = fc_network
def forward(self:nn.Module, x:List[List[torch.Tensor]]) -> torch.Tensor:
x = torch.stack([torch.stack([self.pers_lay1(pd) for pd in sample]) for sample in x])
x = self.pers_lay2(x)
x = self.fc(x)
return x

Related

Subclass of PyTorch DataLoader for changing batch output

I'm interested in a way of applying a transform to a batch generated by a PyTorch DataLoader class. My minimal example is something like this:
class CustomLoader(torch.utils.data.DataLoader):
def __iter__(self):
result = super().__iter__()
return some_function(result)
But this errors since the DataLoader.__iter()__ returns _MultiProcessingDataLoaderIter or _SingleProcessingDataLoaderIter. Weirdly though, directly returning the output does return a Tensor, so any explanation there would be greatly appreciated!
I understand that in general, transform to data should be done in the subclassed Dataset class. However, in my case the data is tabular and the transform is via numpy, and doing it on a sample-wise basis is much slower (5x) than doing it on an entire batch, since surely these operations are vectorized under the hood.
I know I can do something simple like
for X, y in loader:
X = some_function(X)
But I'd also like to use the DataLoader with pytorch-lightning, so this isn't an option.
What is the proper way to subclass PyTorch Dataloaders?
__iter__() is a generator. You will need to yield the result instead of returning it. You can read more about generators here
Regarding your problem to apply a transform to a batch, you can create a custom Dataset instead of DataLoader and then apply the transforms.
class MyDataset(Dataset):
def __init__(self, transforms=None):
super().__init__()
self.data = ... # define your data here
self.transforms = transforms
def __getitem__(self, idx):
x = self.data[idx]
if self.transforms: x = self.transforms(x)
return x
# use your `MyDataset` class for creating your dataloader
dataloader = DataLoader(MyDataset(transforms = CustomTransforms(), batch_size=4)
You can use this dataloader with PyTorch Lightning Trainer as well.
If you are using PyTorch Lightning, I would suggest you to join our Slack channel and ask questions on Github Discussions as well.
Thanks :)
EDIT: (Add transforms to Batch)
If you are using PyTorch Lightning then I would recommend to use LightningDataModule which provides on_before_batch_transfer hook that can be used to apply transforms on a batch ;)
Here is an example:
def on_before_batch_transfer(self, batch, dataloader_idx):
batch['x'] = transforms(batch['x'])
return batch
Checkout the documentation for more

How to use Numba to speed up sparse linear system solvers in Python that are provided in scipy.sparse.linalg?

I wish to speed up the sparse system solver part of my code using Numba. Here is what I have up till now:
# Both numba and numba-scipy packages are installed. I am using PyCharm IDE
import numba
import numba_scipy
# import other required stuff
#numba.jit(nopython=True)
def solve_using_numba(A, b):
return sp.linalg.gmres(A, b)
# total = the number of points in the system
A = sp.lil_matrix((total, total), dtype=float)
# populate A with appropriate data
A = A.tocsc()
b = np.zeros((total, 1), dtype=float)
# populate b with appropriate data
y, exit_code = solve_using_numba(A, b)
# plot solution
This raises the error
argument 0: cannot determine Numba type of <class 'scipy.sparse.csc.csc_matrix'>
In the official documentation, numba-scipy extends Numba to make it aware of SciPy. But it seems that here, numba cannot work with scipy sparse matrix classes. Where am I going wrong and what can I do to fix this?
I only need to speed up the sparse system solution part of the code because the other stuff is pretty lightweight like taking a couple of user inputs, constructing the A and b matrices, and plotting the end result.

How to use multiprocessing in PyTorch?

I'm trying to use PyTorch with complex loss function. In order to accelerate the code, I hope that I can use the PyTorch multiprocessing package.
The first trial, I put 10x1 features into the NN and get 10x4 output.
After that, I want to pass 10x4 parameters into a function to do some calculation. (The calculation will be complex in the future.)
After calculating, the function will return a 10x1 array in total. This array will be set as NN_energy and calculate loss function.
Besides, I also want to know if there is another method to create a backward-able array to store the NN_energy array, instead of using
NN_energy = net(Data_in)[0:10,0]
Thanks a lot.
Full Code:
import torch
import numpy as np
from torch.autograd import Variable
from torch import multiprocessing
def func(msg,BOP):
ans = (BOP[msg][0]+BOP[msg][1]/BOP[msg][2])*BOP[msg][3]
return ans
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden_1, n_hidden_2, n_output):
super(Net, self).__init__()
self.hidden_1 = torch.nn.Linear(n_feature , n_hidden_1) # hidden layer
self.hidden_2 = torch.nn.Linear(n_hidden_1, n_hidden_2) # hidden layer
self.predict = torch.nn.Linear(n_hidden_2, n_output ) # output layer
def forward(self, x):
x = torch.tanh(self.hidden_1(x)) # activation function for hidden layer
x = torch.tanh(self.hidden_2(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
if __name__ == '__main__': # apply_async
Data_in = Variable( torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() )
Ground_truth = Variable( torch.from_numpy( np.asarray(list(range(20,30))).reshape(10,1) ).float() )
net = Net( n_feature=1 , n_hidden_1=15 , n_hidden_2=15 , n_output=4 ) # define the network
optimizer = torch.optim.Rprop( net.parameters() )
loss_func = torch.nn.MSELoss() # this is for regression mean squared loss
NN_output = net(Data_in)
args = range(0,10)
pool = multiprocessing.Pool()
return_data = pool.map( func, zip(args, NN_output) )
pool.close()
pool.join()
NN_energy = net(Data_in)[0:10,0]
for i in range(0,10):
NN_energy[i] = return_data[i]
loss = torch.sqrt( loss_func( NN_energy , Ground_truth ) ) # must be (1. nn output, 2. target)
print(loss)
Error messages:
File
"C:\ProgramData\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py",
line 126, in reduce_tensor
raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which
requires_grad, since autograd does not support crossing process
boundaries. If you just want to transfer the data, call detach() on
the tensor before serializing (e.g., putting it on the queue).
First of all, Torch Variable API is deprecated since a very long time, just don't use it.
Next, torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() is wrong at many levels: np.asarray of list is useless since a copy will be performed anyway, and np.array takes list as input by design. Then, np.arange is available to return a range as numpy array, and it is also available on Torch. Next, specifying both dimension for reshape is useless and error prone, you could simply do reshape((-1, 1)), or even better unsqueeze(-1).
Here is the simplified expression torch.arange(10, dtype=torch.float32, requires_grad=True).unsqueeze(-1).
Using multiprocessing pool is a bad practice if using batch processing is possible. It will be both way more efficient and readable. Indeed, performing N small algebraic operations in parallel is always slower and a larger single algebraic operation, and even more on GPU. More importantly, computing the gradient is not supported by multiprocessing, hence the error that you get. Yet, this is partially true, because it is supports for tensors on cpu since 1.6.0. Have a lok, to the official release changelog.
Could you post a more representative example of what func method could be to make sure you really need it ?
NB: Distributed autograd as you are looking is now available in Pytorch as an experimental feature available in beta since 1.6.0. Have a look to the official documentation.

Data loading with variable batch size?

I am currently working on patch based super-resolution. Most of the papers divide an image into smaller patches and then use the patches as input to the models.I was able to create patches using custom dataloader. The code is given below:
import torch.utils.data as data
from torchvision.transforms import CenterCrop, ToTensor, Compose, ToPILImage, Resize, RandomHorizontalFlip, RandomVerticalFlip
from os import listdir
from os.path import join
from PIL import Image
import random
import os
import numpy as np
import torch
def is_image_file(filename):
return any(filename.endswith(extension) for extension in [".png", ".jpg", ".jpeg", ".bmp"])
class TrainDatasetFromFolder(data.Dataset):
def __init__(self, dataset_dir, patch_size, is_gray, stride):
super(TrainDatasetFromFolder, self).__init__()
self.imageHrfilenames = []
self.imageHrfilenames.extend(join(dataset_dir, x)
for x in sorted(listdir(dataset_dir)) if is_image_file(x))
self.is_gray = is_gray
self.patchSize = patch_size
self.stride = stride
def _load_file(self, index):
filename = self.imageHrfilenames[index]
hr = Image.open(self.imageHrfilenames[index])
downsizes = (1, 0.7, 0.45)
downsize = 2
w_ = int(hr.width * downsizes[downsize])
h_ = int(hr.height * downsizes[downsize])
aug = Compose([Resize([h_, w_], interpolation=Image.BICUBIC),
RandomHorizontalFlip(),
RandomVerticalFlip()])
hr = aug(hr)
rv = random.randint(0, 4)
hr = hr.rotate(90*rv, expand=1)
filename = os.path.splitext(os.path.split(filename)[-1])[0]
return hr, filename
def _patching(self, img):
img = ToTensor()(img)
LR_ = Compose([ToPILImage(), Resize(self.patchSize//2, interpolation=Image.BICUBIC), ToTensor()])
HR_p, LR_p = [], []
for i in range(0, img.shape[1] - self.patchSize, self.stride):
for j in range(0, img.shape[2] - self.patchSize, self.stride):
temp = img[:, i:i + self.patchSize, j:j + self.patchSize]
HR_p += [temp]
LR_p += [LR_(temp)]
return torch.stack(LR_p),torch.stack(HR_p)
def __getitem__(self, index):
HR_, filename = self._load_file(index)
LR_p, HR_p = self._patching(HR_)
return LR_p, HR_p
def __len__(self):
return len(self.imageHrfilenames)
Suppose the batch size is 1, it takes an image and gives an output of size [x,3,patchsize,patchsize]. When batch size is 2, I will have two different outputs of size [x,3,patchsize,patchsize] (for example image 1 may give[50,3,patchsize,patchsize], image 2 may give[75,3,patchsize,patchsize] ). To handle this a custom collate function was required that stacks these two outputs along dimension 0. The collate function is given below:
def my_collate(batch):
data = torch.cat([item[0] for item in batch],dim = 0)
target = torch.cat([item[1] for item in batch],dim = 0)
return [data, target]
This collate function concatenates along x (From the above example, I finally get [125,3,patchsize,pathsize]. For training purposes, I need to train the model using a minibatch size of say 25. Is there any method or any functions which I can use to directly get an output of size [25 , 3, patchsize, pathsize] directly from the dataloader using the necessary number of images as input to the Dataloader?
The following code snippet works for your purpose.
First, we define a ToyDataset which takes in a list of tensors (tensors) of variable length in dimension 0. This is similar to the samples returned by your dataset.
import torch
from torch.utils.data import Dataset
from torch.utils.data.sampler import RandomSampler
class ToyDataset(Dataset):
def __init__(self, tensors):
self.tensors = tensors
def __getitem__(self, index):
return self.tensors[index]
def __len__(self):
return len(tensors)
Secondly, we define a custom data loader. The usual Pytorch dichotomy to create datasets and data loaders is roughly the following: There is an indexed dataset, to which you can pass an index and it returns the associated sample from the dataset. There is a sampler which yields an index, there are different strategies to draw indices which give rise to different samplers. The sampler is used by a batch_sampler to draw multiple indices at once (as many as specified by batch_size). There is a dataloader which combines sampler and dataset to let you iterate over a dataset, importantly the data loader also owns a function (collate_fn) which specifies how the multiple samples retrieved from the dataset using the indices from the batch_sampler should be combined. For your use case, the usual PyTorch dichotomy does not work well, because instead of drawing a fixed number of indices, we need to draw indices until the objects associated with the indices exceed the cumulative size we desire. This means we need immediate inspection of the objects and use this knowledge to decide whether to return a batch or keep drawing indices. This is what the custom data loader below does:
class CustomLoader(object):
def __init__(self, dataset, my_bsz, drop_last=True):
self.ds = dataset
self.my_bsz = my_bsz
self.drop_last = drop_last
self.sampler = RandomSampler(dataset)
def __iter__(self):
batch = torch.Tensor()
for idx in self.sampler:
batch = torch.cat([batch, self.ds[idx]])
while batch.size(0) >= self.my_bsz:
if batch.size(0) == self.my_bsz:
yield batch
batch = torch.Tensor()
else:
return_batch, batch = batch.split([self.my_bsz,batch.size(0)-self.my_bsz])
yield return_batch
if batch.size(0) > 0 and not self.drop_last:
yield batch
Here we iterate over the dataset, after drawing an index and loading the associated object, we concatenate it to the tensors we drew before (batch). We keep doing this until we reach the desired size, such that we can cut out and yield a batch. We retain the rows in batch, which we did not yield. Because it may be the case that a single instance exceeds the desired batch_size, we use a while loop.
You could modify this minimal CustomDataloader to add more features in the style of PyTorch's dataloader. There is also no need to use a RandomSampler to draw in indices, others would work equally well. It would also be possible to avoid repeated concats, in case your data is large by using for example a list and keeping track of the cumulative length of its tensors.
Here is an example, that demonstrates it works:
patch_size = 5
channels = 3
dim0sizes = torch.LongTensor(100).random_(1, 100)
data = torch.randn(size=(dim0sizes.sum(), channels, patch_size, patch_size))
tensors = torch.split(data, list(dim0sizes))
ds = ToyDataset(tensors)
dl = CustomLoader(ds, my_bsz=250, drop_last=False)
for i in dl:
print(i.size(0))
(Related, but not exactly in topic)
For batch size adaptation you can use the code as exemplified in this repo. It is implemented for a different purpose (maximize GPU memory usage), but it is not too hard to translate to your problem.
The code does batch adaptation and batch spoofing.
To improve the previous answer, I found a repo that uses DataManger to achieve different patch sizes and batch sizes. It is basically initiating different dataloaders with different settings and a set_epoch function is used to set the appropriate dataloader for a given epoch.

not able to restore tensorflow graph using python3 multiprocessing

I want to be able to load an existing tensorflow network simultaneously from several processes using multiprocessing library to do inference on different cores simultaneously.
def spawn_process(x):
g = tf.Graph()
sess = tf.Session(graph=g)
with g.as_default():
meta_graph = tf.train.import_meta_graph('model.meta')
meta_graph.restore(sess, tf.train.latest_checkpoint(chkp_dir)
x_ph = tf.get_collection('x')[0] # placeholder tensor that we use to pass in new x's to do inference on.
pred = tf.get_collection('pred')[0] # tensor responsible for computing prediction given x
prediction = sess.run(pred, feed_dict={x_ph: x})
This is basically the function I want to pass to Pool.map to infer parallely.
Above function works with just map and it looks like this
predictions = list(map(spawn_process, range(10)))
predictions = [spawn_process(x) for x in range(10)]
Both of the above work as expected.
But when I try do this, it fails, and each process just hangs right before the meta_graph.restore line and I'm stumped.
p = multiprocess.Pool(4)
predictions = p.map(spawn_process, range(10))
p.close()
p.join()
I don't know why this isn't working with tensorflow, this normally works for me when I try to do any sort of computation parallely. It stops right before the meta_graph.restore line and all the processes hang there.

Resources