Why is the instance of DataLoader a global variable in pytorch - python-3.x

I was debugging my pytorch code and found that an instance of the class DataLoader seems to be a global variable by default. I don't understand why this is the case but I've set up a minimum working example as below that should reproduce my observation. The code is below:
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, df, n_feats, mode):
data = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]).transpose()
x = data[:, list(range(n_feats))] # features
y = data[:, -1] # target
self.x = torch.FloatTensor(x)
self.y = torch.FloatTensor(y)
def __getitem__(self, index):
return self.x[index], self.y[index]
def __len__(self):
return len(self.x)
def prep_dataloader(df, n_feats, mode, batch_size):
dataset = MyDataset(df, n_feats, mode)
dataloader = DataLoader(dataset, batch_size, shuffle=False)
return dataloader
tr_set = prep_dataloader(df, 1, 'train', 200)
def test():
print(tr_set)
As shown above, tr_set was created before the function test and is not passed to test. However, running the code above, I got the following result:
<torch.utils.data.dataloader.DataLoader object at 0x7fb6c2ea7610>
Originally, I was expecting to get an error like "NameError: name 'tr_set' is not defined". However, the function was aware of tr_set and printed the object of tr_set even if tr_set was not passed as an argument. I'm confused with this because in this case tr_set seems like a global variable.
I'm wondering about the reason for this and possible ways that I can prevent it from becoming a global variable. Thank you!
(Update: In case that this matters, I was running the code above in a jupyter notebook.)

This doesn't have to do with DataLoader or how PyTorch functions.
It is not actually a global variable, but since tr_set was in the outer scope, in the first level of your file it's accessible to other components of this same file. However this same variable won't be accessible by other modules, for example, hence the fact it is not a global variable. The reason function test has access to tr_set is because of closure made on that variable, i.e. the variable is carried through to test's inner scope.

Related

How does a pytorch function (such as RoIPool) work?

For example, I'm trying to view the implementation of RoI Pooling in pytorch.
Here is a code fragment showing how to use RoIPool in pytorch
import torch
from torchvision.ops.roi_pool import RoIPool
device = torch.device('cuda')
# create feature layer, proposals and targets
num_proposals = 10
feature_map = torch.randn(1, 64, 32, 32)
proposals = torch.zeros((num_proposals, 4))
proposals[:, 0] = torch.randint(0, 16, (num_proposals,))
proposals[:, 1] = torch.randint(0, 16, (num_proposals,))
proposals[:, 2] = torch.randint(16, 32, (num_proposals,))
proposals[:, 3] = torch.randint(16, 32, (num_proposals,))
roi_pool_obj = RoIPool(3, 2**-1)
roi_pool = roi_pool_obj(feature_map, [proposals])
I'm using pychram, so when I follow RoIPool from the second line, it opens a file located at ~/anaconda3/envs/CV/lib/python3.8/site-package/torchvision/ops/roi_pool.py, which is exactly the same as codes in the documentation.
I pasted the code below without documentations.
from typing import List, Union
import torch
from torch import nn, Tensor
from torch.jit.annotations import BroadcastingList2
from torch.nn.modules.utils import _pair
from torchvision.extension import _assert_has_ops
from ..utils import _log_api_usage_once
from ._utils import convert_boxes_to_roi_format, check_roi_boxes_shape
def roi_pool(
input: Tensor,
boxes: Union[Tensor, List[Tensor]],
output_size: BroadcastingList2[int],
spatial_scale: float = 1.0,
) -> Tensor:
if not torch.jit.is_scripting() and not torch.jit.is_tracing():
_log_api_usage_once(roi_pool)
_assert_has_ops()
check_roi_boxes_shape(boxes)
rois = boxes
output_size = _pair(output_size)
if not isinstance(rois, torch.Tensor):
rois = convert_boxes_to_roi_format(rois)
output, _ = torch.ops.torchvision.roi_pool(input, rois, spatial_scale, output_size[0], output_size[1])
return output
class RoIPool(nn.Module):
def __init__(self, output_size: BroadcastingList2[int], spatial_scale: float):
super().__init__()
_log_api_usage_once(self)
self.output_size = output_size
self.spatial_scale = spatial_scale
def forward(self, input: Tensor, rois: Tensor) -> Tensor:
return roi_pool(input, rois, self.output_size, self.spatial_scale)
def __repr__(self) -> str:
s = f"{self.__class__.__name__}(output_size={self.output_size}, spatial_scale={self.spatial_scale})"
return s
So, in the code example:
When running roi_pool_obj = RoIPool(3, 2**-1) it will create an instance of RoIPool by calling its __init__ method, which only initialized two instance variables;
When running roi_pool = roi_pool_obj(feature_map, [proposals]), it must have called the forward() method (but I don't know how) which then called the roi_pool() function above;
When running the roi_pool() function, it did some checking first and then computed output with the line output, _ = torch.ops.torchvision.roi_pool(input, rois, spatial_scale, output_size[0], output_size[1]).
But this doesn't show details of how roi_pool is implemented and pycharm showed Cannot find declaration to go to when I tried to follow torch.ops.torchvision.roi_pool.
To summarize, I have two questions:
How does the forward() called by running roi_pool = roi_pool_obj(feature_map, [proposals])?
How can I view the source code of torch.ops.torchvision.roi_pool or where is the file containing it's implementaion located?
Last but not least, I've just started reading source code which is pretty difficult for me. I'd appreciate it if you can also provide some advice or tutorials.
RoIPool is a subclass of torch.nn.Module. Source code:
https://github.com/pytorch/vision/blob/07ae61bf9c21ddd1d5f65d326aa9636849b383ca/torchvision/ops/roi_pool.py#L56
nn.Module defines __call__ method which in turn calls forward method. Source code:
https://github.com/pytorch/pytorch/blob/b2311192e6c4745aac3fdd774ac9d56a36b396d4/torch/nn/modules/module.py#L1234
When you executing roi_pool = roi_pool_obj(feature_map, [proposals]) statement the __call__ method uses the forward() of RoiPool. Source code:
https://github.com/pytorch/vision/blob/07ae61bf9c21ddd1d5f65d326aa9636849b383ca/torchvision/ops/roi_pool.py#L67
RoiPool.forward calls torch.ops.torchvision.roi_pool.
https://github.com/pytorch/vision/blob/07ae61bf9c21ddd1d5f65d326aa9636849b383ca/torchvision/ops/roi_pool.py#L52
ops is a object which loads native libraries implemented in c++:
https://github.com/pytorch/pytorch/blob/b2311192e6c4745aac3fdd774ac9d56a36b396d4/torch/_ops.py#L537
so when you call torch.ops.torchvision it will use torchvision library.
Here the roi_pool function is registered:
https://github.com/pytorch/vision/blob/7947fc8fb38b1d3a2aca03f22a2e6a3caa63f2a0/torchvision/csrc/ops/roi_pool.cpp#L53
Here you can find the actual implementation of rol_pool
CPU:
https://github.com/pytorch/vision/blob/7947fc8fb38b1d3a2aca03f22a2e6a3caa63f2a0/torchvision/csrc/ops/cpu/roi_pool_kernel.cpp
GPU:
https://github.com/pytorch/vision/blob/7947fc8fb38b1d3a2aca03f22a2e6a3caa63f2a0/torchvision/csrc/ops/cuda/roi_pool_kernel.cu

PyTorch Dataset / Dataloader from random source

I have a source of random (non-deterministic, non-repeatable) data, that I'd like to wrap in Dataset and Dataloader for PyTorch training. How can I do this?
__len__ is not defined, as the source is infinite (with possible repition).
__getitem__ is not defined, as the source is non-deterministic.
When defining a custom dataset class, you'd ordinarily subclass torch.utils.data.Dataset and define __len__() and __getitem__().
However, for cases where you want sequential but not random access, you can use an iterable-style dataset. To do this, you instead subclass torch.utils.data.IterableDataset and define __iter__(). Whatever is returned by __iter__() should be a proper iterator; it should maintain state (if necessary) and define __next__() to obtain the next item in the sequence. __next__() should raise StopIteration when there's nothing left to read. In your case with an infinite dataset, it never needs to do this.
Here's an example:
import torch
class MyInfiniteIterator:
def __next__(self):
return torch.randn(10)
class MyInfiniteDataset(torch.utils.data.IterableDataset):
def __iter__(self):
return MyInfiniteIterator()
dataset = MyInfiniteDataset()
dataloader = torch.utils.data.DataLoader(dataset, batch_size = 32)
for batch in dataloader:
# ... Do some stuff here ...
# ...
# if some_condition:
# break

How to use multiprocessing in PyTorch?

I'm trying to use PyTorch with complex loss function. In order to accelerate the code, I hope that I can use the PyTorch multiprocessing package.
The first trial, I put 10x1 features into the NN and get 10x4 output.
After that, I want to pass 10x4 parameters into a function to do some calculation. (The calculation will be complex in the future.)
After calculating, the function will return a 10x1 array in total. This array will be set as NN_energy and calculate loss function.
Besides, I also want to know if there is another method to create a backward-able array to store the NN_energy array, instead of using
NN_energy = net(Data_in)[0:10,0]
Thanks a lot.
Full Code:
import torch
import numpy as np
from torch.autograd import Variable
from torch import multiprocessing
def func(msg,BOP):
ans = (BOP[msg][0]+BOP[msg][1]/BOP[msg][2])*BOP[msg][3]
return ans
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden_1, n_hidden_2, n_output):
super(Net, self).__init__()
self.hidden_1 = torch.nn.Linear(n_feature , n_hidden_1) # hidden layer
self.hidden_2 = torch.nn.Linear(n_hidden_1, n_hidden_2) # hidden layer
self.predict = torch.nn.Linear(n_hidden_2, n_output ) # output layer
def forward(self, x):
x = torch.tanh(self.hidden_1(x)) # activation function for hidden layer
x = torch.tanh(self.hidden_2(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
if __name__ == '__main__': # apply_async
Data_in = Variable( torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() )
Ground_truth = Variable( torch.from_numpy( np.asarray(list(range(20,30))).reshape(10,1) ).float() )
net = Net( n_feature=1 , n_hidden_1=15 , n_hidden_2=15 , n_output=4 ) # define the network
optimizer = torch.optim.Rprop( net.parameters() )
loss_func = torch.nn.MSELoss() # this is for regression mean squared loss
NN_output = net(Data_in)
args = range(0,10)
pool = multiprocessing.Pool()
return_data = pool.map( func, zip(args, NN_output) )
pool.close()
pool.join()
NN_energy = net(Data_in)[0:10,0]
for i in range(0,10):
NN_energy[i] = return_data[i]
loss = torch.sqrt( loss_func( NN_energy , Ground_truth ) ) # must be (1. nn output, 2. target)
print(loss)
Error messages:
File
"C:\ProgramData\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py",
line 126, in reduce_tensor
raise RuntimeError("Cowardly refusing to serialize non-leaf tensor which requires_grad, "
RuntimeError: Cowardly refusing to serialize non-leaf tensor which
requires_grad, since autograd does not support crossing process
boundaries. If you just want to transfer the data, call detach() on
the tensor before serializing (e.g., putting it on the queue).
First of all, Torch Variable API is deprecated since a very long time, just don't use it.
Next, torch.from_numpy( np.asarray(list(range( 0,10))).reshape(10,1) ).float() is wrong at many levels: np.asarray of list is useless since a copy will be performed anyway, and np.array takes list as input by design. Then, np.arange is available to return a range as numpy array, and it is also available on Torch. Next, specifying both dimension for reshape is useless and error prone, you could simply do reshape((-1, 1)), or even better unsqueeze(-1).
Here is the simplified expression torch.arange(10, dtype=torch.float32, requires_grad=True).unsqueeze(-1).
Using multiprocessing pool is a bad practice if using batch processing is possible. It will be both way more efficient and readable. Indeed, performing N small algebraic operations in parallel is always slower and a larger single algebraic operation, and even more on GPU. More importantly, computing the gradient is not supported by multiprocessing, hence the error that you get. Yet, this is partially true, because it is supports for tensors on cpu since 1.6.0. Have a lok, to the official release changelog.
Could you post a more representative example of what func method could be to make sure you really need it ?
NB: Distributed autograd as you are looking is now available in Pytorch as an experimental feature available in beta since 1.6.0. Have a look to the official documentation.

Sklearn method in class

I would like to create a class that uses sklearn transformation methods. I found this article and I am using it as an example.
from sklearn import preprocessing
from sklearn.base import TransformerMixin
def minmax(dataframe):
minmax_transformer = preprocessing.MinMaxScaler()
return minmax_tranformer
class FunctionFeaturizer(TransformerMixin):
def __init__(self, scaler):
self.scaler = scaler
def fit(self, X, y=None):
return self
def transform(self, X):
fv = self.scaler(X)
return fv
if __name__=="__main__":
scaling = FunctionFeaturizer(minmax)
df = pd.DataFrame({'feature': np.arange(10)})
df_scaled = scaling.fit(df).transform(df)
print(df_scaled)
The output is StandardScaler(copy=True, with_mean=True, with_std=True) which is actually the result of the preprocessing.StandardScaler().fit(df) if I use it out of the class.
What I am expecting is:
array([[0. ],
[0.11111111],
[0.22222222],
[0.33333333],
[0.44444444],
[0.55555556],
[0.66666667],
[0.77777778],
[0.88888889],
[1. ]])
I am feeling that I am mixing few things here but I do not know what.
Update
I did some modifications:
def minmax():
return preprocessing.MinMaxScaler()
class FunctionFeaturizer(TransformerMixin):
def __init__(self, scaler):
self.scaler = scaler
def fit(self, X, y=None):
return self
def fit_transform(self, X):
self.scaler.fit(X)
return self.scaler.transform(X)
if __name__=="__main__":
scaling = FunctionFeaturizer(minmax)
df = pd.DataFrame({'feature': np.arange(10)})
df_scaled = scaling.fit_transform(df)
print(df_scaled)
But now I am receiving the following error:
Traceback (most recent call last):
File "C:/my_file.py", line 33, in <module>
test_scale = scaling.fit_transform(df)
File "C:/my_file.py", line 26, in fit_transform
self.scaler.fit(X)
AttributeError: 'function' object has no attribute 'fit'
Solving your error
in your code you have:
if __name__=="__main__":
scaling = FunctionFeaturizer(minmax)
df = pd.DataFrame({'feature': np.arange(10)})
df_scaled = scaling.fit_transform(df)
print(df_scaled)
change the line
scaling = FunctionFeaturizer(minmax)
to
scaling = FunctionFeaturizer(minmax())
you need to call the function to get the instantiation of MinMaxScaler returned to you.
Suggestion
Instead of implementing fit and fit_transform, implement fit and transform unless you can optimize both process into fit_tranform. This way, it is clearer what you are doing.
If you implement only fit and transform, you can still call fit_transform because you extend the TransformerMixin class. It will just call both functions in a row.
Getting your expected results
Your transformer is looking at every column of your dataset and distributing the values linearly between 0 and 1.
So, to get your expected results, it will really depend on what your df looks like. However, you did not share that with us, so it is difficult to tell if you will get it.
However, if you have df = [[0],[1],[2],[3],[4],[5],[6],[7],[8],[9]], you will see your expected result.
if __name__=="__main__":
scaling = FunctionFeaturizer(minmax())
df = [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]
df_scaled = scaling.fit_transform(df)
print(df_scaled)
> [[0. ]
> [0.11111111]
> [0.22222222]
> [0.33333333]
> [0.44444444]
> [0.55555556]
> [0.66666667]
> [0.77777778]
> [0.88888889]
> [1. ]]

TensorFlow layers: using custom(ized) initialization function?

Why does obtaining a new initialization function with partial give me an error, while a lambda doesn't?
All of these functions:
f_init = partial(tf.random_normal, mean=0.0, stddev=0.01, partition_info=None)
f_init = partial(tf.contrib.layers.xavier_initializer, partition_info=None)
f_init = partial(tf.random_normal, mean=0.0, stddev=0.01)
f_init = tf.contrib.layers.xavier_initializer
Throw the following exception:
TypeError: ... got an unexpected keyword argument 'partition_info'
(while ... stands for xavier_initializer and the other functions, of course)
When applied to a simple conv2d layer:
conv1 = tf.layers.conv2d(x, 32, [5, 5],
strides=[1, 1],
padding="same",
activation=tf.nn.relu,
kernel_initializer=f_init,
name="conv1")
However, if I use a lambda to obtain custom initialization functions:
f_init = lambda shape, dtype, partition_info=None:\
tf.random_normal(shape, mean=0.0, stddev=0.01, dtype=dtype)
...it works without any problems.
Shouldn't partial also return a new anonymous function of, e.g., tf.random_normal supplied with mean=0.0 and stddev=0.01 like the lambda statement does?
The error says that the functions tf.random_normal and tf.contrib.layers.xavier_initializer do not have an parameter with the name partition_info which is indeed the case. There is no such parameter (see here and here).
Your lambda works, because it does not pass the partition_info to tf.random_normal, which is correct.
Also make sure to not get confused with the functions returning initialisation values (like tf.random_normal) and the corresponding initializer (like tf.random_normal_initializer). The first one returns floats, the latter creates a callable, that expects a shape, a dtype and the partition_info. When called, this callable returns the normal distributed values.
Your lambda does conform to this signature and thus it works. But when using partial the signature of the resulting callable is just the list of parameters that haven't been frozen by the call to partial:
f_init = partial(tf.random_normal, mean=0.0, stddev=0.01)
Since tf.random_normalhas the signature:
def random_normal(shape, mean=0.0, stddev=1.0, dtype=dtypes.float32,
seed=None, name=None):
# ...
You can use the partial as if it was defined like this:
def f_init(shape, dtype=dtypes.float32, seed=None, name=None):
# ...
Note that there is no parameter named partition_info, but TensorFlow will try to pass it when calling f_init, resulting in the error you got.
To customize things like mean and stddev, you do not need to create a custom initializer, though. This for example creates an initializer, that returns normal distributed values with mean 0.0 and standard deviation 0.01:
f_init = tf.random_normal_initializer(mean=0.0, stddev=0.01)
But if you need a custom initializer, e.g. to implement custom initialization logic, you could follow this pattern (see here):
class RandomNormal(Initializer):
def __init__(self, mean=0.0, stddev=1.0, seed=None, dtype=dtypes.float32):
self.mean = mean
self.stddev = stddev
self.seed = seed
self.dtype = _assert_float_dtype(dtypes.as_dtype(dtype))
def __call__(self, shape, dtype=None, partition_info=None):
if dtype is None:
dtype = self.dtype
normal = random_ops.random_normal(shape, self.mean, self.stddev,
dtype, seed=self.seed)
# do what you want with normal here
return normal
def get_config(self):
return {"mean": self.mean,
"stddev": self.stddev,
"seed": self.seed,
"dtype": self.dtype.name}
# Alias to lower_case, 'function-style' name
random_normal = RandomNormal

Resources