There seems to be a problem mixing pytorch's autograd with joblib. I need to get gradient in parallel for a lot of samples. Joblib works fine with other aspects of pytorch, however, when mixing with autograd it gives errors. I made a very small example which shows serial version works fine but the parallel version crashes.
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = xi * xi
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
The error message is not very helpful as well:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Joblib is not copying the graph associated with the operations to the different process. One way to work around it is to perform the computation inside the job.
import torch
from torch import autograd
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(False)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
# This will compute yi in the job, and thus will
# create the graph here
yi = Out[0](*Out[1])
# now the differentiation works
return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]
torch.set_num_threads(1)
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = lambda xi: xi * xi, [xi]
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
Edit
More philosophical questions are
(1) does it make sense to use joblib parallelism, if you can simply vectorize your operations and let torch to use intraoperator parallelism?
(2) mak14 mentioned using threading backend, it is good that it fixes your example. But multiple threads will use only one CPU, it makes sense for IO bounded jobs, like making HTTP requests, but not for CPU bounded operations.
Edit #2
The existence of torch.multiprocessing suggests that gradients require some special treatment, you could attempt to write a backend to joblib using torch.multiprocessing instead of multiprocessing or threading.
Here you find an overview to how graphs are constructed in both frameworks
https://www.tensorflow.org/guide/intro_to_graphs
https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/
But I fear that to give a definite answer as to why one works and not the other will have to look into the implementation.
The problem is that parallel uses "loky" as a default backend, you should use "threading" as a backend, by this way your code will run as intended, refer to the following documentation about Joblib Parallel class Joblib Parallel Class
So editing your provided code to the following:
from joblib import Parallel, delayed
import numpy as np
import torch
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
return torch.autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = xi * xi
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2, backend="threading")([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
will give the following results:
Grads_serial [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
Grads_parallel [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
I wish this response will be helpful for you, have a good day.
Related
I am creating a neural network for DeepSet where each element in the set is itself a set. Since these sets are not vectorizable I am representing my input as lists of lists of torch.tensors.
For the forward-pass of my network I don't think there is any alternative to using for-loops/list comprehension. Potentially these for-loops iterate a relatively big number of times. As such, training my networks is very time consuming since they are running in Python.
I have tried with TorchScript, without a major improvement in runningtime (about 40% improved runningtime). I hope that re-writing my loops in Cython might yield better reuslts. However I can't figure out how to combine nn.Modules from PyTorch with Cython. Any suggestions?
Here is my module:
class DSSN(nn.Module):
def __init__(self, pd_rho, dh_rho, fc_network, device):
super(DSSN, self).__init__()
self.pers_lay1 = pd_rho
self.pers_lay2 = dh_rho
self.fc = fc_network
def forward(self:nn.Module, x:List[List[torch.Tensor]]) -> torch.Tensor:
x = torch.stack([torch.stack([self.pers_lay1(pd) for pd in sample]) for sample in x])
x = self.pers_lay2(x)
x = self.fc(x)
return x
I'm trying to determine how to compute KL Divergence of two torch.distribution.Distribution objects. I couldn't find a function to do that so far. Here is what I've tried:
import torch as t
from torch import distributions as tdist
import torch.nn.functional as F
def kl_divergence(x: t.distributions.Distribution, y: t.distributions.Distribution):
"""Compute the KL divergence between two distributions."""
return F.kl_div(x, y)
a = tdist.Normal(0, 1)
b = tdist.Normal(1, 1)
print(kl_divergence(a, b)) # TypeError: kl_div(): argument 'input' (position 1) must be Tensor, not Normal
torch.nn.functional.kl_div is computing the KL-divergence loss. The KL-divergence between two distributions can be computed using torch.distributions.kl.kl_divergence.
tdist.Normal(...) will return a normal distribution object, you have to get a sample out of the distribution...
x = a.sample()
y = b.sample()
I am trying to get gradient from sum of some indexes of an array using bincount. However, pytorch does not implement the gradient. This can be implemented by a loop and torch.sum but it is too slow. Is it possible to do this efficiently in pytorch (maybe einsum or index_add)? Of course, we can loop over indexes and add one by one, however that would increase the computational graph size significantly and is very low performance.
import torch
from torch import autograd
import numpy as np
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
inds = tt([1, 5, 7, 1], False).long()
y = tt(np.arange(4) + 0.1).float()
sum_y_section = torch.bincount(inds, y * y, minlength=8)
#sum_y_section = torch.sum(y * y)
grad = autograd.grad(sum_y_section, y, create_graph=True, allow_unused=False)
print("sum_y_section", sum_y_section)
print("grad", grad)
We can use a new feature in Pytorch V1.11 called scatter_reduce.
bincount = lambda inds, arr: torch.scatter_reduce(arr, 0, inds, reduce="sum")
I’d try to use a hook to manipulate the gradient in a custom way
I tried using Numpy, Scipy and Scikitlearn, but couldn't find what I need in any of them, basically I need to fit a curve to a dataset, but restricting some of the coefficients to known values, I found how to do it in MATLAB, using fittype, but couldn't do it in python.
In my case I have a dataset of X and Y and I need to find the best fitting curve, I know it's a polynomial of second degree (ax^2 + bx + c) and I know it's values of b and c, so I just needed it to find the value of a.
The solution I found in MATLAB was https://www.mathworks.com/matlabcentral/answers/216688-constraining-polyfit-with-known-coefficients which is the same problem as mine, but with the difference that their polynomial was of degree 5th, how could I do something similar in python?
To add some info: I need to fit a curve to a dataset, so things like scipy.optimize.curve_fit that expects a function won't work (at least as far as I tried).
The tools you have available usually expect functions only inputting their parameters (a being the only unknown in your case), or inputting their parameters and some data (a, x, and y in your case).
Scipy's curve-fit handles that use-case just fine, so long as we hand it a function that it understands. It expects x first and all your parameters as the remaining arguments:
from scipy.optimize import curve_fit
import numpy as np
b = 0
c = 0
def f(x, a):
return c+x*(b+x*a)
x = np.linspace(-5, 5)
y = x**2
# params == [1.]
params, _ = curve_fit(f, x, y)
Alternatively you can reach for your favorite minimization routine. The difference here is that you manually construct the error function so that it only inputs the parameters you care about, and then you don't need to provide that data to scipy.
from scipy.optimize import minimize
import numpy as np
b = 0
c = 0
x = np.linspace(-5, 5)
y = x**2
def error(a):
prediction = c+x*(b+x*a)
return np.linalg.norm(prediction-y)/len(prediction)**.5
result = minimize(error, np.array([42.]))
assert result.success
# params == [1.]
params = result.x
I don't think scipy has a partially applied polynomial fit function built-in, but you could use either of the above ideas to easily build one yourself if you do that kind of thing a lot.
from scipy.optimize import curve_fit
import numpy as np
def polyfit(coefs, x, y):
# build a mapping from null coefficient locations to locations in the function
# coefficients we're passing to curve_fit
#
# idx[j]==i means that unknown_coefs[i] belongs in coefs[j]
_tmp = [i for i,c in enumerate(coefs) if c is None]
idx = {j:i for i,j in enumerate(_tmp)}
def f(x, *unknown_coefs):
# create the entire polynomial's coefficients by filling in the unknown
# values in the right places, using the aforementioned mapping
p = [(unknown_coefs[idx[i]] if c is None else c) for i,c in enumerate(coefs)]
return np.polyval(p, x)
# we're passing an initial value just so that scipy knows how many parameters
# to use
params, _ = curve_fit(f, x, y, np.zeros((sum(c is None for c in coefs),)))
# return all the polynomial's coefficients, not just the few we just discovered
return np.array([(params[idx[i]] if c is None else c) for i,c in enumerate(coefs)])
x = np.linspace(-5, 5)
y = x**2
# (unknown)x^2 + 1x + 0
# params == [1, 0, 0.]
params = fit([None, 0, 0], x, y)
Similar features exist in nearly every mainstream scientific library; you just might need to reshape your problem a bit to frame it in terms of the available primitives.
I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutual information. If I have vectors x and labels y and the first three feature values are discrete I can get the MMI values like so:
mutual_info_classif(x, y, discrete_features=[0, 1, 2])
Now I'd like to use the same mutual information selection in a pipeline. I'd like to do something like this
SelectKBest(score_func=mutual_info_classif).fit(x, y)
but there's no way to pass the discrete features mask to SelectKBest. Is there some syntax to do this that I'm overlooking, or do I have to write my own score function wrapper?
Unfortunately I could not find this functionality for the SelectKBest.
But what we can do easily is extend the SelectKBest as our custom class to override the fit() method which will be called.
This is the current fit() method of SelectKBest (taken from source at github)
# No provision for extra parameters here
def fit(self, X, y):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
....
....
# Here only the X, y are passed to scoring function
score_func_ret = self.score_func(X, y)
....
....
self.scores_ = np.asarray(self.scores_)
return self
Now we will define our new class SelectKBestCustom with the changed fit(). I have copied everything from the above source, changing only two lines (commented about it):
from sklearn.utils import check_X_y
class SelectKBestCustom(SelectKBest):
# Changed here
def fit(self, X, y, discrete_features='auto'):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
if not callable(self.score_func):
raise TypeError("The score function should be a callable, %s (%s) "
"was passed."
% (self.score_func, type(self.score_func)))
self._check_params(X, y)
# Changed here also
score_func_ret = self.score_func(X, y, discrete_features)
if isinstance(score_func_ret, (list, tuple)):
self.scores_, self.pvalues_ = score_func_ret
self.pvalues_ = np.asarray(self.pvalues_)
else:
self.scores_ = score_func_ret
self.pvalues_ = None
self.scores_ = np.asarray(self.scores_)
return self
This can be called simply like:
clf = SelectKBestCustom(mutual_info_classif,k=2)
clf.fit(X, y, discrete_features=[0, 1, 2])
Edit:
The above solution can be useful in pipelines also, and the discrete_features parameter can be assigned different values when calling fit().
Another Solution (less preferable):
Still, if you just need to work SelectKBest with mutual_info_classif, temporarily (just analysing the results), we can also make a custom function which can call mutual_info_classif internally with hard coded discrete_features. Something along the lines of:
def mutual_info_classif_custom(X, y):
# To change discrete_features,
# you need to redefine the function each time
# Because once the func def is supplied to selectKBest, it cant be changed
discrete_features = [0, 1, 2]
return mutual_info_classif(X, y, discrete_features)
Usage of the above function:
selector = SelectKBest(mutual_info_classif_custom).fit(X, y)
You could also use partials as follows:
from functools import partial
discrete_mutual_info_classif = partial(mutual_info_classif, iscrete_features=[0, 1, 2])
SelectKBest(score_func=discrete_mutual_info_classif).fit(x, y)