How do I recalculate the gradient after changing the input? Pytorch - pytorch

I am trying to do something like this without redefining a = f(x,y):
a = f(x,y)
find gradient of a with respect to x
change x
find gradient of a with respect to x
find gradient of a with respect to y
I tried a partial example below but it just gives me an error. Does anyone know how I can do this without redefining the original function everytime?
>>> x = torch.tensor([2.], requires_grad=True)
>>> y = 10*x**2
>>> torch.autograd.grad(y,x, retain_graph=True)
(tensor([40.]),)
>>> x = torch.tensor([1.], requires_grad=True)
>>> torch.autograd.grad(y,x, retain_graph=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Philip/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 157, in grad
inputs, allow_unused)
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

How do I recalculate the gradient after changing the input?
In general you need to recompute the output with the new input.
Consider the way that the backpropagation algorithm works. Depending on the form of f, different intermediate results will need to be saved for later use by the backpropagation algorithm. These intermediate results may or may not depend on the original value of x, even when computing the gradient w.r.t. y.
For example, if f(x,y) = g(h(x,y)) then by the chain rule df/dy = dg/dh * dh/dy. To make this a little more concrete lets consider the case where g is some non-linear function and h(x,y) = x*y. Then we have that df/dy = g'(h(x,y))*x. The reason backpropagation works efficiently here is that it caches the intermediate value of h(x,y) during forward pass, so all it needs to do is plug that value into g' during the backward pass. If you change the value of x, then the cached value of h(x,y) will no longer be the correct value needed to compute the gradient you are interested in (which should be using the value of h computed using the new value of x). Therefore you must recompute the forward pass again to store the correct cached values.

Related

How to get a 2D output from linear layer in pytorch?

I would like to project a tensor into a space with an additional dimension.
I tried
torch.nn.Linear(
in_features=num_inputs,
out_features=(num_inputs, num_additional),
)
But this results in an error
A workaround would be to
torch.nn.Linear(
in_features=num_inputs,
out_features=num_inputs*num_additional,
)
and then change the view the output
output.view(batch_size, num_inputs, num_additional)
But I imagine this workaround will get tricky to read, especially when a projection into more than one additional dimension is desired.
Is there a more direct way to code this operation?
Perhaps the source code for linear can be changed
https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear
To accept more dimensions for the weight and bias initialization, and F.linear seems like it would need to be replaced with a different function.
IMO the workaround you provided is already clear enough. However, if you want to express this as a single operation, you can always write your own module by subclassing torch.nn.Linear:
import numpy as np
import torch
class MultiDimLinear(torch.nn.Linear):
def __init__(self, in_features, out_shape, **kwargs):
self.out_shape = out_shape
out_features = np.prod(out_shape)
super().__init__(in_features, out_features, **kwargs)
def forward(self, x):
out = super().forward(x)
return out.reshape((len(x), *self.out_shape))
if __name__ == '__main__':
tmp = torch.empty((32, 10))
linear = MultiDimLinear(in_features=10, out_shape=(10, 10))
out = linear(tmp)
print(out.shape) # (32, 10, 10)
Another way would be to use torch.einsum
https://pytorch.org/docs/stable/generated/torch.einsum.html
torch.einsum can prevent summation across dimensions in tensor to tensor multiplication operations. This can allow separate multiplication operations to happen in parallel. [ I do not know if this would necessarily result in GPU efficiency; if the operations are still occurring in the same kernel. In fact, it may be slower https://github.com/pytorch/pytorch/issues/32591 ]
How this would work is to directly initialize the weight and bias tensors (look at source code for the torch linear layer for that code)
Say that the input (X) has dimensions (a, b), where a is the batch size.
Say that you want to pass this input through a series of classifiers, represented in a single weight tensor (W) with dimensions (c, d, e), where c is the number of classifiers, and e is the number of classes for the classifier
import torch
x = torch.arange(2*4).view(2, 4)
w = torch.arange(5*4*6).view(5, 4, 2)
torch.einsum('ab, cbe -> ace', x, w)
in the last line, a and b are the dimensions of the input as mentioned above. What might be the tricky part is c, b, and e are the dimensions of the classifiers weight tensor; I didn't use d, I used b instead. That is because the vector multiplication is happening along that dimension for the inputs tensor and the weight tensor. So that's why the left side of the einsum equation is ab, cbe. The right side of the einsum equation is simply what dimensions to exclude from summation.
The final dimensions we want is (a, c, e). a is the batch size, c is the number of classifiers, and e is the number of classes for each classifier. We do not want to add those values, so to preserve their separation, the left side of the equation is ace.
For those unfamiliar with einsum, this will be harder to read than the word around I created (though I highly recommend learning it, because it gets very easy and intuitive very fast even though it's a bit tricky at first https://www.youtube.com/watch?v=pkVwUVEHmfI )
However, for paralyzing certain operations (especially on GPU), it seems that einsum is the only way to do it. For example so that in my previous example, I didn't want to use a classification head yet, I just wanted to project to multiple dimensions.
import torch
x = torch.arange(2*4).view(2, 4)
w = torch.arange(5*4*6).view(5, 4, 4)
y = torch.einsum('ab, cbe -> ace', x, w)
And say I do a few other operations to y, perhaps some non linear operations, activations, etc.
z = f(y)
z will still have the dimensions 2, 5, 4. Batch size two, 5 hidden states per batch, and the dimension of those hidden states are 4.
And then I want to apply a classifier to each separate tensor.
w2 = torch.arange(4*2).view(4, 2)
final = torch.einsum('fgh, hj -> fgj', z, w2)
Quick refresh, 2 is the batch size, 5 is the number of classifier, and 2 is the number of outputs for each classifier.
The output dimensions, f, g, j (2, 5, 2) will not be summed across, and thus will be preserved in the output.
As cited in the github link, this may be slower than just using regular linear layers. There may be efficiencies in a very large number of parallel operations.

Which is the error of a value corresponding to the maximum of a function?

This is my problem:
The first input is the observed data of MUSE, which is an astronomical instrument provides cubes, i.e. an image for each wavelength with a certain range. This means that, taken all the wavelengths corresponding to the pixel i,j, I can extract the spectrum for this pixel. Since these images are observed, for each pixel I have an error.
The second input is a spectrum template, i.e. a model of a spectrum. This template is assumed to be without error. I map this spectra at various redshift (this means multiply the wavelenghts for a factor 1+z, where z belong to a certain range).
The core of my code is the cross-correlation between the cube, i.e. the spectra extracted from each pixel, and the template mapped at different redshift. The result is a cross-correlation function for each pixel for each z, let's call this computed function as f(z). Taking, for each pixel, the argmax of f(z), I get the best redshift.
This is a common and widely-used process, indeed, it actually works well.
My question:
Since my input, i.e. the MUSE cube, has an error, I have propagated this error through the cross-correlation, obtaining an error on f(z), i.e. each f_i has a error sigma_i. So, how can I compute the error on z_max, which is the value of z corresponding to the maximum of f?
Maybe a solution could be the implementation of bootstrap method: I can extract, within the error of f, a certain number of function, for each of them I computed the argamx, so i can have an idea about the scatter of z_max.
By the way, I'm using python (3.x) and tensorflow has been used to compute the cross-correlation function.
Thanks!
EDIT
Following #TF_Support suggestion I'm trying to add some code and some figures to better understand the problem. But, before this, maybe it's better a little of math.
With this expression I had computed the cross-correlation:
where S is the spectra, T is the template and N is the normalization coefficient. Since S has an error, I had propagated these errors through the previous relation founding:
where SST_k is the the sum of the template squared and sigma_ij is the error on on S_ij (actually, I should have written sigma_S_ij).
The follow function (implemented with tensorflow 2.1) makes the cross-correlation between one template and the spectra of batch pixels, and computes the error on the cross-correlation function:
#tf.function
def make_xcorr_err1(T, S, sigma_S):
sum_spectra_sq = tf.reduce_sum(tf.square(S), 1) #shape (batch,)
sum_template_sq = tf.reduce_sum(tf.square(T), 0) #shape (Nz, )
norm = tf.sqrt(tf.reshape(sum_spectra_sq, (-1,1))*tf.reshape(sum_template_sq, (1,-1))) #shape (batch, Nz)
xcorr = tf.matmul(S, T, transpose_a = False, transpose_b= False)/norm
foo1 = tf.matmul(sigma_S**2, T**2, transpose_a = False, transpose_b= False)/norm**2
foo2 = xcorr**2 * tf.reshape(sum_template_sq**2, (1,-1)) * tf.reshape(tf.reduce_sum((S*sigma_S)**2, 1), (-1,1))/norm**4
foo3 = - 2 * xcorr * tf.reshape(sum_template_sq, (1,-1)) * tf.matmul(S*(sigma_S)**2, T, transpose_a = False, transpose_b= False)/norm**3
sigma_xcorr = tf.sqrt(tf.maximum(foo1+foo2+foo3, 0.))
Maybe, in order to understand my problem, more important than code is an image representing an output. This is the cross-correlation function for a single pixel, in red the maximum value, let's call z_best, i.e. the best cross-correlated value. The figure also shows the 3 sigma errors (the grey limits are +3sigma -3sigma).
If i zoom-in near the peak, I get this:
As you can see the maximum (as any other value) oscillates within a certain range. I would like to find a way to map this fluctuations of maximum (or the fluctuations around the maximum, or the fluctuations of the whole function) to an error on the value corresponding the maximum, i.e. an error on z_best.

Covariance matrix is NoneType in lmfit Python3-6

I want to fit a f(x,y) function using lmfit. Dataset is small and there are many fitting parameters (6 points on x-axis, 11 points on y-axis and 16 unconstrained fitting parameters). Using all defaults from Model.fit I cannot obtain covariance matrix and during the fitting process the values of free parameters are not being changed at all.
I tried to change initial values for the parameters. However, when I set the same kind of problem in OriginPro Surface Fitting functionality, the Levenberg-Marquardt algorithm manages to fit data and estimate the errors (although quite large-valued for certain parameters). This means that there has to be some problem with my code. I can't find where the problem lies. I'm not Python master.
The MWE is as below.
import numpy as np
from lmfit import Model, Parameters
import numdifftools # not calling this doesn't change anything
x, y = np.array([226.5, 361.05, 404.41, 589, 632.8, 1013.98]), np.linspace(0,100,11)
X, Y = np.meshgrid(x, y)
Z = np.array([[1.3945, 1.34896, 1.34415, 1.33432, 1.33306, 1.32612],\
[1.39422, 1.3487, 1.34389, 1.33408, 1.33282, 1.32591],\
[1.39336, 1.34795, 1.34315, 1.33336, 1.33211, 1.32524],\
[1.39208, 1.34682, 1.34205, 1.3323, 1.33105, 1.32424],\
[1.39046, 1.3454, 1.34065, 1.33095, 1.32972, 1.32296],\
[1.38854, 1.34373, 1.33901, 1.32937, 1.32814, 1.32145],\
[1.38636, 1.34184, 1.33714, 1.32757, 1.32636, 1.31974],\
[1.38395, 1.33974, 1.33508, 1.32559, 1.32438, 1.31784],\
[1.38132, 1.33746, 1.33284, 1.32342, 1.32223, 1.31576],\
[1.37849, 1.33501, 1.33042, 1.32109, 1.31991, 1.31353],\
[1.37547, 1.33239, 1.32784, 1.31861, 1.31744, 1.31114]])
#This has to be defined beforehand (otherwise parameters names are not defined error)
a1,a2,a3,a4 = 1.3208, -1.2325E-5, -1.8674E-6, 5.0233E-9
b1,b2,b3,b4 = 5208.2413, -0.5179, -2.284E-2, 6.9608E-5
c1,c2,c3,c4 = -2.5551E8, -18341.336, -920, 2.7729
d1,d2,d3,d4 = 9.3495, 2E-3, 3.6733E-5, -1.2932E-7
# Function to fit
def model(x, y, *args):
return a1+a2*y+a3*np.power(y,2)+a4*np.power(y,3)+\
(b1+b2*y+b3*np.power(y,2)+b4*np.power(y,3))/np.power(x,2)+\
(c1+c2*y+c3*np.power(y,2)+c4*np.power(y,3))/np.power(x,4)+\
(d1+d2*y+d3*np.power(y,2)+d4*np.power(y,3))/np.power(x,6)
# This is the callable that is passed to Model.fit. M is a (2,N) array
# where N is the total number of data points in Z, which will be ravelled
# to one dimension.
def _model(M, **args):
x, y = M
arr = model(x, y, params)
return arr
# We need to ravel the meshgrids of X, Y points to a pair of 1-D arrays.
xdata = np.vstack((X.ravel(), Y.ravel()))
# Fitting parameters.
fmodel = Model(_model)
params = Parameters()
params.add_many(('a1',1.3208,True,1,np.inf,None,None),\
('a2',-1.2325E-5,True,-np.inf,np.inf,None,None),\
('a3',-1.8674E-6,True,-np.inf,np.inf,None,None),\
('a4',5.0233E-9,True,-np.inf,np.inf,None,None),\
('b1',5208.2413,True,-np.inf,np.inf,None,None),\
('b2',-0.5179,True,-np.inf,np.inf,None,None),\
('b3',-2.284E-2,True,-np.inf,np.inf,None,None),\
('b4',6.9608E-5,True,-np.inf,np.inf,None,None),\
('c1',-2.5551E8,True,-np.inf,np.inf,None,None),\
('c2',-18341.336,True,-np.inf,np.inf,None,None),\
('c3',-920,True,-np.inf,np.inf,None,None),\
('c4',2.7729,True,-np.inf,np.inf,None,None),\
('d1',9.3495,True,-np.inf,np.inf,None,None),\
('d2',2E-3,True,-np.inf,np.inf,None,None),\
('d3',3.6733E-5,True,-np.inf,np.inf,None,None),\
('d4',-1.2932E-7,True,-np.inf,np.inf,None,None))
result = fmodel.fit(Z.ravel(), params, M=xdata)
fit = model(X, Y, result.params)
print(result.covar)
This code results in covariance being NoneType. I expect that it will after all be calculated, because Origin can somehow manage. If it is needed I can provide all parameters from Origin Surface Fitting Parameters.
When plotting Z-fit difference, there is quite large discrepancy for low x-values (not happening in Origin).
You are not defining your model function in a way that can be used sensibly by lmfit. You have:
def _model(M, **args):
x, y = M
arr = model(x, y, params)
return arr
def model(x, y, *args):
return a1+a2*y+a3*np.power(y,2)+a4*np.power(y,3)+\
(b1+b2*y+b3*np.power(y,2)+b4*np.power(y,3))/np.power(x,2)+\
(c1+c2*y+c3*np.power(y,2)+c4*np.power(y,3))/np.power(x,4)+\
(d1+d2*y+d3*np.power(y,2)+d4*np.power(y,3))/np.power(x,6)
model = Model(_model)
Which has a few problems:
args is not used in _model, and params is not defined in the function so will be module-level.
Similarly in model, args is not used and a1, a2, etc will be taken from the module-level (programming) variables and (importantly!!) these will not be updated in the fit.
In short, your model function never sees varying values for the parameters.
lmfit.Model takes the named function arguments and turns those into parameter names. It does not turn **kws or *position_args into parameter names. So I think that what you want to do is write a model function like this:
def model(x, y, a1, a2, a2, a4, b1, b2, b3 ,b3, c1, c2, c3, c4,
d1, d2, d3, d4):
return a1+a2*y+a3*np.power(y,2)+a4*np.power(y,3)+\
(b1+b2*y+b3*np.power(y,2)+b4*np.power(y,3))/np.power(x,2)+\
(c1+c2*y+c3*np.power(y,2)+c4*np.power(y,3))/np.power(x,4)+\
(d1+d2*y+d3*np.power(y,2)+d4*np.power(y,3))/np.power(x,6)
Then create a model from that with:
# Note: don't give a function and Model instance the same name!!
my_model = Model(model, independent_vars=('x', 'y'))
With that model defined you can run the fit, and without having to unravel your data (the independent data in lmfit can be of almost any data type, and data arrays can be multi-dimensional):
result = my_model.fit(Z, params, x=X, y=Y)
For what it is worth, making such changes works for me in the sense that the fit runs to completion. The fit still gets stuck with some of the parameters not updating from their initial values, but that is sort of a separate question from the mechanics of setting up and running the fit, and is probably due to polynomials being pretty unstable or poor initial estimates.
As an aside: np.power(y,n) can be spelled y**n and readability counts. Also, numerical stability is sometimes improved with replaced
a + b*x + c*x**2 + d*x**3
with
a + x*(b + x*(c + x*d))
Though I do not know if that would help in your case.

Matrix value not needed for Lop?

In the theano derivatives tutorial here:
http://deeplearning.net/software/theano/tutorial/gradients.html#tutcomputinggrads
the example of Lop works without an explicit value of the W matrix in the dot product. And, in fact, the partial derivatives in this case do remove the values of the components of W so they are not needed.
But, attempting a similar thing with the Rop throws an error:
theano.gof.fg.MissingInputError: ("An input of the graph, used to compute dot(Elemwise{second,no_inplace}.0, ), was not provided and not given a value.
How is this different?
Theano will try to optimize the computation graph, but it does not always work.
In the Lop example, Theano can detect that we don't actually need that W, but when changed to the Rop it just can't.
The Lop example:
W = T.dmatrix('W')
v = T.dvector('v')
x = T.dvector('x')
y = T.dot(x, W)
VJ = T.Lop(y, W, v)
f = theano.function([v, x], VJ)
f([2, 2], [0, 1])
If I just change y = T.dot(x, W) to y = T.dot(x, W**1), Theano will fail to do the optimization and throw the same error message at me say that I did not provide enough parameters.
Actually in the Rop example, if we change the values given to W, it does not affect the result at all, because Theano failed to optimize that.
p.s. I find the Theano documents very unclear sometimes.

matrices are not aligned Error: Python SciPy fmin_bfgs

Problem Synopsis:
When attempting to use the scipy.optimize.fmin_bfgs minimization (optimization) function, the function throws a
derphi0 = np.dot(gfk, pk)
ValueError: matrices are not aligned
error. According to my error checking this occurs at the very end of the first iteration through fmin_bfgs--just before any values are returned or any calls to callback.
Configuration:
Windows Vista
Python 3.2.2
SciPy 0.10
IDE = Eclipse with PyDev
Detailed Description:
I am using the scipy.optimize.fmin_bfgs to minimize the cost of a simple logistic regression implementation (converting from Octave to Python/SciPy). Basically, the cost function is named cost_arr function and the gradient descent is in gradient_descent_arr function.
I have manually tested and fully verified that *cost_arr* and *gradient_descent_arr* work properly and return all values properly. I also tested to verify that the proper parameters are passed to the *fmin_bfgs* function. Nevertheless, when run, I get the ValueError: matrices are not aligned. According to the source review, the exact error occurs in the
def line_search_wolfe1
function in # Minpack's Wolfe line and scalar searches as supplied by the scipy packages.
Notably, if I use scipy.optimize.fmin instead, the fmin function runs to completion.
Exact Error:
File
"D:\Users\Shannon\Programming\Eclipse\workspace\SBML\sbml\LogisticRegression.py",
line 395, in fminunc_opt
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, callback=self.callback_fmin_bfgs, retall=True)
File
"C:\Python32x32\lib\site-packages\scipy\optimize\optimize.py", line
533, in fmin_bfgs old_fval,old_old_fval)
File "C:\Python32x32\lib\site-packages\scipy\optimize\linesearch.py", line
76, in line_search_wolfe1
derphi0 = np.dot(gfk, pk)
ValueError: matrices are not aligned
I call the optimization function with:
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, callback=self.callback_fmin_bfgs, retall=True)
I have spent a few days trying to fix this and cannot seem to determine what is causing the matrices are not aligned error.
ADDENDUM: 2012-01-08
I worked with this a lot more and seem to have narrowed the issues (but am baffled on how to fix them). First, fmin (using just fmin) works using these functions--cost, gradient. Second, the cost and the gradient functions both accurately return expected values when tested in a single iteration in a manual implementation (NOT using fmin_bfgs). Third, I added error code to optimize.linsearch and the error seems to be thrown at def line_search_wolfe1 in line: derphi0 = np.dot(gfk, pk).
Here, according to my tests, scipy.optimize.optimize pk = [[ 12.00921659]
[ 11.26284221]]pk type = and scipy.optimize.optimizegfk = [[-12.00921659] [-11.26284221]]gfk type =
Note: according to my tests, the error is thrown on the very first iteration through fmin_bfgs (i.e., fmin_bfgs never even completes a single iteration or update).
I appreciate ANY guidance or insights.
My Code Below (logging, documentation removed):
Assume theta = 2x1 ndarray (Actual: theta Info Size=(2, 1) Type = )
Assume X = 100x2 ndarray (Actual: X Info Size=(2, 100) Type = )
Assume y = 100x1 ndarray (Actual: y Info Size=(100, 1) Type = )
def cost_arr(self, theta, X, y):
theta = scipy.resize(theta,(2,1))
m = scipy.shape(X)
m = 1 / m[1] # Use m[1] because this is the length of X
logging.info(__name__ + "cost_arr reports m = " + str(m))
z = scipy.dot(theta.T, X) # Must transpose the vector theta
hypthetax = self.sigmoid(z)
yones = scipy.ones(scipy.shape(y))
hypthetaxones = scipy.ones(scipy.shape(hypthetax))
costright = scipy.dot((yones - y).T, ((scipy.log(hypthetaxones - hypthetax)).T))
costleft = scipy.dot((-1 * y).T, ((scipy.log(hypthetax)).T))
def gradient_descent_arr(self, theta, X, y):
theta = scipy.resize(theta,(2,1))
m = scipy.shape(X)
m = 1 / m[1] # Use m[1] because this is the length of X
x = scipy.dot(theta.T, X) # Must transpose the vector theta
sig = self.sigmoid(x)
sig = sig.T - y
grad = scipy.dot(X,sig)
grad = m * grad
return grad
def fminunc_opt_bfgs(self, initialtheta, X, y, maxnumit):
myargs= (X,y)
optcost = scipy.optimize.fmin_bfgs(self.cost_arr, initialtheta, fprime=self.gradient_descent_arr, args=myargs, maxiter=maxnumit, retall=True, full_output=True)
return optcost
In case anyone else encounters this problem ....
1) ERROR 1: As noted in the comments, I incorrectly returned the value from my gradient as a multidimensional array (m,n) or (m,1). fmin_bfgs seems to require a 1d array output from the gradient (that is, you must return a (m,) array and NOT a (m,1) array. Use scipy.shape(myarray) to check the dimensions if you are unsure of the return value.
The fix involved adding:
grad = numpy.ndarray.flatten(grad)
just before returning the gradient from your gradient function. This "flattens" the array from (m,1) to (m,). fmin_bfgs can take this as input.
2) ERROR 2: Remember, the fmin_bfgs seems to work with NONlinear functions. In my case, the sample that I was initially working with was a LINEAR function. This appears to explain some of the anomalous results even after the flatten fix mentioned above. For LINEAR functions, fmin, rather than fmin_bfgs, may work better.
QED
As of current scipy version you need not pass fprime argument. It will compute the gradient for you without any issues. You can also use 'minimize' fn and pass method as 'bfgs' instead without providing gradient as argument.

Resources