Gradient flow through torch.nn.Parameter() - pytorch

I have a toy example
a = torch.ones(10)
b = torch.nn.Parameter(a,requires_grad=True)
c = (b**2).sum()
c.backward()
print(b.grad)
print(a.grad)
b.grad calculated successfully, but a.grad is None. How to make gradient flow through torch.nn.Parameter? This example looks artificial, but I work with class A derived from nn.Module and it's parameters initialized with outputs from some other Module B, and I whant to make gradients flow through A parameters to B parameters.

#a_guest answer is wrong. Using requires_grad=True here will change nothing since torch.nn.Parameter is not tracked in computation graph. You should do it the other way around, to create a Parameter tensor, and then to extract a raw tensor reference out of it:
:
a = torch.nn.Parameter(torch.ones((10,)), requires_grad=True)
b = a[:] # silly hack to convert in a raw tensor including the computation graph
b.retain_grad() # Otherwise backward pass will not store the gradient since it is not a leaf
c = (b**2).sum()
c.backward()
print(b.grad)
print(a.grad)
Another approach would be to copy manually the content of tensor a in b
You could fix this by making the copy explicit:
a = torch.ones((10,), requires_grad=True)
b = torch.nn.Parameter(a.clone(), requires_grad=True)
b = a
c = (b**2).sum()
c.backward()
print(b.grad)
print(a.grad)
Yet it is not very convenient since the copy must be done systematically.

Related

How to get a 2D output from linear layer in pytorch?

I would like to project a tensor into a space with an additional dimension.
I tried
torch.nn.Linear(
in_features=num_inputs,
out_features=(num_inputs, num_additional),
)
But this results in an error
A workaround would be to
torch.nn.Linear(
in_features=num_inputs,
out_features=num_inputs*num_additional,
)
and then change the view the output
output.view(batch_size, num_inputs, num_additional)
But I imagine this workaround will get tricky to read, especially when a projection into more than one additional dimension is desired.
Is there a more direct way to code this operation?
Perhaps the source code for linear can be changed
https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear
To accept more dimensions for the weight and bias initialization, and F.linear seems like it would need to be replaced with a different function.
IMO the workaround you provided is already clear enough. However, if you want to express this as a single operation, you can always write your own module by subclassing torch.nn.Linear:
import numpy as np
import torch
class MultiDimLinear(torch.nn.Linear):
def __init__(self, in_features, out_shape, **kwargs):
self.out_shape = out_shape
out_features = np.prod(out_shape)
super().__init__(in_features, out_features, **kwargs)
def forward(self, x):
out = super().forward(x)
return out.reshape((len(x), *self.out_shape))
if __name__ == '__main__':
tmp = torch.empty((32, 10))
linear = MultiDimLinear(in_features=10, out_shape=(10, 10))
out = linear(tmp)
print(out.shape) # (32, 10, 10)
Another way would be to use torch.einsum
https://pytorch.org/docs/stable/generated/torch.einsum.html
torch.einsum can prevent summation across dimensions in tensor to tensor multiplication operations. This can allow separate multiplication operations to happen in parallel. [ I do not know if this would necessarily result in GPU efficiency; if the operations are still occurring in the same kernel. In fact, it may be slower https://github.com/pytorch/pytorch/issues/32591 ]
How this would work is to directly initialize the weight and bias tensors (look at source code for the torch linear layer for that code)
Say that the input (X) has dimensions (a, b), where a is the batch size.
Say that you want to pass this input through a series of classifiers, represented in a single weight tensor (W) with dimensions (c, d, e), where c is the number of classifiers, and e is the number of classes for the classifier
import torch
x = torch.arange(2*4).view(2, 4)
w = torch.arange(5*4*6).view(5, 4, 2)
torch.einsum('ab, cbe -> ace', x, w)
in the last line, a and b are the dimensions of the input as mentioned above. What might be the tricky part is c, b, and e are the dimensions of the classifiers weight tensor; I didn't use d, I used b instead. That is because the vector multiplication is happening along that dimension for the inputs tensor and the weight tensor. So that's why the left side of the einsum equation is ab, cbe. The right side of the einsum equation is simply what dimensions to exclude from summation.
The final dimensions we want is (a, c, e). a is the batch size, c is the number of classifiers, and e is the number of classes for each classifier. We do not want to add those values, so to preserve their separation, the left side of the equation is ace.
For those unfamiliar with einsum, this will be harder to read than the word around I created (though I highly recommend learning it, because it gets very easy and intuitive very fast even though it's a bit tricky at first https://www.youtube.com/watch?v=pkVwUVEHmfI )
However, for paralyzing certain operations (especially on GPU), it seems that einsum is the only way to do it. For example so that in my previous example, I didn't want to use a classification head yet, I just wanted to project to multiple dimensions.
import torch
x = torch.arange(2*4).view(2, 4)
w = torch.arange(5*4*6).view(5, 4, 4)
y = torch.einsum('ab, cbe -> ace', x, w)
And say I do a few other operations to y, perhaps some non linear operations, activations, etc.
z = f(y)
z will still have the dimensions 2, 5, 4. Batch size two, 5 hidden states per batch, and the dimension of those hidden states are 4.
And then I want to apply a classifier to each separate tensor.
w2 = torch.arange(4*2).view(4, 2)
final = torch.einsum('fgh, hj -> fgj', z, w2)
Quick refresh, 2 is the batch size, 5 is the number of classifier, and 2 is the number of outputs for each classifier.
The output dimensions, f, g, j (2, 5, 2) will not be summed across, and thus will be preserved in the output.
As cited in the github link, this may be slower than just using regular linear layers. There may be efficiencies in a very large number of parallel operations.

Retrieve elements from a 3D tensor with a 2D index tensor

I am playing around with GPT2 and I have 2 tensors:
O: An output tensor of shaped (B, S-1, V) where B is the batch size S is the the number of timestep and V is the vocabulary size. This is the output of a generative model and is softmaxed along the 2nd dimension.
L: A 2D tensor shaped (B, S-1) where each element is the index of the correct token for each timestep for each sample. This is basically the labels.
I want to extract the predicted probability of the corresponding correct token from tensor O based on tensor L such that I will end up with a 2D tensor shaped (B, S). Is there an efficient way of doing this apart from using loops?
For reference, I based my answer on this Medium article.
Essentially, your answer lies in torch.gather, assuming that both of your tensors are just regular torch.Tensors (or can be converted to one).
import torch
# Specify some arbitrary dimensions for now
B = 3
V = 6
S = 4
# Make example reproducible
torch.manual_seed(42)
# L necessarily has to be a torch.LongTensor, otherwise indexing will fail.
L = torch.randint(0, V, size=[B, S])
O = torch.rand([B, S, V])
# Now collect the results. L needs to have similar dimension,
# except in the axis you want to collect along.
X = torch.gather(O, dim=2, index=L.unsqueeze(dim=2))
# Make sure X has no "unnecessary" dimension
X = X.squeeze(dim=2)
It is a bit difficult to see whether this produces the exact correct results, which is why I included a random seed which makes the example deterministic in the result, and you an easily verify that it gets you the desired results. However, for clarification, one could also use a lower-dimensional tensor, for which this becomes clearer what exactly torch.gather does.
Note that torch.gather also allows you to index multiple indexes in the same row theoretically. Meaning if you instead got a multiclass example for which multiple values are correct, you could similarly use a tensor L of shape [B, S, number_of_correct_samples].

Covariance matrix is NoneType in lmfit Python3-6

I want to fit a f(x,y) function using lmfit. Dataset is small and there are many fitting parameters (6 points on x-axis, 11 points on y-axis and 16 unconstrained fitting parameters). Using all defaults from Model.fit I cannot obtain covariance matrix and during the fitting process the values of free parameters are not being changed at all.
I tried to change initial values for the parameters. However, when I set the same kind of problem in OriginPro Surface Fitting functionality, the Levenberg-Marquardt algorithm manages to fit data and estimate the errors (although quite large-valued for certain parameters). This means that there has to be some problem with my code. I can't find where the problem lies. I'm not Python master.
The MWE is as below.
import numpy as np
from lmfit import Model, Parameters
import numdifftools # not calling this doesn't change anything
x, y = np.array([226.5, 361.05, 404.41, 589, 632.8, 1013.98]), np.linspace(0,100,11)
X, Y = np.meshgrid(x, y)
Z = np.array([[1.3945, 1.34896, 1.34415, 1.33432, 1.33306, 1.32612],\
[1.39422, 1.3487, 1.34389, 1.33408, 1.33282, 1.32591],\
[1.39336, 1.34795, 1.34315, 1.33336, 1.33211, 1.32524],\
[1.39208, 1.34682, 1.34205, 1.3323, 1.33105, 1.32424],\
[1.39046, 1.3454, 1.34065, 1.33095, 1.32972, 1.32296],\
[1.38854, 1.34373, 1.33901, 1.32937, 1.32814, 1.32145],\
[1.38636, 1.34184, 1.33714, 1.32757, 1.32636, 1.31974],\
[1.38395, 1.33974, 1.33508, 1.32559, 1.32438, 1.31784],\
[1.38132, 1.33746, 1.33284, 1.32342, 1.32223, 1.31576],\
[1.37849, 1.33501, 1.33042, 1.32109, 1.31991, 1.31353],\
[1.37547, 1.33239, 1.32784, 1.31861, 1.31744, 1.31114]])
#This has to be defined beforehand (otherwise parameters names are not defined error)
a1,a2,a3,a4 = 1.3208, -1.2325E-5, -1.8674E-6, 5.0233E-9
b1,b2,b3,b4 = 5208.2413, -0.5179, -2.284E-2, 6.9608E-5
c1,c2,c3,c4 = -2.5551E8, -18341.336, -920, 2.7729
d1,d2,d3,d4 = 9.3495, 2E-3, 3.6733E-5, -1.2932E-7
# Function to fit
def model(x, y, *args):
return a1+a2*y+a3*np.power(y,2)+a4*np.power(y,3)+\
(b1+b2*y+b3*np.power(y,2)+b4*np.power(y,3))/np.power(x,2)+\
(c1+c2*y+c3*np.power(y,2)+c4*np.power(y,3))/np.power(x,4)+\
(d1+d2*y+d3*np.power(y,2)+d4*np.power(y,3))/np.power(x,6)
# This is the callable that is passed to Model.fit. M is a (2,N) array
# where N is the total number of data points in Z, which will be ravelled
# to one dimension.
def _model(M, **args):
x, y = M
arr = model(x, y, params)
return arr
# We need to ravel the meshgrids of X, Y points to a pair of 1-D arrays.
xdata = np.vstack((X.ravel(), Y.ravel()))
# Fitting parameters.
fmodel = Model(_model)
params = Parameters()
params.add_many(('a1',1.3208,True,1,np.inf,None,None),\
('a2',-1.2325E-5,True,-np.inf,np.inf,None,None),\
('a3',-1.8674E-6,True,-np.inf,np.inf,None,None),\
('a4',5.0233E-9,True,-np.inf,np.inf,None,None),\
('b1',5208.2413,True,-np.inf,np.inf,None,None),\
('b2',-0.5179,True,-np.inf,np.inf,None,None),\
('b3',-2.284E-2,True,-np.inf,np.inf,None,None),\
('b4',6.9608E-5,True,-np.inf,np.inf,None,None),\
('c1',-2.5551E8,True,-np.inf,np.inf,None,None),\
('c2',-18341.336,True,-np.inf,np.inf,None,None),\
('c3',-920,True,-np.inf,np.inf,None,None),\
('c4',2.7729,True,-np.inf,np.inf,None,None),\
('d1',9.3495,True,-np.inf,np.inf,None,None),\
('d2',2E-3,True,-np.inf,np.inf,None,None),\
('d3',3.6733E-5,True,-np.inf,np.inf,None,None),\
('d4',-1.2932E-7,True,-np.inf,np.inf,None,None))
result = fmodel.fit(Z.ravel(), params, M=xdata)
fit = model(X, Y, result.params)
print(result.covar)
This code results in covariance being NoneType. I expect that it will after all be calculated, because Origin can somehow manage. If it is needed I can provide all parameters from Origin Surface Fitting Parameters.
When plotting Z-fit difference, there is quite large discrepancy for low x-values (not happening in Origin).
You are not defining your model function in a way that can be used sensibly by lmfit. You have:
def _model(M, **args):
x, y = M
arr = model(x, y, params)
return arr
def model(x, y, *args):
return a1+a2*y+a3*np.power(y,2)+a4*np.power(y,3)+\
(b1+b2*y+b3*np.power(y,2)+b4*np.power(y,3))/np.power(x,2)+\
(c1+c2*y+c3*np.power(y,2)+c4*np.power(y,3))/np.power(x,4)+\
(d1+d2*y+d3*np.power(y,2)+d4*np.power(y,3))/np.power(x,6)
model = Model(_model)
Which has a few problems:
args is not used in _model, and params is not defined in the function so will be module-level.
Similarly in model, args is not used and a1, a2, etc will be taken from the module-level (programming) variables and (importantly!!) these will not be updated in the fit.
In short, your model function never sees varying values for the parameters.
lmfit.Model takes the named function arguments and turns those into parameter names. It does not turn **kws or *position_args into parameter names. So I think that what you want to do is write a model function like this:
def model(x, y, a1, a2, a2, a4, b1, b2, b3 ,b3, c1, c2, c3, c4,
d1, d2, d3, d4):
return a1+a2*y+a3*np.power(y,2)+a4*np.power(y,3)+\
(b1+b2*y+b3*np.power(y,2)+b4*np.power(y,3))/np.power(x,2)+\
(c1+c2*y+c3*np.power(y,2)+c4*np.power(y,3))/np.power(x,4)+\
(d1+d2*y+d3*np.power(y,2)+d4*np.power(y,3))/np.power(x,6)
Then create a model from that with:
# Note: don't give a function and Model instance the same name!!
my_model = Model(model, independent_vars=('x', 'y'))
With that model defined you can run the fit, and without having to unravel your data (the independent data in lmfit can be of almost any data type, and data arrays can be multi-dimensional):
result = my_model.fit(Z, params, x=X, y=Y)
For what it is worth, making such changes works for me in the sense that the fit runs to completion. The fit still gets stuck with some of the parameters not updating from their initial values, but that is sort of a separate question from the mechanics of setting up and running the fit, and is probably due to polynomials being pretty unstable or poor initial estimates.
As an aside: np.power(y,n) can be spelled y**n and readability counts. Also, numerical stability is sometimes improved with replaced
a + b*x + c*x**2 + d*x**3
with
a + x*(b + x*(c + x*d))
Though I do not know if that would help in your case.

Only backpropagate up to a given variable

I’m working on a GAN where the discriminator operates on the latent space vector produced by the encoder. The details aren’t important, but the paper describing the model is https://arxiv.org/abs/1706.00409 if you want to take a look.
Essentially, my problem is that my training code requires an unnecessary backward pass through the encoder and I’m not sure how to get around this such that it becomes optimal. Here is the relevant code, where E is the encoder and D is the discriminator.
latent_vec = E(input) #latent_vector will be a Variable with requires_grad=True
predictions = D(latent_vec)
e_loss = encoder_loss(predictions, ground_truth)
e_optimizer.zero_grad()
e_loss.backward() #backpropagates through both D and E, which is necessary
e_optimizer.step()
d_loss = discriminator_loss(predictions, ground_truth)
d_optimizer.zero_grad()
d_loss.backward() #backpropagates through D, but also through E unnecessarily
d_optimizer.step()
This still works because the optimizers are only modifying the parameters of their respective models, but it’s inefficient because d_loss.backward() unnecessarily backpropagates through E. I’m aware that I can recreate a version of latent_vec that will prevent backpropagation through E using
latent_vec_no_grad = Variable(latent_vec.data), but then I would be stuck with 2 forward passes through D (one with latent_vec so that e_loss backpropagates through E, and one with latent_vec_no_grad so that d_loss only backpropagates through D).
Ideally, a flag such as latent_vec.block_backprop = True could be set after optimizing E, but no such flag exists. Is there an elegant solution that would make training optimal?

How to add a confusion matrix to Theano examples?

I want to make use of Theano's logistic regression classifier, but I would like to make an apples-to-apples comparison with previous studies I've done to see how deep learning stacks up. I recognize this is probably a fairly simple task if I was more proficient in Theano, but this is what I have so far. From the tutorials on the website, I have the following code:
def errors(self, y):
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
I'm pretty sure this is where I need to add the functionality, but I'm not certain how to go about it. What I need is either access to y_pred and y for each and every run (to update my confusion matrix in python) or to have the C++ code handle the confusion matrix and return it at some point along the way. I don't think I can do the former, and I'm unsure how to do the latter. I've done some messing around with an update function along the lines of:
def confuMat(self, y):
x=T.vector('x')
classes = T.scalar('n_classes')
onehot = T.eq(x.dimshuffle(0,'x'),T.arange(classes).dimshuffle('x',0))
oneHot = theano.function([x,classes],onehot)
yMat = T.matrix('y')
yPredMat = T.matrix('y_pred')
confMat = T.dot(yMat.T,yPredMat)
confusionMatrix = theano.function(inputs=[yMat,yPredMat],outputs=confMat)
def confusion_matrix(x,y,n_class):
return confusionMatrix(oneHot(x,n_class),oneHot(y,n_class))
t = np.asarray(confusion_matrix(y,self.y_pred,self.n_out))
print (t)
But I'm not completely clear on how to get this to interface with the function in question and give me a numpy array I can work with.
I'm quite new to Theano, so hopefully this is an easy fix for one of you. I'd like to use this classifer as my output layer in a number of configurations, so I could use the confusion matrix with other architectures.
I suggest using a brute force sort of a way. You need an output for a prediction first. Create a function for it.
prediction = theano.function(
inputs = [index],
outputs = MLPlayers.predicts,
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size]})
In your test loop, gather the predictions...
labels = labels + test_set_y.eval().tolist()
for mini_batch in xrange(n_test_batches):
wrong = wrong + int(test_model(mini_batch))
predictions = predictions + prediction(mini_batch).tolist()
Now create confusion matrix this way:
correct = 0
confusion = numpy.zeros((outs,outs), dtype = int)
for index in xrange(len(predictions)):
if labels[index] is predictions[index]:
correct = correct + 1
confusion[int(predictions[index]),int(labels[index])] = confusion[int(predictions[index]),int(labels[index])] + 1
You can find this kind of an implementation in this repository.

Resources