Extending Pytorch: Python vs. C++ vs. CUDA - pytorch

I have been trying to implement a custom Conv2d module where grad_input (dx) and grad_weight (dw) are calculated by using different grad_output (dy) values. I implemented this by extending torch.autograd as in Pytorch tutorials.
However I am confused by the information in this link.
Is extending the autograd.Function not enough?
What is the difference
between writing a new autograd function in Python vs C++?
How about
the CUDA implementations in
/torch/nn/blob/master/lib/THNN/generic/SpatialConvolutionMM.c where
dx and dw calculated? Should I change them too?
Here is my custom function:
class myCustomConv2d(torch.autograd.Function):
#staticmethod
def forward(ctx, x, w, bias=None, stride=1, padding=0, dilation=1, groups=1):
ctx.save_for_backward(x, w, bias)
ctx.stride = stride
ctx.padding = padding
ctx.dilation = dilation
ctx.groups = groups
out = F.conv2d(x, w, bias, stride, padding, dilation, groups)
return out
#staticmethod
def backward(ctx, grad_output):
input, weight, bias = ctx.saved_tensors
stride = ctx.stride
padding = ctx.padding
dilation = ctx.dilation
groups = ctx.groups
grad_input = grad_weight = grad_bias = None
dy_for_inputs = myspecialfunction1(grad_output)
dy_for_weights = myspecialfunction2(grad_output)
grad_input = torch.nn.grad.conv2d_input(input.shape, weight, dy_for_inputs , stride, padding, dilation, groups)
grad_weight = torch.nn.grad.conv2d_weight(input, weight.shape, dy_for_weights , stride, padding, dilation, groups)
if bias is not None and ctx.needs_input_grad[2]:
grad_bias = dy_for_weights .sum((0,2,3)).squeeze(0)
return grad_input, grad_weight, grad_bias, None, None, None, None

Is extending the autograd.Function not enough?
It is enough if your code reuses Pytorch components wrapped within Python interface (which seems to be the case). Gradient is composed automatically.
What is the difference between writing a new autograd function in Python vs C++?
Performance, the more custom your operation is (and the harder it is to compose it from exstsing Pytorch operations), the more performance improvement you would obtain.
How about the CUDA implementations in /torch/nn/blob/master/lib/THNN/generic/SpatialConvolutionMM.c where dx and dw calculated? Should I change them too?
No need for that, unless you want to create specialized ops for CUDA

Related

Is a neural network using layernorm, flatten, dropout then linear equivalent to a linear regression?

I've joined a new project where someone has defined a class similar to the following:
class FeatureClassifier(nn.Module):
def __init__(self):
super().__init__()
self.layer_norm = nn.LayerNorm(512)
self.flatten = nn.Flatten()
self.dropout = nn.Dropout(0.1)
self.fc1 = nn.Linear(512, 2)
def forward(self, x):
x = self.layer_norm(x)
x = self.flatten(nn.functional.relu(x))
x = self.dropout(x)
x = self.fc1(nn.functional.relu(x))
return x
Which is returning quite poor results. My impression is that this is effectively some data manipulations and a Relu, so a linear regression on a subset of normalised data. The author of the code however contends that it is non-linear. What aspect of this network might make it non-linear?
The code returns quite poor fits and losses. I tried upping the dropout to 0.5 to see if that had an impact, but it did not. In my mind this confirms the linear behaviour to a degree, as with a larger dropout I would expect the behaviour to change if any complex information was being extracted.

Troubles in unsupervised domain adaptation with GCN

I am trying to implement an unsupervised domain adaptation network following the paper GCAN: Graph Convolutional Adversarial Network for Unsupervised Domain
Adaptation, presented in CVPR in 2019 (can be found at this link). I have some trouble understanding some parts of the paper.
I reported the image found in the paper explaining the structure od the model. I have some troubles understanding if the input of the model is just one image or multiple, since there is a domain classification network that should classify the domain that the image comes from, but at the same time there is a part in which the alignment of the classes' centroid is evaluated.
Moreover, there is no indication on how to compute the class centroid itself, and since I am not an expert in this matter, i wonder how it is possible to compute it and optimize it using the loss function given in the paper.
The last thing I'm wondering about is an error that I get in the code (using pytorch to implement the solution). With this being the code I wrote for the model:
class GCAN(nn.Module):
def __init__(self, num_classes, gcn_in_channels=256, gcn_out_channels=150):
super(GCAN, self).__init__()
self.cnn = resnet50(pretrained=True)
resnet_features = self.cnn.fc.in_features
combined_features = resnet_features + gcn_out_channels
self.cnn = nn.Sequential(*list(self.cnn.children())[:-1])
self.dsa = alexnet(pretrained=True)
self.gcn = geometric_nn.GCNConv(in_channels=gcn_in_channels,
out_channels=gcn_out_channels)
self.domain_alignment = nn.Sequential(
nn.Linear(in_features=combined_features,
out_features=1024),
nn.ReLU(),
nn.Linear(in_features=1024, out_features=1024),
nn.ReLU(),
nn.Linear(in_features=1024, out_features=1),
nn.Sigmoid()
)
self.classifier = nn.Sequential(
nn.Linear(in_features=combined_features, out_features=1024),
nn.Dropout(p=0.2),
nn.ReLU(),
nn.Linear(in_features=1024, out_features=1024),
nn.Dropout(p=0.2),
nn.ReLU(),
nn.Linear(in_features=1024, out_features=num_classes),
nn.Softmax()
)
def forward(self, xs):
resnet_features = self.cnn(xs)
scores = self.dsa(xs)
scores = scores.cpu().detach().numpy()
adjacency_matrix = np.matmul(scores, np.transpose(scores))
graph = nx.from_numpy_matrix(adjacency_matrix) # networkx
gcn_features = self.gcn(graph)
concat_features = torch.cat((resnet_features, gcn_features))
domain_classification = self.domain_alignment(concat_features)
pseudo_label = self.classifier(concat_features)
return domain_classification, pseudo_label
when I try to plot the summary I get the following error:
forward() missing 1 required positional argument: 'edge_index'
But looking at the documentation of the GCN convolution (which is the part that gives the error), I have given to the layer both in_channels and out_channels. What am I missing in this case?

Don't include an operation for gradient computation in PyTorch

I have a custom layer. Let the layer be called 'Gaussian'
class Gaussian(nn.Module):
def __init__():
super(Gaussian, self).__init__()
##torch.no_grad
def forward(self, x):
_r = np.random.randint(0, x.shape[0], x.shape[0])
_sample = x[_r]
_d = (_sample - x)
_number = int(self.k * x.shape[0])
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach()
return x
The above class will be used as below:
cnn_model = nn.Sequential(nn.Conv2d(1, 32, 5), Gaussian(), nn.ReLU(), nn.Conv2d(32, 32, 5))
If x is the input, I want the gradient of x to exclude operations that are present in the Gaussian module, but include the calculations in other layers of the neural network(nn.Conv2d etc).
In the end, my aim is to use the Gaussian module to perform calculations but that calculations should not be included in gradient computation.
I tried to do the following:
Used the #torch.no_grad above the forward method of the Gaussian
Using detach after every operation in the Gaussian module:
x[1: _number] = x[1: _number] + (self.n * _d[1: _number]).detach() and similarly for other operations
Use y = x.detach() in the forward method. Perform the operations on y and then x.data = y
Are the above methods correct?
P.S: Question edited
The gradient calculation has sense when there are parameters to optimise.
If your module do not have any parameters, then no gradient will be stored, because there are no parameters to associate it.

Local fully connected layer - Pytorch

Assume we have a feature representation with kN neurons before the classification layer. Now, the classification layer produces an output layer of size N with only local connections.
That is, the kth neuron at the output is computed using input neurons at locations from kN to kN+N. Hence, every N locations in the input layer (with stride N) give single neuron value at the output.
This is done using conv1dlocal in Keras, however, the PyTorch does not seem to have this.
Weight matrix in standard linear layer: kNxN = kN^2 variables
Weight matrix with local linear layer: (kx1)#N times = NK variables
This is currently triaged on the PyTorch issue tracker, in the mean time you can get a similar behavious using fold and unfold. See this answer:
https://github.com/pytorch/pytorch/issues/499#issuecomment-503962218
class LocalLinear(nn.Module):
def __init__(self,in_features,local_features,kernel_size,padding=0,stride=1,bias=True):
super(LocalLinear, self).__init__()
self.kernel_size = kernel_size
self.stride = stride
self.padding = padding
fold_num = (in_features+2*padding-self.kernel_size)//self.stride+1
self.weight = nn.Parameter(torch.randn(fold_num,kernel_size,local_features))
self.bias = nn.Parameter(torch.randn(fold_num,local_features)) if bias else None
def forward(self, x:torch.Tensor):
x = F.pad(x,[self.padding]*2,value=0)
x = x.unfold(-1,size=self.kernel_size,step=self.stride)
x = torch.matmul(x.unsqueeze(2),self.weight).squeeze(2)+self.bias
return x

How to compute gradient of the error with respect to the model input?

Given a simple 2 layer neural network, the traditional idea is to compute the gradient w.r.t. the weights/model parameters. For an experiment, I want to compute the gradient of the error w.r.t the input. Are there existing Pytorch methods that can allow me to do this?
More concretely, consider the following neural network:
import torch.nn as nn
import torch.nn.functional as F
class NeuralNet(nn.Module):
def __init__(self, n_features, n_hidden, n_classes, dropout):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(n_features, n_hidden)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(n_hidden, n_classes)
self.dropout = dropout
def forward(self, x):
x = self.sigmoid(self.fc1(x))
x = F.dropout(x, self.dropout, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
I instantiate the model and an optimizer for the weights as follows:
import torch.optim as optim
model = NeuralNet(n_features=args.n_features,
n_hidden=args.n_hidden,
n_classes=args.n_classes,
dropout=args.dropout)
optimizer_w = optim.SGD(model.parameters(), lr=0.001)
While training, I update the weights as usual. Now, given that I have values for the weights, I should be able to use them to compute the gradient w.r.t. the input. I am unable to figure out how.
def train(epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(features)
loss_train = F.nll_loss(output[idx_train], labels[idx_train])
acc_train = accuracy(output[idx_train], labels[idx_train])
loss_train.backward()
optimizer_w.step()
# grad_features = loss_train.backward() w.r.t to features
# features -= 0.001 * grad_features
for epoch in range(args.epochs):
train(epoch)
It is possible, just set input.requires_grad = True for each input batch you're feeding in, and then after loss.backward() you should see that input.grad holds the expected gradient. In other words, if your input to the model (which you call features in your code) is some M x N x ... tensor, features.grad will be a tensor of the same shape, where each element of grad holds the gradient with respect to the corresponding element of features. In my comments below, I use i as a generalized index - if your parameters has for instance 3 dimensions, replace it with features.grad[i, j, k], etc.
Regarding the error you're getting: PyTorch operations build a tree representing the mathematical operation they are describing, which is then used for differentiation. For instance c = a + b will create a tree where a and b are leaf nodes and c is not a leaf (since it results from other expressions). Your model is the expression, and its inputs as well as parameters are the leaves, whereas all intermediate and final outputs are not leaves. You can think of leaves as "constants" or "parameters" and of all other variables as of functions of those. This message tells you that you can only set requires_grad of leaf variables.
Your problem is that at the first iteration, features is random (or however else you initialize) and is therefore a valid leaf. After your first iteration, features is no longer a leaf, since it becomes an expression calculated based on the previous ones. In pseudocode, you have
f_1 = initial_value # valid leaf
f_2 = f_1 + your_grad_stuff # not a leaf: f_2 is a function of f_1
to deal with that you need to use detach, which breaks the links in the tree, and makes the autograd treat a tensor as if it was constant, no matter how it was created. In particular, no gradient calculations will be backpropagated through detach. So you need something like
features = features.detach() - 0.01 * features.grad
Note: perhaps you need to sprinkle a couple more detaches here and there, which is hard to say without seeing your whole code and knowing the exact purpose.

Resources