Does this LSTM loop code break the computational graph in PyTorch? - pytorch

The code below is from https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html
for i in inputs:
# Step through the sequence one element at a time.
# after each step, hidden contains the hidden state.
out, hidden = lstm(i.view(1, 1, -1), hidden)
For me it seems like handling the LSTM in this way breaks the computational graph as hidden keeps on getting overridden. Should all the hidden states not be stored in an array so the computational graph can be maintained, so backprop can flow through the hidden states?

In this case, no: you are providing the hidden state from one layer to the other at every loop iteration. This means the gradient flow is kept the and backpropagation will occur through the hidden states as well.
To give a clear answer to your question: yes the hidden variable is been overwritten. However, the activations corresponding to those hidden states themselves have been cached in memory for backpropagation.
If you take the example from the tutorial page, they are looping through the sequence of elements one by one:
torch.manual_seed(1)
lstm = nn.LSTM(3, 3) # H_in = 3, H_hidden = H_out = 3
inputs = torch.randn(5, 1, 3) # sequence length = 5
Our data sample is shaped as (sequence_length=5, batch_size=1, feature_size=3).
The hidden states h_0 and c_0 are initialized once:
hidden = (torch.randn(1, 1, 3),
torch.randn(1, 1, 3))
Then we loop over the sequence and doing inference on the LSTM block with one element of the sequence at a time:
for element in inputs:
out, hidden = lstm(element[None], hidden)
The reshape with view in the tutorial page is superfluous and won't work in general cases where batch_size is not equal to 1... Doing element[None] will just add an additional dimension to the tensor which is what we want.
So the input element passed is essentially a single-element sequence shaped i.e. (1, 1, 3). Do note this is a stateful call since we are indeed providing the hidden states from the previous layer.
Here the final output and hidden states are:
>>> out
tensor([[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>) # <- h_5
>>> hidden
(tensor([[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>), # <- h_5
tensor([[[-1.1298, 0.4467, 0.0254]]], grad_fn=<StackBackward>)) # <- c_5
This is actually the default behavior performed by nn.LSTM, i.e. calling it with the whole sequence: hidden states will be passed from one sequence element to another.
torch.manual_seed(1)
hidden = (torch.randn(1, 1, 3),
torch.randn(1, 1, 3))
out, hidden = lstm(inputs, hidden)
Then:
>>> out
tensor([[[-0.2682, 0.0304, -0.1526]], # <- h_1
[[-0.5370, 0.0346, -0.1958]], # <- h_2
[[-0.3947, 0.0391, -0.1217]], # <- h_3
[[-0.1854, 0.0740, -0.0979]], # <- h_4
[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>) # <- h_5
>>> hidden
(tensor([[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>), # <- h_5
tensor([[[-1.1298, 0.4467, 0.0254]]], grad_fn=<StackBackward>)) # <- c_5
You can see, here out contains the consecutive hidden states, while it only contained the last hidden state in the previous example.

Related

Expand the tensor by several dimensions

In PyTorch, given a tensor of size=[3], how to expand it by several dimensions to the size=[3,2,5,5] such that the added dimensions have the corresponding values from the original tensor. For example, making size=[3] vector=[1,2,3] such that the first tensor of size [2,5,5] has values 1, the second one has all values 2, and the third one all values 3.
In addition, how to expand the vector of size [3,2] to [3,2,5,5]?
One way to do it I can think is by means of creating a vector of the same size with ones-Like and then einsum but I think there should be an easier way.
You can first unsqueeze the appropriate number of singleton dimensions, then expand to a view at the target shape with torch.Tensor.expand:
>>> x = torch.rand(3)
>>> target = [3,2,5,5]
>>> x[:, None, None, None].expand(target)
A nice workaround is to use torch.Tensor.reshape or torch.Tensor.view to do perform multiple unsqueezing:
>>> x.view(-1, 1, 1, 1).expand(target)
This allows for a more general approach to handle any arbitrary target shape:
>>> x.view(len(x), *(1,)*(len(target)-1)).expand(target)
For an even more general implementation, where x can be multi-dimensional:
>>> x = torch.rand(3, 2)
# just to make sure the target shape is valid w.r.t to x
>>> assert list(x.shape) == list(target[:x.ndim])
>>> x.view(*x.shape, *(1,)*(len(target)-x.ndim)).expand(target)

How to get the partial derivative of probability to input in pyTorch?

I want to generate attack samples via the following steps:
Find a pre-trained CNN classification model, whose input is X and output is P(y|X), and the most possible result of X is y.
I want to input X' and get y_fool, where X' is not far away from X and y_fool is not equal to y
The steps for getting X' is:enter image description here
How can I get the partial derivative described in the image?
Here is my code but I got None: (The model is Vgg16)
x = torch.autograd.Variable(image, requires_grad=True)
output = model(image)
prob = nn.functional.softmax(output[0], dim=0)
prob.backward(torch.ones(prob.size()))
print(x.grad)
How should I modify my codes? Could someone help me? I would be absolutely grateful.
Here, the point is to backpropagate a "false" example through the network, in other words you need to maximize one particular coordinate of your output which does not correspond to the actual label of x.
Let's say for example that your model outputs N-dimensional vectors, that x label should be [1, 0, 0, ...] and that we will try to make the model actually predict [0, 1, 0, 0, ...] (so y_fool actually has its second coordinate set to 1, instead of the first one).
Quick note on the side : Variable is deprecated, just set the requires_grad flag to True. So you get :
x = torch.tensor(image, requires_grad=True)
output = model(x)
# If the model is well trained, prob_vector[1] should be almost 0 at the beginning
prob_vector = nn.functional.softmax(output, dim=0)
# We want to fool the model and maximize this coordinate instead of prob_vector[0]
fool_prob = prob_vector[1]
# fool_prob is a scalar tensor, so we can backward it easy
fool_prob.backward()
# and you should have your gradients :
print(x.grad)
After that, if you want to use an optimizer in your loop to modify x, remember that pytorch optimizer.step method tries to minimize the loss, whereas you want to maximize it. So either you use a negative learning rate or you change the backprop sign :
# Maximizing a scalar is minimizing its opposite
(-fool_prob).backward()

Convolution with RGB images - what values does a RGB filter hold?

Convolution for a grayscale image is straightforward. You have a filter of shape nxnx1and convolve the input image to extract whatever features you desire.
I also understand how convolution would work for a RGB image. The filter would have a shape of nxnx3. However, would all 3 'layers' in the filter hold the same kernel? For example, if the 0th layer a map as shown below, would layer 1 and 2 also hold the exact values? I am asking in regards to Convolutional Neural Networks and not conventional image processing. I understand the weights of each filter are learned and are randomized initially, am I correct in thinking that each layer would have different randomized values?
Would all 3 'layers' in the filter hold the same kernel?
The short answer is no. The longer answer is, there isn't a kernel per layer, but instead just one kernel which handles all input and output layer at once.
The code below shows step by step how one would calculate each convolution manually, and from this we can see that at a high level the calculation goes like this:
take a patch from the batch of images (BatchSize x 3x3x3 in your case)
flatten it [BatchSize, 27]
matrix multiply it by the reshaped kernel [27, output_filters]
add in the bias of shape [output_filters]
All the colors are processed at once using matrix multiplication with the kernel matrix. If we think about the kernel matrix, we can see that the values in the kernel matrix that are used to generate the first filter are in the first column, and the values to generate the second filter are in the second column. So, indeed, the values are different and not reused, but they are not stored or applied separately.
The code walkthrough
import tensorflow as tf
import numpy as np
# Define a 3x3 kernel that after convolution will create an image with 2 filters (channels)
conv_layer = tf.keras.layers.Conv2D(filters=2, kernel_size=3)
# Lets create a random input image
starting_image = np.array( np.random.rand(1,4,4,3), dtype=np.float32)
# and process it
result = conv_layer(starting_image)
weight, bias = conv_layer.get_weights()
print('size of weight', weight.shape)
print('size of bias', bias.shape)
size of weight (3, 3, 3, 2)
size of bias (2,)
# The output of the convolution of the 4x4x3 image input
# is a 2x2x2 output (because we don't have padding)
result.numpy()
array([[[[-0.34940776, -0.6426925 ],
[-0.81834394, -0.16166998]],
[[-0.37515935, -0.28143463],
[-0.60084903, -0.5310158 ]]]], dtype=float32)
# Now let's see how we can recreate this using the weights
# The way convolution is done is to extract a patch
# the size of the kernel (3x3 in this case)
# We will use the first patch, the first three rows and columns and all the colors
patch = starting_image[0,:3,:3,:]
print('patch.shape' , patch.shape)
# Then we flatten the patch
flat_patch = np.reshape( patch, [1,-1] )
print('New shape is', flat_patch.shape)
patch.shape (3, 3, 3)
New shape is (1, 27)
# next we take the weight and reshape it to be [-1,filters]
flat_weight = np.reshape( weight, [-1,2] )
print('flat_weight shape is ',flat_weight.shape)
flat_weight shape is (27, 2)
# we have the patch of shape [1,27] and the weight of [27,2]
# doing a matric multiplication of the two shapes [1,27]*[27,2] = a shape of [1,2]
# which is the output we want, 2 filter outputs for this patch
output_for_patch = np.matmul(flat_patch,flat_weight)
# but we haven't added the bias yet, so lets do that
output_for_patch = output_for_patch + bias
# Finally, we can see that our manual calculation matches
# what Conv2D does exactly for the first patch
output_for_patch
array([[-0.34940773, -0.64269245]], dtype=float32)
If we compare this to the full convolution above, we can see that this is exactly the first patch
array([[[[-0.34940776, -0.6426925 ],
[-0.81834394, -0.16166998]],
[[-0.37515935, -0.28143463],
[-0.60084903, -0.5310158 ]]]], dtype=float32)
We would repeat this process for each patch. If we want to optimize this code some more, instead of passing only one image patch at a time [1,27] we can pass [batch_number,27] patches at a time and the kernel will process them all at once returning [batch_number,filter_size].

How to assign a new value to a pytorch Variable without breaking backpropagation?

I have a pytorch variable that is used as a trainable input for a model. At some point I need to manually reassign all values in this variable.
How can I do that without breaking the connections with the loss function?
Suppose the current values are [1.2, 3.2, 43.2] and I simply want them to become [1,2,3].
Edit
At the time I asked this question, I hadn't realized that PyTorch doesn't have a static graph as Tensorflow or Keras do.
In PyTorch, the training loop is made manually and you need to call everything in each training step. (There isn't the notion of placeholder + static graph for later feeding data).
Consequently, we can't "break the graph", since we will use the new variable to perform all the further computations again. I was worried about a problem that happens in Keras, not in PyTorch.
You can use the data attribute of tensors to modify the values, since modifications on data do not affect the graph. So the graph will still be intact and modifications of the data attribute itself have no influence on the graph. (Operations and changes on data are not tracked by autograd and thus not present in the graph)
Since you haven't given an example, this example is based on your comment statement: 'Suppose I want to change the weights of a layer.'
I used normal tensors here, but this works the same for weight.data and bias.data attributes of a layers.
Here is a short example:
import torch
import torch.nn.functional as F
# Test 1, random vector with CE
w1 = torch.rand(1, 3, requires_grad=True)
loss = F.cross_entropy(w1, torch.tensor([1]))
loss.backward()
print('w1.data', w1)
print('w1.grad', w1.grad)
print()
# Test 2, replacing values of w2 with w1, before CE
# to make sure that everything is exactly like in Test 1 after replacing the values
w2 = torch.zeros(1, 3, requires_grad=True)
w2.data = w1.data
loss = F.cross_entropy(w2, torch.tensor([1]))
loss.backward()
print('w2.data', w2)
print('w2.grad', w2.grad)
print()
# Test 3, replace data after computation
w3 = torch.rand(1, 3, requires_grad=True)
loss = F.cross_entropy(w3, torch.tensor([1]))
# setting values
# the graph of the previous computation is still intact as you can in the below print-outs
w3.data = w1.data
loss.backward()
# data were replaced with values from w1
print('w3.data', w3)
# gradient still shows results from computation with w3
print('w3.grad', w3.grad)
Output:
w1.data tensor([[ 0.9367, 0.6669, 0.3106]])
w1.grad tensor([[ 0.4351, -0.6678, 0.2326]])
w2.data tensor([[ 0.9367, 0.6669, 0.3106]])
w2.grad tensor([[ 0.4351, -0.6678, 0.2326]])
w3.data tensor([[ 0.9367, 0.6669, 0.3106]])
w3.grad tensor([[ 0.3179, -0.7114, 0.3935]])
The most interesting part here is w3. At the time backward is called the values are replaced by values of w1. But the gradients are calculated based on the CE-function with values of original w3. The replaced values have no effect on the graph.
So the graph connection is not broken, replacing had no influence on graph. I hope this is what you were looking for!

How to set up the number of inputs neurons in sklearn MLPClassifier?

Given a dataset of n samples, m features, and using [sklearn.neural_network.MLPClassifier][1], how can I set hidden_layer_sizes to start with m inputs? For instance, I understand that if hidden_layer_sizes= (10,10) it means there are 2 hidden layers each of 10 neurons (i.e., units) but I don't know if this also implies 10 inputs as well.
Thank you
This classifier/regressor, as implemented, is doing this automatically when calling fit.
This can be seen in it's code here.
Excerpt:
n_samples, n_features = X.shape
# Ensure y is 2D
if y.ndim == 1:
y = y.reshape((-1, 1))
self.n_outputs_ = y.shape[1]
layer_units = ([n_features] + hidden_layer_sizes +
[self.n_outputs_])
You see, that your potentially given hidden_layer_sizes is surrounded by layer-dimensions defined by your data within .fit(). This is the reason, the signature reads like this with a subtraction of 2!:
Parameters
hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
The ith element represents the number of neurons in the ith hidden layer.

Resources