retain_graph problem with GRU in Pytorch 1.6 - python-3.x

I am aware that, while employing loss.backward() we need to specify retain_graph=True if there are multiple networks and multiple loss functions to optimize each network separately. But even with (or without) specifying this parameter I am getting errors. Following is an MWE to reproduce the issue (on PyTorch 1.6).
import torch
from torch import nn
from torch import optim
torch.autograd.set_detect_anomaly(True)
class GRU1(nn.Module):
def __init__(self):
super(GRU1, self).__init__()
self.brnn = nn.GRU(input_size=2, bidirectional=True, num_layers=1, hidden_size=100)
def forward(self, x):
return self.brnn(x)
class GRU2(nn.Module):
def __init__(self):
super(GRU2, self).__init__()
self.brnn = nn.GRU(input_size=200, bidirectional=True, num_layers=1, hidden_size=1)
def forward(self, x):
return self.brnn(x)
gru1 = GRU1()
gru2 = GRU2()
gru1_opt = optim.Adam(gru1.parameters())
gru2_opt = optim.Adam(gru2.parameters())
criterion = nn.MSELoss()
for i in range(100):
gru1_opt.zero_grad()
gru2_opt.zero_grad()
vector = torch.randn((15, 100, 2))
gru1_output, _ = gru1(vector) # (15, 100, 200)
loss_gru1 = criterion(gru1_output, torch.randn((15, 100, 200)))
loss_gru1.backward(retain_graph=True)
gru1_opt.step()
gru1_output, _ = gru1(vector) # (15, 100, 200)
gru2_output, _ = gru2(gru1_output) # (15, 100, 2)
loss_gru2 = criterion(gru2_output, torch.randn((15, 100, 2)))
loss_gru2.backward(retain_graph=True)
gru2_opt.step()
print(f"GRU1 loss: {loss_gru1.item()}, GRU2 loss: {loss_gru2.item()}")
With retain_graph set to True I get the error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 300]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
The error without the parameter is
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
which is expected.
Please point at what needs to be changed in the above code for it to begin training. Any help is appreciated.

In such a case, one can detach the computation graph to exclude the parameters that don't need to be optimized. In this case, the computation graph should be detached after the second forward pass with gru1 i.e.
....
gru1_opt.step()
gru1_output, _ = gru1(vector)
gru1_output = gru1_output.detach()
....
This way, you won't "try to backward through the graph a second time" as the error mentioned.

Related

RNN Input Dimension! Pytorch

having hard time to input the data into RNN in pytorch..
The output dimension from my previous linear layer is (32,50) where 32 is batch size.
I want to feed this to RNN layer.
I tried reshaping my data in 2 forms: [32,1,50] and [32,50,1] both of the times I get error.
class trainer:
def __init__():
self.r = nn.RNN(input_size= 50, hidden_size=2, num_layers = 3, batch_first=True)
def forward(self):
previous_layer_output
previous_layer_output.unsqueeze_(-1) Makes shape [32,50,1] # I also tried
#previous_layer_output.unsqueeze_(1) makes shape [32,1,50]
rnn = self.r(previous_layer_output )
I get this error : RuntimeError: input.size(-1) must be equal to input_size. Expected 50, got 1
If i do previous_layer_output.unsqueeze_(1) I get the below error:
ValueError: Expected target size (32, 50), got torch.Size([32])
To distill and isolate your problem, I wrote this code.
import torch
from torch import nn
rnn = nn.RNN (input_size=50, hidden_size=2, num_layers=2, batch_first=True)
a = torch.rand(32, 50)
a.unsqueeze_(1)
o = rnn(a)
and it worked fine for me. I am not sure what is the problem but it might be somewhere else in your code.

Change custom loss parameter and NN parameter with respect to epoch

I have a Keras model defined in the following manner (Tried to keep only the necessary parts):
temperature = 5.0
def knowledge_distillation_loss(y_true, y_pred, lambda_const):
y_true, logits = y_true[:, :10], y_true[:, 10:]
y_soft = K.softmax(logits/temperature)
y_pred, y_pred_soft = y_pred[:, :10], y_pred[:, 10:]
return lambda_const*logloss(y_true, y_pred) + logloss(y_soft, y_pred_soft)
def get_model(num_labels):
#Some layers for model
model.add(Dense(num_labels))
logits = model.layers[-1].output
probabilities = Activation('softmax')(logits)
# softed probabilities
logits_T = Lambda(lambda x: x/temperature)(logits)
probabilities_T = Activation('softmax')(logits_T)
output = concatenate([probabilities, probabilities_T])
model = Model(model.input, output)
lambda_const = 0.07
model.compile(
optimizer=optimizers.SGD(lr=1e-1, momentum=0.9, nesterov=True),
loss=lambda y_true, y_pred: knowledge_distillation_loss(y_true, y_pred, lambda_const),
metrics=[accuracy])
return model
I am following this reference.
This is implemented using fit generator() on Keras with tf backend. Obviously, I will have trouble when loading the model since temperature is hared coded.
Also,
I wish to update temperature parameter with respect to the epoch number in both loss function and model.
How do I define such a control signal?
I've turned this into a complete example of one way to do this.
You could make a class for the loss function.
class TemperatureLossFunction:
def __init__(self, temperature):
self.temperature = temperature
def loss_fun(self, y_truth, y_pred):
return self.temperature*keras.losses.mse(y_truth, y_pred)
def setTemperature(self, t, session=None):
if session:
session.run(self.temperature.assign( t )
elif tensorflow.get_default_session():
tensorflow.get_default_session().run(self.temperature.assign( t ))
class TemperatureLossCallback(keras.callbacks.Callback):
def __init__(self, temp_lf):
self.temp_lf = temp_lf
def on_epoch_end(self, epoch, params):
self.temp_lf.setTemperature(epoch)
I've created two methods for working with this, the first method creates and saves the model.
def init(session):
global temperature #global for serialization issues
temperature = tensorflow.Variable(5.0)
tlo = TemperatureLossFunction(temperature)
inp = keras.layers.Input((4,4))
l1 = keras.layers.Lambda( lambda x: temperature*x )
op = l1(inp)
m = keras.models.Model(inputs=[inp], outputs=[op])
m.compile( optimizer = keras.optimizers.SGD(0.01), loss=tlo.loss_fun)
#make sure the session is the one your using!
session.run(temperature.initializer)
The first test I run makes sure we are changing the value.
m.evaluate( numpy.ones((1, 4, 4)), numpy.zeros((1, 4, 4)) )
session.run(temperature.assign(1))
m.evaluate( numpy.ones((1, 4, 4)), numpy.zeros((1, 4, 4)) )
The second test I run makes sure we can change the values with a callback.
cb = TemperatureLossCallback(tlo)
def gen():
for i in range(10):
yield numpy.ones((1, 4, 4)), numpy.zeros((1, 4, 4))
m.fit_generator(
gen(), steps_per_epoch=1, epochs=10, callbacks=[cb]
)
m.save("junk.h5")
Finally, to demonstrate reloading the file.
def restart(session):
global temperature
temperature = tensorflow.Variable(5.0)
tlo = TemperatureLossFunction(temperature)
loss_fun = tlo.loss_fun
m = keras.models.load_model(
"junk.h5",
custom_objects = {"loss_fun":tlo.loss_fun}
)
session.run(temperature.initializer)
m.evaluate( numpy.ones((1, 4, 4)), numpy.zeros((1, 4, 4)) )
session.run(temperature.assign(1))
m.evaluate( numpy.ones( (1, 4, 4) ), numpy.zeros( ( 1, 4, 4) ) )
This is just the code I use to start the program for completeness
import sys
if __name__=="__main__":
sess = tensorflow.Session()
with sess.as_default():
if "restart" in sys.argv:
restart(sess)
else:
init(sess)
One downside of this method, if you run this you will see that the temperature variable does not get loaded from the model file. It takes on the value assigned in the code.
On the plus side, both the loss function and the layer are referencing the same Variable
One way I found to save the variable value is to create a new layer and use the variable as the weight for the new layer.
class VLayer(keras.layers.Layer):
def __init__(self, *args, **kwargs):
super().__init__(**kwargs)
def build(self, input_shape):
self.v1 = self.add_weight(
dtype="float32",
shape = (),
trainable=False,
initializer="zeros"
)
def call(self, x):
return x*self.v1
def setValue(self, val):
self.set_weights( numpy.array([val]) )
Now when you load the model, the weight will be loaded. Unfortunately, I could not find a way to link the weight to a Variable on load. So there will be two variables, one for the loss function and one for the layer. Both of them can be set from a callback though. So I feel this method is on a more robust path.

Pytorch. Optimizing the input to a model: Trying to backward through the graph a second time, but the buffers have already been freed

Currently, I'm trying to optimize the values of an input tensor, x, to a model.
I want to restrict the input to only contain values in the range [0.0;1.0].
There is not too much information about how to do this, when not working with a layer as such.
I've created a minimum working example below, which gives the error message in the title of this post.
The magic happens in the optimize_x() function
If I comment out the line: model.x = model.x.clamp(min=0.0, max=1.0) the issue is fixed, but the tensor is obviously not clamped.
I'm aware that I could just set retain_graph=True - but it's not clear whether this is the right way to go, or if there is a better way of achieving this functionality?
import torch
from torch.distributions import Uniform
class OptimizeInputModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = torch.nn.Sequential(
torch.nn.Linear(123, 1000),
torch.nn.Dropout(0.4),
torch.nn.ReLU(),
torch.nn.Linear(1000, 100),
torch.nn.Dropout(0.4),
torch.nn.ReLU(),
torch.nn.Linear(100, 1),
torch.nn.Sigmoid(),
)
in_shape = (1, 123)
self.x = torch.ones(in_shape) * 0.1
self.x.requires_grad = True
def forward(self) -> torch.Tensor:
return self.model(self.x)
class MyLossFunc(torch.nn.Module):
def forward(self, y: torch.Tensor) -> torch.Tensor:
loss = torch.sum(-y)
return loss
def optimize_x():
model = OptimizeInputModel()
optimizer = torch.optim.Adam([model.x], lr=1e-4)
loss_fn = MyLossFunc()
for epoch in range(50000):
# Constrain X to have no values < 0
model.x = model.x.clamp(min=0.0, max=1.0)
y = model()
loss = loss_fn(y)
if epoch % 9 == 0:
print(f'Epoch: {epoch}\t Loss: {loss}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
optimize_x()
Full error message:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
For anyone in the future who might have the same question.
My solution was to do (note the underscore!):
model.x.data.clamp_(min=0.0, max=1.0)
instead of:
model.x = model.x.clamp(min=0.0, max=1.0)

how to build a multidimensional autoencoder with pytorch

I followed this great answer for sequence autoencoder,
LSTM autoencoder always returns the average of the input sequence.
but I met some problem when I try to change the code:
question one:
Your explanation is so professional, but the problem is a little bit different from mine, I attached some code I changed from your example. My input features are 2 dimensional, and my output is same with the input.
for example:
input_x = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
output_y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
the input_x and output_y are same, 5-timesteps, 2-dimensional feature.
import torch
import torch.nn as nn
import torch.optim as optim
class LSTM(nn.Module):
def __init__(self, input_dim, latent_dim, num_layers):
super(LSTM, self).__init__()
self.input_dim = input_dim
self.latent_dim = latent_dim
self.num_layers = num_layers
self.encoder = nn.LSTM(self.input_dim, self.latent_dim, self.num_layers)
# I changed here, to 40 dimesion, I think there is some problem
# self.decoder = nn.LSTM(self.latent_dim, self.input_dim, self.num_layers)
self.decoder = nn.LSTM(40, self.input_dim, self.num_layers)
def forward(self, input):
# Encode
_, (last_hidden, _) = self.encoder(input)
# It is way more general that way
encoded = last_hidden.repeat(input.shape)
# Decode
y, _ = self.decoder(encoded)
return torch.squeeze(y)
model = LSTM(input_dim=2, latent_dim=20, num_layers=1)
loss_function = nn.MSELoss()
optimizer = optim.Adam(model.parameters())
y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
x = y.view(len(y), -1, 2) # I changed here
while True:
y_pred = model(x)
optimizer.zero_grad()
loss = loss_function(y_pred, y)
loss.backward()
optimizer.step()
print(y_pred)
The above code can learn very well, can you help review the code and give some instructions.
When I input 2 examples as the input to the model, the model cannot work:
for example, change the code:
y = torch.Tensor([[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]])
to:
y = torch.Tensor([[[0.0,0.0],[0.5,0.5]], [[0.1,0.1], [0.6,0.6]], [[0.2,0.2],[0.7,0.7]], [[0.3,0.3],[0.8,0.8]], [[0.4,0.4],[0.9,0.9]]])
When I compute the loss function, it complain some errors? can anyone help have a look
question two:
my training samples are with different length:
for example:
x1 = [[0.0,0.0], [0.1,0.1], [0.2,0.2], [0.3,0.3], [0.4,0.4]] #with 5 timesteps
x2 = [[0.5,0.5], [0.6,0.6], [0.7,0.7]] #with only 3 timesteps
How can I input these two training sample into the model at the same time for a batch training.
Recurrent N-dimensional autoencoder
First of all, LSTMs work on 1D samples, yours are 2D as it's usually used for words encoded with a single vector.
No worries though, one can flatten this 2D sample to 1D, example for your case would be:
import torch
var = torch.randn(10, 32, 100, 100)
var.reshape((10, 32, -1)) # shape: [10, 32, 100 * 100]
Please notice it's really not general, what if you were to have 3D input? Snippet belows generalizes this notion to any dimension of your samples, provided the preceding dimensions are batch_size and seq_len:
import torch
input_size = 2
var = torch.randn(10, 32, 100, 100, 35)
var.reshape(var.shape[:-input_size] + (-1,)) # shape: [10, 32, 100 * 100 * 35]
Finally, you can employ it inside neural network as follows. Look at forward method especially and constructor arguments:
import torch
class LSTM(nn.Module):
# input_dim has to be size after flattening
# For 20x20 single input it would be 400
def __init__(
self,
input_dimensionality: int,
input_dim: int,
latent_dim: int,
num_layers: int,
):
super(LSTM, self).__init__()
self.input_dimensionality: int = input_dimensionality
self.input_dim: int = input_dim # It is 1d, remember
self.latent_dim: int = latent_dim
self.num_layers: int = num_layers
self.encoder = torch.nn.LSTM(self.input_dim, self.latent_dim, self.num_layers)
# You can have any latent dim you want, just output has to be exact same size as input
# In this case, only encoder and decoder, it has to be input_dim though
self.decoder = torch.nn.LSTM(self.latent_dim, self.input_dim, self.num_layers)
def forward(self, input):
# Save original size first:
original_shape = input.shape
# Flatten 2d (or 3d or however many you specified in constructor)
input = input.reshape(input.shape[: -self.input_dimensionality] + (-1,))
# Rest goes as in my previous answer
_, (last_hidden, _) = self.encoder(input)
encoded = last_hidden.repeat(input.shape)
y, _ = self.decoder(encoded)
# You have to reshape output to what the original was
reshaped_y = y.reshape(original_shape)
return torch.squeeze(reshaped_y)
Remember you have to reshape your output in this case. It should work for any dimensions.
Batching
When it comes to batching and different length of sequences it is a little more complicated.
You have to pad each sequence in batch before pushing it through network. Usually, values with which you pad are zeros, you may configure it inside LSTM though.
You may check this link for an example. You will have to use functions like torch.nn.pack_padded_sequence and others to make it work, you may check this answer.
Oh, since PyTorch 1.1 you don't have to sort your sequences by length in order to pack them. But when it comes to this topic, grab some tutorials, should make things clearer.
Lastly: Please, separate your questions. If you perform the autoencoding with single example, move on to batching and if you have issues there, please post a new question on StackOverflow, thanks.

pyspark and pytorch: unable to retrieve gradients on the driver

I'm new to pytorch and I'm trying to explore the feasibility of its usage with spark (for now I'm working in spark standalone).
As for now I'm struggling on a very specific topic.
Let's start with a very simple model:
# linmodel.py
import torch
import torch.nn as nn
import numpy as np
def standardize(x):
return (x - np.mean(x)) / np.std(x)
def add_noise(y):
rnd = np.random.randn(y.shape[0])
return y + rnd
def cost(target, predicted):
cost = torch.sum((torch.t(target) - predicted) ** 2)
return cost
class LinModel(nn.Module):
def __init__(self, in_size, out_size):
super(LinModel, self).__init__() # always call parent's init
self.linear = nn.Linear(in_size, out_size, bias=False) # layer parameters
def forward(self, x):
return self.linear(x)
Which instantiates a basic linear model, along with some utility functions.
The goal is to approximate a target matrix, and to keep track of how the
gradients behave.
I'm trying to achieve the following:
create my target matrix
split the inputs on the workers
instantiate models and optimizer on the workers
compute the approximation on subsets of input
retrieve the gradients for further analysis
And everything works fine until point 5.
Here's the code:
#test.py
import torch
import torch.nn as nn
import numpy as np
import torch.optim
from torch.autograd import Variable
from pyspark import SparkContext
import linmodel
def prepare_input(nsamples=400):
Xold = np.linspace(0, 1000, nsamples).reshape([nsamples, 1])
X = linmodel.standardize(Xold)
W = np.random.randint(1, 10, size=(5, 1))
Y = W.dot(X.T) # target
for i in range(Y.shape[1]):
Y[:, i] = linmodel.add_noise(Y[:, i])
x = Variable(torch.from_numpy(X), requires_grad=False).type(torch.FloatTensor)
y = Variable(torch.from_numpy(Y), requires_grad=False).type(torch.FloatTensor)
print("created torch variables {} {}".format(x.size(), y.size()))
return x, y, W
def initialize(tup):
x, y = tup[0] # data
m, o = tup[1] # model and optimizer
model, optimizer = torch_step(x, y, m, o)
# here we have the gradients
print('gradient: {}'.format([param.grad.data for param in model.parameters()]))
return (x, y), (model, optimizer)
def create_model():
model = linmodel.LinModel(1, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
return model, optimizer
def torch_step(x, y, model, optimizer):
prediction = model(x)
loss = linmodel.cost(y, prediction)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model, optimizer
def main(sc, num_partitions=4):
x, y, W = prepare_input()
parts_x = list(torch.split(x, int(x.size()[0] / num_partitions)))
parts_y = list(torch.split(y, int(x.size()[0] / num_partitions), 1))
rdd_models = sc.parallelize([create_model() for _ in range(num_partitions)]).repartition(num_partitions)
rdd_x = sc.parallelize(parts_x).repartition(num_partitions)
rdd_y = sc.parallelize(parts_y).repartition(num_partitions)
parts = rdd_x.zip(rdd_y) # [((100x1), (5x100)), ...]
full = parts.zip(rdd_models).map(initialize).cache()
models_out = full.map(lambda x: x[1][0]).collect()
test_model = models_out[0]
print(type(test_model))
print('gradient: {}'.format([param.grad.data for param in test_model.parameters()]))
if __name__ == '__main__':
sc = SparkContext(appName='test')
main(sc)
As you can see in the comments , when the function initialize is mapped on the full rdd, if you inspect the logs of the executors you'll find the gradients to be computed.
When I collect the result and try to access the very same attribute on the driver I receive a AttributeError: 'NoneType' object has no attribute 'data'
meaning that all the model.grad attribute are set to None.
I'm sure I'm missing something big here, but I cannot see it.
Any hint is appreciated.
Thanks a lot.
There are two major mistakes in your approach (according to me):
Since you want a distributed training, your approach of instantiating the model separately in all of the executors is wrong. You should instantiate the model in the head node (node where spark driver is located) and then distribute that model to all the executors. So each executor independently does forward pass and calculates the gradients on its portion of data and passes the gradients to the head node for weight update (weight update has to be serialized). Then the updated network is again scattered to the executors for the next iteration.
A much bigger concern is that I am not very sure if the gradient buffers are copied to the head node from the executors when you perform .collect(). Due to which model.grad can be set to None. To begin debugging, I suggest you have only one executor (and 1 partition) and then perform a .collect() to see if the gradient buffers are being copied. Or if you are good at Java or Scala, you can look at the collect() method's implementation.
Hope this helps.....

Resources