Cant perform forward pass twice in DDP - pytorch

I am trying to forward pass 2 different inputs with the same model as shown below:
for epoch in range(num_epochs):
dataloader.sampler.set_epoch(epoch)
for batch_index, (real,_) in enumerate(dataloader):
disc.zero_grad()
real=real.to(rank)
noise=torch.randn((batch_size,z_dim,1,1)).to(rank)
fake_img=gen(noise)
fake_img_clone=fake_img.detach().clone()
disc_real=disc(real).reshape(-1)
lossD_real=critereon(disc_real,torch.ones_like(disc_real))
disc_fake=disc(fake_img.detach()).reshape(-1)
lossD_fake=critereon(disc_fake,torch.zeros_like(disc_fake))
lossD = (lossD_fake+lossD_real)/2
opt_disc.step()
However, I keep getting the error "one of the variables needed for gradient computation has been modified by an inplace operation"
Setting torch.autograd.set_detect_anomaly(True, check_nan=True) shows that the error occurs in disc_real=disc(real).reshape(-1), but when I manually debug it, the error occurs only when I add the second forward pass line disc_fake=disc(fake_img.detach()).reshape(-1)
I am currently using the latest version of pytorch. Please help me solve this :frowning:

The error message "one of the variables needed for gradient computation has been modified by an inplace operation" typically occurs when you modify a tensor in-place, which can break the computation graph and cause issues with backpropagation.
In your code, it seems that you are modifying the fake_img tensor in-place when you detach and clone it with fake_img_clone=fake_img.detach().clone(). This is because the detach() function returns a new tensor with the same data as the original tensor, but it shares the same storage as the original tensor, which means that modifying the new tensor will also modify the original tensor.
To avoid this issue, you can detach the fake_img tensor without cloning it, like this: fake_img.detach(). This will create a new tensor that is not connected to the computation graph and will not be modified by any subsequent operations.
Here's the updated code:
for epoch in range(num_epochs):
dataloader.sampler.set_epoch(epoch)
for batch_index, (real,_) in enumerate(dataloader):
disc.zero_grad()
real = real.to(rank)
noise = torch.randn((batch_size,z_dim,1,1)).to(rank)
fake_img = gen(noise)
disc_real = disc(real).reshape(-1)
lossD_real = critereon(disc_real, torch.ones_like(disc_real))
with torch.no_grad():
fake_img_detached = fake_img.detach()
disc_fake = disc(fake_img_detached).reshape(-1)
lossD_fake = critereon(disc_fake, torch.zeros_like(disc_fake))
lossD = (lossD_fake + lossD_real) / 2
lossD.backward()
opt_disc.step()
In the updated code, we detach the fake_img tensor without cloning it, and store the result in a new tensor fake_img_detached with a with torch.no_grad() context manager to avoid tracking the gradient of the detached tensor. We then use fake_img_detached for the forward pass of the discriminator, which should avoid the in-place modification issue.

Related

Can't get Keras Code Example #1 to work with multi-label dataset

Apologies in advance.
I am attempting to recreate this CNN (from the Keras Code Examples), with another dataset.
https://keras.io/examples/vision/image_classification_from_scratch/
The dataset I am using is one for retinal scans, and classifies images on a scale from 0-4. So, it's a multi-label image classification.
The Keras example used is binary classification (cats v dogs), though I would have hoped it wouldn't make much difference (maybe this is a big assumption on my part).
I skipped the 'image augmentation' part of the walkthrough. So, I have not created the
data_augmentation = keras.Sequential(
[
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
]
)
part. So, instead of:
def make_model(input_shape, num_classes):
inputs = keras.Input(shape=input_shape)
# Image augmentation block
x = data_augmentation(inputs)
# Entry block
x = layers.Rescaling(1.0 / 255)(x)
.......
at the beginning of the model, I have:
def make_model(input_shape, num_classes):
inputs = keras.Input(shape=input_shape)
# Image augmentation block
x = keras.Sequential(inputs)
# Entry block
x = layers.Rescaling(1.0 / 255)(x)
.......
However I keep getting different errors no matter how much I try to change things around, such as "TypeError: Keras symbolic inputs/outputs do not implement __len__.", or "ValueError: Exception encountered when calling layer "rescaling_3" (type Rescaling).".
What am I missing here?

Pytorch model gradients are printed correctly but copied wrongly

I want to copy the gradients of loss, with respect to weight, for different data samples using pytorch. In the code below, I am iterating one sample each time from the data loader (batch size = 1) and collecting gradients for 1st fully connected (fc1) layer. Gradients should be different for different samples. The print function shows correct gradients, which are different for different samples. But when I store them in a list, I get the same gradients repeatedly. Any suggestions would be much appreciated. Thanks in advance!
grad_list = [ ]
for data in test_loader:
inputs, labels = data[0], data[1]
inputs = torch.autograd.Variable(inputs)
labels = torch.autograd.Variable(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward
output = target_model(inputs)
loss = criterion(output, labels)
loss.backward()
grad_list.append(target_model.fc1.weight.grad.data)
print(target_model.fc1.weight.grad.data)
Try using clone and detach instead:
grad_list.append(target_model.fc1.weight.grad.clone().detach())
The data property you are appending to your list is a mutable reference to the storage of the parameter (i.e. the actual memory address and the values contained within). What you need to do is create a replica of the gradient tensor (with clone) and remove it from the computational graph (with detach) to avoid it interfering with gradient computation.

Number of operation increases with tf.gradient

So I was trying to calculate the gradient wrt input, using a combination of Keras and tensorflow:
the code (in a loop) is like:
import keras.backend as K
loss = K.categorical_crossentropy(model's output, target)
gradient = sess.run([tf.gradients(loss, model.input, colocate_gradients_with_ops=True)],feed_dict={model.input: img}) # img is a numpy array that fits the dimension requirements
n_operations = len(tf.get_default_graph().get_operations())
I noticed that "n_operations" increases every iteration, and so as time it costs. Is that normal? Is there any way to prevent this?
Thank you!
No this is not the desired behavior. Your problem is that you are defining your gradient operation again and again, while you only need to execute the operation. The tf.gradient function pushes new operations onto the graph and return a handle to those gradients. So you only have to execute them to get the desired results. With multiple runs of the function multiple operations are generated and this will eventually ruin your performance. The solution is as follows:
# outside the loop
loss = K.categorical_crossentropy(model's output, target)
gradients = tf.gradients(loss, model.input, colocate_gradients_with_ops=True)
# inside the loop
gradient_np = sess.run([gradients],feed_dict={model.input: img}) # img is a numpy array that fits the dimension requirements

Seq2seq for non-sentence, float data; stuck configuring the decoder

I am trying to apply sequence-to-sequence modelling to EEG data. The encoding works just fine, but getting the decoding to work is proving problematic. The input-data has the shape None-by-3000-by-31, where the second dimension is the sequence-length.
The encoder looks like this:
initial_state = lstm_sequence_encoder.zero_state(batchsize, dtype=self.model_precision)
encoder_output, state = dynamic_rnn(
cell=LSTMCell(32),
inputs=lstm_input, # shape=(None,3000,32)
initial_state=initial_state, # zeroes
dtype=lstm_input.dtype # tf.float32
)
I use the final state of the RNN as the initial state of the decoder. For training, I use the TrainingHelper:
training_helper = TrainingHelper(target_input, [self.sequence_length])
training_decoder = BasicDecoder(
cell=lstm_sequence_decoder,
helper=training_helper,
initial_state=thought_vector
)
output, _, _ = dynamic_decode(
decoder=training_decoder,
maximum_iterations=3000
)
My troubles start when I try to implement inference. Since I am using non-sentence data, I do not need to tokenize or embed, because the data is essentially embedded already. The InferenceHelper class seemed the best way to achieve my goal. So this is what I use. I'll give my code then explain my problem.
def _sample_fn(decoder_outputs):
return decoder_outputs
def _end_fn(_):
return tf.tile([False], [self.lstm_layersize]) # Batch-size is sequence-length because of time major
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[32],
sample_dtype=target_input.dtype,
start_inputs=tf.zeros(batchsize_placeholder, 32), # the batchsize varies
end_fn=_end_fn
)
inference_decoder = BasicDecoder(
cell=lstm_sequence_decoder,
helper=inference_helper,
initial_state=thought_vector
)
output, _, _ = dynamic_decode(
decoder=inference_decoder,
maximum_iterations=3000
)
The Problem
I don't know what the shape of the inputs should be. I know the start-inputs should be zero because it is the first time-step. But this throws errors; it expects the input to be (1,32).
I also thought I should pass the output of each time-step unchanged to the next. However, this raises problems at run-time: the batch-size varies, so the shape is partial. The library throws an exception at this as it tries to convert the start_input to a tensor:
...
self._start_inputs = ops.convert_to_tensor(
start_inputs, name='start_inputs')
Any ideas?
This is a lesson in poor documentation.
I fixed my problem, but failed to address the variable batch-size problem.
The _end_fn was causing problems I was unaware of. I also managed to work out what the appropriate fields are for the InferenceHelper. I've given the fields names in case anyone needs guidance in future
def _end_fn(_):
return tf.tile([False], [batchsize])
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[lstm_number_of_units], # In my case, 32
sample_dtype=tf.float32, # Depends on the data
start_inputs=tf.zeros((batchsize, lstm_number_of_units)),
end_fn=_end_fn
)
As for the batch-size problem, there are two things I'm considering:
Changing the internal state of my model object. My TensorFlow computation graph is built inside a class. A class-field records the batch-size. Changing this during training may work. Or:
Pad the batches so that they are 200 sequences long. This will waste time.
Preferably I'd like a way to dynamically manage the batch-sizes.
EDIT: I found a way. It involves simply substituting square-brackets for parentheses:
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[self.lstm_layersize],
sample_dtype=target_input.dtype,
start_inputs=tf.zeros([batchsize, self.lstm_layersize]),
end_fn=_end_fn
)

Tensorflow: Summaries defined in function not accessible in tensorboard

I have a graph and a set of custom functions that define multilayer RNNs according to an input list which will specify the number of units in each layer. For instance:
def BuildLayers(....):
# takes inputs, list of layer sizes, mask information, etc
#
# invokes BuildLayer(...) several times
#
# returns RNN output and states of last layer
BuildLayer loops through a more detailed function which builds and returns individual layers:
def BuildLayer(....):
# Takes individual layer size, output of previous layer, etc
#
# handles bookkeeping of RNNCells, wrappers, reshaping, etc
# **Important! Defines scope for each layer**
#
# returns RNN output and states of last layer
And ultimately this would be called in a function that defines a graph and runs it in a session:
def Experiment(parameters):
tf.reset_default_graph()
graph = tf.Graph()
with graph.as_default():
#
# Placeholders
# BuildLayers(...)
# Loss function definitions
# optimizer definitions
with tf.Session(graph=graph) as session:
#
# Loop through epochs:
# etc
I.e., if the layer size parameter is [16, 32, 16], we end up with an RNN that has a cell of 16 units in layer1, scoped as layer1, 32 units in layer 2, scoped appropriately, and 16 units in layer 3, scoped, etc.
This seems to work fine, a casual inspection of the graph in tensorboard looks correct, nodes look correct, the thing trains, etc.
Problem: How can I add histogram summaries, e.g., of kernel weights and biases, to that function definition? I've done so naively, as such:
def buildLayer(numUnits, numLayer, input, lengths):
name = 'layer' "{0:0=2d}".format(numLayer)
with tf.variable_scope(name):
cellfw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
cellbw = tf.contrib.rnn.GRUCell(numUnits, activation = tf.nn.tanh)
outputs, state = tf.nn.bidirectional_dynamic_rnn(cell_fw = cellfw, cell_bw = cellbw, inputs = input, dtype=tf.float32, sequence_length = lengths)
outputs = tf.concat([outputs[0], outputs[1]], axis=2)
FwKernel = tf.get_default_graph().get_tensor_by_name(name + '/bidirectional_rnn/fw/gru_cell/gates/kernel:0')
FwKernel_sum = tf.summary.histogram("FwKernel", FwKernel, 'rnn')
return outputs, state
And then, at the end of the graph definition, assumed these summaries would be caught up in the
merged = tf.summary.merge_all()
statement. It isn't. I'm confused by this behavior. I can see the histogram summary definitions on a visual inspection of the graph in tensorboard-- they're there. But they don't seem to be getting to the merge and so are never accessible in tensorboard as histograms per se.
How do I get summaries, which are defined in a function, to show up in tensorboard, preferably through a merge and without passing them around through function calls like excess baggage?
The least painful way I have found to avoid this is to pass a single list (i.e., "summaries") through each function, and within the BuildLayers function, to append or extend that list with all desired histogram summaries.
Then, in the main graph definition, rather than a merge_all
merged = tf.summary.merge_all()
instead use a merge and pass the list in as the argument
merged = tf.summary.merge(summaries)
This has the disadvantage of not actually being a merge_all, meaning that if you had defined other summaries (typically scalar summaries for loss functions, at least) you're going to have to manually append them to the summaries list or carry around two merge objects or something similar, which misses the self-advertised point of the merge_all.
I leave this here as an answer to my own question because it might help someone, but will pointedly not accept it because I am hoping to be shown a better way.
Most likely the problem is that summaries are created in the with graph.as_default(): context. The summary operations are then added to this graph's _collections["SUMMARIES"] list. But, when you call merge_all() you are no longer in that context (that set graph to be the default). So, merge_all() looks for summaries in the default graph that was created when you imported tensorflow, which is probably empty.
To fix the issue, simply call merge_all() within the same with graph.as_default(): context.
Here are some relevant code links:
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/summary/summary.py#L293
https://github.com/tensorflow/tensorflow/blob/92e6c3e4f5c1cabfda1e61547a6a1b268ef95fa5/tensorflow/python/framework/ops.py#L5649

Resources