loss.backward() gives RuntimeError: CUDA error: device-side assert triggered - pytorch

I'm trying to use BCELoss without any success.
loss = BCELoss()
opt = optim.AdamW (model.parameters(), lr=0.01, betas=(0.9, 0.99), weight_decay=0.001)
loss = Loss(z, y)
opt.zero_grad()
loss.backward()
z, y have the shape: (128, 2)
I'm getting error (from loss.backward()):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
When using CrossEntropyLoss (and y has the shape of (128,)) all works !
what is wrong ?
what do I need to change ?

Binary Cross Entropy or BCELoss requires every sample to have one number between [0, 1].
Your z needs to be [128,] (float and within [0,1]) and y to be [128,] (long integer containing only 0 and 1).

Related

Train two model iteratively with PyTorch

I hope to train two cascaded networks, e.g. X->Z->Y, Z=net1(X), Y=net2(Z).
I hope to optimize the parameters of these two networks iteratively, i.e., for a fixed parameter of net1, firstly train parameters of net2 using MSE(predY,Y) loss util convergence; then, use the converged MSE loss to train a iteration of net1, etc.
So, I define two optimizers for each networks respectively. My training code is below:
net1 = SimpleLinearF()
opt1 = torch.optim.Adam(net1.parameters(), lr=0.01)
loss_func = nn.MSELoss()
for itera1 in range(num_iters1 + 1):
predZ = net1(X)
net2 = SimpleLinearF()
opt2 = torch.optim.Adam(net2.parameters(), lr=0.01)
for itera2 in range(num_iters2 + 1):
predY = net2(predZ)
loss = loss_func(predY,Y)
if itera2 % (num_iters2 // 2) == 0:
print('iteration: {:d}, loss: {:.7f}'.format(int(itera2), float(loss)))
loss.backward(retain_graph=True)
opt2.step()
opt2.zero_grad()
loss.backward()
opt1.step()
opt1.zero_grad()
However, I encounter the following mistake:
RuntimeError: one of the variables needed for gradient computation has been modified by an
inplace operation: [torch.FloatTensor [1, 1]], which is output 0 of AsStridedBackward0, is at
version 502; expected version 501 instead. Hint: enable anomaly detection to find the
operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Does anyone know why this error occurs? How should I solve this problem. Many Thanks.
I found the answer to my question after some searching on PyTorch computation graph.
Just remove the retain_graph=True and add a .detach() in net2(predZ) will solve this error.
This detach operation can cut net1 away from the computation graph of net2/optimizor2.

How to use nn.CrossEntropyLoss() for a PatchGAN Discriminator output?

I am trying to use the nn.CrossEntropyLoss() to find the cross-entropy loss between reals and fakes of a patchGAN discriminator that outputs a tensor of shape (batch_size, 1, 30, 30).
I am confused with the documentation here that asks for class indexes instead of targets.
CE_loss = nn.CrossEntropyLoss()
real_loss = CE_loss(discriminator_real_outputs, torch.ones_like(discriminator_real_outputs))
fake_loss= CE_loss(discriminator_fake_outputs, torch.zeros_like(discriminator_fake_outputs))
I understand from an error that the target requires long int. I converted it to long.
CE_loss = nn.CrossEntropyLoss()
real_loss = CE_loss(discriminator_real_outputs, torch.ones_like(discriminator_real_outputs).long())
fake_loss= CE_loss(discriminator_fake_outputs, torch.zeros_like(discriminator_fake_outputs).long())
Then it asked for the target to be of the (batch_size, 30,30) instead of (batch_size, 1,30,30), fixed that too.
After that, it returned this error cuda runtime error (710) device-side assert triggered which broke the GPU on Google Colab and couldn't run it again until I reset the runtime.
I want to use this loss like other losses in the form
loss = Loss(input, target) without the index. How do I go about this?

Using SSIM loss in TensorFlow returns NaN values

I'm training a network with MRI images and I wanted to use SSIM as loss function. Till now I was using MSE, and everything was working fine. But when I tried to use SSIM (tf.image.ssim), I get a bunch of these warining messages:
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py:397: UserWarning: Warning: converting a masked element to nan.
dv = (np.float64(self.norm.vmax) -
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py:398: UserWarning: Warning: converting a masked element to nan.
np.float64(self.norm.vmin))
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py:405: UserWarning: Warning: converting a masked element to nan.
a_min = np.float64(newmin)
/usr/local/lib/python3.7/dist-packages/matplotlib/image.py:410: UserWarning: Warning: converting a masked element to nan.
a_max = np.float64(newmax)
/usr/local/lib/python3.7/dist-packages/matplotlib/colors.py:933: UserWarning: Warning: converting a masked element to nan.
dtype = np.min_scalar_type(value)
/usr/local/lib/python3.7/dist-packages/numpy/ma/core.py:713: UserWarning: Warning: converting a masked element to nan.
data = np.array(a, copy=False, subok=subok)
I code is running anyway but no figure is being produced. I am not sure what's happening here or where should I look. I am using tensorflow 2.4.0.
I am attaching a summary of my code my code here:
generator = Generator() #An u-net defined in tf.keras
gen_learningrate = 5e-4
generator_optimizer = tf.keras.optimizers.Adam(gen_learningrate, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
# Generator loss
def generator_loss(gen_output, target):
# SSIM loss
loss = - tf.reduce_mean(tf.image.ssim(target, gen_output, 2))
return loss
#tf.function
def train_step(input_image, target, epoch):
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
gen_output = generator(input_image, training=True)
loss = generator_loss(gen_output, target)
generator_gradients = gen_tape.gradient(loss, generator.trainable_variables)
generator_optimizer.apply_gradients(zip(generator_gradients,
generator.trainable_variables))
return loss
def fit(train_ds, epochs, test_ds):
for input_image, target in train_ds:
loss = train_step(input_image,target,epoch)
fit(train_dataset, EPOCHS, test_dataset)
I have explored a little bit and noticed most people using tf.image.ssim() as loss function have used tf.train() from tensorflow 1.0 or model.fit() from tf.keras. I suspect the NaN value returned has something to do with GradientTape() function but I'm not sure how.
In my experience this warning is typically related to attempting plotting a point with a coordinate at infinity. Of course you should really show us more code for us to help you effectively.
You may get your network prediction so near with real image, so it give an Infinity (and NaN if you perform some operations with it).
Be careful to use it.

Is my Keras Convolutional Model returning any values?

For my project I'm mostly following this simple GAN tutorial except that my data is in a time series of 3 values between {-1,1}. I striped away a lot of its complexity to try to understand where the discrepancy is coming from. However, after lots of trail & error and Stack Overflow searches it's time I raise my hand and ask for help. I'm running Python 3.6 / Conda 4.8.3 in a VSCode Jupyter notebook on OSX with TensorFlow 2.0.0. My simplified discriminator does not return any errors in my notebook.
def build_discriminator():
discriminator_input = Input(shape=(4000,3), name='discriminator_input')
x = discriminator_input
x = Conv1D(32, 3, strides=1, padding="same", input_shape=(4000,3)) (x)
x = LeakyReLU()(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
discriminator_output = Dense(1, activation='sigmoid')(x)
return Model(discriminator_input, discriminator_output)
#Test it with some random noise of the same shape as the training data
d = build_discriminator()
noise = tf.random.uniform(
(1,4000,3), minval=-1, maxval=1, dtype=tf.dtypes.float32
)
decision = d(noise)
Output I'm getting:
print(decision)
<tf.Tensor 'model_1/dense_6/Sigmoid:0' shape=(1, 1) dtype=float32>
I was expecting to put random noise in the untrained discriminator the same size as a training sample and at least get a value between [0,1] to test that the network is processing data.
Expected output:
<tf.Tensor [[0.014325]] shape=(1, 1) dtype=float32>
I need a bit of help interpreting this discrepancy. Does that mean my model isn't processing at all? Or am I missing something more subtle? What do I need to change so that my discriminator returns a tensor of values?
Against recommendations I spend some time removing Keras & Tensorflow from Conda and installing it with pip so that tf.__version__ correctly returned 2.2.0 in the notebook. To my surprise it worked and returned the expected result.
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.49497133]], dtype=float32)>
Posting here in case anyone else stumbles across this question with the same problem.

Custom distance loss function in Pytorch?

I want to implement the following distance loss function in pytorch. I was following this https://discuss.pytorch.org/t/custom-loss-functions/29387/4 thread from the pytorch forum
np.linalg.norm(output - target)
# where output.shape = [1, 2] and target.shape = [1, 2]
So I have implemented the loss function like this
def my_loss(output, target):
loss = torch.tensor(np.linalg.norm(output.detach().numpy() - target.detach().numpy()))
return loss
with this loss function, calling backwards gives runtime error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
My entire code looks like this
model = nn.Linear(2, 2)
x = torch.randn(1, 2)
target = torch.randn(1, 2)
output = model(x)
loss = my_loss(output, target)
loss.backward() <----- Error here
print(model.weight.grad)
PS: I am aware of the pairwise loss of pytorch but due to some limitation of it, I have to implement it myself.
Following the pytorch source code I have tried the following,
class my_function(torch.nn.Module): # forgot to define backward()
def forward(self, output, target):
loss = torch.tensor(np.linalg.norm(output.detach().numpy() - target.detach().numpy()))
return loss
model = nn.Linear(2, 2)
x = torch.randn(1, 2)
target = torch.randn(1, 2)
output = model(x)
criterion = my_function()
loss = criterion(output, target)
loss.backward()
print(model.weight.grad)
And I get the Run time error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
How can I implement the loss function correctly?
This happens because, in the loss function, you are detaching tensors. You had to detach because you wanted to use np.linalg.norm. This breaks the graph and you get the error that tensors don't have grad fn.
You can replace
loss = torch.tensor(np.linalg.norm(output.detach().numpy() - target.detach().numpy()))
by torch operations as
loss = torch.norm(output-target)
This should work fine.

Resources