Gradients vanishing despite using Kaiming initialization - pytorch

I was implementing a conv block in pytorch with activation function(prelu). I used Kaiming initilization to initialize all my weights and set all the bias to zero. However as I tested these blocks (by stacking 100 such conv and activation blocks on top of each other), I noticed that the output I am getting values of the order of 10^(-10). Is this normal, considering I am stacking upto 100 layers. Adding a small bias to each layer fixes the problem. But in Kaiming initialization the biases are supposed to be zero.
Here is the conv block code
from collections import Iterable
def convBlock(
input_channels, output_channels, kernel_size=3, padding=None, activation="prelu"
):
"""
Initializes a conv block using Kaiming Initialization
"""
padding_par = 0
if padding == "same":
padding_par = same_padding(kernel_size)
conv = nn.Conv2d(input_channels, output_channels, kernel_size, padding=padding_par)
relu_negative_slope = 0.25
act = None
if activation == "prelu" or activation == "leaky_relu":
nn.init.kaiming_normal_(conv.weight, a=relu_negative_slope, mode="fan_in")
if activation == "prelu":
act = nn.PReLU(init=relu_negative_slope)
else:
act = nn.LeakyReLU(negative_slope=relu_negative_slope)
if activation == "relu":
nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
act = nn.ReLU()
nn.init.constant_(conv.bias.data, 0)
block = nn.Sequential(conv, act)
return block
def flatten(lis):
for item in lis:
if isinstance(item, Iterable) and not isinstance(item, str):
for x in flatten(item):
yield x
else:
yield item
def Sequential(args):
flattened_args = list(flatten(args))
return nn.Sequential(*flattened_args)
This is the test Code
ls=[]
for i in range(100):
ls.append(convBlock(3,3,3,"same"))
model=Sequential(ls)
test=np.ones((1,3,5,5))
model(torch.Tensor(test))
And the output I am getting is
tensor([[[[-1.7771e-10, -3.5088e-10, 5.9369e-09, 4.2668e-09, 9.8803e-10],
[ 1.8657e-09, -4.0271e-10, 3.1189e-09, 1.5117e-09, 6.6546e-09],
[ 2.4237e-09, -6.2249e-10, -5.7327e-10, 4.2867e-09, 6.0034e-09],
[-1.8757e-10, 5.5446e-09, 1.7641e-09, 5.7018e-09, 6.4347e-09],
[ 1.2352e-09, -3.4732e-10, 4.1553e-10, -1.2996e-09, 3.8971e-09]],
[[ 2.6607e-09, 1.7756e-09, -1.0923e-09, -1.4272e-09, -1.1840e-09],
[ 2.0668e-10, -1.8130e-09, -2.3864e-09, -1.7061e-09, -1.7147e-10],
[-6.7161e-10, -1.3440e-09, -6.3196e-10, -8.7677e-10, -1.4851e-09],
[ 3.1475e-09, -1.6574e-09, -3.4180e-09, -3.5224e-09, -2.6642e-09],
[-1.9703e-09, -3.2277e-09, -2.4733e-09, -2.3707e-09, -8.7598e-10]],
[[ 3.5573e-09, 7.8113e-09, 6.8232e-09, 1.2285e-09, -9.3973e-10],
[ 6.6368e-09, 8.2877e-09, 9.2108e-10, 9.7531e-10, 7.0011e-10],
[ 6.6954e-09, 9.1019e-09, 1.5128e-08, 3.3151e-09, 2.1899e-10],
[ 1.2152e-08, 7.7002e-09, 1.6406e-08, 1.4948e-08, -6.0882e-10],
[ 6.9930e-09, 7.3222e-09, -7.4308e-10, 5.2505e-09, 3.4365e-09]]]],
grad_fn=<PreluBackward>)

Amazing question (and welcome to StackOverflow)! Research paper for quick reference.
TLDR
Try wider networks (64 channels)
Add Batch Normalization after activation (or even before, shouldn't make much difference)
Add residual connections (shouldn't improve much over batch norm, last resort)
Please check this out in this order and give a comment what (and if) any of that worked in your case (as I'm also curious).
Things you do differently
Your neural network is very deep, yet very narrow (81 parameters per layer only!)
Due to above, one cannot reliably create those weights from normal distribution as the sample is just too small.
Try wider networks, 64 channels or more
You are trying much deeper network than they did
Section: Comparison Experiments
We conducted comparisons on a deep but efficient model with 14 weight
layers (actually 22 was also tested in comparison with Xavier)
That was due to date of release of this paper (2015) and hardware limitations "back in the days" (let's say)
Is this normal?
Approach itself is quite strange with layers of this depth, at least currently;
each conv block is usually followed by activation like ReLU and Batch Normalization (which normalizes signal and helps with exploding/vanishing signals)
usually networks of this depth (even of depth half of what you've got) use also residual connections (though this is not directly linked to vanishing/small signal, more connected to degradation problem of even deep networks, like 1000 layers)

Related

Numbers of hidden layers and units in AutoKeras dense block

I am training a model with Autokeras. So far my best model is this:
structured_data_block_1/normalize:false
structured_data_block_1/dense_block_1/use_batchnorm:true
structured_data_block_1/dense_block_1/num_layers:2
structured_data_block_1/dense_block_1/units_0:32
structured_data_block_1/dense_block_1/dropout:0
structured_data_block_1/dense_block_1/units_1:32
dense_block_2/use_batchnorm:true
dense_block_2/num_layers:2
dense_block_2/units_0:128
dense_block_2/dropout:0
dense_block_2/units_1:16
dense_block_3/use_batchnorm:false
dense_block_3/num_layers:1
dense_block_3/units_0:32
dense_block_3/dropout:0
dense_block_3/units_1:32
regression_head_1/dropout:0
optimizer:"adam"
learning_rate:0.1
dense_block_2/units_2:32
structured_data_block_1/dense_block_1/units_2:256
dense_block_3/units_2:128
My first dense_block_1 has 2 layers (num_layers:2), how can I have three units / neurons then? It say units_0: 32, units_1: 32 and units_2: 256, this implies to me that I have three layers, so why is num_layers:2?
If I would want to recreate the above model in this code, how would I do it properly?
input_node = ak.StructuredDataInput()
output_node = ak.StructuredDataBlock(categorical_encoding=False, normalize=False)(input_node)
output_node = ak.DenseBlock()(output_node)
output_node = ak.DenseBlock()(output_node)
output_node = ak.RegressionHead()(output_node)
Thx for any input

Coupling of Different Blocks in a UNET

I am starting to work with Neuralnetworks using Keras. I try to adapt the model (UNet-like architecture) given by Sim, Oh, Kim, Jung in "Optimal Transport driven CycleGAN for Unsupervised Learning in Inverse Problems" (Fig. 10).
def def_generator(image_shape=(256,256,3)):
init= RandomNormal(stddev=0.02)
#Start 1st Block
in_image = Input(shape=image_shape)
g1=Conv2D(64,(3,3))(in_image)
g1=InstanceNormalization(axis=-1)(g1)
g1=LeakyReLU(alpha=0.2)(g1)
g1=Conv2D(64,(3,3))(g1)
g1=InstanceNormalization(axis=-1)(g1)
g1=LeakyReLU(alpha=0.2)(g1)
#End of 1st Block
#Start of 2nd Block
g2=MaxPool2D()(g1)
g2=Conv2D(128,(3,3))(g2)
g2=InstanceNormalization(axis=-1)(g2)
g2=LeakyReLU(alpha=0.2)(g2)
g2=Conv2D(128,(3,3))(g2)
g2=InstanceNormalization(axis=-1)(g2)
g2=LeakyReLU(alpha=0.2)(g2)
#End of 2nd Block
#Start of 3rd Block
g3=MaxPool2D()(g2)
g3=Conv2D(256,(3,3))(g3)
g3=InstanceNormalization(axis=-1)(g3)
g3=LeakyReLU(alpha=0.2)(g3)
g3=Conv2D(256,(3,3))(g3)
g3=InstanceNormalization(axis=-1)(g3)
g3=LeakyReLU(alpha=0.2)(g3)
#End of 3rd Block
#Start of 4th block
g4=MaxPool2D()(g3)
g4=Conv2D(512,(3,3))(g4)
g4=InstanceNormalization(axis=-1)(g4)
g4=LeakyReLU(alpha=0.2)(g4)
g4=Conv2D(512,(3,3))(g4)
g4=InstanceNormalization(axis=-1)(g4)
g4=LeakyReLU(alpha=0.2)(g4)
g4=Conv2D(256,(3,3))(g4)
g4=InstanceNormalization(axis=-1)(g4)
g4=LeakyReLU(alpha=0.2)(g4)
g4=Conv2DTranspose(256,(2,2),strides=(4,4),output_padding=1)(g4)
#End of 4th Block
#Start of 5th Block
g5input=Concatenate()([g4,g3])
g5=Conv2D(256,(3,3))(g5input)
g5=InstanceNormalization(axis=-1)(g5)
g5=LeakyReLU(alpha=0.2)(g5)
g5=Conv2D(256,(3,3))(g5)
g5=InstanceNormalization(axis=-1)(g5)
g5=LeakyReLU(alpha=0.2)(g5)
g5=Conv2DTranspose(128,(2,2),strides=(3,3), padding='same', output_padding=0)(g5)
#End of 5th Block
#Start of 6th block
g6input=Concatenate()([g5,g2])
g6=Conv2D(128,(2,2))(g6input)
g6=InstanceNormalization(axis=-1)(g6)
g6=LeakyReLU(alpha=0.2)(g6)
g6=Conv2D(128,(2,2))(g6)
g6=InstanceNormalization(axis=-1)(g6)
g6=LeakyReLU(alpha=0.2)(g6)
g6=Conv2DTranspose(64,(2,2),strides=(2,2), padding='valid', output_padding=1)(g6)
#End of 6th Block
#Start of 7th block
g7input=Concatenate()([g6,g1])
g7=Conv2D(64,(2,2))(g7input)
g7=InstanceNormalization(axis=-1)(g7)
g7=LeakyReLU(alpha=0.2)(g7)
g7=Conv2D(64,(2,2))(g7)
g7=InstanceNormalization(axis=-1)(g7)
g7=LeakyReLU(alpha=0.2)(g7)
g7=Conv2DTranspose(1,(1,1))(g7)
model=Model(in_image, g5)
model.compile(loss='mse', optimizer=Adam(lr=2e-4,beta_1=0.5), loss_weights=[0.5], metrics=['accuracy'])
return model
g=def_generator((120,120,1))
print(g.summary())
I run always in the problem that the dimensions of the layers which should be concatenated are not compatible.
I understand that this issue is resulting from the MaxPooling+Conv2d steps before.
I am now wondering if there is a trick/strategy to avoid/reduce this issue?
Any help will be appreciated.
Best wishes
Michael
the problem is very simple, you are concatenating block with layers with different size, this is happening because you are trying to run the network on images that are NOT POWER OF 2 size, when you do the max pooling of an image that is not divisible for 2 you lose a pixel (243x243 -> 121x121) and when you double with the traspose you get a different size (121x121 -> 242x242) and the concatenation doesnt work because 242 is different to 243, the images are of different size (at least this is what i think, you should have shared the error).
This means that when an image reaches a maxpooling layer it needs to have an edge divisible for 2.
so, solution:
having 4 blocks means that the images need to be AT LEAST divisible for 16, otherwise it will not work

Computing gradient twice for two different losses in Pytorch

I want to compute the gradients twice for two different losses in the same iteration.
Code:
batch_output0,batch_output1 = get_output_from_model(model=model,
data=batch[0])
train_loss0 = loss_fun0(batch_output0, batch_labels0.float().view(-1, 1))
train_loss0.backward()
grad0_conv_w = model.conv1.conv1.weight.grad
batch_output0,batch_output1 = get_output_from_model(model=model,
data=batch[0])
train_loss1 = loss_fun1(batch_output1, batch_labels1.float().view(-1, 1))
train_loss1.backward()
grad1_conv_w = model.conv1.conv1.weight.grad
Outputs:
train_loss0: tensor(0.6950, grad_fn=<BinaryCrossEntropyBackward>)
train_loss1: tensor(25.5431, grad_fn=<MseLossBackward>)
Grad0: tensor([-2.4883e-05, 3.7842e-05, 1.2635e-04, ..., -1.6413e-04,
-1.8419e-04, -1.7884e-04])
Grad1: tensor([-2.4883e-05, 3.7842e-05, 1.2635e-04, ..., -1.6413e-04,
-1.8419e-04, -1.7884e-04])
You may note that even though the two losses are quite different, the gradients for the corresponding losses are exactly the same.
Please help me to diagnose the problem.
Thank you.

Confusion About Implementing LeafSystem With Vector Output Port Correctly

I'm a student teaching myself Drake, specifically pydrake with Dr. Russ Tedrake's excellent Underactuated Robotics course. I am trying to write a combined energy shaping and lqr controller for keeping a cartpole system balanced upright. I based the diagram on the cartpole example found in Chapter 3 of Underactuated Robotics [http://underactuated.mit.edu/acrobot.html], and the SwingUpAndBalanceController on Chapter 2: [http://underactuated.mit.edu/pend.html].
I have found that due to my use of the cart_pole.sdf model I have to create an abstract input port due receive FramePoseVector from the cart_pole.get_output_port(0). From there I know that I have to create a control signal output of type BasicVector to feed into a Saturation block before feeding into the cartpole's actuation port.
The problem I'm encountering right now is that I'm not sure how to get the system's current state data in the DeclareVectorOutputPort's callback function. I was under the assumption I would use the LeafContext parameter in the callback function, OutputControlSignal, obtaining the BasicVector continuous state vector. However, this resulting vector, x_bar is always NaN. Out of desperation (and testing to make sure the rest of my program worked) I set x_bar to the controller's initialization cart_pole_context and have found that the simulation runs with a control signal of 0.0 (as expected). I can also set output to 100 and the cartpole simulation just flies off into endless space (as expected).
TL;DR: What is the proper way to obtain the continuous state vector in a custom controller extending LeafSystem with a DeclareVectorOutputPort?
Thank you for any help! I really appreciate it :) I've been teaching myself so it's been a little arduous haha.
# Combined Energy Shaping (SwingUp) and LQR (Balance) Controller
# with a simple state machine
class SwingUpAndBalanceController(LeafSystem):
def __init__(self, cart_pole, cart_pole_context, input_i, ouput_i, Q, R, x_star):
LeafSystem.__init__(self)
self.DeclareAbstractInputPort("state_input", AbstractValue.Make(FramePoseVector()))
self.DeclareVectorOutputPort("control_signal", BasicVector(1),
self.OutputControlSignal)
(self.K, self.S) = BalancingLQRCtrlr(cart_pole, cart_pole_context,
input_i, ouput_i, Q, R, x_star).get_LQR_matrices()
(self.A, self.B, self.C, self.D) = BalancingLQRCtrlr(cart_pole, cart_pole_context,
input_i, ouput_i,
Q, R, x_star).get_lin_matrices()
self.energy_shaping = EnergyShapingCtrlr(cart_pole, x_star)
self.energy_shaping_context = self.energy_shaping.CreateDefaultContext()
self.cart_pole_context = cart_pole_context
def OutputControlSignal(self, context, output):
#xbar = copy(self.cart_pole_context.get_continuous_state_vector())
xbar = copy(context.get_continuous_state_vector())
xbar_ = np.array([xbar[0], xbar[1], xbar[2], xbar[3]])
xbar_[1] = wrap_to(xbar_[1], 0, 2.0*np.pi) - np.pi
# If x'Sx <= 2, then use LQR ctrlr. Cost-to-go J_star = x^T * S * x
threshold = np.array([2.0])
if (xbar_.dot(self.S.dot(xbar_)) < 2.0):
#output[:] = -self.K.dot(xbar_) # u = -Kx
output.set_value(-self.K.dot(xbar_))
else:
self.energy_shaping.get_input_port(0).FixValue(self.energy_shaping_context,
self.cart_pole_context.get_continuous_state_vector())
output_val = self.energy_shaping.get_output_port(0).Eval(self.energy_shaping_context)
output.set_value(output_val)
print(output)
Here are two things that might help:
If you want to get the state of the cart-pole from MultibodyPlant, you probably want to be connecting to the continuous_state output port, which gives you a normal vector instead of the abstract-type FramePoseVector. In that case, your call to get_input_port().Eval(context) should work just fine.
If you do really want to read the FramePoseVector, then you have to evaluate the input port slightly differently. You can find an example of that here.

Feature extraction in loop seems to cause memory leak in pytorch

I have spent considerable time trying to debug some pytorch code which I have created a minimal example of for the purpose of helping to better understand what the issue might be.
I have removed all necessary portions of the code which are unrelated to the issue so the remaining piece of code won't make much sense from a functional standpoint but it still displays the error I'm facing.
The overall task I'm working on is in a loop and every pass of the loop is computing the embedding of the image and adding it to a variable storing it. It's effectively aggregating it (not concatenating, so the size remains the same). I don't expect the number of iterations to force the datatype to overflow, I don't see this happening here nor in my code.
I have added multiple metrics to evaluate the size of the tensors I'm working with to make sure they're not growing in memory footprint
I'm checking the overall GPU memory usage to verify the issue leading to the final RuntimeError: CUDA out of memory..
My environment is as follows:
- python 3.6.2
- Pytorch 1.4.0
- Cudatoolkit 10.0
- Driver version 410.78
- GPU: Nvidia GeForce GT 1030 (2GB VRAM)
(though I've replicated this experiment with the same result on a Titan RTX with 24GB,
same pytorch version and cuda toolkit and driver, it only goes out of memory further in the loop).
Complete code below. I have marked 2 lines as culprits, as deleting them removes the issue, though obviously I need to find a way to execute them without having memory issues. Any help would be much appreciated! You may try with any image named "source_image.bmp" to replicate the issue.
import torch
from PIL import Image
import torchvision
from torchvision import transforms
from pynvml import nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo, nvmlInit
import sys
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0' # this is necessary on my system to allow the environment to recognize my nvidia GPU for some reason
os.environ['CUDA_LAUNCH_BLOCKING'] = '1' # to debug by having all CUDA functions executed in place
torch.set_default_tensor_type('torch.cuda.FloatTensor')
# Preprocess image
tfms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),])
img = tfms(Image.open('source_image.bmp')).unsqueeze(0).cuda()
model = torchvision.models.resnet50(pretrained=True).cuda()
model.eval() # we put the model in evaluation mode, to prevent storage of gradient which might accumulate
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
print(f'Total available memory : {info.total / 1000000000}')
feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
orig_embedding = feature_extractor(img)
embedding_depth = 2048
mem0 = 0
embedding = torch.zeros(2048, img.shape[2], img.shape[3]) #, dtype=torch.float)
patch_size=[4,4]
patch_stride=[2,2]
patch_value=0.0
# Here, we iterate over the patch placement, defined at the top left location
for row in range(img.shape[2]-1):
for col in range(img.shape[3]-1):
print("######################################################")
######################################################
# Isolated line, culprit 1 of the GPU memory leak
######################################################
patched_embedding = feature_extractor(img)
delta_embedding = (patched_embedding - orig_embedding).view(-1, 1, 1)
######################################################
# Isolated line, culprit 2 of the GPU memory leak
######################################################
embedding[:,row:row+1,col:col+1] = torch.add(embedding[:,row:row+1,col:col+1], delta_embedding)
print("img size:\t\t", img.element_size() * img.nelement())
print("patched_embedding size:\t", patched_embedding.element_size() * patched_embedding.nelement())
print("delta_embedding size:\t", delta_embedding.element_size() * delta_embedding.nelement())
print("Embedding size:\t\t", embedding.element_size() * embedding.nelement())
del patched_embedding, delta_embedding
torch.cuda.empty_cache()
info = nvmlDeviceGetMemoryInfo(h)
print("\nMem usage increase:\t", info.used / 1000000000 - mem0)
mem0 = info.used / 1000000000
print(f'Free:\t\t\t {(info.total - info.used) / 1000000000}')
print("Done.")
Add this to your code as soon as you load the model
for param in model.parameters():
param.requires_grad = False
from https://pytorch.org/docs/stable/notes/autograd.html#excluding-subgraphs-from-backward

Resources