Neural Networks: Why can't I go deeper?

Neural Networks: Why can't I go deeper? - keras

I am using this model to get depth maps from images:
def get_model(learning_rate=0.001, channels=2):
h = 128 # height of the image
w = 128 # width of the image
c = channels # no of channels
encoding_size = 512
# encoder
image = Input(shape=(c, h, w))
conv_1_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(image)
conv_1_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv_1_1)
pool_1_2 = MaxPooling2D((2, 2))(conv_1_2)
conv_2_1 = Conv2D(64, (3, 3), activation='relu', padding='same')(pool_1_2)
conv_2_2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv_2_1)
pool_2_2 = MaxPooling2D((2, 2))(conv_2_2)
conv_3_1 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool_2_2)
conv_3_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv_3_1)
# pool_3_2 = MaxPooling2D((2, 2))(conv_3_2)
# conv_4_1 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool_3_2)
# conv_4_2 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv_4_1)
# pool_4_3 = MaxPooling2D((2, 2))(conv_4_2)
# conv_5_1 = Conv2D(512, (3, 3), activation='relu', padding='same')(pool_4_3)
# conv_5_2 = Conv2D(512, (3, 3), activation='relu', padding='same')(conv_5_1)
flat_5_2 = Flatten()(conv_3_2)
encoding = Dense(encoding_size, activation='tanh')(flat_5_2)
# decoder
reshaped_6_1 = Reshape((8, 8, 8))(encoding)
conv_6_1 = Conv2D(128, (3, 3), activation='relu', padding='same')(reshaped_6_1)
conv_6_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv_6_1)
upsample_6_2 = UpSampling2D((2, 2))(conv_6_2)
conv_7_1 = Conv2D(64, (3, 3), activation='relu', padding='same')(upsample_6_2)
conv_7_2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv_7_1)
upsample_7_2 = UpSampling2D((2, 2))(conv_7_2)
conv_8_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(upsample_7_2)
conv_8_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv_8_1)
upsample_8_2 = UpSampling2D((2, 2))(conv_8_2)
conv_9_1 = Conv2D(16, (3, 3), activation='relu', padding='same')(upsample_8_2)
conv_9_2 = Conv2D(16, (3, 3), activation='relu', padding='same')(conv_9_1)
upsample_9_2 = UpSampling2D((2, 2))(conv_9_2)
conv_10_1 = Conv2D(8, (3, 3), activation='relu', padding='same')(upsample_9_2)
conv_10_2 = Conv2D(1, (3, 3), activation='relu', padding='same')(conv_10_1)
output = Conv2D(1, (1, 1), activation=relu_normalized, padding='same')(conv_10_2)
model = Model(inputs=image, outputs=output)
model.compile(loss='mae', optimizer=Adam(learning_rate))
return model
Input: 2x128x128 (two bw images) - Squished to [0,1] (preprocessing normalization)
Output: 1x128x128 (depth map) - Squished to [0,1] by relu-normalized
NOTE: relu_normalized is just relu followed by squishing values to 0-1 so as to have a proper image. Sigmoid doesn't seem to fit this criteria.
When I add any more layers, the loss becomes a constant and backprop is not happening properly because both the output and gradients are becoming zero (and hence changing the learning rate didn't change anything in the network)
So if I want to go deeper to generalize more, by uncommenting the lines (and of course connecting conv_5_2 to flat_5_2), what is it that I am missing?
My thoughts:
Using Sigmoid would lead to vanishing gradient problem, but I am using relu's, would that problem still exist?
Changing anything in the network, like encoding size, even changing to activations to elu or selu doesn't show any progress.
Why are my outputs getting closer to zero when I try to add even one more conv layer followed by max_pooling?
UPDATE:
Here's relu_normalized,
def relu_normalized(x):
epsilon = 1e-6
relu_x = relu(x)
relu_scaled_x = relu_x / (K.max(relu_x) + epsilon)
return relu_scaled_x
and later after getting the output which has range [0,1], we simple do output_image = 255 * output and we can save this as b/w image now.

If you want go deeper you have to add some batch normalization layer (in Keras https://keras.io/layers/normalization/#batchnormalization) in this case.
From Ian Goodfellow's book, on the batch normalization chapter:
Very deep models involve the composition of several functions or layers. The
gradient tells how to update each parameter, under the assumption that the other
layers do not change. In practice, we update all of the layers simultaneously.
When we make the update, unexpected results can happen because many functions
composed together are changed simultaneously, using updates that were computed
under the assumption that the other functions remain constant
Also, tanh is easily saturated so use only if you need it :)

There is a problem that might happen with "relu" when you have learning rates too big.
There is a high chance of all activations going to 0 and getting stuck there never to change anymore. (When they're at 0, their gradient is also 0).
Since I'm not an expert on adjusting the parameters in detail for using "relu", and my results with "relu" are always bad, I prefer using "sigmoid" or "tanh". (It's worth trying, although there might be some vanishing there...). I keep my images ranged from 0 to 1 and use "binary_crossentropy" as loss, which is a lot faster than "mae/mse" in this case.
Another thing that happened to me was an "apparently" frozen loss function. It happened that the value was changing so little, that the displayed decimals weren't enough to see the variation, but after a lot of epochs, it found a reasonable way to go down properly. (Probably some kind of saturation indeed, but for me it's still better than getting freezes or nans)

You can introduce recurrent layers like LSTM which would "trap" the errors using gating, potentially improving the situation.

Related

Adding CTC Loss and CTC decode to a Keras model

I am trying to solve a use case of handwritten text recognition. I have used CNN and LSTM to create a network. The output of this needs to be fed to a CTC layer. I could find some codes to do this in native tensorflow. Is there an easier option for this in Keras.
model = Sequential()
model.add(Conv2D(64, kernel_size=(5,5),activation = 'relu', input_shape=(128,32,1), padding='same', data_format='channels_last'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(128, kernel_size=(5,5),activation = 'relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(256, kernel_size=(5,5),activation = 'relu', padding='same'))
model.add(Conv2D(256, kernel_size=(5,5),activation = 'relu', padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1,2),padding='same'))
model.add(Conv2D(512, kernel_size=(5,5),activation = 'relu', padding='same'))
model.add(BatchNormalization())
model.add(Conv2D(512, kernel_size=(5,5),activation = 'relu', padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1,2),padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1,1)))
model.add(Conv2D(512, kernel_size=(5,5),activation = 'relu', padding='same'))
model.add(Lambda(lambda x: x[:, :, 0, :], output_shape=(None,31,512), mask=None, arguments=None))
#model.add(Bidirectional(LSTM(256, return_sequences=True), input_shape=(31, 256)))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Bidirectional(LSTM(128, return_sequences=True)))
model.add(Dense(75, activation = 'softmax'))
Any help on how we can easily add CTC Loss and Decode layers to this would be great

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with your defined model, we need to create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing, the model that you have created earlier can be used.
Let's create a keras model that you used in a different way so that we can create two different versions of the model to be used at training and testing time.
# input with shape of height=32 and width=128
inputs = Input(shape=(32, 128, 1))
# convolution layer with kernel size (3,3)
conv_1 = Conv2D(64, (3, 3), activation='relu', padding='same')(inputs)
# poolig layer with kernel size (2,2)
pool_1 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_1)
conv_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool_1)
pool_2 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_2)
conv_3 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool_2)
conv_4 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv_3)
# poolig layer with kernel size (2,1)
pool_4 = MaxPool2D(pool_size=(2, 1))(conv_4)
conv_5 = Conv2D(512, (3, 3), activation='relu', padding='same')(pool_4)
# Batch normalization layer
batch_norm_5 = BatchNormalization()(conv_5)
conv_6 = Conv2D(512, (3, 3), activation='relu', padding='same')(batch_norm_5)
batch_norm_6 = BatchNormalization()(conv_6)
pool_6 = MaxPool2D(pool_size=(2, 1))(batch_norm_6)
conv_7 = Conv2D(512, (2, 2), activation='relu')(pool_6)
squeezed = Lambda(lambda x: K.squeeze(x, 1))(conv_7)
# bidirectional LSTM layers with units=128
blstm_1 = Bidirectional(LSTM(128, return_sequences=True, dropout=0.2))(squeezed)
blstm_2 = Bidirectional(LSTM(128, return_sequences=True, dropout=0.2))(blstm_1)
outputs = Dense(len(char_list) + 1, activation='softmax')(blstm_2)
# model to be used at test time
test_model = Model(inputs, outputs)
We will use ctc_loss_fuction during training. So, lets implement the ctc_loss_function and create a training model using ctc_loss_function:
labels = Input(name='the_labels', shape=[max_label_len], dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([outputs, labels,
input_length, label_length])
#model to be used at training time
training_model = Model(inputs=[inputs, labels, input_length, label_length], outputs=loss_out)
--> Train this model and save the weights in .h5 file
Now use the test model and load saved weights of the training model by using arguments by_name=True so it will load weights for only matching layers.

Why does Keras documentation examples for autonencoders use Conv2D instead of Conv2DTranspose

I have been following the Keras documentation to build up a CNN autoencoder
https://blog.keras.io/building-autoencoders-in-keras.html .
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras import backend as K
input_img = Input(shape=(28, 28, 1)) # adapt this if using `channels_first` image data format
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)
# at this point the representation is (4, 4, 8) i.e. 128-dimensional
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
I have noticed that it uses Conv2D in its decoding layers instead of Conv2DTranspose. But some other articles explain CNN autoencoders using Conv2DTranspose as a replacement for Upsampling2D and Conv2D. I have seen several questions related to Conv2DTranspose itself. But I haven't found an answer to my question.
My question is can I use Conv2DTranspose instead of Upsampling2D and Conv2D layers. If so, why haven't the authors themselves (Keras documentation) have not used it? Does it make any difference?

Transpose Convolutions often result in artifacts called Checkerboard artifacts - Small adjacent squares easily distinguishable from each other. These make it very easy for humans to recognize fake images from real ones.
You can read this article for more information.
In short, using Resizing + Conv2D instead of Conv2dTranspose minimizes these checkerboard artifacts.

Make a custom loss function for mean intersection of union for regression in bounding boxes

I am trying to iterate over the batch one by one to calculate the mean intersection over union. but fit function showing this
Error: An operation has None for the gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
Help as I am new to keras
#y_true shape: (None, 4)
import keras.backend as K
def iou(y_true, y_pred):
# determine the (x, y)-coordinates of the intersection rectangle
iou = 0
for i in range(K.int_shape(y_pred)[0]):
boxA = y_pred[i]
boxB = y_true[i]
xA = K.max(boxA[0], boxB[0])
yA = K.max(boxA[2], boxB[2])
xB = K.min(boxA[1], boxB[1])
yB = K.min(boxA[3], boxB[3])
interArea = K.max(0, xB - xA + 1) * K.max(0, yB - yA + 1)
boxAArea = (boxA[1] - boxA[0] + 1) * (boxA[3] - boxA[2] + 1)
boxBArea = (boxB[1] - boxB[0] + 1) * (boxB[3] - boxB[2] + 1)
iou += interArea / float(boxAArea + boxBArea - interArea)
#MEAN
mean = iou/K.int_shape(y_pred)[0]
return 1-mean
model.compile(optimizer='adam', loss=iou, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=20, batch_size = 50)
my model works fine with mean squared error as loss function. Model:
input_shape = (180, 240, 3)
model = Sequential([
Conv2D(32, (3, 3), input_shape=input_shape, padding='same',activation='relu'),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
BatchNormalization(),
Conv2D(64, (3, 3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
BatchNormalization(),
Conv2D(128, (3, 3), activation='relu', padding='same',),
Conv2D(256, (3, 3), activation='relu', padding='same',),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
Conv2D(128, (3, 3), activation='relu', padding='same',),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
BatchNormalization(),
Flatten(),
Dense(4096, activation='relu'),
Dense(4096, activation='relu'),
Dense(4, activation='relu')
])

It means that all operations inside your custom loss function should be differentiable since otherwise the optimization procedure cannot be executed. To that end, you just need to check one by one which operation is a culprit in your code and substitute it with a Keras differentiable backend analogue or to find some other alternative.
Considering the provided code snippet, there may be several possible suggestions to make it work:
for-loop should be vectorized
since you are using max(0, ...) in order to get an intersection area, it may happen that it is constant 0 and no gradient is available so check if it is not stuck there
for mean calculation there is a ready-to-use Keras backend function K.mean
it is a good practice to bound the values in order to improve your optimization (e.g., to (0,1) range)

How to copy weights from a 2D convnet in to a 3D Convnet on Keras?

I'm trying to implement a 3D convnet followed by LSTM layer for sequence generation using 3D images as input , on Keras with Tensorflow backend.
I would like to start training with weights of an existing pre-trained model in order to avoid common issues with random initialization .
In order to start with a basic example, I took VGG-16 and I implemented a "3D" version of this network (without the FC layers):
img_input = Input((100,80,80,3))
x = Conv3D(64, (3, 3 ,3), activation='relu', padding='same', name='block1_conv1')(img_input)
x = Conv3D(64, (3, 3 ,3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling3D((1, 2, 2), strides=(1, 2, 2), name='block1_pool')(x)
x = Conv3D(128, (3, 3 ,3), activation='relu', padding='same', name='block2_conv1')(x)
x = Conv3D(128, (3, 3 ,3), activation='relu', padding='same', name='block2_conv2')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1,2, 2), name='block2_pool')(x)
x = Conv3D(256, (3, 3 ,3), activation='relu', padding='same', name='block3_conv1')(x)
x = Conv3D(256, (3, 3 , 3), activation='relu', padding='same', name='block3_conv2')(x)
x = Conv3D(256, (3, 3, 3), activation='relu', padding='same', name='block3_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1,2, 2), name='block3_pool')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1, 2, 2), name='block4_pool')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1, 2, 2), name='block5_pool')(x)
So I would like to know how can I load the weights of the pretrained VGG-16 into each one of the 100 slices (my 3D images are composed by 100 80x80 rgb slices) ,
Any advise you can give to me would be useful,
Thanks

This depends on what you are looking to do in your application.
If you are just looking to process the 3D image in terms of slices, then you can define a TimeDistributed VGG16 network (Conv2D instead of Conv3D) would be the way to go.
The model then becomes something like this for each layer you define above:
img_input = Input((100,80,80,3))
x = TimeDistributed(Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1', trainable=False))(img_input)
x = TimeDistributed(Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2', trainable=False))(x)
x = TimeDistributed((MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool', trainable=False)(x)
...
...
Note that I have included the option 'trainable=False' over here. This is pretty useful if you only want to train the deeper layers and freeze the lower layers with the well trained wights of VGG.
To load the VGG weights for the model, you can then use the load_weights function of Keras.
model.load_weights(filepath, by_name=True)
If you set the layer names which you do not want to train to be the same as what is defined in the VGG16, then you can simply load those layers by name over here.
However, spatio-temporal feature learning is something that can be potentially done better much by using 3D ConvNets.
If this is the basis of your application, then you cannot directly import VGG16 weights in to a Conv3D model, because the number of parameters in each layer now increases because the filter went from being a 3*3 to a 3*3*3 for example.
You could still load the weights layer by layer to the model by considering which patch of 3*3 from the 3*3*3 would be most suitable for initialization with VGG16 weights. set_weights() function takes as input a list of numpy arrays (for the kernel weights and the bias respectively). You can extract each layer weight from VGG16 and then construct a new numpy array for an equivalent Conv3D weight matrix and feed it to your Conv3D model.
But I would encourage you to look at existing literature and models for processing 3D images to see if those can give you the better initialization using transfer learning.
For example, C3D is one such popular model. ShapeNet and Pascal3D are popular 3D datasets.
This discussion on how to process video data might also be useful to give you better insights on how to proceed.

Where to use dilated convolution in autoencoder for temporal data?

I am trying to build and encoder-decoder model for time series data with 1D convolution in Keras. Consider this simple model:
inputs = Input(shape = (timesteps, input_dim))
t = Conv1D(16, kernel_size=3, padding='same')(inputs)
encoded = Conv1D(16, kernel_size=2, strides=2)(t)
t = UpSampling1D(2)(encoded)
t = Conv1D(16, kernel_size=3, padding='same')(inputs)
decoded = Conv1D(1, kernel_size=3, padding='same')(t)
model = Model(inputs, decoded)
My questions are:
Where to use dilation (dilation_rate=2)? In the encoder only or in both in order to maximize the receptive field?
What should I use as a latent representation? Fully connected layer, lower dimensional image (as above), pooling or fewer filters?

This answer is for the other people who came here via google:
Dilation VS stride: Stride makes the response smaller. So you only use is once. Dilation makes the kernel bigger by adding zeros in between. It will give the same effect as strides but without making the response smaller.
Keras/tf.keras example:
x = input_img
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), strides=2, padding='valid')(x)
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), strides=2, padding='valid')(x)
x = Conv2D(16, (3, 3), padding='valid')(x)
encoded = Conv2D(num_featers, (2, 2), padding='valid')(x)
Is the same as:
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), dilation_rate=2, padding='valid')(x)
x = Conv2D(16, (3, 3), dilation_rate=2, padding='valid')(x)
x = Conv2D(16, (3, 3), dilation_rate=4, padding='valid')(x)
encoded = Conv2D(num_featers, (2, 2), dilation_rate=4, padding='valid')(x)
If you replace the strides in an auto-encoder with dilation_rate like this it will work. (Conv2dTranspose also has dilation_rate but that does not work: https://github.com/keras-team/keras/issues/8159. A work around is training your network with strides (encoder) and upscaling2d (decoder). Load those weights in to a simpler encoder with dilation when your gone use it.)
About the pooling: pooling is not needed in this case, but it can help with remove location bias. Other method is translations augmentation to get the same result. Depending on you problem you want this or not.
fully connected: are completely out of style. Just use a convolution layer with the size to connect everything. This is exactly the same but will make it possible to have a bigger input.
Fewer or more filter: I never know. Visualize your filter and/or filter response. If you see filters what are very similar, you used to many filter. Or did not stimulate difference in kind enough (dropout and data-augmentation could help with it).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Neural Networks: Why can't I go deeper? - keras

You can introduce recurrent layers like LSTM which would "trap" the errors using gating, potentially improving the situation.

Related

Adding CTC Loss and CTC decode to a Keras model

Why does Keras documentation examples for autonencoders use Conv2D instead of Conv2DTranspose

Make a custom loss function for mean intersection of union for regression in bounding boxes

How to copy weights from a 2D convnet in to a 3D Convnet on Keras?

Where to use dilated convolution in autoencoder for temporal data?

Categories

Resources