Where to use dilated convolution in autoencoder for temporal data? - keras

I am trying to build and encoder-decoder model for time series data with 1D convolution in Keras. Consider this simple model:
inputs = Input(shape = (timesteps, input_dim))
t = Conv1D(16, kernel_size=3, padding='same')(inputs)
encoded = Conv1D(16, kernel_size=2, strides=2)(t)
t = UpSampling1D(2)(encoded)
t = Conv1D(16, kernel_size=3, padding='same')(inputs)
decoded = Conv1D(1, kernel_size=3, padding='same')(t)
model = Model(inputs, decoded)
My questions are:
Where to use dilation (dilation_rate=2)? In the encoder only or in both in order to maximize the receptive field?
What should I use as a latent representation? Fully connected layer, lower dimensional image (as above), pooling or fewer filters?

This answer is for the other people who came here via google:
Dilation VS stride: Stride makes the response smaller. So you only use is once. Dilation makes the kernel bigger by adding zeros in between. It will give the same effect as strides but without making the response smaller.
Keras/tf.keras example:
x = input_img
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), strides=2, padding='valid')(x)
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), strides=2, padding='valid')(x)
x = Conv2D(16, (3, 3), padding='valid')(x)
encoded = Conv2D(num_featers, (2, 2), padding='valid')(x)
Is the same as:
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), padding='valid')(x)
x = Conv2D(16, (3, 3), dilation_rate=2, padding='valid')(x)
x = Conv2D(16, (3, 3), dilation_rate=2, padding='valid')(x)
x = Conv2D(16, (3, 3), dilation_rate=4, padding='valid')(x)
encoded = Conv2D(num_featers, (2, 2), dilation_rate=4, padding='valid')(x)
If you replace the strides in an auto-encoder with dilation_rate like this it will work. (Conv2dTranspose also has dilation_rate but that does not work: https://github.com/keras-team/keras/issues/8159. A work around is training your network with strides (encoder) and upscaling2d (decoder). Load those weights in to a simpler encoder with dilation when your gone use it.)
About the pooling: pooling is not needed in this case, but it can help with remove location bias. Other method is translations augmentation to get the same result. Depending on you problem you want this or not.
fully connected: are completely out of style. Just use a convolution layer with the size to connect everything. This is exactly the same but will make it possible to have a bigger input.
Fewer or more filter: I never know. Visualize your filter and/or filter response. If you see filters what are very similar, you used to many filter. Or did not stimulate difference in kind enough (dropout and data-augmentation could help with it).

Related

Autoencoder of CNN - decrease or increase filters?

In an Autoencoder based on CNN, will you increase or decrease the number of filters between layers ? As we compress the information, I was thinking of decreasing.
Example here of the encoder part where the number of filters is decreased at each new layer, from 16 to 8 to 4.
x = Conv2D(filters = 16, kernel_size = 3, activation='relu', padding='same', name='encoder_1a')(inputs)
x = MaxPooling2D(pool_size = (2, 2), padding='same', name='encoder_1b')(x)
x = Conv2D(filters = 8, kernel_size = 3, activation='relu', padding='same', name='encoder_2a')(x)
x = MaxPooling2D(pool_size = (2, 2), padding='same', name='encoder_2b')(x)
x = Conv2D(filters = 4, kernel_size = 3, activation='relu', padding='same', name='encoder_3a')(x)
x = MaxPooling2D(pool_size = (2, 2), padding='same', name='encoder_3b')(x)
It is not always the case that the filter sizes are reduced or increased with increasing number of layers in encoder. In most examples of encoder I have seen of convolutional autoencoder architectures the height and width is decreased through strided convolution or pooling, and depth of layer is increased (filter sizes are increased), kept similar to last one or varied with each new layer in encoder. But there is also examples where the output channels or filter sizes are decreased with more layers.
Usually autoencoder encodes input into latent representation/vector or embedding that has lower dimension than input that minimizes reconstruction error. So both of the above can be used for creating undercomplete autoencoder by varying kernel size, number of layers, adding an extra layer at the end of encoder with a certain dimension etc.
Filter increase example
In the image below as more layers are added in encoder the filter sizes increase. But as the input 28*28*1 = 784 dimension features and the flattened representation 3*3*128 = 1152 is more so another layer is added before final layer which is the embedding layer. It reduces the feature dimension with predefined number of outputs in fully connected network. Even the last dense/fully connected layer can be replaced by varying the number of layers or kernel size to have an output (1, 1, NUM_FILTERS).
Filter decrease example
An easy example of filters decreasing in encoder as the number of layers increase can be found on keras convolutional autoencoder example just as your code.
import keras
from keras import layers
input_img = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = layers.MaxPooling2D((2, 2), padding='same')(x)
x = layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = layers.MaxPooling2D((2, 2), padding='same')(x)
References
https://www.deeplearningbook.org/contents/autoencoders.html
https://xifengguo.github.io/papers/ICONIP17-DCEC.pdf
https://blog.keras.io/building-autoencoders-in-keras.html

Why does Keras documentation examples for autonencoders use Conv2D instead of Conv2DTranspose

I have been following the Keras documentation to build up a CNN autoencoder
https://blog.keras.io/building-autoencoders-in-keras.html .
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras import backend as K
input_img = Input(shape=(28, 28, 1)) # adapt this if using `channels_first` image data format
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)
# at this point the representation is (4, 4, 8) i.e. 128-dimensional
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
I have noticed that it uses Conv2D in its decoding layers instead of Conv2DTranspose. But some other articles explain CNN autoencoders using Conv2DTranspose as a replacement for Upsampling2D and Conv2D. I have seen several questions related to Conv2DTranspose itself. But I haven't found an answer to my question.
My question is can I use Conv2DTranspose instead of Upsampling2D and Conv2D layers. If so, why haven't the authors themselves (Keras documentation) have not used it? Does it make any difference?
Transpose Convolutions often result in artifacts called Checkerboard artifacts - Small adjacent squares easily distinguishable from each other. These make it very easy for humans to recognize fake images from real ones.
You can read this article for more information.
In short, using Resizing + Conv2D instead of Conv2dTranspose minimizes these checkerboard artifacts.

Make a custom loss function for mean intersection of union for regression in bounding boxes

I am trying to iterate over the batch one by one to calculate the mean intersection over union. but fit function showing this
Error: An operation has None for the gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
Help as I am new to keras
#y_true shape: (None, 4)
import keras.backend as K
def iou(y_true, y_pred):
# determine the (x, y)-coordinates of the intersection rectangle
iou = 0
for i in range(K.int_shape(y_pred)[0]):
boxA = y_pred[i]
boxB = y_true[i]
xA = K.max(boxA[0], boxB[0])
yA = K.max(boxA[2], boxB[2])
xB = K.min(boxA[1], boxB[1])
yB = K.min(boxA[3], boxB[3])
interArea = K.max(0, xB - xA + 1) * K.max(0, yB - yA + 1)
boxAArea = (boxA[1] - boxA[0] + 1) * (boxA[3] - boxA[2] + 1)
boxBArea = (boxB[1] - boxB[0] + 1) * (boxB[3] - boxB[2] + 1)
iou += interArea / float(boxAArea + boxBArea - interArea)
#MEAN
mean = iou/K.int_shape(y_pred)[0]
return 1-mean
model.compile(optimizer='adam', loss=iou, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=20, batch_size = 50)
my model works fine with mean squared error as loss function. Model:
input_shape = (180, 240, 3)
model = Sequential([
Conv2D(32, (3, 3), input_shape=input_shape, padding='same',activation='relu'),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
BatchNormalization(),
Conv2D(64, (3, 3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
BatchNormalization(),
Conv2D(128, (3, 3), activation='relu', padding='same',),
Conv2D(256, (3, 3), activation='relu', padding='same',),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
Conv2D(128, (3, 3), activation='relu', padding='same',),
MaxPooling2D(pool_size=(2, 2), strides=(2, 2)),
BatchNormalization(),
Flatten(),
Dense(4096, activation='relu'),
Dense(4096, activation='relu'),
Dense(4, activation='relu')
])
It means that all operations inside your custom loss function should be differentiable since otherwise the optimization procedure cannot be executed. To that end, you just need to check one by one which operation is a culprit in your code and substitute it with a Keras differentiable backend analogue or to find some other alternative.
Considering the provided code snippet, there may be several possible suggestions to make it work:
for-loop should be vectorized
since you are using max(0, ...) in order to get an intersection area, it may happen that it is constant 0 and no gradient is available so check if it is not stuck there
for mean calculation there is a ready-to-use Keras backend function K.mean
it is a good practice to bound the values in order to improve your optimization (e.g., to (0,1) range)

How to copy weights from a 2D convnet in to a 3D Convnet on Keras?

I'm trying to implement a 3D convnet followed by LSTM layer for sequence generation using 3D images as input , on Keras with Tensorflow backend.
I would like to start training with weights of an existing pre-trained model in order to avoid common issues with random initialization .
In order to start with a basic example, I took VGG-16 and I implemented a "3D" version of this network (without the FC layers):
img_input = Input((100,80,80,3))
x = Conv3D(64, (3, 3 ,3), activation='relu', padding='same', name='block1_conv1')(img_input)
x = Conv3D(64, (3, 3 ,3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling3D((1, 2, 2), strides=(1, 2, 2), name='block1_pool')(x)
x = Conv3D(128, (3, 3 ,3), activation='relu', padding='same', name='block2_conv1')(x)
x = Conv3D(128, (3, 3 ,3), activation='relu', padding='same', name='block2_conv2')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1,2, 2), name='block2_pool')(x)
x = Conv3D(256, (3, 3 ,3), activation='relu', padding='same', name='block3_conv1')(x)
x = Conv3D(256, (3, 3 , 3), activation='relu', padding='same', name='block3_conv2')(x)
x = Conv3D(256, (3, 3, 3), activation='relu', padding='same', name='block3_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1,2, 2), name='block3_pool')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1, 2, 2), name='block4_pool')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1, 2, 2), name='block5_pool')(x)
So I would like to know how can I load the weights of the pretrained VGG-16 into each one of the 100 slices (my 3D images are composed by 100 80x80 rgb slices) ,
Any advise you can give to me would be useful,
Thanks
This depends on what you are looking to do in your application.
If you are just looking to process the 3D image in terms of slices, then you can define a TimeDistributed VGG16 network (Conv2D instead of Conv3D) would be the way to go.
The model then becomes something like this for each layer you define above:
img_input = Input((100,80,80,3))
x = TimeDistributed(Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1', trainable=False))(img_input)
x = TimeDistributed(Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2', trainable=False))(x)
x = TimeDistributed((MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool', trainable=False)(x)
...
...
Note that I have included the option 'trainable=False' over here. This is pretty useful if you only want to train the deeper layers and freeze the lower layers with the well trained wights of VGG.
To load the VGG weights for the model, you can then use the load_weights function of Keras.
model.load_weights(filepath, by_name=True)
If you set the layer names which you do not want to train to be the same as what is defined in the VGG16, then you can simply load those layers by name over here.
However, spatio-temporal feature learning is something that can be potentially done better much by using 3D ConvNets.
If this is the basis of your application, then you cannot directly import VGG16 weights in to a Conv3D model, because the number of parameters in each layer now increases because the filter went from being a 3*3 to a 3*3*3 for example.
You could still load the weights layer by layer to the model by considering which patch of 3*3 from the 3*3*3 would be most suitable for initialization with VGG16 weights. set_weights() function takes as input a list of numpy arrays (for the kernel weights and the bias respectively). You can extract each layer weight from VGG16 and then construct a new numpy array for an equivalent Conv3D weight matrix and feed it to your Conv3D model.
But I would encourage you to look at existing literature and models for processing 3D images to see if those can give you the better initialization using transfer learning.
For example, C3D is one such popular model. ShapeNet and Pascal3D are popular 3D datasets.
This discussion on how to process video data might also be useful to give you better insights on how to proceed.

Neural Networks: Why can't I go deeper?

I am using this model to get depth maps from images:
def get_model(learning_rate=0.001, channels=2):
h = 128 # height of the image
w = 128 # width of the image
c = channels # no of channels
encoding_size = 512
# encoder
image = Input(shape=(c, h, w))
conv_1_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(image)
conv_1_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv_1_1)
pool_1_2 = MaxPooling2D((2, 2))(conv_1_2)
conv_2_1 = Conv2D(64, (3, 3), activation='relu', padding='same')(pool_1_2)
conv_2_2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv_2_1)
pool_2_2 = MaxPooling2D((2, 2))(conv_2_2)
conv_3_1 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool_2_2)
conv_3_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv_3_1)
# pool_3_2 = MaxPooling2D((2, 2))(conv_3_2)
# conv_4_1 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool_3_2)
# conv_4_2 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv_4_1)
# pool_4_3 = MaxPooling2D((2, 2))(conv_4_2)
# conv_5_1 = Conv2D(512, (3, 3), activation='relu', padding='same')(pool_4_3)
# conv_5_2 = Conv2D(512, (3, 3), activation='relu', padding='same')(conv_5_1)
flat_5_2 = Flatten()(conv_3_2)
encoding = Dense(encoding_size, activation='tanh')(flat_5_2)
# decoder
reshaped_6_1 = Reshape((8, 8, 8))(encoding)
conv_6_1 = Conv2D(128, (3, 3), activation='relu', padding='same')(reshaped_6_1)
conv_6_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv_6_1)
upsample_6_2 = UpSampling2D((2, 2))(conv_6_2)
conv_7_1 = Conv2D(64, (3, 3), activation='relu', padding='same')(upsample_6_2)
conv_7_2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv_7_1)
upsample_7_2 = UpSampling2D((2, 2))(conv_7_2)
conv_8_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(upsample_7_2)
conv_8_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv_8_1)
upsample_8_2 = UpSampling2D((2, 2))(conv_8_2)
conv_9_1 = Conv2D(16, (3, 3), activation='relu', padding='same')(upsample_8_2)
conv_9_2 = Conv2D(16, (3, 3), activation='relu', padding='same')(conv_9_1)
upsample_9_2 = UpSampling2D((2, 2))(conv_9_2)
conv_10_1 = Conv2D(8, (3, 3), activation='relu', padding='same')(upsample_9_2)
conv_10_2 = Conv2D(1, (3, 3), activation='relu', padding='same')(conv_10_1)
output = Conv2D(1, (1, 1), activation=relu_normalized, padding='same')(conv_10_2)
model = Model(inputs=image, outputs=output)
model.compile(loss='mae', optimizer=Adam(learning_rate))
return model
Input: 2x128x128 (two bw images) - Squished to [0,1] (preprocessing normalization)
Output: 1x128x128 (depth map) - Squished to [0,1] by relu-normalized
NOTE: relu_normalized is just relu followed by squishing values to 0-1 so as to have a proper image. Sigmoid doesn't seem to fit this criteria.
When I add any more layers, the loss becomes a constant and backprop is not happening properly because both the output and gradients are becoming zero (and hence changing the learning rate didn't change anything in the network)
So if I want to go deeper to generalize more, by uncommenting the lines (and of course connecting conv_5_2 to flat_5_2), what is it that I am missing?
My thoughts:
Using Sigmoid would lead to vanishing gradient problem, but I am using relu's, would that problem still exist?
Changing anything in the network, like encoding size, even changing to activations to elu or selu doesn't show any progress.
Why are my outputs getting closer to zero when I try to add even one more conv layer followed by max_pooling?
UPDATE:
Here's relu_normalized,
def relu_normalized(x):
epsilon = 1e-6
relu_x = relu(x)
relu_scaled_x = relu_x / (K.max(relu_x) + epsilon)
return relu_scaled_x
and later after getting the output which has range [0,1], we simple do output_image = 255 * output and we can save this as b/w image now.
If you want go deeper you have to add some batch normalization layer (in Keras https://keras.io/layers/normalization/#batchnormalization) in this case.
From Ian Goodfellow's book, on the batch normalization chapter:
Very deep models involve the composition of several functions or layers. The
gradient tells how to update each parameter, under the assumption that the other
layers do not change. In practice, we update all of the layers simultaneously.
When we make the update, unexpected results can happen because many functions
composed together are changed simultaneously, using updates that were computed
under the assumption that the other functions remain constant
Also, tanh is easily saturated so use only if you need it :)
There is a problem that might happen with "relu" when you have learning rates too big.
There is a high chance of all activations going to 0 and getting stuck there never to change anymore. (When they're at 0, their gradient is also 0).
Since I'm not an expert on adjusting the parameters in detail for using "relu", and my results with "relu" are always bad, I prefer using "sigmoid" or "tanh". (It's worth trying, although there might be some vanishing there...). I keep my images ranged from 0 to 1 and use "binary_crossentropy" as loss, which is a lot faster than "mae/mse" in this case.
Another thing that happened to me was an "apparently" frozen loss function. It happened that the value was changing so little, that the displayed decimals weren't enough to see the variation, but after a lot of epochs, it found a reasonable way to go down properly. (Probably some kind of saturation indeed, but for me it's still better than getting freezes or nans)
You can introduce recurrent layers like LSTM which would "trap" the errors using gating, potentially improving the situation.

Resources