I am running an U-net as defined below:
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (c5)
u6 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (c6)
u7 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (c7)
u8 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (c8)
u9 = Conv2DTranspose(8, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (c9)
outputs = Conv2D(10, (1, 1), activation='sigmoid') (c9)
model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='Adamax', loss = dice, metrics = [mIoU])
Notice that I'm doing multi-class prediction on ten classes. And the inputs are 256x256x3 (rgb) images and the ground truths are binary masks of size 256x256x10 since the depth=num_classes=10. My question is, I accidently forgot to change the activation function from sigmoid to softmax and ran the network. The network still ran. How is this possible?? Is it because it's treating each binary mask independently?
More intriguingly, the network actually yielded better results when using sigmoid as opposed to when I ran it with softmax.
Q1: Why my network is still trainable with a *wrong* loss function?
A1: Because your network is optimized in terms of gradient descent, which does not care about which loss function is used as long as it is differentiable. This fact reveals the difficulty to debug a network when it doesn't work, because it is not a code bug (e.g. causing memory leak, numerical overflow, etc.), but some bug does not scientifically sound (e.g. your regression target is of range (0,100), but you use sigmoid as the activation function of the last dense layer).
Q2: How come `sigmoid` gives better performance than `softmax`?
A2: First, using the sigmoid loss function means to train 10 binary classifiers, one for each class (i.e. the classic one v.s. all or one v.s. rest setting), and thus it is also technically sound.
The only difference between sigmoid and softmax is that the sum of the class-wise predicted probability is always 1 for the softmax network, while may not necessarily to be 1 for the sigmoid network. In other words, you might have confusions to decide a label during testing for the sigmoid network.
Regarding to why sigmoid is better than softmax, it is related to many aspects and difficult to analyze without careful studies. One possible explanation is that sigmoid treats rows in the weight matrix of the last dense layer independently, while softmax treats them dependently. Therefore, sigmoid may better handle those samples with contradicting gradient directions. Another thought is that maybe you should try the recent heated-up softmax.
Finally, if you believe sigmoid version gives you better performance but you still want a softmax network, you may reuse all the layers until the last dense layer in the sigmoid network and finetune a new softmax layer, or use both losses just as in a multi-task problem.
Related
This question already has answers here:
Obtaining output of an Intermediate layer in TensorFlow/Keras
(2 answers)
Closed 1 year ago.
Hi Guys I am working with autoencoders, I am trying to get the features from a specific layer of autoencoder ( I am not interested in the latent space). I am using the following code:
#Define autoencoder
import keras
input_shape = (1, 512, 512, 1)
SIZE = 512
encoder = keras.models.Sequential()
encoder.add(keras.layers.Conv2D(32, (9, 9), activation='elu', padding='same', input_shape=(SIZE, SIZE, 1)))
encoder.add(keras.layers.BatchNormalization())
encoder.add(keras.layers.Conv2D(64, (7, 7), activation='elu', padding='same'))
encoder.add(keras.layers.BatchNormalization())
encoder.add(keras.layers.Conv2D(32, (5, 5), activation='elu', padding='same'))
encoder.add(keras.layers.MaxPooling2D((2, 2), padding='same'))
encoder.add(keras.layers.BatchNormalization())
encoder.add(keras.layers.Conv2D(32, (3, 3), activation='elu', padding='same'))
encoder.add(keras.layers.MaxPooling2D((2, 2), padding='same'))
encoder.add(keras.layers.BatchNormalization())
#Decoder
decoder = keras.models.Sequential()
decoder.add(keras.layers.Conv2D(32, (3, 3), activation='elu', padding='same'))
decoder.add(keras.layers.UpSampling2D((2, 2)))
decoder.add(keras.layers.BatchNormalization())
decoder.add(keras.layers.Conv2D(32, (5, 5), activation='elu', padding='same'))
decoder.add(keras.layers.UpSampling2D((2, 2)))
decoder.add(keras.layers.BatchNormalization())
decoder.add(keras.layers.Conv2D(64, (7, 7), activation='elu', padding='same'))
#decoder.add(keras.layers.UpSampling2D((2, 2)))
decoder.add(keras.layers.BatchNormalization())
decoder.add(keras.layers.Conv2D(32, (9,9), activation='elu', padding='same'))
#decoder.add(keras.layers.UpSampling2D((2, 2)))
decoder.add(keras.layers.BatchNormalization())
#decoder.add(keras.layers.Conv2D(64, (11,11), activation='elu', padding='same'))
#decoder.add(keras.layers.UpSampling2D((2, 2)))
#decoder.add(keras.layers.BatchNormalization())
decoder.add(keras.layers.Conv2D(1, (3, 3), activation='elu', padding='same'))
autoencoder = keras.models.Sequential([encoder,decoder])
autoencoder.compile(loss='mean_squared_error', optimizer = "adam")
autoencoder.summary()
Finally, I train the autoencoder:
model_train = autoencoder.fit(X_train_noise, X_train,
epochs=5000,
shuffle=True)
Now, I need to enter a new image but I only want the outputs from the third layer from the encoder part of the defined CNN. Any ideas?
Thanks!!!
In Sequential model you can get any layer output by the model.layers[index] or model.get_layer(layer_name).
For example for the third layer output:
features_for_third_layer = encoder.layers[2].output
or
features_for_third_layer = autoencoder.layers[0].layers[2].output
After training autoencoder, if you want just the output of any arbitrary layer of encoder, and you want to feed an image and get the output, one easy way is to define another model like this:
new_model = keras.models.Model(inputs=encoder.input, outputs=encoder.layers[2].output)
Then, you can get the output like this:
feature_vector = new_model.predict([image])
I am using the u-net code from this Kaggle notebook that I've also pasted below:
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (c5)
u6 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (c6)
u7 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (c7)
u8 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (c8)
u9 = Conv2DTranspose(8, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (c9)
outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)
model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[mean_iou])
My question is where to properly add a kernal_regularizer (l2 regularization). I've looked at countless repos and notebooks, but I'm not able to find any source where l2 regularization was used successfully. Although I know how l2 regularization works, I have no intuition about which layers to add it into.
Hence, some intuition on where to add the kernal regularizer and what to set the param to would be helpful.
Going over the Kaggele notebook you have linked. It appears that no weight regularization is being used throughout the entire model (so the code you added is correct).
This is quit peculiar and very uncommon, in almost all cases and models, L2 weight regularization (a.k.a ridge regression) is being used in every single layer, perhaps just with different weight-decay coefficients.
I suggest adding weight regularization to all the layers but starting with a very small weight decay coefficient:
c1 = Conv2D(8, (3, 3), activation='relu', padding='same', kernel_regularizer=regularizers.l2(w_decay)) (s)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same', kernel_regularizer=regularizers.l2(w_decay)) (c1)
p1 = MaxPooling2D((2, 2)) (c1)
...
I am trying to apply batch normalization on an U-net and I have the following architecture:
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
width = 32
activation = 'sigmoid'
c1 = Conv2D(width, (3, 3), activation='elu', padding='same') (s)
c1 = Conv2D(width, (3, 3), activation='elu', padding='same') (c1)
c1 = BatchNormalization()(c1)
p1 = MaxPooling2D((2, 2)) (c1)
#p1 = Dropout(0.2)(p1)
c2 = Conv2D(width*2, (3, 3), activation='elu', padding='same') (p1)
c2 = Conv2D(width*2, (3, 3), activation='elu', padding='same') (c2)
c2 = BatchNormalization()(c2)
p2 = MaxPooling2D((2, 2)) (c2)
#p2 = Dropout(0.2)(p2)
c3 = Conv2D(width*4, (3, 3), activation='elu', padding='same') (p2)
c3 = Conv2D(width*4, (3, 3), activation='elu', padding='same') (c3)
c3 = BatchNormalization()(c3)
p3 = MaxPooling2D((2, 2)) (c3)
#p3 = Dropout(0.2)(p3)
c4 = Conv2D(width*8, (3, 3), activation='elu', padding='same') (p3)
c4 = Conv2D(width*8, (3, 3), activation='elu', padding='same') (c4)
c4 = BatchNormalization()(c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
#p4 = Dropout(0.2)(p4)
c5 = Conv2D(width*16, (3, 3), activation='elu', padding='same') (p4)
c5 = Conv2D(width*16, (3, 3), activation='elu', padding='same') (c5)
u6 = Conv2DTranspose(width*8, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
#u6 = Dropout(0.2)(u6)
c6 = Conv2D(width*8, (3, 3), activation='elu', padding='same') (u6)
c6 = Conv2D(width*8, (3, 3), activation='elu', padding='same') (c6)
u7 = Conv2DTranspose(width*4, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
#u7 = Dropout(0.2)(u7)
c7 = Conv2D(width*4, (3, 3), activation='elu', padding='same') (u7)
c7 = Conv2D(width*4, (3, 3), activation='elu', padding='same') (c7)
u8 = Conv2DTranspose(width*2, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
#u8 = Dropout(0.2)(u8)
c8 = Conv2D(width*2, (3, 3), activation='elu', padding='same') (u8)
c8 = Conv2D(width*2, (3, 3), activation='elu', padding='same') (c8)
u9 = Conv2DTranspose(width, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
#u9 = Dropout(0.2)(u9)
c9 = Conv2D(width, (3, 3), activation='elu', padding='same') (u9)
c9 = Conv2D(width, (3, 3), activation='elu', padding='same') (c9)
outputs = Conv2D(num_classes, (1, 1), activation=activation) (c9)
model = Model(inputs=[inputs], outputs=[outputs])
What happens is the training loss very quickly approaches a plateau value (within 2 epochs) and the whole time val loss remains nan. I looked at other posts and some say it's because the dimension ordering is wrong. But if this were true, then i shouldn't be getting training loss either. Other reasons are that the value is diminishing due to learning rate. However, this reason too is offset by the fact that I am getting a loss for the training. What am I doing wrong?
if num_classes>1 your activation should be "softmax" and not "sigmoid" and then it'll probably work
I wasn't passing in any validation data to the fit method! I needed to do something like this: model.fit(X_train, Y_train, validation_split=0.1, batch_size=8, epochs=30)
I'm trying to implement a 3D convnet followed by LSTM layer for sequence generation using 3D images as input , on Keras with Tensorflow backend.
I would like to start training with weights of an existing pre-trained model in order to avoid common issues with random initialization .
In order to start with a basic example, I took VGG-16 and I implemented a "3D" version of this network (without the FC layers):
img_input = Input((100,80,80,3))
x = Conv3D(64, (3, 3 ,3), activation='relu', padding='same', name='block1_conv1')(img_input)
x = Conv3D(64, (3, 3 ,3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling3D((1, 2, 2), strides=(1, 2, 2), name='block1_pool')(x)
x = Conv3D(128, (3, 3 ,3), activation='relu', padding='same', name='block2_conv1')(x)
x = Conv3D(128, (3, 3 ,3), activation='relu', padding='same', name='block2_conv2')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1,2, 2), name='block2_pool')(x)
x = Conv3D(256, (3, 3 ,3), activation='relu', padding='same', name='block3_conv1')(x)
x = Conv3D(256, (3, 3 , 3), activation='relu', padding='same', name='block3_conv2')(x)
x = Conv3D(256, (3, 3, 3), activation='relu', padding='same', name='block3_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1,2, 2), name='block3_pool')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv1')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv2')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block4_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1, 2, 2), name='block4_pool')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv1')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv2')(x)
x = Conv3D(512, (3, 3 ,3), activation='relu', padding='same', name='block5_conv3')(x)
x = MaxPooling3D((1, 2 ,2), strides=(1, 2, 2), name='block5_pool')(x)
So I would like to know how can I load the weights of the pretrained VGG-16 into each one of the 100 slices (my 3D images are composed by 100 80x80 rgb slices) ,
Any advise you can give to me would be useful,
Thanks
This depends on what you are looking to do in your application.
If you are just looking to process the 3D image in terms of slices, then you can define a TimeDistributed VGG16 network (Conv2D instead of Conv3D) would be the way to go.
The model then becomes something like this for each layer you define above:
img_input = Input((100,80,80,3))
x = TimeDistributed(Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1', trainable=False))(img_input)
x = TimeDistributed(Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2', trainable=False))(x)
x = TimeDistributed((MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool', trainable=False)(x)
...
...
Note that I have included the option 'trainable=False' over here. This is pretty useful if you only want to train the deeper layers and freeze the lower layers with the well trained wights of VGG.
To load the VGG weights for the model, you can then use the load_weights function of Keras.
model.load_weights(filepath, by_name=True)
If you set the layer names which you do not want to train to be the same as what is defined in the VGG16, then you can simply load those layers by name over here.
However, spatio-temporal feature learning is something that can be potentially done better much by using 3D ConvNets.
If this is the basis of your application, then you cannot directly import VGG16 weights in to a Conv3D model, because the number of parameters in each layer now increases because the filter went from being a 3*3 to a 3*3*3 for example.
You could still load the weights layer by layer to the model by considering which patch of 3*3 from the 3*3*3 would be most suitable for initialization with VGG16 weights. set_weights() function takes as input a list of numpy arrays (for the kernel weights and the bias respectively). You can extract each layer weight from VGG16 and then construct a new numpy array for an equivalent Conv3D weight matrix and feed it to your Conv3D model.
But I would encourage you to look at existing literature and models for processing 3D images to see if those can give you the better initialization using transfer learning.
For example, C3D is one such popular model. ShapeNet and Pascal3D are popular 3D datasets.
This discussion on how to process video data might also be useful to give you better insights on how to proceed.
I am using this model to get depth maps from images:
def get_model(learning_rate=0.001, channels=2):
h = 128 # height of the image
w = 128 # width of the image
c = channels # no of channels
encoding_size = 512
# encoder
image = Input(shape=(c, h, w))
conv_1_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(image)
conv_1_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv_1_1)
pool_1_2 = MaxPooling2D((2, 2))(conv_1_2)
conv_2_1 = Conv2D(64, (3, 3), activation='relu', padding='same')(pool_1_2)
conv_2_2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv_2_1)
pool_2_2 = MaxPooling2D((2, 2))(conv_2_2)
conv_3_1 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool_2_2)
conv_3_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv_3_1)
# pool_3_2 = MaxPooling2D((2, 2))(conv_3_2)
# conv_4_1 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool_3_2)
# conv_4_2 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv_4_1)
# pool_4_3 = MaxPooling2D((2, 2))(conv_4_2)
# conv_5_1 = Conv2D(512, (3, 3), activation='relu', padding='same')(pool_4_3)
# conv_5_2 = Conv2D(512, (3, 3), activation='relu', padding='same')(conv_5_1)
flat_5_2 = Flatten()(conv_3_2)
encoding = Dense(encoding_size, activation='tanh')(flat_5_2)
# decoder
reshaped_6_1 = Reshape((8, 8, 8))(encoding)
conv_6_1 = Conv2D(128, (3, 3), activation='relu', padding='same')(reshaped_6_1)
conv_6_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv_6_1)
upsample_6_2 = UpSampling2D((2, 2))(conv_6_2)
conv_7_1 = Conv2D(64, (3, 3), activation='relu', padding='same')(upsample_6_2)
conv_7_2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv_7_1)
upsample_7_2 = UpSampling2D((2, 2))(conv_7_2)
conv_8_1 = Conv2D(32, (3, 3), activation='relu', padding='same')(upsample_7_2)
conv_8_2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv_8_1)
upsample_8_2 = UpSampling2D((2, 2))(conv_8_2)
conv_9_1 = Conv2D(16, (3, 3), activation='relu', padding='same')(upsample_8_2)
conv_9_2 = Conv2D(16, (3, 3), activation='relu', padding='same')(conv_9_1)
upsample_9_2 = UpSampling2D((2, 2))(conv_9_2)
conv_10_1 = Conv2D(8, (3, 3), activation='relu', padding='same')(upsample_9_2)
conv_10_2 = Conv2D(1, (3, 3), activation='relu', padding='same')(conv_10_1)
output = Conv2D(1, (1, 1), activation=relu_normalized, padding='same')(conv_10_2)
model = Model(inputs=image, outputs=output)
model.compile(loss='mae', optimizer=Adam(learning_rate))
return model
Input: 2x128x128 (two bw images) - Squished to [0,1] (preprocessing normalization)
Output: 1x128x128 (depth map) - Squished to [0,1] by relu-normalized
NOTE: relu_normalized is just relu followed by squishing values to 0-1 so as to have a proper image. Sigmoid doesn't seem to fit this criteria.
When I add any more layers, the loss becomes a constant and backprop is not happening properly because both the output and gradients are becoming zero (and hence changing the learning rate didn't change anything in the network)
So if I want to go deeper to generalize more, by uncommenting the lines (and of course connecting conv_5_2 to flat_5_2), what is it that I am missing?
My thoughts:
Using Sigmoid would lead to vanishing gradient problem, but I am using relu's, would that problem still exist?
Changing anything in the network, like encoding size, even changing to activations to elu or selu doesn't show any progress.
Why are my outputs getting closer to zero when I try to add even one more conv layer followed by max_pooling?
UPDATE:
Here's relu_normalized,
def relu_normalized(x):
epsilon = 1e-6
relu_x = relu(x)
relu_scaled_x = relu_x / (K.max(relu_x) + epsilon)
return relu_scaled_x
and later after getting the output which has range [0,1], we simple do output_image = 255 * output and we can save this as b/w image now.
If you want go deeper you have to add some batch normalization layer (in Keras https://keras.io/layers/normalization/#batchnormalization) in this case.
From Ian Goodfellow's book, on the batch normalization chapter:
Very deep models involve the composition of several functions or layers. The
gradient tells how to update each parameter, under the assumption that the other
layers do not change. In practice, we update all of the layers simultaneously.
When we make the update, unexpected results can happen because many functions
composed together are changed simultaneously, using updates that were computed
under the assumption that the other functions remain constant
Also, tanh is easily saturated so use only if you need it :)
There is a problem that might happen with "relu" when you have learning rates too big.
There is a high chance of all activations going to 0 and getting stuck there never to change anymore. (When they're at 0, their gradient is also 0).
Since I'm not an expert on adjusting the parameters in detail for using "relu", and my results with "relu" are always bad, I prefer using "sigmoid" or "tanh". (It's worth trying, although there might be some vanishing there...). I keep my images ranged from 0 to 1 and use "binary_crossentropy" as loss, which is a lot faster than "mae/mse" in this case.
Another thing that happened to me was an "apparently" frozen loss function. It happened that the value was changing so little, that the displayed decimals weren't enough to see the variation, but after a lot of epochs, it found a reasonable way to go down properly. (Probably some kind of saturation indeed, but for me it's still better than getting freezes or nans)
You can introduce recurrent layers like LSTM which would "trap" the errors using gating, potentially improving the situation.