Why CuDNNLSTM has more parameres than LSTM in keras?

Why CuDNNLSTM has more parameres than LSTM in keras? - keras

I have been trying to compute number of parameters in LSTM cell in Keras. I created two models one with LSTM and other with CuDNNLSTM.
Partial summary of models are as
CuDNNLSTM Model:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 300) 192000
_________________________________________________________________
bidirectional (Bidirectional (None, None, 600) 1444800
LSTM model
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 300) 192000
_________________________________________________________________
bidirectional (Bidirectional (None, None, 600) 1442400
Number of parameters in LSTM is following the formula for lstm parameter computation available all over the internet. However, CuDNNLSTM has 2400 extra parameters.
What is the cause of these extra parameters?
code
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
from tensorflow.compat.v1.keras.models import Sequential
from tensorflow.compat.v1.keras.layers import CuDNNLSTM, Bidirectional, Embedding, LSTM
model = Sequential()
model.add(Embedding(640, 300))
model.add(Bidirectional(<LSTM type>(300, return_sequences=True)))

LSTM parameters can be grouped in 3 categories: input weight matrices (W), recurrent weight matrices (R), biases (b). Part of the LSTM cell's computation is W*x + b_i + R*h + b_r where b_i are input biases and b_r are recurrent biases.
If you let b = b_i + b_r, you could rewrite the above expression as W*x + R*h + b. In doing so, you've eliminated the need to keep two separate bias vectors (b_i and b_r) and instead, you only need to store one vector (b).
cuDNN sticks with the original mathematical formulation and stores b_i and b_r separately. Keras does not; it only stores b. That's why cuDNN's LSTM has more parameters than Keras.

Related

Meaning of 2D input in Keras LSTM

In Keras, LSTM is in the shape of [batch, timesteps, feature]. What if I indicate the input as keras.Input(shape=(20, 1)) and feed a matrix of (100, 20, 1) as input? What's the number of batch that it's considering in this case? Is the batch size 100 with 20 time stems in each batch?

TL;DR
The batch, timestep, features in your case is defined as None, 20, 1, where the batch represents the batch_size parameter passed during model.fit. The model does not need to know this before hand. Therefore, when you define your input layer (or your LSTM layer's input shape), you simply defined (timesteps, features) which is (20, 1). A simple model.summary() would show you that that input size is translated to (None, 20, 1) while creating the computation graph.
Deeper dive into the subject
A good way to understand whats going on is to simply print the summary of your model. Let me take a simple example here and walk you through the steps -
#Creating a simple stacked LSTM model
from tensorflow.keras import layers, Model
import numpy as np
inp = layers.Input((20,1)) #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)
model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()
Model: "model_8"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_10 (InputLayer) [(None, 20, 1)] 0
lstm_14 (LSTM) (None, 20, 5) 140
lstm_15 (LSTM) (None, 4) 160
dense_8 (Dense) (None, 1) 5
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
As you see here, the flow of tensors (more specifically how the shapes of tensors change as they flow down the network) are displayed. As you can see, I am using the functional API which allows me to specifically create an input layer of the shape 20,1 which I then pass to the LSTM. But interestingly, you can see that the actual shape of this Input layer is (None, 20, 1). This is the batch, timesteps, features that you are also referring to.
The time steps are 20, and a single feature, so thats easy to understand, however, the None is a placeholder for the batch_size parameter which you define during the model.fit
#Fit model
X_train, y_train = np.random.random((100,20,1)), np.random.random((100,))
model.fit(X_train, y_train, batch_size=10, epochs=2)
Epoch 1/2
10/10 [==============================] - 1s 4ms/step - loss: 0.6938
Epoch 2/2
10/10 [==============================] - 0s 3ms/step - loss: 0.6932
In this example, I set the batch_size to 10. This means, that when you train the model, each "step" will pass batches of the shape (10, 20, 1) to the model and there will be 10 such steps in each epoch, because the overall size of the training data is (100, 20, 1). This is indicated by the 10/10 that you see in front of the progress bar for each epoch.
Another interesting thing to note, is that you dont necessarily need to define the dimensions of the input as long as your obey the basic rules of model training and batch size constraints. Here is an example. Here I define the number of timesteps as None which means that I can now pass variable length timesteps (variable length sentences for an example) to encode using the LSTM layers.
from tensorflow.keras import layers, Model
import numpy as np
inp = layers.Input((None,1)) #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)
model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()
Model: "model_10"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_12 (InputLayer) [(None, None, 1)] 0
lstm_18 (LSTM) (None, None, 5) 140
lstm_19 (LSTM) (None, 4) 160
dense_10 (Dense) (None, 1) 5
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
This means that the model doesn't need to know how many timesteps it will have to work with beforehand, similar to the fact that it doesn't need to know what batch_size it would get beforehand. These things can be interpreted during the model.fit or passed as a parameter. Notice the model.summary() simply extends this lack of information around the timesteps dimension to the subsequent layers.
An important note though - LSTMs can work with variable size inputs because all you have to do is pass the timesteps as None in the example above, however, you have to ensure that each batch independently has the same number of time steps. In other words, to work with variable-sized sentences say [(20,1), (25, 1), (20, 1), ...] either use a batch size of 1 so that each batch has a consistent size, or create a generator which creates batches of equal batch_size and combine sentences with constant length. For example the first batch is only 5 (20,1) sentences, the second batch is only 5 (25,1) sentences etc. The second method is faster than the first, but may be more painful to setup.
Bonus
Also, for anyone curious around what is the effect of batch_size on model training, a large batch_size might be very helpful to speed up computation speed as its preferred over decaying the learning rate but it can cause what is known as a Generalization Gap. This topic is well explored in this awesome paper.
These 2 papers should give a lot of clarity around how to use batch_size as a powerful parameter for your model training, which is quite often ignored.

How to access intermediate layers in a pretrained model after retraining it on a new image dataset

I trained a VGG16 model on a labeled image dataset using the categorical crossentropy. I removed the fully connected layer and replaced it with new layers, as follows:
vgg16_model = VGG16(weights="imagenet", include_top=False, input_shape=(224, 224, 3))
model = Sequential()
model.add(vgg16_model)
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(70, activation='softmax'))
Then I trained the full model on my dataset. I want to use it now as a feature extraction model by extracting image features using any of the intermediate layers that belong to vgg16_model.
After training and saving the model, I can only access the layers that I added Dense and Dropout using pop() function to remove them and only keep the trained feature extractor model (vgg16).
i = 0
while i < 4:
model.pop()
This keeps only VGG16 model. However, the layers inside are not accessible, I tried:
new_model = Model(model.input,model.layers[-1].output)
But I get this error:
ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name='input_9'), name='input_9', description="created by layer 'input_9'") at layer "block1_conv1". The following previous layers were accessed without issue: []
How can I modify my model to consider early k layers at a given time, then use the model for prediction?

Defining the keras model in the way you did, unfortunately has many complications when we tried to features from intermediate layer. I've raised a ticket, see here.
As you want to extract image features using any of the intermediate layers that belong to vgg16_model, you can try the following approach.
# [good practice]: check first to know name and shape
# for layer in model.layers:
# print(layer.name, layer.output_shape)
# vgg16 (None, 7, 7, 512)
# flatten (None, 25088)
# dense (None, 256)
# dropout (None, 256)
# dense_1 (None, 70)
# get the trained model first
trained_vgg16 = keras.Model(
inputs=model.get_layer(name="vgg16").inputs,
outputs=model.get_layer(name="vgg16").outputs,
)
x = tf.ones((1, 224, 224, 3))
y = trained_vgg16(x)
y.shape
TensorShape([1, 7, 7, 512])
Next, use this trained_vgg16 model to build the target model. For example,
# extract only 1 intermediate layer
feature_extractor_block3_pool = keras.Model(
inputs=trained_vgg16.inputs,
outputs=trained_vgg16.get_layer(name="block3_pool").output,
)
# or, 2 based on purpose.
feature_extractor_block3_pool_block4_conv3 = keras.Model(
inputs=trained_vgg16.inputs,
outputs=[
trained_vgg16.get_layer(name="block3_pool").output,
trained_vgg16.get_layer(name="block4_conv3").output,
],
)
# or, all
feature_extractor = keras.Model(
inputs=trained_vgg16.inputs,
outputs=[layer.output for layer in trained_vgg16.layers],
)

How LeakyReLU layer works without setting the number of units?

When building Sequential model, I notice there is a difference between adding relu layer and LeakyReLU layer.
test = Sequential()
test.add(Dense(1024, activation="relu"))
test.add(LeakyReLU(0.2))
Why cant we add layer with activation = "LeakyReLU" ? (LeakyReLU is not a string which keras can work with)
When adding relu layer, we set the number of units (1024 in my example)
Why can't we do the same for LeakyReLU ?
I was sure that the different between relu and LeakyReLU is the method behavior, but it seems more than that.

We could specify the activation function in the dense layer itself, by using aliases like activation='relu', which would use the default keras parameters for relu. There is no such aliases available in keras, for LeakyRelu activation function. We have to use tf.keras.layers.LeakyRelu or tf.nn.leaky_relu.
We cannot set number of units in Relu layer, it just takes the previous output tensor and applies the relu activation function on it. You have specified the number of units for the Dense layer not the relu layer. When we specify Dense(1024, activation="relu") we multiply the inputs with weights, add biases and apply relu function on the output (all of this is mentioned on a single line). From the method mentioned on step 1, this process is done in 2 stages firstly to multiply weights, add biases and then to apply the LeakyRelu activation function (mentioned in 2 lines).

import tensorflow as tf
test = Sequential()
test.add(Dense(1024, input_dim=784, activation="relu", name="First"))
test.add(Dense(512, activation=tf.keras.layers.LeakyReLU(alpha=0.01), name="middle"))
test.add(Dense(1, activation='sigmoid', name="Last"))
test.compile(loss='binary_crossentropy', optimizer="adam")
print(test.summary())
ouput:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
First (Dense) (None, 1024) 803840
_________________________________________________________________
middle (Dense) (None, 512) 524800
_________________________________________________________________
Last (Dense) (None, 1) 513
=================================================================

Enquiry about the input & output shape of LSTMs in Keras

I am trying to predict a time series with LSTM and am writing my code in Python by using Keras.
I have 30 features as input (continuous value) and a binary output.
I would like to use the 20 previous timesteps (t-20, t-19, .. , t-1) of each input feature in order to predict the output of next timestep (t+1).
My batch size is fixed at 52. What does this exactly mean?
I don't understand how to define the shape of the input layer.
The stacked LSTM example in the Keras documentation says that the last dimension of the 3D tensor will be 'data_dim'.
Is it input dimension or output dimension?
If this is output dimension, then I can't use more than one input feature as in my case the input_shape will be (batch_size=52,time_step=20,data_dim=1).
Also, in case data_dim is input shape, then I have tried to define a four layers-LSTM and the model shape results to be like this.
Layer (type) Output Shape Param #
================================================================= input_2 (InputLayer) (52, 20, 30) 0
_________________________________________________________________ lstm_3 (LSTM) (52, 20, 128) 81408
_________________________________________________________________ lstm_4 (LSTM) (52, 128) 131584
_________________________________________________________________ dense_2 (Dense) (52, 1) 129
================================================================= Total params: 213,121 Trainable params: 213,121 Non-trainable params: 0
Does this architecture make sense? Am I making some obvious mistakes?
My snippet of code is the one below:
input_layer=Input(batch_shape=(batch_size,input_timesteps,input_dims))
lstm1=LSTM(num_neurons,activation = 'relu',dropout=0.0,stateful=False,return_sequences=True)(input_layer)
lstm2=LSTM(num_neurons,activation = 'relu',dropout=0.0,stateful=False,return_sequences=False)(lstm1)
output_layer=Dense(1, activation='sigmoid')(lstm2)
model=Model(inputs=input_layer,outputs=output_layer)
I am getting very poor results and thus trying to debug each step.

If you want to use deep learning techniques you should try to overfit first and then reduce the complexity till you reach a break even point in terms of both neural complexity, training error and test error.
You are actually using a larger feature space in the hidden layer, are you sure your data are able to fit this?
Do you have enough rows to let the model learn this complex representation?
Otherwise I would suggest you something like this, in order to extrapolate the most important dimensions:
num_neurons1 = int(input_dims/2)
num_neurons2 = int(input_dims/4)
input_layer=Input(batch_shape=(batch_size, input_timesteps, input_dims))
lstm1=LSTM(num_neurons, activation = 'relu', dropout=0.0, stateful=False, return_sequences=True, kernel_initializer="he_normal")(input_layer)
lstm2=LSTM(num_neurons2, activation = 'relu', dropout=0.0, stateful=False, return_sequences=False, kernel_initializer="he_normal")(lstm1)
output_layer=Dense(1, activation='sigmoid')(lstm2)
model=Model(inputs=input_layer,outputs=output_layer)
Also, you are using relu as activation function.
Does it fit your data? Would be better you have only positive data after rescaling & normalization.
In case it does fit, you can also use a proper kernel initialization.
To better understand the problem, you could also post the optimizer parameters and the behaviour while training during epochs.

Understanding SimpleRNN process

I have created the following SimpleRNN using Keras:
X = X.reshape((X.shape[0], X.shape[1], 1))
tr_X, ts_X, tr_y, ts_y = train_test_split(X, y, train_size=.8)
batch_size = 1000
print('RNN model...')
model = Sequential()
model.add(SimpleRNN(64, activation='relu', batch_input_shape=(batch_size, X.shape[1], 1)))
model.add(Dense(1, activation='relu'))
print('Training...')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print (model.summary())
print ('\n')
model.fit(tr_X, tr_y,
batch_size=batch_size, epochs=1,
shuffle=True, validation_data=(ts_X, ts_y))
For the model summary, I get the following:
Layer (type) Output Shape Param #
=================================================================
simple_rnn_1 (SimpleRNN) (1000, 64) 4224
_________________________________________________________________
dense_1 (Dense) (1000, 1) 65
=================================================================
Total params: 4,289
Trainable params: 4,289
Non-trainable params: 0
_________________________________________________________________
Given that I have a dataset of 10,000 samples and 64 features. My goal is to generate a classification model by training it using this dataset (class labels are binary 0 and 1). Now, I am trying to understand what is going on here. As seen in 'Output Shape' column, the simple_rnn_1 has (1000, 64). I interpret it as 1000 rows (which is the batch) and 64 features. Assuming the code above is logically correct, my questions is:
How does RNN handle this matrix (i.e., (1000,64))? Does it input
each column something like this figure?
Should SimpleRNN() units always be equal to the number of features?
Thank you

In the code, you defined batch_input_shape to be with shape: (batch_size, X.shape[1], 1)
which means that you will insert to the RNN, batch_size examples, each example contains X.shape[1] time-stamps (number of pink boxes in your image) and each time-stamp is shape 1 (scalar).
So yes, input shape of (1000,64,1) will be exactly like you said - each column will be input to the RNN.
No! units will be your output dim. Usually more units means more complex network (just like in regular neural network) -> more parameters to learn.
Units will be the shape of the RNN's internal state.
(So, in your example, if you declare units=2000 your output will be (1000,2000).)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why CuDNNLSTM has more parameres than LSTM in keras? - keras

Related

Meaning of 2D input in Keras LSTM

How to access intermediate layers in a pretrained model after retraining it on a new image dataset

How LeakyReLU layer works without setting the number of units?

Enquiry about the input & output shape of LSTMs in Keras

Understanding SimpleRNN process

Categories

Resources