In Keras, LSTM is in the shape of [batch, timesteps, feature]. What if I indicate the input as keras.Input(shape=(20, 1)) and feed a matrix of (100, 20, 1) as input? What's the number of batch that it's considering in this case? Is the batch size 100 with 20 time stems in each batch?
TL;DR
The batch, timestep, features in your case is defined as None, 20, 1, where the batch represents the batch_size parameter passed during model.fit. The model does not need to know this before hand. Therefore, when you define your input layer (or your LSTM layer's input shape), you simply defined (timesteps, features) which is (20, 1). A simple model.summary() would show you that that input size is translated to (None, 20, 1) while creating the computation graph.
Deeper dive into the subject
A good way to understand whats going on is to simply print the summary of your model. Let me take a simple example here and walk you through the steps -
#Creating a simple stacked LSTM model
from tensorflow.keras import layers, Model
import numpy as np
inp = layers.Input((20,1)) #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)
model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()
Model: "model_8"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_10 (InputLayer) [(None, 20, 1)] 0
lstm_14 (LSTM) (None, 20, 5) 140
lstm_15 (LSTM) (None, 4) 160
dense_8 (Dense) (None, 1) 5
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
As you see here, the flow of tensors (more specifically how the shapes of tensors change as they flow down the network) are displayed. As you can see, I am using the functional API which allows me to specifically create an input layer of the shape 20,1 which I then pass to the LSTM. But interestingly, you can see that the actual shape of this Input layer is (None, 20, 1). This is the batch, timesteps, features that you are also referring to.
The time steps are 20, and a single feature, so thats easy to understand, however, the None is a placeholder for the batch_size parameter which you define during the model.fit
#Fit model
X_train, y_train = np.random.random((100,20,1)), np.random.random((100,))
model.fit(X_train, y_train, batch_size=10, epochs=2)
Epoch 1/2
10/10 [==============================] - 1s 4ms/step - loss: 0.6938
Epoch 2/2
10/10 [==============================] - 0s 3ms/step - loss: 0.6932
In this example, I set the batch_size to 10. This means, that when you train the model, each "step" will pass batches of the shape (10, 20, 1) to the model and there will be 10 such steps in each epoch, because the overall size of the training data is (100, 20, 1). This is indicated by the 10/10 that you see in front of the progress bar for each epoch.
Another interesting thing to note, is that you dont necessarily need to define the dimensions of the input as long as your obey the basic rules of model training and batch size constraints. Here is an example. Here I define the number of timesteps as None which means that I can now pass variable length timesteps (variable length sentences for an example) to encode using the LSTM layers.
from tensorflow.keras import layers, Model
import numpy as np
inp = layers.Input((None,1)) #<------
x = layers.LSTM(5, return_sequences=True)(inp)
x = layers.LSTM(4)(x)
out = layers.Dense(1, activation='sigmoid')(x)
model = Model(inp, out)
model.compile(loss='binary_crossentropy')
model.summary()
Model: "model_10"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_12 (InputLayer) [(None, None, 1)] 0
lstm_18 (LSTM) (None, None, 5) 140
lstm_19 (LSTM) (None, 4) 160
dense_10 (Dense) (None, 1) 5
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
This means that the model doesn't need to know how many timesteps it will have to work with beforehand, similar to the fact that it doesn't need to know what batch_size it would get beforehand. These things can be interpreted during the model.fit or passed as a parameter. Notice the model.summary() simply extends this lack of information around the timesteps dimension to the subsequent layers.
An important note though - LSTMs can work with variable size inputs because all you have to do is pass the timesteps as None in the example above, however, you have to ensure that each batch independently has the same number of time steps. In other words, to work with variable-sized sentences say [(20,1), (25, 1), (20, 1), ...] either use a batch size of 1 so that each batch has a consistent size, or create a generator which creates batches of equal batch_size and combine sentences with constant length. For example the first batch is only 5 (20,1) sentences, the second batch is only 5 (25,1) sentences etc. The second method is faster than the first, but may be more painful to setup.
Bonus
Also, for anyone curious around what is the effect of batch_size on model training, a large batch_size might be very helpful to speed up computation speed as its preferred over decaying the learning rate but it can cause what is known as a Generalization Gap. This topic is well explored in this awesome paper.
These 2 papers should give a lot of clarity around how to use batch_size as a powerful parameter for your model training, which is quite often ignored.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I need to use encoder-decoder structure to predict 2D trajectories. As almost all available tutorials are related to NLP -with sparse vectors-, I couldn't be sure about how to adapt the solutions to a continuous data.
In addition to my ignorance in seqence-to-sequence models, embedding process for words confused me more. I have a dataset that consists of 3,000,000 samples each having x-y coordinates (-1, 1) with 125 observations, which means the shape of each sample is (125, 2). I thought I could think of this as 125 words with 2 dimensional already embedded words, but the encoder and the decoder in this Keras Tutorial expect 3D arrays as (num_pairs, max_english_sentence_length, num_english_characters).
I doubt I need to train each sample (125, 2) separately with this model, as the way Google's search bar does with only one word written.
As far as I understood, an encoder is many-to-one type model and a decoder is one-to-many type model. I need to get a memory state c and a hiddenstate h as vectors(?). Then I should use those vectors as input to decoder and extract predictions in the shape of (x,y) as many as I determine as encoder output.
I'd be so thankful if someone could give an example of an encoder-decoder LSTM architecture over the shape of my dataset, especially in terms of dimensions required for encoder-decoder inputs and outputs, particulary on Keras model if possible.
I assume you want to forecast 50 time steps with the 125 previous ones (as an example). I give you the most basic Encoder-Decoder Structure for time Series but it can be improved (with Luong Attention for instance).
from tensorflow.keras import layers,models
input_timesteps=125
input_features=2
output_timesteps=50
output_features=2
units=100
#Input
encoder_inputs = layers.Input(shape=(input_timesteps,input_features))
#Encoder
encoder = layers.LSTM(units, return_state=True, return_sequences=False)
encoder_outputs, state_h, state_c = encoder(encoder_inputs) # because return_sequences=False => encoder_outputs=state_h
#Decoder
decoder = layers.RepeatVector(output_timesteps)(state_h)
decoder_lstm = layers.LSTM(units, return_sequences=True, return_state=False)
decoder = decoder_lstm(decoder, initial_state=[state_h, state_c])
#Output
out = layers.TimeDistributed(Dense(output_features))(decoder)
model = models.Model(encoder_inputs, out)
So the core idea here is :
Encode the time series into two states : state_h and state_c. Check this to understand the work of LSTM cells.
Repeat state_h the number of time steps you want to forecast
Decode using an LSTM with initial states calculated by the encoder
Use a Dense layer to shape the number of needed features for each time steps
I advise you to test our achtecture and visualize them with model.summary() and tf.keras.utils.plot_model(mode,show_shapes=True). It gives you good representations like, for the summary :
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None, 125, 2)] 0
__________________________________________________________________________________________________
lstm_8 (LSTM) [(None, 100), (None, 41200 input_5[0][0]
__________________________________________________________________________________________________
repeat_vector_4 (RepeatVector) (None, 50, 100) 0 lstm_8[0][1]
__________________________________________________________________________________________________
lstm_9 (LSTM) (None, 50, 100) 80400 repeat_vector_4[0][0]
lstm_8[0][1]
lstm_8[0][2]
__________________________________________________________________________________________________
time_distributed_4 (TimeDistrib (None, 50, 2) 202 lstm_9[0][0]
==================================================================================================
Total params: 121,802
Trainable params: 121,802
Non-trainable params: 0
__________________________________________________________________________________________________
and the model plotted :
I have been trying to compute number of parameters in LSTM cell in Keras. I created two models one with LSTM and other with CuDNNLSTM.
Partial summary of models are as
CuDNNLSTM Model:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 300) 192000
_________________________________________________________________
bidirectional (Bidirectional (None, None, 600) 1444800
LSTM model
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 300) 192000
_________________________________________________________________
bidirectional (Bidirectional (None, None, 600) 1442400
Number of parameters in LSTM is following the formula for lstm parameter computation available all over the internet. However, CuDNNLSTM has 2400 extra parameters.
What is the cause of these extra parameters?
code
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
from tensorflow.compat.v1.keras.models import Sequential
from tensorflow.compat.v1.keras.layers import CuDNNLSTM, Bidirectional, Embedding, LSTM
model = Sequential()
model.add(Embedding(640, 300))
model.add(Bidirectional(<LSTM type>(300, return_sequences=True)))
LSTM parameters can be grouped in 3 categories: input weight matrices (W), recurrent weight matrices (R), biases (b). Part of the LSTM cell's computation is W*x + b_i + R*h + b_r where b_i are input biases and b_r are recurrent biases.
If you let b = b_i + b_r, you could rewrite the above expression as W*x + R*h + b. In doing so, you've eliminated the need to keep two separate bias vectors (b_i and b_r) and instead, you only need to store one vector (b).
cuDNN sticks with the original mathematical formulation and stores b_i and b_r separately. Keras does not; it only stores b. That's why cuDNN's LSTM has more parameters than Keras.
Currently the goal of my task is to be able to, given input of two sentences,
output two labelings of the phrasal-synonyms of the two sentences.
e.g. I am [happy], I am [having a good time]
However, I can never get my dense layer to output the dimension I want, which is [batch size, max_sentence_length, num_classes].
I've looked at some other posts saying that dense layer only allows the "units" parameter to be 1-D, so I should flatten my labels before I get it through the dense layer, and reshape my label matrix accordingly for loss function calculation, but I don't know if this is actually the appropriate way to approach it.
My current model is something like this:
Two inputs: (None, 58)
GLoVE embedding: (None, 58, 300)
BiLSTM (without return sequences): (None, 256)
Dense: (None, 290)
I've tried getting this to run too, but the performance is abysmal so I am questioning myself if I did anything incorrectly...
My intended result was originally to have the dense layer output dimension (None, 58, 5), as I have sentence length 58 and 5 different labels, thus the 290 in dense: 58*5=290.
I am trying to predict a time series with LSTM and am writing my code in Python by using Keras.
I have 30 features as input (continuous value) and a binary output.
I would like to use the 20 previous timesteps (t-20, t-19, .. , t-1) of each input feature in order to predict the output of next timestep (t+1).
My batch size is fixed at 52. What does this exactly mean?
I don't understand how to define the shape of the input layer.
The stacked LSTM example in the Keras documentation says that the last dimension of the 3D tensor will be 'data_dim'.
Is it input dimension or output dimension?
If this is output dimension, then I can't use more than one input feature as in my case the input_shape will be (batch_size=52,time_step=20,data_dim=1).
Also, in case data_dim is input shape, then I have tried to define a four layers-LSTM and the model shape results to be like this.
Layer (type) Output Shape Param #
================================================================= input_2 (InputLayer) (52, 20, 30) 0
_________________________________________________________________ lstm_3 (LSTM) (52, 20, 128) 81408
_________________________________________________________________ lstm_4 (LSTM) (52, 128) 131584
_________________________________________________________________ dense_2 (Dense) (52, 1) 129
================================================================= Total params: 213,121 Trainable params: 213,121 Non-trainable params: 0
Does this architecture make sense? Am I making some obvious mistakes?
My snippet of code is the one below:
input_layer=Input(batch_shape=(batch_size,input_timesteps,input_dims))
lstm1=LSTM(num_neurons,activation = 'relu',dropout=0.0,stateful=False,return_sequences=True)(input_layer)
lstm2=LSTM(num_neurons,activation = 'relu',dropout=0.0,stateful=False,return_sequences=False)(lstm1)
output_layer=Dense(1, activation='sigmoid')(lstm2)
model=Model(inputs=input_layer,outputs=output_layer)
I am getting very poor results and thus trying to debug each step.
If you want to use deep learning techniques you should try to overfit first and then reduce the complexity till you reach a break even point in terms of both neural complexity, training error and test error.
You are actually using a larger feature space in the hidden layer, are you sure your data are able to fit this?
Do you have enough rows to let the model learn this complex representation?
Otherwise I would suggest you something like this, in order to extrapolate the most important dimensions:
num_neurons1 = int(input_dims/2)
num_neurons2 = int(input_dims/4)
input_layer=Input(batch_shape=(batch_size, input_timesteps, input_dims))
lstm1=LSTM(num_neurons, activation = 'relu', dropout=0.0, stateful=False, return_sequences=True, kernel_initializer="he_normal")(input_layer)
lstm2=LSTM(num_neurons2, activation = 'relu', dropout=0.0, stateful=False, return_sequences=False, kernel_initializer="he_normal")(lstm1)
output_layer=Dense(1, activation='sigmoid')(lstm2)
model=Model(inputs=input_layer,outputs=output_layer)
Also, you are using relu as activation function.
Does it fit your data? Would be better you have only positive data after rescaling & normalization.
In case it does fit, you can also use a proper kernel initialization.
To better understand the problem, you could also post the optimizer parameters and the behaviour while training during epochs.
I have created the following SimpleRNN using Keras:
X = X.reshape((X.shape[0], X.shape[1], 1))
tr_X, ts_X, tr_y, ts_y = train_test_split(X, y, train_size=.8)
batch_size = 1000
print('RNN model...')
model = Sequential()
model.add(SimpleRNN(64, activation='relu', batch_input_shape=(batch_size, X.shape[1], 1)))
model.add(Dense(1, activation='relu'))
print('Training...')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print (model.summary())
print ('\n')
model.fit(tr_X, tr_y,
batch_size=batch_size, epochs=1,
shuffle=True, validation_data=(ts_X, ts_y))
For the model summary, I get the following:
Layer (type) Output Shape Param #
=================================================================
simple_rnn_1 (SimpleRNN) (1000, 64) 4224
_________________________________________________________________
dense_1 (Dense) (1000, 1) 65
=================================================================
Total params: 4,289
Trainable params: 4,289
Non-trainable params: 0
_________________________________________________________________
Given that I have a dataset of 10,000 samples and 64 features. My goal is to generate a classification model by training it using this dataset (class labels are binary 0 and 1). Now, I am trying to understand what is going on here. As seen in 'Output Shape' column, the simple_rnn_1 has (1000, 64). I interpret it as 1000 rows (which is the batch) and 64 features. Assuming the code above is logically correct, my questions is:
How does RNN handle this matrix (i.e., (1000,64))? Does it input
each column something like this figure?
Should SimpleRNN() units always be equal to the number of features?
Thank you
In the code, you defined batch_input_shape to be with shape: (batch_size, X.shape[1], 1)
which means that you will insert to the RNN, batch_size examples, each example contains X.shape[1] time-stamps (number of pink boxes in your image) and each time-stamp is shape 1 (scalar).
So yes, input shape of (1000,64,1) will be exactly like you said - each column will be input to the RNN.
No! units will be your output dim. Usually more units means more complex network (just like in regular neural network) -> more parameters to learn.
Units will be the shape of the RNN's internal state.
(So, in your example, if you declare units=2000 your output will be (1000,2000).)