How to set the length of input layer in lstm - keras

I'm building a LSTM model to classify some review data using Keras. Output is just 1 or 0.
I added an word embedding layer before feeding the text data into lstm layer. Part of my code is shown below. Here is max_feature is vocabulary size. Word vector size is 2. Size of each document is maxlen. All document is already padded to the same length.
However, I'm always confused with the length of LSTM layer. Should that be the same length as the length of my documents (maxlen)? Feed in all words in each document, and get an output?.....
There are quite a few online sources explaining LSTM. But in terms of implementation, I feel not many of them give clear explainations...
Really appreciate it if someone could clarify on this.
# max_features: vocabulary size
# word vector size: 2
# maxlen: my document size, already padded to the same length
# Build our model
print('Build model...')
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim= 2, input_length=maxlen ))
model.add(LSTM(units= ???? ))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

The number of units in the LSTM is irrelevant to the data dimensions, units is the number of neurons/nodes in the LSTM layer.
LSTM is a recursive network, which parameters are used again and again at the same layer:
'A' represents the LSTM cell, and the amount of 'A's are the same as your input length. Units represents the hidden dimensions of 'A'.

Related

How to understand hidden_states of the returns in BertModel?(huggingface-transformers)

Returns last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)): Sequence of hidden-states at the
output of the last layer of the model.
pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)):
Last layer hidden-state of the first token of the sequence
(classification token) further processed by a Linear layer and a Tanh
activation function. The Linear layer weights are trained from the
next sentence prediction (classification) objective during
pre-training.
This output is usually not a good summary of the semantic content of
the input, you’re often better with averaging or pooling the sequence
of hidden-states for the whole input sequence.
hidden_states (tuple(torch.FloatTensor), optional, returned when
config.output_hidden_states=True): Tuple of torch.FloatTensor (one for
the output of the embeddings + one for the output of each layer) of
shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the
initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when
config.output_attentions=True): Tuple of torch.FloatTensor (one for
each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length).
Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
This is from https://huggingface.co/transformers/model_doc/bert.html#bertmodel. Although the description in the document is clear, I still don't understand the hidden_states of returns. There is a tuple, one for the output of the embeddings, and the other for the output of each layer.
Please tell me how to distinguish them, or what is the meaning of them? Thanks very much!![wink~
hidden_states (tuple(torch.FloatTensor), optional, returned when config.output_hidden_states=True): Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
For a given token, its input representation is constructed by summing the corresponding token embedding, segment embedding, and position embedding. This input representation is called the initial embedding output which can be found at index 0 of the tuple hidden_states.
This figure explains how the embeddings are calculated.
The remaining 12 elements in the tuple contain the output of the corresponding hidden layer. E.g: the last hidden layer can be found at index 12, which is the 13th item in the tuple. The dimension of both the initial embedding output and the hidden states are [batch_size, sequence_length, hidden_size]. It would be useful to compare the indexing of hidden_states bottom-up with this image from the BERT paper.
last_hidden_state contains the hidden representations for each token in each sequence of the batch. So the size is (batch_size, seq_len, hidden_size).
You can refer to Difference between CLS hidden state and pooled_output for more clarification.
I find the answer in the length of this tuple. The length is (1+num_layers). And the output of the last layer is different from the embedding output, because layer output plus the initial embedding. :D

Request for improvement suggestion on my CNN learning model?

I’m trying to build a classification model for production line. If I understand correctly , it’s possible to use a CNN to classify numerical data .(and not only pictures)
My data is an array of 21 columns  per line:
20 different measurements and the last column is a type . It can be 0 or 1 or 2
each line of the array use a timestamp as index
type 0 represents 80 % of the production, and do not need extra treatment
but type 1 and 2 need extra treatment after production (so I need to clearly identify them)
To recreate something a CNN can use , I created a dataset where each label has for learning data an array made of the last previous 20 lines since it’s position .
So each label has for corresponding learning data , a square array of 20x20 measurements (like a picture ) .
(data already have been normalized using keras ColumnTransformer
after reading about unbalanced dataset , i decided to include only a type 0 each time I found a type 1 or 2 . At the end my dataset size is 18 000 lines , data shape '(18206, 20, 20)'
my learning model is pretty basic and looks like this :
train, test, train_label, test_label = train_test_split(X,y,test_size=0.3,shuffle=True)
##Call CNN model
sizePic = 20
model = Sequential()
model.add(Dense(sizePic*3, input_shape=(sizePic,sizePic,), activation='relu'))
model.add(Dense(sizePic, activation='relu'))
model.add(Flatten())
model.add(Dense(3, activation='softmax'))
# Compile model
sgd = optimizers.SGD(lr=0.03)
model.compile(loss='sparse_categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
self.logger.info(model.summary())
# Fit the model
model.fit(train, train_label, epochs=750, batch_size=200,verbose=1)
# evaluate the model
self.learning_scores = model.evaluate(test, test_label, verbose=2)
self.logger.info("scores %r"%self.learning_scores)
at the end prediction scores are :
scores [0.6088506683505354, 0.7341632843017578]
I have been changing parameters like batch_size and learning rate , but with no big improvement . To my understanding, it's better to start this way than adding layers to the model , is this correct ?
Any suggestion ??
thanks for your time
You are not using any conv layer, only fully connected layers (and don't be afraid of adding some conv layers because they have way less parameters than dense layers)

Embedding vs inserting word vectors directly to input layer

I used gensim to build a word2vec embedding of my corpus.
Currently I'm converting my (padded) input sentences to the word vectors using the gensim model.
This vectors are used as input for the model.
model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(MAX_SEQUENCE_LENGTH, dim)))
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
...
model.fit(training_sentences_vectors, training_labels, validation_data=validation_data)
Are there any drawbacks using the word vectors directly without a keras embedding layer?
I'm also currently adding additional (one-hot encoded) tags to the input tokens by concatenating them to each word vector, does this approach make sense?
In your current setup, the drawback will be that you will not be able to set your word vectors to be trainable. You will not be able to fine tune your model for your task.
What I mean by this is that Gensim has only learned the "Language Model". It understands your corpus and its contents. However, it does not know how to optimize for whatever downstream task you are using keras for. Your model's weights will help to fine tune your model, however you will likely experience an increase in performance if you extract the embeddings from gensim, use them to initialize a keras embedding layer, and then pass in indexes instead of word vectors for your input layer.
There's an elegant way to do what you need.
Problem with your solution is that:
the size of the input is large: (batch_size, MAX_SEQUENCE_LENGTH, dim) and may not fit in memory.
You won't be able to train and update the word vectors as per your task
You can instead get away with just: (batch_size, MAX_SEQUENCE_LENGTH). The keras embedding layer allows you to pass in a word index and get a vector. So, 42 -> Embedding Layer -> [3, 5.2, ..., 33].
Conveniently, gensim's w2v model has a function get_keras_embedding which creates the needed embedding layer for you with the trained weights.
gensim_model = # train it or load it
embedding_layer = gensim_model.wv.get_keras_embedding(train_embeddings=True)
embedding_layer.mask_zero = True # No need for a masking layer
model = Sequential()
model.add(embedding_layer) # your embedding layer
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
But, you have to make sure the index for a word in the data is the same as the index for the word2vec model.
word2index = {}
for index, word in enumerate(model.wv.index2word):
word2index[word] = index
Use the above word2index dictionary to convert your input data to have the same index as the gensim model.
For example, your data might be:
X_train = [["hello", "there"], ["General", "Kenobi"]]
new_X_train = []
for sent in X_train:
temp_sent = []
for word in sent:
temp_sent.append(word2index[word])
# Add the padding for each sentence. Here I am padding with 0
temp_sent += [0] * (MAX_SEQUENCE_LENGTH - len(temp_sent))
new_X_train.append(temp_sent)
X_train = numpy.as_array(new_X_train)
Now you can use X_train and it will be like: [[23, 34, 0, 0], [21, 63, 0, 0]]
The Embedding Layer will map the index to that vector automatically and train it if needed.
I think this is the best way of doing it but I'll dig into how gensim wants it to be done and update this post if needed.

How to correctly get layer weights from Conv2D in keras?

I have Conv2D layer defines as:
Conv2D(96, kernel_size=(5, 5),
activation='relu',
input_shape=(image_rows, image_cols, 1),
kernel_initializer=initializers.glorot_normal(seed),
bias_initializer=initializers.glorot_uniform(seed),
padding='same',
name='conv_1')
This is the first layer in my network.
Input dimensions are 64 by 160, image is 1 channel.
I am trying to visualize weights from this convolutional layer but not sure how to get them.
Here is how I am doing this now:
1.Call
layer.get_weights()[0]
This returs an array of shape (5, 5, 1, 96). 1 is because images are 1-channel.
2.Take 5 by 5 filters by
layer.get_weights()[0][:,:,:,j][:,:,0]
Very ugly but I am not sure how to simplify this, any comments are very appreciated.
I am not sure in these 5 by 5 squares. Are they filters actually?
If not could anyone please tell how to correctly grab filters from the model?
I tried to display the weights like so only the first 25. I have the same question that you do is this the filter or something else. It doesn't seem to be the same filters that are derived from deep belief networks or stacked RBM's.
Here is the untrained visualized weights:
and here are the trained weights:
Strangely there is no change after training! If you compare them they are identical.
and then the DBN RBM filters layer 1 on top and layer 2 on bottom:
If i set kernel_intialization="ones" then I get filters that look good but the net loss never decreases though with many trial and error changes:
Here is the code to display the 2D Conv Weights / Filters.
ann = Sequential()
x = Conv2D(filters=64,kernel_size=(5,5),input_shape=(32,32,3))
ann.add(x)
ann.add(Activation("relu"))
...
x1w = x.get_weights()[0][:,:,0,:]
for i in range(1,26):
plt.subplot(5,5,i)
plt.imshow(x1w[:,:,i],interpolation="nearest",cmap="gray")
plt.show()
ann.fit(Xtrain, ytrain_indicator, epochs=5, batch_size=32)
x1w = x.get_weights()[0][:,:,0,:]
for i in range(1,26):
plt.subplot(5,5,i)
plt.imshow(x1w[:,:,i],interpolation="nearest",cmap="gray")
plt.show()
---------------------------UPDATE------------------------
So I tried it again with a learning rate of 0.01 instead of 1e-6 and used the images normalized between 0 and 1 instead of 0 and 255 by dividing the images by 255.0. Now the convolution filters are changing and the output of the first convolutional filter looks like so:
The trained filter you'll notice is changed (not by much) with a reasonable learning rate:
Here is image seven of the CIFAR-10 test set:
And here is the output of the first convolution layer:
And if I take the last convolution layer (no dense layers in between) and feed it to a classifier untrained it is similar to classifying raw images in terms of accuracy but if I train the convolution layers the last convolution layer output increases the accuracy of the classifier (random forest).
So I would conclude the convolution layers are indeed filters as well as weights.
In layer.get_weights()[0][:,:,:,:], the dimensions in [:,:,:,:] are x position of the weight, y position of the weight, the n th input to the corresponding conv layer (coming from the previous layer, note that if you try to obtain the weights of first conv layer then this number is 1 because only one input is driven to the first conv layer) and k th filter or kernel in the corresponding layer, respectively. So, the array shape returned by layer.get_weights()[0] can be interpreted as only one input is driven to the layer and 96 filters with 5x5 size are generated. If you want to reach one of the filters, you can type, lets say the 6th filter
print(layer.get_weights()[0][:,:,:,6].squeeze()).
However, if you need the filters of the 2nd conv layer (see model image link attached below), then notice for each of 32 input images or matrices you will have 64 filters. If you want to get the weights of any of them for example weights of the 4th filter generated for the 8th input image, then you should type
print(layer.get_weights()[0][:,:,8,4].squeeze()).
enter image description here

Keras: LSTM with class weights

my question is quite closely related to this question but also goes beyond it.
I am trying to implement the following LSTM in Keras where
the number of timesteps be nb_tsteps=10
the number of input features is nb_feat=40
the number of LSTM cells at each time step is 120
the LSTM layer is followed by TimeDistributedDense layers
From the question referenced above I understand that I have to present the input data as
nb_samples, 10, 40
where I get nb_samples by rolling a window of length nb_tsteps=10 across the original timeseries of shape (5932720, 40). The code is hence
model = Sequential()
model.add(LSTM(120, input_shape=(X_train.shape[1], X_train.shape[2]),
return_sequences=True, consume_less='gpu'))
model.add(TimeDistributed(Dense(50, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(20, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(10, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(3, activation='relu')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
Now to my question (assuming the above is correct so far):
The binary responses (0/1) are heavily imbalanced and I need to pass a class_weight dictionary like cw = {0: 1, 1: 25} to model.fit(). However I get an exception class_weight not supported for 3+ dimensional targets. This is because I present the response data as (nb_samples, 1, 1). If I reshape it into a 2D array (nb_samples, 1) I get the exception Error when checking model target: expected timedistributed_5 to have 3 dimensions, but got array with shape (5932720, 1).
Thanks a lot for any help!
I think you should use sample_weight with sample_weight_mode='temporal'.
From the Keras docs:
sample_weight: Numpy array of weights for the training samples, used
for scaling the loss function (during training only). You can either
pass a flat (1D) Numpy array with the same length as the input samples
(1:1 mapping between weights and samples), or in the case of temporal
data, you can pass a 2D array with shape (samples, sequence_length),
to apply a different weight to every timestep of every sample. In this
case you should make sure to specify sample_weight_mode="temporal" in
compile().
In your case you would need to supply a 2D array with the same shape as your labels.
If this is still an issue.. I think the TimeDistributed Layer expects and returns a 3D array (kind of similar to if you have return_sequences=True in the regular LSTM layer). Try adding a Flatten() layer or another LSTM layer at the end before the prediction layer.
d = TimeDistributed(Dense(10))(input_from_previous_layer)
lstm_out = Bidirectional(LSTM(10))(d)
output = Dense(1, activation='sigmoid')(lstm_out)
Using temporal is a workaround. Check out this stack. The issue is also documented on github.

Resources