Embedding vs inserting word vectors directly to input layer - keras

I used gensim to build a word2vec embedding of my corpus.
Currently I'm converting my (padded) input sentences to the word vectors using the gensim model.
This vectors are used as input for the model.
model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(MAX_SEQUENCE_LENGTH, dim)))
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
...
model.fit(training_sentences_vectors, training_labels, validation_data=validation_data)
Are there any drawbacks using the word vectors directly without a keras embedding layer?
I'm also currently adding additional (one-hot encoded) tags to the input tokens by concatenating them to each word vector, does this approach make sense?

In your current setup, the drawback will be that you will not be able to set your word vectors to be trainable. You will not be able to fine tune your model for your task.
What I mean by this is that Gensim has only learned the "Language Model". It understands your corpus and its contents. However, it does not know how to optimize for whatever downstream task you are using keras for. Your model's weights will help to fine tune your model, however you will likely experience an increase in performance if you extract the embeddings from gensim, use them to initialize a keras embedding layer, and then pass in indexes instead of word vectors for your input layer.

There's an elegant way to do what you need.
Problem with your solution is that:
the size of the input is large: (batch_size, MAX_SEQUENCE_LENGTH, dim) and may not fit in memory.
You won't be able to train and update the word vectors as per your task
You can instead get away with just: (batch_size, MAX_SEQUENCE_LENGTH). The keras embedding layer allows you to pass in a word index and get a vector. So, 42 -> Embedding Layer -> [3, 5.2, ..., 33].
Conveniently, gensim's w2v model has a function get_keras_embedding which creates the needed embedding layer for you with the trained weights.
gensim_model = # train it or load it
embedding_layer = gensim_model.wv.get_keras_embedding(train_embeddings=True)
embedding_layer.mask_zero = True # No need for a masking layer
model = Sequential()
model.add(embedding_layer) # your embedding layer
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
But, you have to make sure the index for a word in the data is the same as the index for the word2vec model.
word2index = {}
for index, word in enumerate(model.wv.index2word):
word2index[word] = index
Use the above word2index dictionary to convert your input data to have the same index as the gensim model.
For example, your data might be:
X_train = [["hello", "there"], ["General", "Kenobi"]]
new_X_train = []
for sent in X_train:
temp_sent = []
for word in sent:
temp_sent.append(word2index[word])
# Add the padding for each sentence. Here I am padding with 0
temp_sent += [0] * (MAX_SEQUENCE_LENGTH - len(temp_sent))
new_X_train.append(temp_sent)
X_train = numpy.as_array(new_X_train)
Now you can use X_train and it will be like: [[23, 34, 0, 0], [21, 63, 0, 0]]
The Embedding Layer will map the index to that vector automatically and train it if needed.
I think this is the best way of doing it but I'll dig into how gensim wants it to be done and update this post if needed.

Related

numpy array for sequential network: varying sequence length

I have a recurrent network (RNN) whose task is to learn to classify vectors (float32) in two classes. My model is really simple so far:
model = Sequential([
SimpleRNN(units=10, input_shape=(None, len_vector)),
Dense(1, activation="relu")
])
model.compile(loss='mse', optimizer='Adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=30)
To train this network, I create a dataset with 1000 instances of sequences of vectors. When I create sequences with the same length each, the training works perfectly and the dataset has shape:
[<number of sequences>, <number of vectors in each sequence>, <number of floats in each vector>]
The problem is that my model must be able to work on sequence with various length. I don't know how (or even if it is possible) to create a numpy array where one dimension is not constant.
While searching a solution, I saw that setting the array dtype=object made it possible to assign list of different shapes to element of a numpy array, but the keras model will only accept dtype="float32".
Is there a way I can make this numpy array dataset? Or should I change the algorithm to train the model? Or is the only solution to pad sequences with nul vectors to unify their length?
(Thanks for the help. I'm fairly new to deep learning so I apologize if I'm asking for something obvious.)
Use Ragged Tensors, they provide you to make variable length inputs,
import numpy as np
_input = tf.keras.layers.Input(shape=(None, 100))
lstm = tf.keras.layers.LSTM(20,)(_input)
func = tf.keras.backend.function(inputs=_input, outputs=lstm)
rt = tf.ragged.constant([np.random.randn(1,34,100),
np.random.randn(1,55,100) ,
np.random.randn(1,60,100) ,
np.random.randn(1,70,100)])
func(rt[1])

How to combine POS tag feature with associated word vector for word get from Pretrained gensim word2vec ans use in embedding layer in keras

I have already pretrained word2vec in gensim. In keras , I want to use Word vector for word get from pretrained word2vec combined with that word's POS tag feature that i encode in one hot vector. In Keras, I think use embedding matrix So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?
Just merge both the features. Take care of the shapes when merging, your word embedding layer would have 3 dimensions and POS you have is 2 dimension. You would have to convert POS to 3 dimension before merging.
from keras.layers import Input, Concatenate, Reshape
from keras.layers import Embedding
max_len = 100
word_in = Input(shape=(max_len,))
emb_word = Embedding(1000, 300, input_length=max_len)(word_in)
pos_ohe = Input(shape=(max_len,))
pos_in = Reshape((max_len, 1))(pos_ohe)
x = Concatenate()([emb_word, pos])

How to set the length of input layer in lstm

I'm building a LSTM model to classify some review data using Keras. Output is just 1 or 0.
I added an word embedding layer before feeding the text data into lstm layer. Part of my code is shown below. Here is max_feature is vocabulary size. Word vector size is 2. Size of each document is maxlen. All document is already padded to the same length.
However, I'm always confused with the length of LSTM layer. Should that be the same length as the length of my documents (maxlen)? Feed in all words in each document, and get an output?.....
There are quite a few online sources explaining LSTM. But in terms of implementation, I feel not many of them give clear explainations...
Really appreciate it if someone could clarify on this.
# max_features: vocabulary size
# word vector size: 2
# maxlen: my document size, already padded to the same length
# Build our model
print('Build model...')
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim= 2, input_length=maxlen ))
model.add(LSTM(units= ???? ))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
The number of units in the LSTM is irrelevant to the data dimensions, units is the number of neurons/nodes in the LSTM layer.
LSTM is a recursive network, which parameters are used again and again at the same layer:
'A' represents the LSTM cell, and the amount of 'A's are the same as your input length. Units represents the hidden dimensions of 'A'.

How to use the result of embedding with mask_zero=True in keras

In keras, I want to calculate the mean of nonzero embedding output.
I wonder what is the difference between mask_zero=True or False in Embedding Layer.
I tried the code below :
input_data = Input(shape=(5,), dtype='int32', name='input')
embedding_layer = Embedding(1000, 24, input_length=5,mask_zero=True,name='embedding')
out = word_embedding_layer(input_data)
def antirectifier(x):
x = K.mean(x, axis=1, keepdims=True)
return x
def antirectifier_output_shape(input_shape):
shape = list(input_shape)
return tuple(shape)
out = Lambda(antirectifier, output_shape=antirectifier_output_shape,name='lambda')(out)
But it seems that the result is the mean of all the elements, how can i just calculate the mean of all nonzero inputs?
From the function's doc :
If this is True then all subsequent layers in the model need to
support masking
Your lambda function doesn't support masking. For example Recurrent layers in Keras support masking. If you set mask_zero=True in your embeddings, then all the 0 indices that you feed to the embedding layer will be propagated as "masked" and the following layers that are able to understand the "masked" information will use them.
Basically, if you build a "mean" layer that grabs the mask and computes the average only for non-masked values, then you will get the desired results.
You can find here a way to build your lambda layers that support masking
I hope it helps.

Keras SimpleRNN - Shape MFCC vectors

I'm currently trying to implement a Recurrent Neural Network in Keras. The data consists of a collection of 45.000 whereby each entry is a collection (of variable length) of MFCC vectors with each 13 coefficients:
spoken = numpy.load('spoken.npy')
print(spoken[0]) # Gives:
example_row = [
[
5.67170000e-01 -1.79430000e-01 -7.27360000e+00 -9.59300000e-02
-9.30140000e-02 -1.62960000e-01 4.11620000e-01 3.00590000e-01
6.86360000e-02 1.07130000e+00 1.07090000e-01 5.00890000e-01
7.51750000e-01],
[.....]
]
print(spoken.shape) # Gives: (45000,0)
print(spoken[0].shape) # Gives (N, 13) --> N amount of MFCC vectors
I'm struggling to understand how I need to reshape this Numpy array in order to feed it to the SimpleRNN of Keras:
model = Sequential()
model_spoken.add(SimpleRNN(units=10, activation='relu', input_shape=?))
.....
Therefore, my question is how do I need to reshape a collection of variable length MFCC vectors so that I can feed it to the SimpleRNN object of Keras?
It was actually quite simple since Keras has built in function for reformatting the array and padding zeros to get a static length:
spoken_train = pad_sequences(spoken_train, maxlen=100)
See github issue

Resources