numpy array for sequential network: varying sequence length - python-3.x

I have a recurrent network (RNN) whose task is to learn to classify vectors (float32) in two classes. My model is really simple so far:
model = Sequential([
SimpleRNN(units=10, input_shape=(None, len_vector)),
Dense(1, activation="relu")
])
model.compile(loss='mse', optimizer='Adam', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=30)
To train this network, I create a dataset with 1000 instances of sequences of vectors. When I create sequences with the same length each, the training works perfectly and the dataset has shape:
[<number of sequences>, <number of vectors in each sequence>, <number of floats in each vector>]
The problem is that my model must be able to work on sequence with various length. I don't know how (or even if it is possible) to create a numpy array where one dimension is not constant.
While searching a solution, I saw that setting the array dtype=object made it possible to assign list of different shapes to element of a numpy array, but the keras model will only accept dtype="float32".
Is there a way I can make this numpy array dataset? Or should I change the algorithm to train the model? Or is the only solution to pad sequences with nul vectors to unify their length?
(Thanks for the help. I'm fairly new to deep learning so I apologize if I'm asking for something obvious.)

Use Ragged Tensors, they provide you to make variable length inputs,
import numpy as np
_input = tf.keras.layers.Input(shape=(None, 100))
lstm = tf.keras.layers.LSTM(20,)(_input)
func = tf.keras.backend.function(inputs=_input, outputs=lstm)
rt = tf.ragged.constant([np.random.randn(1,34,100),
np.random.randn(1,55,100) ,
np.random.randn(1,60,100) ,
np.random.randn(1,70,100)])
func(rt[1])

Related

Embedding vs inserting word vectors directly to input layer

I used gensim to build a word2vec embedding of my corpus.
Currently I'm converting my (padded) input sentences to the word vectors using the gensim model.
This vectors are used as input for the model.
model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(MAX_SEQUENCE_LENGTH, dim)))
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
...
model.fit(training_sentences_vectors, training_labels, validation_data=validation_data)
Are there any drawbacks using the word vectors directly without a keras embedding layer?
I'm also currently adding additional (one-hot encoded) tags to the input tokens by concatenating them to each word vector, does this approach make sense?
In your current setup, the drawback will be that you will not be able to set your word vectors to be trainable. You will not be able to fine tune your model for your task.
What I mean by this is that Gensim has only learned the "Language Model". It understands your corpus and its contents. However, it does not know how to optimize for whatever downstream task you are using keras for. Your model's weights will help to fine tune your model, however you will likely experience an increase in performance if you extract the embeddings from gensim, use them to initialize a keras embedding layer, and then pass in indexes instead of word vectors for your input layer.
There's an elegant way to do what you need.
Problem with your solution is that:
the size of the input is large: (batch_size, MAX_SEQUENCE_LENGTH, dim) and may not fit in memory.
You won't be able to train and update the word vectors as per your task
You can instead get away with just: (batch_size, MAX_SEQUENCE_LENGTH). The keras embedding layer allows you to pass in a word index and get a vector. So, 42 -> Embedding Layer -> [3, 5.2, ..., 33].
Conveniently, gensim's w2v model has a function get_keras_embedding which creates the needed embedding layer for you with the trained weights.
gensim_model = # train it or load it
embedding_layer = gensim_model.wv.get_keras_embedding(train_embeddings=True)
embedding_layer.mask_zero = True # No need for a masking layer
model = Sequential()
model.add(embedding_layer) # your embedding layer
model.add(Bidirectional(
LSTM(num_lstm, dropout=0.5, recurrent_dropout=0.4, return_sequences=True))
)
But, you have to make sure the index for a word in the data is the same as the index for the word2vec model.
word2index = {}
for index, word in enumerate(model.wv.index2word):
word2index[word] = index
Use the above word2index dictionary to convert your input data to have the same index as the gensim model.
For example, your data might be:
X_train = [["hello", "there"], ["General", "Kenobi"]]
new_X_train = []
for sent in X_train:
temp_sent = []
for word in sent:
temp_sent.append(word2index[word])
# Add the padding for each sentence. Here I am padding with 0
temp_sent += [0] * (MAX_SEQUENCE_LENGTH - len(temp_sent))
new_X_train.append(temp_sent)
X_train = numpy.as_array(new_X_train)
Now you can use X_train and it will be like: [[23, 34, 0, 0], [21, 63, 0, 0]]
The Embedding Layer will map the index to that vector automatically and train it if needed.
I think this is the best way of doing it but I'll dig into how gensim wants it to be done and update this post if needed.

Keras SimpleRNN - Shape MFCC vectors

I'm currently trying to implement a Recurrent Neural Network in Keras. The data consists of a collection of 45.000 whereby each entry is a collection (of variable length) of MFCC vectors with each 13 coefficients:
spoken = numpy.load('spoken.npy')
print(spoken[0]) # Gives:
example_row = [
[
5.67170000e-01 -1.79430000e-01 -7.27360000e+00 -9.59300000e-02
-9.30140000e-02 -1.62960000e-01 4.11620000e-01 3.00590000e-01
6.86360000e-02 1.07130000e+00 1.07090000e-01 5.00890000e-01
7.51750000e-01],
[.....]
]
print(spoken.shape) # Gives: (45000,0)
print(spoken[0].shape) # Gives (N, 13) --> N amount of MFCC vectors
I'm struggling to understand how I need to reshape this Numpy array in order to feed it to the SimpleRNN of Keras:
model = Sequential()
model_spoken.add(SimpleRNN(units=10, activation='relu', input_shape=?))
.....
Therefore, my question is how do I need to reshape a collection of variable length MFCC vectors so that I can feed it to the SimpleRNN object of Keras?
It was actually quite simple since Keras has built in function for reformatting the array and padding zeros to get a static length:
spoken_train = pad_sequences(spoken_train, maxlen=100)
See github issue

2-dimensional LSTM in Keras

I am new to Keras and LSTMs -- I want to train a model on 2-dimensional sequences (ie, movement in a grid-space), as opposed to 1-dimensional sequences (like characters of text).
As a test, I first tried just one dimension, and I am doing it successfully with the following setup:
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=X[0].shape, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(512, return_sequences=False, dropout=0.2))
model.add(Dense(len(y[0]), activation="softmax"))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy'])
model.fit(X, y, epochs=50)
I'm formatting the data like this:
data = ## list of integers (1D)
inputs = []
outputs = []
for i in range(len(data) - SEQUENCE_LENGTH):
inputs.append(data[i:i + SEQUENCE_LENGTH])
outputs.append(data[i + SEQUENCE_LENGTH])
X = np.array([to_categorical(np.array(input), CATEGORY_LENGTH) for input in inputs])
y = to_categorical(np.array(outputs), CATEGORY_LENGTH)
This is straightforward and converges quickly.
But if instead of a list of integers, my data consists of 2D tuples, I can no longer create categorical (one-hot) arrays to pass to the LSTM layers.
I've tried not using categorical arrays and simply passing the tuples to the model. In this case, I've changed my output layer to:
model.add(Dense(1, activation="linear"))
But that does not converge, or at least moves incredibly slowly.
How can I adapt this code to handle input with additional dimensions?
This previous answer should apply to your question as well. The only difference is that you will have to convert your tuple to a data frame beforehand.

Strange behaviour sequence to sequence learning for variable length sequences

I am training a sequence to sequence model for variable length sequences with Keras, but I am running into some unexpected problems. It is unclear to me whether the behaviour I am observing is the desired behaviour of the library and why it would be.
Model Creation
I've made a recurrent model with an embeddings layer and a GRU recurrent layer that illustrates the problem. I used mask_zero=0.0 for the embeddings layer instead of a masking layer, but changing this doesn't seem to make a difference (nor does adding a masking layer before the output):
import numpy
from keras.layers import Embedding, GRU, TimeDistributed, Dense, Input
from keras.models import Model
import keras.preprocessing.sequence
numpy.random.seed(0)
input_layer = Input(shape=(3,), dtype='int32', name='input')
embeddings = Embedding(input_dim=20, output_dim=2, input_length=3, mask_zero=True, name='embeddings')(input_layer)
recurrent = GRU(5, return_sequences=True, name='GRU')(embeddings)
output_layer = TimeDistributed(Dense(1), name='output')(recurrent)
model = Model(input=input_layer, output=output_layer)
output_weights = model.layers[-1].get_weights()
output_weights[1] = numpy.array([0.2])
model.layers[-1].set_weights(output_weights)
model.compile(loss='mse', metrics=['mse'], optimizer='adam', sample_weight_mode='temporal')
I use masking and the sample_weight parameter to exclude the padding values from the training/evaluation. I will test this model on one input/output sequence which I pad using the Keras padding function:
X = [[1, 2]]
X_padded = keras.preprocessing.sequence.pad_sequences(X, dtype='float32', maxlen=3)
Y = [[[1], [2]]]
Y_padded = keras.preprocessing.sequence.pad_sequences(Y, maxlen=3, dtype='float32')
Output Shape
Why the output is expected to be formatted in this way. Why can I not use input/output sequences that have exactly the same dimensionality? model.evaluate(X_padded, Y_padded) gives me a dimensionality error.
Then, when I run model.predict(X_padded) I get the following output (with numpy.random.seed(0) before generating the model):
[[[ 0.2 ]
[ 0.19946882]
[ 0.19175649]]]
Why isn't the first input masked for the output layer? Is the output_value computed anyways (and equal to the bias, as the hidden layer values are 0? This does not seem desirable. Adding a Masking layer before the output layer does not solve this problem.
MSE calculation
Then, when I evaluate the model (model.evaluate(X_padded, Y_padded)), this returns the Mean Squared Error (MSE) of the entire sequence (1.3168) including this first value, which I suppose is to be expected when it isn't masked, but not what I would want.
From the Keras documentation I understand I should use the sample_weight parameter to solve this problem, which I tried:
sample_weight = numpy.array([[0, 1, 1]])
model_evaluation = model.evaluate(X_padded, Y_padded, sample_weight=sample_weight)
print model.metrics_names, model_evaluation
The output I get is
['loss', 'mean_squared_error'] [2.9329459667205811, 1.3168648481369019]
This leaves the metric (MSE) unaltered, it is still the MSE over all values, including the one that I wanted masked. Why? This is not what I want when I evaluate my model. It does cause a change in the loss value, which appears to be the MSE over the last two values normalised to not give more weight to longer sequences.
Am I doing something wrong with the sample weights? Also, I can really not figure out how this loss value came about. What should I do to exclude the padded values from both training and evaluation (I assume the sample_weight parameter works the same in the fit function).
It was indeed a bug in the library, in Keras 2 this issue is resolved.

Keras: LSTM with class weights

my question is quite closely related to this question but also goes beyond it.
I am trying to implement the following LSTM in Keras where
the number of timesteps be nb_tsteps=10
the number of input features is nb_feat=40
the number of LSTM cells at each time step is 120
the LSTM layer is followed by TimeDistributedDense layers
From the question referenced above I understand that I have to present the input data as
nb_samples, 10, 40
where I get nb_samples by rolling a window of length nb_tsteps=10 across the original timeseries of shape (5932720, 40). The code is hence
model = Sequential()
model.add(LSTM(120, input_shape=(X_train.shape[1], X_train.shape[2]),
return_sequences=True, consume_less='gpu'))
model.add(TimeDistributed(Dense(50, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(20, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(10, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(3, activation='relu')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
Now to my question (assuming the above is correct so far):
The binary responses (0/1) are heavily imbalanced and I need to pass a class_weight dictionary like cw = {0: 1, 1: 25} to model.fit(). However I get an exception class_weight not supported for 3+ dimensional targets. This is because I present the response data as (nb_samples, 1, 1). If I reshape it into a 2D array (nb_samples, 1) I get the exception Error when checking model target: expected timedistributed_5 to have 3 dimensions, but got array with shape (5932720, 1).
Thanks a lot for any help!
I think you should use sample_weight with sample_weight_mode='temporal'.
From the Keras docs:
sample_weight: Numpy array of weights for the training samples, used
for scaling the loss function (during training only). You can either
pass a flat (1D) Numpy array with the same length as the input samples
(1:1 mapping between weights and samples), or in the case of temporal
data, you can pass a 2D array with shape (samples, sequence_length),
to apply a different weight to every timestep of every sample. In this
case you should make sure to specify sample_weight_mode="temporal" in
compile().
In your case you would need to supply a 2D array with the same shape as your labels.
If this is still an issue.. I think the TimeDistributed Layer expects and returns a 3D array (kind of similar to if you have return_sequences=True in the regular LSTM layer). Try adding a Flatten() layer or another LSTM layer at the end before the prediction layer.
d = TimeDistributed(Dense(10))(input_from_previous_layer)
lstm_out = Bidirectional(LSTM(10))(d)
output = Dense(1, activation='sigmoid')(lstm_out)
Using temporal is a workaround. Check out this stack. The issue is also documented on github.

Resources