Multi-Feature Sequence Padding and Masking in RNN using Keras - keras

I have constructed LSTM architecture using Keras, but I am not certain if duplicating time steps is a good approach to deal with variable sequence length.
I have a multidimensional data set with multi-feature sequence and varying time steps. It is a multivariate time series data with multiple examples to train LSTM on, and Y is either 0 or 1. Currently, I am duplicating last time steps for each sequence to ensure timesteps = 3.
I appreciate if someone could answer the following questions or concerns:
1. Is creating additional time steps with feature values represented by zeroes more suitable?
2. What is the right way to frame this problem, pad sequences, and mask for evaluation.
3. I am duplicating last time step in Y variable as well for prediction, and the value 1 in Y only appears at the last time step if at all.
# The input sequences are
trainX = np.array([
[
# Input features at timestep 1
[1, 2, 3],
# Input features at timestep 2
[5, 2, 3] #<------ duplicate this to ensure compliance
],
# Datapoint 2
[
# Features at timestep 1
[1, 8, 9],
# Features at timestep 2
[9, 8, 9],
# Features at timestep 3
[7, 6, 1]
]
])
# The desired model outputs is as follows:
trainY = np.array([
# Datapoint 1
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[1] #<---------- duplicate this to ensure compliance
],
# Datapoint 2
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[0]
# Target class at time step 3
[0]
]
])
timesteps = 3
model = Sequential()
model.add(LSTM(3, kernel_initializer ='uniform', return_sequences=True, batch_input_shape=(None, timesteps, trainX.shape[2]),
kernel_constraint=maxnorm(3), name='LSTM'))
model.add(Dropout(0.2))
model.add(LSTM(3, return_sequences=True, kernel_constraint=maxnorm(3), name='LSTM-2'))
model.add(Flatten(name='Flatten'))
model.add(Dense(timesteps, activation='sigmoid', name='Dense'))
model.compile(loss="mse", optimizer="sgd", metrics=["mse"])
model.fit(trainX, trainY, epochs=2000, batch_size=2)
predY = model.predict(testX)

In my opinion there are two solutions to your problem. (Duplicating timesteps is None of them):
Use pad_sequence layer in combination with a masking layer. This is the common approach. Now thanks to padding every sample has the same number of timesteps. The good thing on this method, it's very easy to implement. Also, the Masking layer will give you a little performance boost.
The downside of this approach: If you train on a GPU, CuDNNLSTM is the layer to go, which is highly optimized for gpu and therefore a lot faster. But it's not working with a masking layer and if your dataset has a high range of timesteps, you're losing perfomance.
Set your timesteps-shape to None and write a keras generator which will group your batches by timesteps.(I think you'll also have to use the functional api) Now you can implement CuDNNLSTM and every sample will be computed with only the relavant timesteps (instead of padded ones), which is much more efficient.
If you're new to keras and perfomance is not so important, go with option 1. If you have a production environment where you often have to train the Network and it's cost relevant, try option 2.

Related

Can't use combination of gradiants for multiple losses functions of a multi-output keras model

I am doing a time-series forecasting in Keras with a CNN and the EHR dataset. The goal is to predict both what molecule to give to the patient and the time until the next patient visit. I have to implement a bi-objective gradient descent based on this paper. The algorithm to implements is here (end of page 7, the beginning of page 8):
The model I choose is this one :
With time-series of length 3 as input (correspondings to 3 consecutive visits for a client)
And 2 outputs:
the atc code (the code of the molecule to predict)
the time to wait until the next visit (in categories of months: 0,1,2,3,4 for >=4)
both outputs use the SparseCategoricalCorssentropy loss function.
when I start to implement the first operation: gs - gl I have this error :
Some values in my gradients are at None and I don't know why. My optimizer is defined as follow: optimizer=tf.Keras.optimizers.Adam(learning_rate=1e-3 when compiling my model.
Also, when I try some operations on gradients to see how things work, I have another problem: only one input is taken into account which will pose a problem later because I have to consider each loss function separately:
With this code, I have this output message : WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss.
EPOCHS = 1
for epoch in range(EPOCHS):
with tf.GradientTape() as ATCTape, tf.GradientTape() as WTTape:
predictions = model(xTrain,training=False)
ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])
ATCGrads = ATCTape.gradient(ATCLoss, model.trainable_variables)
WTGrads = WTTape.gradient(WTLoss,model.trainable_variables)
grads = ATCGrads + WTGrads
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
With this code, it's okay, but both losses are combined into one, whereas I need to consider both losses separately
EPOCHS = 1
for epoch in range(EPOCHS):
with tf.GradientTape() as tape:
predictions = model(xTrain,training=False)
ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])
lossValue = ATCLoss + WTLoss
grads = tape.gradient(lossValue, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I need help to understand why I have all of those problems.
The notebook containing all the code is here: https://colab.research.google.com/drive/1b6UorAAEddNKFQCxaK1Wsuj09U645KhU?usp=sharing
The implementation begins in the part Model Creation
The reason you get None in ATCGrads and WTGrads is because two gradients corresponding loss is wrt different outputs outputATC and outputWaitTime, if
outputs value is not using to calculate the loss then there will be no gradients wrt that outputs hence you get None gradients for that output layer. That is also the reason why you get WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss, because you don't have those gradients wrt each loss. If you combine losses into one then both outputs are using to calculate the loss, thus no WARNING.
So if you want do a list element wise subtraction, you could first convert None to 0. before subtraction, and you cannot using tf.math.subtract(gs, gl) because it require shapes of all inputs must match, so:
import tensorflow as tf
gs = [tf.constant([1., 2.]), tf.constant(3.), None]
gl = [tf.constant([3., 4.]), None, tf.constant(4.)]
to_zero = lambda i : 0. if i is None else i
gs = list(map(to_zero, gs))
gl = list(map(to_zero, gl))
sub = [s_i - l_i for s_i, l_i in zip(gs, gl)]
print(sub)
Outpts:
[<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-2., -2.], dtype=float32)>,
<tf.Tensor: shape=(), dtype=float32, numpy=3.0>,
<tf.Tensor: shape=(), dtype=float32, numpy=-4.0>]
Also beware the tape.gradient() will return a list or nested structure of Tensors (or IndexedSlices, or None), one for each element in sources. Returned structure is the same as the structure of sources; Add two list [1, 2] + [3, 4] in python will not give you [4, 6] like you do in numpy array, instead it will combine two list and give you [1, 2, 3, 4].

Ways to limit output of NN regression problem in certain limit(i.e. I want my NN to always predict output values only between -20 to +30)

I am training NN for the regression problem. So the output layer has a linear activation function. NN output is supposed to be between -20 to 30. My NN is performing good most of the time. However, sometimes it gives output more than 30 which is not desirable for my system. So does anyone know any activation function that can provide such kind of restriction on output or any suggestions on modifying linear activation function for my application?
I am using Keras with tenserflow backend for this application
What you can do is to activate your last layer with a sigmoid, the result will be between 0 and 1 and then create a custom layer in order to get the desired range :
def get_range(input, maxx, minn):
return (minn - maxx) * ((input - K.min(input, axis=1))/ (K.max(input, axis=1)*K.min(input, axis=1))) + maxx
and then add this to your network :
out = layers.Lambda(get_range, arguments={'maxx': 30, 'minn': -20})(sigmoid_output)
The output will be normalized between 'maxx' and 'minn'.
UPDATE
If you want to clip your data without normalizing all your outputs, do this instead :
def clip(input, maxx, minn):
return K.clip(input, minn, maxx)
out = layers.Lambda(clip, arguments={'maxx': 30, 'minn': -20})(sigmoid_output)
What you should do is normalize your target outputs to the range [-1, 1] or [0, 1], and then use a tanh (for [-1, 1]) or sigmoid (for [0, 1]) activation at the output, and train the model with normalize data.
Then you can denormalize the predictions to get values in your original ranges during inference.

Predicting a Time Series data with LSTM in Keras

I'm trying to prepare a model that will predict the first two numbers from a given array of numbers. So, the input dataset is like this -
[1 2 3 5]
[4 8 5 9]
[10 2 3 15]
Output will be -
[1 2]
[4 8]
[10 2]
So, the architectures of RNN are like below, (Taken from here)
Then, the basic architecture of I'm trying to achieve should be something close to this -
So, it should be a Many-To-Many network. (Resembles the fourth image)
Question - So, how can I create this type of model with Keras?
My Findings -
I tried something like this -
n_samples = 10000
input = np.random.randint(5,10, (n_samples,5))
output = input[...,0:2]
rinp = input.reshape(n_samples,1,5)
model = Sequential()
model.add(LSTM(10, input_shape=(1,5)))
model.add(Dense(2))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(rinp, output, epochs=1000, batch_size=500, verbose=1)
But as you can see, this is not even close. This is like an MLP. It does not utilize any time steps. Because, the input shape is - (n_samples,1,5). So, there is only one time step.
So, my implementation is wrong.
I've seen some examples of One-to-One, Many-to-One and Many-to-Many examples from here.
In Many-to-Many example, the author used the following code snippet.
length = 5
seq = array([i/float(length) for i in range(length)])
X = seq.reshape(1, length, 1)
y = seq.reshape(1, length, 1)
# define LSTM configuration
n_neurons = length
n_batch = 1
n_epoch = 1000
# create LSTM
model = Sequential()
model.add(LSTM(n_neurons, input_shape=(length, 1), return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mean_squared_error', optimizer='adam')
print(model.summary())
# train LSTM
model.fit(X, y, epochs=n_epoch, batch_size=n_batch, verbose=2)
# evaluate
result = model.predict(X, batch_size=n_batch, verbose=0)
for value in result[0,:,0]:
print('%.1f' % value)
As you can see from the X and y values, the described model is like the one below -
Which is not the one I'm trying to achieve.
Any example regarding the architecture I'm trying to implement would be greatly helpful.
It looks like you are trying to build a Sequence-to-Sequence (seq2seq) model based on the drawing. There is a very nice tutorial online to get you started. Instead of predicting sentences, you can just predict fixed length tokens of length 2. This architecture and variants are often used for machine translation. Based on your data, I'm guessing you are trying to experiment with the problem of long-term dependencies in recurrent networks; otherwise it wouldn't make sense to use seq2seq for any practical purposes.

LSTM's for Binary classification in Keras?

Suppose I have the following data-set X with 2 features and Y labels .
X = [[0.3, 0.1], [0.2, 0.9], [0.4, 0.0]]
Y = [0, 1, 0]
# split into input (X) and output (Y) variables
X = dataset[:, 0:2] #X features are from the first column and the 50th column
Y = dataset[:, 2]
model = Sequential()
model.add(Embedding(2, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y)
It works, but I wanted o know more about parameter_1, parameter_2, parameter_3 that go in
Embedding(parameter_1, parameter_2, input_length=parameter_2)
P.S, I just put in random stuff and don't know what I am doing.
What would be the proper parameters to fill in Embedding() given the data set I described above?
Alright, following more precise questions in the comments, here is the explaination.
An embedding layer is usually used to embed words so I will use a "red line example" with words, but you can think of them as categorical features.
The embedding layer is useful indeed to represent words (categorical features) as vectors in a continuous vector space.
When you have a text, you will tokenize your words and assign them a number. They become then categorical features labelled with an index. You will have for example the sentence " I embed stuff" becoming the list of categorical objects [2, 1, 3] where a dictionnary maps the index to each words : {1: "embed", 2: "I", 3: "stuff", 4: "some_other_words", 0:"<pad>"}
When you use a neural network or a continuous mathematical framework, those discrete objects (=categories) are unordered, there is no sense in 2 > 1 when you talk about your words, those are not "numerical values", they are categories. So you want to make them become numbers, to embed them in a vector space.
This is precisely what the Embedding() layer does, it maps every indexes to a word. So to do that, there are three main parameters to define :
How many indices you want to use in total. This is the number of words you have in your vocabulary, or the number of categories that the categorical feature you want to encode has. This is the input_dim feature. In our little example, we have 5 words in the vocabulary (indices from 0 to 4), so we will have input_dim = 5. The reason why it is called a "dimension" is because under the hood, keras is transforming the index number into a one-hot vector of dimension = the number of different elements. For example, the word "stuff" which is index 3 will be transformed into the 5 dimesions vector : [0 0 0 1 0] before being embedded. This is why your inputs should be integer, they are indices representing where the 1 is in the one-hot vector.
How big do you want your output vectors. This is the size of the vector space where your features will live. The parameter is output_dim. if you don't have a lot of words in your vocabulary (different categories for your features), this number should be low, in our case we will set it to output_dim = 2. Our 5 words will be living in a 2D space.
As embedding layers are often the firsts in a Neural Network, you need to specify what is the number of words that you have in the samples. This will be the input_length. Our sample was a 3 words phrase so input_length=3.
The reason why you usually have the embedding layer as first layer is because it takes integers inputs, layers in neural networks return real values, so it wouldn't work.
So to summarize, what comes in the the layer is a sequence of indices : [2, 1, 3] in our example. And what comes out is the embedded vector corresponding to each index. This might be something like [[0.2, 0.4], [-1.2, 0.3], [-0.5, -0.8]].
And to come back to your example, the input should be a list of samples, samples being lists of indices. There is no use to embed features that are already real values, values which have a mathematical sense already, the model can understand it, as opposed to categorical values.
Is it clearer now? :)

model selection with Keras and Theano takes a very long time

I am performing nested cross-validation for model selection and performance estimation for a set of recurrent neural networks with different architectures and parameters using Keras and Theano, which are set up to run on a AWS P2 instance which has a Tesla K80 GPU with CUDA and cuDNN installed/enabled.
To perform model selection, I compare 30 models sampled from the parameter space using
param_grid = {
'nb_hidden_layers': [1, 2, 3],
'dropout_frac': [0.15, 0.20],
'output_activation': ['sigmoid', 'softmax'],
'optimization': ['Adedelta', 'RMSprop', 'Adam'],
'learning_rate': [0.001, 0.005, 0.010],
'batch_size': [64, 100, 150, 200],
'nb_epoch': [10, 15, 20],
'perform_batchnormalization': [True, False]
}
params_list = list(ParameterSampler(param_grid, n_iter = 30))
I then construct a RNN model using the function NeuralNetworkClassifier() defined below
def NeuralNetworkClassifier(params, units_in_hidden_layer = [50, 75, 100, 125, 150]):
nb_units_in_hidden_layers = np.random.choice(units_in_hidden_layer, size = params['nb_hidden_layers'], replace = False)
layers = [8] # number of features in every week
layers.extend(nb_units_in_hidden_layers)
layers.extend([1]) # node identifying quit/stay
model = Sequential()
# constructing all layers up to, but not including, the penultimate one
layer_idx = -1 # this ensures proper generalization nb_hidden_layers = 1 (for which the loop below will never run)
for layer_idx in range(len(layers) - 3):
model.add(LSTM(input_dim = layers[layer_idx], output_dim = layers[layer_idx + 1], init = 'he_uniform', return_sequences = True)) # all LSTM layers, up to and including the penultimate one, need return_sequences = True
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the penultimate layer
model.add(LSTM(input_dim = layers[layer_idx + 1], output_dim = layers[(layer_idx + 1) + 1], init = 'he_uniform', return_sequences = False)) # the last LSTM layer needs return_sequences = False
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the final layer
model.add(Dense(output_dim = layers[-1], init = 'he_normal'))
model.add(Activation(params['output_activation']))
if params['optimization'] == 'SGD':
optim = SGD()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'RMSprop':
optim = RMSprop()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'Adam':
optim = Adam()
elif params['optimization'] == 'Adedelta':
optim = Adadelta()
model.compile(loss = 'binary_crossentropy', optimizer = optim, metrics = ['precision'])
return model
which construct a RNN whose number of hidden layers is given by the parameter 'nb_hidden_layers' in param_grid and the number of hidden units in each layer is randomly sampled from the list [50, 75, 100, 125, 150]. At the end, this function compiles the model and returns it.
During the nested cross-validation (CV), the inner loop (which runs IN times) compares the performance of the 30 randomly selected model. After this step, I pick the best-performing model in the outer loop and estimate its performance on a hold-out dataset; this scheme is repeated OUT times. Therefore, I am compileing a RNN model OUTxINx30 times, and this takes an extremely long time; for example, when OUT=4 and IN=3, my method takes between 6 to 7 hours to finish.
I see that the GPU is being used sporadically (but the GPU usage never goes above 40%); however, most of the time, it is the CPU that is being used. My (uneducated) guess is that compile is being done on the CPU many many times and takes the bulk of the computing time, whereas model fitting and predicting are done on the GPU and takes a short time.
My questions:
Is there a way to remedy this situation?
Is compile actually done on the CPU?
How do people do nested CV to select the best RNN architecture?
Is it reasonable for me to perform this scheme on the production server? Do you suggest I do one big nested CV, that might take 24 hours, to select the best performing model and just use that one model afterwards on the production server?
Thank you all.
I can't answer all your questions, still hope it helps.
Compilation is done in CPU because it's mainly composed of symbolic graph operations and code generation. To make things worse, theano graph optimization uses pure python code, which can be an overhead compared to a C/C++ implementation.
To improve theano compilation time (at the cost of runtime performance):
Use less aggressive optimization
In /home/ec2-user/.theanorc add line:
optimizer = fast_compile
Or totally disable optimization with:
optimizer = None
Precompile some blocks
If there are common blocks shared amoung your models, you can precompile them with theano.OpFromGraph
You can't do this in Keras alone, though.
Switch framework
Keras does support tensorflow backend. Compared to theano, tensorflow work more like a VM than a compiler. Typically TF runs slower than theano but compiles much faster.

Resources