Training many-to-many stateful LSTM with and without final dense layer - keras

I am trying to train a recurrent model in Keras containing an LSTM for regression purposes.
I would like to use the model online and, as far as I understood, I need to train a stateful LSTM.
Since the model has to output a sequence of values, I hope it computes the loss on each of the expected output vector.
However, I fear my code is not working this way and I would be grateful if anyone would help me to understand if I am doing right or if there is some better approach.
The input to the model is a sequence of 128-dimensional vectors. Each sequence in the training set has a different lenght.
At each time, the model should output a vector of 3 elements.
I am trying to train and compare two models:
A) a simple LSTM with 128 inputs and 3 outputs;
B) a simple LSTM with 128 inputs and 100 outputs + a dense layer with 3 outputs;
For model A) I wrote the following code:
# Model
model = Sequential()
model.add(LSTM(3, batch_input_shape=(1, None, 128), return_sequences=True, activation = "linear", stateful = True))`
model.compile(loss='mean_squared_error', optimizer=Adam())
# Training
for i in range(n_epoch):
for j in np.random.permutation(n_sequences):
X = data[j] # j-th sequences
X = X[np.newaxis, ...] # X has size 1 x NTimes x 128
Y = dataY[j] # Y has size NTimes x 3
history = model.fit(X, Y, epochs=1, batch_size=1, verbose=0, shuffle=False)
model.reset_states()
With this code, model A) seems to train fine because the output sequence approaches the ground-truth sequence on the training set.
However, I wonder if the loss is really computed by considering all NTimes output vectors.
For model B), I could not find any way to get the entire output sequence due to the dense layer. Hence, I wrote:
# Model
model = Sequential()
model.add(LSTM(100, batch_input_shape=(1, None, 128), , stateful = True))
model.add(Dense(3, activation="linear"))
model.compile(loss='mean_squared_error', optimizer=Adam())
# Training
for i in range(n_epoch):
for j in np.random.permutation(n_sequences):
X = data[j] #j-th sequence
X = X[np.newaxis, ...] # X has size 1 x NTimes x 128
Y = dataY[j] # Y has size NTimes x 3
for h in range(X.shape[1]):
x = X[0,h,:]
x = x[np.newaxis, np.newaxis, ...] # h-th vector in j-th sequence
y = Y[h,:]
y = y[np.newaxis, ...]
loss += model.train_on_batch(x,y)
model.reset_states() #After the end of the sequence
With this code, model B) does not train fine. It seems to me the training does not converge and loss values increase and decrease cyclically
I have also tried to use as Y only the last vector and them calling the fit function on the Whole training sequence X, but no improvements.
Any idea? Thank you!

If you want to still have three outputs per step of your sequence, you need to TimeDistribute your Dense layer like so:
model.add(TimeDistributed(Dense(3, activation="linear")))
This applies the dense layer to each timestep independently.
See https://keras.io/layers/wrappers/#timedistributed

Related

Regression Model with 3 Hidden DenseVariational Layers in Tensorflow-Probability returns nan as loss during training

I am getting acquainted with Tensorflow-Probability and here I am running into a problem. During training, the model returns nan as the loss (possibly meaning a huge loss that causes overflowing). Since the functional form of the synthetic data is not overly complicated and the ratio of data points to parameters is not frightening at first glance at least I wonder what is the problem and how it could be corrected.
The code is the following --accompanied by some possibly helpful images:
# Create and plot 5000 data points
x_train = np.linspace(-1, 2, 5000)[:, np.newaxis]
y_train = np.power(x_train, 3) + 0.1*(2+x_train)*np.random.randn(5000)[:, np.newaxis]
plt.scatter(x_train, y_train, alpha=0.1)
plt.show()
# Define the prior weight distribution -- all N(0, 1) -- and not trainable
def prior(kernel_size, bias_size, dtype = None):
n = kernel_size + bias_size
prior_model = Sequential([
tfpl.DistributionLambda(
lambda t: tfd.MultivariateNormalDiag(loc = tf.zeros(n) , scale_diag = tf.ones(n)
))
])
return(prior_model)
# Define variational posterior weight distribution -- multivariate Gaussian
def posterior(kernel_size, bias_size, dtype = None):
n = kernel_size + bias_size
posterior_model = Sequential([
tfpl.VariableLayer(tfpl.MultivariateNormalTriL.params_size(n) , dtype = dtype), # The parameters of the model are declared Variables that are trainable
tfpl.MultivariateNormalTriL(n) # The posterior function will return to the Variational layer that will call it a MultivariateNormalTril object that will have as many dimensions
# as the parameters of the Variational Dense Layer. That means that each parameter will be generated by a distinct Normal Gaussian shifted and scaled
# by a mu and sigma learned from the data, independently of all the other weights. The output of this Variablelayer will become the input to the
# MultivariateNormalTriL object.
# The shape of the VariableLayer object will be defined by the number of paramaters needed to create the MultivariateNormalTriL object given
# that it will live in a Space of n dimensions (event_size = n). This number is returned by the tfpl.MultivariateNormalTriL.params_size(n)
])
return(posterior_model)
x_in = Input(shape = (1,))
x = tfpl.DenseVariational(units= 2**4,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0],
activation='relu')(x_in)
x = tfpl.DenseVariational(units= 2**4,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0],
activation='relu')(x)
x = tfpl.DenseVariational(units=tfpl.IndependentNormal.params_size(1),
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0])(x)
y_out = tfpl.IndependentNormal(1)(x)
model = Model(inputs = x_in, outputs = y_out)
def nll(y_true, y_pred):
return -y_pred.log_prob(y_true)
model.compile(loss=nll, optimizer= 'Adam')
model.summary()
Train the model
history = model.fit(x_train1, y_train1, epochs=500)
The problem seems to be in the loss function: negative log-likelihood of the independent normal distribution without any specified location and scale leads to the untamed variance which leads to the blowing up the final loss value. Since you're experimenting with the variational layers, you must be interested in the estimation of the epistemic uncertainty, to that end, I'd recommend to apply the constant variance.
I tried to make a couple of slight changes to your code within the following lines:
first of all, the final output y_out comes directly from the final variational layer without any IndpendnetNormal distribution layer:
y_out = tfpl.DenseVariational(units=1,
make_prior_fn=prior,
make_posterior_fn=posterior,
kl_weight=1/x_train.shape[0])(x)
second, the loss function now contains the necessary calculations with the normal distribution you need but with the static variance in order to avoid the blowing up of the loss during training:
def nll(y_true, y_pred):
dist = tfp.distributions.Normal(loc=y_pred, scale=1.0)
return tf.reduce_sum(-dist.log_prob(y_true))
then the model is compiled and trained in the same way as before:
model.compile(loss=nll, optimizer= 'Adam')
history = model.fit(x_train, y_train, epochs=3000)
and finally let's sample 100 different predictions from the trained model and plot these values to visualize the epistemic uncertainty of the model:
predicted = [model(x_train) for _ in range(100)]
for i, res in enumerate(predicted):
plt.plot(x_train, res , alpha=0.1)
plt.scatter(x_train, y_train, alpha=0.1)
plt.show()
After 3000 epochs the result looks like this (with the reduced number of training points to 3000 instead of 5000 to speed-up the training):
The model has 38,589 trainable parameters but you have only 5,000 points as data; so, effective training is impossible with so many parameters.

Sentiment Analysis using LSTM (Model has not not generate good output)

I Make a sentiment analysis model using LSTM but my model gives very bad prediction.
Here is the complete code
Dataset for amazon review
My LSTM model looks like this:
def ltsm_model(input_shape, word_to_vec_map, word_to_index):
"""
Function creating the ltsm_model model's graph.
Arguments:
input_shape -- shape of the input, usually (max_len,)
word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)
Returns:
model -- a model instance in Keras
"""
### START CODE HERE ###
# Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
sentence_indices = Input(shape=input_shape, dtype='int32')
# Create the embedding layer pretrained with GloVe Vectors (≈1 line)
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
# Propagate sentence_indices through your embedding layer, you get back the embeddings
embeddings = embedding_layer(sentence_indices)
# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a batch of sequences.
X = LSTM(128, return_sequences=True)(embeddings)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X trough another LSTM layer with 128-dimensional hidden state
# Be careful, the returned output should be a single hidden state, not a batch of sequences.
X = LSTM(128, return_sequences=False)(X)
# Add dropout with a probability of 0.5
X = Dropout(0.5)(X)
# Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
X = Dense(2, activation='relu')(X)
# Add a softmax activation
X = Activation('softmax')(X)
# Create Model instance which converts sentence_indices into X.
model = Model(inputs=[sentence_indices], outputs=X)
### END CODE HERE ###
return model
Here is what my training dataset looks like:
This is my testing data:
x_test = np.array(['amazing!: this soundtrack is my favorite music..'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+ str(np.argmax(model.predict(X_test_indices))))
I got following out for this:
amazing!: this soundtrack is my favorite music.. 0
But it should be positive sentiment and should be 1
Also this my fit model output:
How can I improve my model performance? This pretty bad model I suppose.

Keras - Issues using pre-trained word embeddings

I'm following Keras tutorials on word embeddings and replicated the code (with a few modifications) from this particular one:
Using pre-trained word embeddings in a Keras model
It's a topic classification problem in which they are loading pre-trained word vectors and use them via a fixed embedding layer.
When using the pre-trained embedding vectors I can, in fact, achieve their 95% accuracy. This is the code:
embedding_layer = Embedding(len(embed_matrix), len(embed_matrix.columns), weights=[embed_matrix],
input_length=data.shape[1:], trainable=False)
sequence_input = Input(shape=(MAXLEN,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Dropout(0.2)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x) # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
output = Dense(target.shape[1], activation='softmax')(x)
model = Model(sequence_input, output)
model.compile(loss='categorical_crossentropy', optimizer='adam',
metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=2,
batch_size=128)
The issue happens when I remove the embedding vectors and use completely random vectors, surprisingly achieving higher accuracy: 96.5%.
The code is the same, with one modification: weighs=[random_matrix]. That's a matrix with the same shape of embed_matrix, but using random values. So this is the embedding layer now:
embedding_layer = Embedding(len(embed_matrix),
len(embed_matrix.columns), weights=[random_matrix],
input_length=data.shape[1:], trainable=False)
I experimented many times with random weights and the result is always similar. Notice that even though those weights are random, the trainable parameter is still False, so the NN is not updating them.
After that, I fully removed the embedding layer and used words sequences as the input, expecting that those weights were not contributing to the model's accuracy. With that, I got nothing more than 16% accuracy.
So, what is going on? How could random embeddings achieve the same or better performance than pre-trained ones?
And why using word indexes (normalized, of course) as inputs result in such a poor accuracy?

Keras: feed output as input at next timestep

The goal is to predict a timeseries Y of 87601 timesteps (10 years) and 9 targets. The input features X (exogenous input) are 11 timeseries of 87600 timesteps. The output has one more timestep, as this is the initial value.
The output Yt at timestep t depends on the input Xt and on the previous output Yt-1.
Hence, the model should look like this: Model layout
I could only find this thread on this: LSTM: How to feed the output back to the input? #4068.
I tried to implemented this with Keras as follows:
def build_model():
# Input layers
input_x = layers.Input(shape=(features,), name='input_x')
input_y = layers.Input(shape=(targets,), name='input_y-1')
# Merge two inputs
merge = layers.concatenate([input_x,input_y], name='merge')
# Normalise input
norm = layers.Lambda(normalise, name='scale')(merge)
# Hidden layers
x = layers.Dense(128, input_shape=(features,))(norm)
# Output layer
output = layers.Dense(targets, activation='relu', name='output')(x)
model = Model(inputs=[input_x,input_y], outputs=output)
model.compile(loss='mean_squared_error', optimizer=Adam())
return model
def make_prediction(model, X, y):
y_pred = [y[0,None,:]]
for i in range(len(X)):
y_pred.append(model.predict([X[i,None,:],y_pred[i]]))
y_pred = np.asarray(y_pred)
y_pred = y_pred.reshape(y_pred.shape[0],y_pred.shape[2])
return y_pred
# Fit
model = build_model()
model.fit([X_train, y_train[:-1]], [y_train[1:]]], epochs=200,
batch_size=24, shuffle=False)
# Predict
y_hat = make_prediction(model, X_train, y_train)
This works, but is it not what I want to achieve, as there is no connection between input and output. Hence, the model doesn't learn how to correct for an error in the fed-back output, which results in poor accuracy when predicting as the error on the output is accumulated at every timestep.
Is there a way in Keras to implement the output-input feed-back during training stage?
Also, as the initial value of Y is always known, I want to feed this to the network as well.

LSTM Model in Keras with Auxiliary Inputs

I have a dataset with 2 columns - Each column contains a set of documents. I have to match the document in Col A with documents provided in Col B. This is a supervised classification problem. So my training data contains a label column indicating whether the documents match or not.
To solve the problem, I have a created a set of features, say f1-f25 (by comparing the 2 documents) and then trained a binary classifier on these features. This approach works reasonably well, but now I would like to evaluate Deep Learning models on this problem (specifically, LSTM models).
I am using the keras library in Python. After going through the keras documentation and other tutorials available online, I have managed to do the following:
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Each document contains a series of 200 words
# The necessary text pre-processing steps have been completed to transform
each doc to a fixed length seq
main_input1 = Input(shape=(200,), dtype='int32', name='main_input1')
main_input2 = Input(shape=(200,), dtype='int32', name='main_input2')
# Next I add a word embedding layer (embed_matrix is separately created
for each word in my vocabulary by reading from a pre-trained embedding model)
x = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input1)
y = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input2)
# Next separately pass each layer thru a lstm layer to transform seq of
vectors into a single sequence
lstm_out_x1 = LSTM(32)(x)
lstm_out_x2 = LSTM(32)(y)
# concatenate the 2 layers and stack a dense layer on top
x = keras.layers.concatenate([lstm_out_x1, lstm_out_x2])
x = Dense(64, activation='relu')(x)
# generate intermediate output
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(x)
# add auxiliary input - auxiliary inputs contains 25 features for each document pair
auxiliary_input = Input(shape=(25,), name='aux_input')
# merge aux output with aux input and stack dense layer on top
main_input = keras.layers.concatenate([auxiliary_output, auxiliary_input])
x = Dense(64, activation='relu')(main_input)
x = Dense(64, activation='relu')(x)
# finally add the main output layer
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
model = Model(inputs=[main_input1, main_input2, auxiliary_input], outputs= main_output)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit([x1, x2,aux_input], y,
epochs=3, batch_size=32)
However, when I score this on the training data, I get the same prob. score for all cases. The issue seems to be with the way auxiliary input is fed in (because it generates meaningful output when I remove the aux. input).
I also tried inserting the auxiliary input at different places in the network. But somehow I couldnt get this to work.
Any pointers?
Well, this is open for several months and people are voting it up.
I did something very similar recently using this dataset that can be used to forecast credit card defaults and it contains categorical data of customers (gender, education level, marriage status etc.) as well as payment history as time series. So I had to merge time series with non-series data. My solution was very similar to yours by combining LSTM with a dense, I try to adopt the approach to your problem. What worked for me is dense layer(s) on the auxiliary input.
Furthermore in your case a shared layer would make sense so the same weights are used to "read" both documents. My proposal for testing on your data:
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Each document contains a series of 200 words
# The necessary text pre-processing steps have been completed to transform
each doc to a fixed length seq
main_input1 = Input(shape=(200,), dtype='int32', name='main_input1')
main_input2 = Input(shape=(200,), dtype='int32', name='main_input2')
# Next I add a word embedding layer (embed_matrix is separately created
for each word in my vocabulary by reading from a pre-trained embedding model)
x1 = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input1)
x2 = Embedding(output_dim=300, input_dim=20000,
input_length=200, weights = [embed_matrix])(main_input2)
# Next separately pass each layer thru a lstm layer to transform seq of
vectors into a single sequence
# Comment Manngo: Here I changed to shared layer
# Also renamed y as input as it was confusing
# Now x and y are x1 and x2
lstm_reader = LSTM(32)
lstm_out_x1 = lstm_reader(x1)
lstm_out_x2 = lstm_reader(x2)
# concatenate the 2 layers and stack a dense layer on top
x = keras.layers.concatenate([lstm_out_x1, lstm_out_x2])
x = Dense(64, activation='relu')(x)
x = Dense(32, activation='relu')(x)
# generate intermediate output
# Comment Manngo: This is created as a dead-end
# It will not be used as an input of any layers below
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(x)
# add auxiliary input - auxiliary inputs contains 25 features for each document pair
# Comment Manngo: Dense branch on the comparison features
auxiliary_input = Input(shape=(25,), name='aux_input')
auxiliary_input = Dense(64, activation='relu')(auxiliary_input)
auxiliary_input = Dense(32, activation='relu')(auxiliary_input)
# OLD: merge aux output with aux input and stack dense layer on top
# Comment Manngo: actually this is merging the aux output preparation dense with the aux input processing dense
main_input = keras.layers.concatenate([x, auxiliary_input])
main = Dense(64, activation='relu')(main_input)
main = Dense(64, activation='relu')(main)
# finally add the main output layer
main_output = Dense(1, activation='sigmoid', name='main_output')(main)
# Compile
# Comment Manngo: also define weighting of outputs, main as 1, auxiliary as 0.5
model.compile(optimizer=adam,
loss={'main_output': 'w_binary_crossentropy', 'aux_output': 'binary_crossentropy'},
loss_weights={'main_output': 1.,'auxiliary_output': 0.5},
metrics=['accuracy'])
# Train model on main_output and on auxiliary_output as a support
# Comment Manngo: Unknown information marked with placeholders ____
# We have 3 inputs: x1 and x2: the 2 strings
# aux_in: the 25 features
# We have 2 outputs: main and auxiliary; both have the same targets -> (binary)y
model.fit({'main_input1': __x1__, 'main_input2': __x2__, 'auxiliary_input' : __aux_in__}, {'main_output': __y__, 'auxiliary_output': __y__},
epochs=1000,
batch_size=__,
validation_split=0.1,
callbacks=[____])
I don't know how much this can help since I don't have your data so I can't try. Nevertheless this is my best shot.
I didn't run the above code for obvious reasons.
I found answers from https://datascience.stackexchange.com/questions/17099/adding-features-to-time-series-model-lstm Mr.Philippe Remy wrote a library to condition on auxiliary inputs. I used his library and it's very helpful.
# 10 stations
# 365 days
# 3 continuous variables A and B => C is target.
# 2 conditions dim=5 and dim=1. First cond is one-hot. Second is continuous.
import numpy as np
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from cond_rnn import ConditionalRNN
stations = 10 # 10 stations.
time_steps = 365 # 365 days.
continuous_variables_per_station = 3 # A,B,C where C is the target.
condition_variables_per_station = 2 # 2 variables of dim 5 and 1.
condition_dim_1 = 5
condition_dim_2 = 1
np.random.seed(123)
continuous_data = np.random.uniform(size=(stations, time_steps, continuous_variables_per_station))
condition_data_1 = np.zeros(shape=(stations, condition_dim_1))
condition_data_1[:, 0] = 1 # dummy.
condition_data_2 = np.random.uniform(size=(stations, condition_dim_2))
window = 50 # we split series in 50 days (look-back window)
x, y, c1, c2 = [], [], [], []
for i in range(window, continuous_data.shape[1]):
x.append(continuous_data[:, i - window:i])
y.append(continuous_data[:, i])
c1.append(condition_data_1) # just replicate.
c2.append(condition_data_2) # just replicate.
# now we have (batch_dim, station_dim, time_steps, input_dim).
x = np.array(x)
y = np.array(y)
c1 = np.array(c1)
c2 = np.array(c2)
print(x.shape, y.shape, c1.shape, c2.shape)
# let's collapse the station_dim in the batch_dim.
x = np.reshape(x, [-1, window, x.shape[-1]])
y = np.reshape(y, [-1, y.shape[-1]])
c1 = np.reshape(c1, [-1, c1.shape[-1]])
c2 = np.reshape(c2, [-1, c2.shape[-1]])
print(x.shape, y.shape, c1.shape, c2.shape)
model = Sequential(layers=[
ConditionalRNN(10, cell='GRU'), # num_cells = 10
Dense(units=1, activation='linear') # regression problem.
])
model.compile(optimizer='adam', loss='mse')
model.fit(x=[x, c1, c2], y=y, epochs=2, validation_split=0.2)

Resources