Is it possible to make longer output length with keras? - keras

I want to predict some index related to space weather(kp, Dst,etc..) with RNN or LSTM. It was possible to build many to one model although it shows poor accuracy. However, my goal is to predict 7 days in the future with last 3 days observation.
The question is, is it functionally possible to build RNN which has longer output length(or timestep?) than input?
Any help would be greatly appreciated! Please help me.

You can. One way of doing it is using the so called "sequence to sequence" architecture. It enables you to input a sequence of data points and predict a sequence of values. It works by encoding the input with a fixed size vector (usually the last hidden state of an LSTM) and then using it as starting state for another LSTM that is unrolled for n time steps.
If you have labels for all the 7 steps you want to predict then you can use a model as the following one where the decoder takes as input the labels as input for each time step 0 <= t < n:
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
As you can see the first LSTM encode the sequence with len num_encoder_token and we only keep its last state [encoder_states = [state_h, state_c]] and use it as initialization for the second LSTM (decoder_lstm) that is then fed with the labels and try to predict the label vector right shifted of 1.
For example if you have as input sequence day1, day2, day3 and you want to predict day4, day5, ..., day10, then you will feed the encoder with day1 to day3 and use day4 ... day9 as input for the decoder. The decoder label then will be day5 ... day10.
You can read more about this kind of model at https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html.

Related

Keras how to see in the black box of a model

I wanted to see dimensions while model is training because model trains in a black box.
I wanted to see dimension of decoder input iteratively as decoder moves in time series. It is a Teacher forcing decoder. But when i fit model it just show me accuracy, loss, epochs and iterations even if i appy verbose 1 or 2. Code is below in keras.
# TRAINING WITH TEACHER FORCING
# Define an input sequence and process it.
encoder_inputs= Input(shape=(n_timesteps_in, n_features))
encoder_lstm=LSTM(LSTMoutputDimension, return_state=True)
LSTM_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
# We discard `LSTM_outputs` and only keep the other states.
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None, n_features), name='decoder_inputs')
decoder_lstm = LSTM(LSTMoutputDimension, return_sequences=True, return_state=True, name='decoder_lstm')
# Set up the decoder, using `context vector` as initial state.
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
#complete the decoder model by adding a Dense layer with Softmax activation function
#for prediction of the next output
#Dense layer will output one-hot encoded representation as we did for input
#Therefore, we will use n_features number of neurons
decoder_dense = Dense(n_features, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_outputs)

dimension of the input layer for embeddings in Keras

It is not clear to me whether there is any difference between specifying the input dimension Input(shape=(20,)) or not Input(shape=(None,)) in the following example:
input_layer = Input(shape=(None,))
emb = Embedding(86, 300) (input_layer)
lstm = Bidirectional(LSTM(300)) (emb)
output_layer = Dense(10, activation="softmax") (lstm)
model = Model(input_layer, output_layer)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["acc"])
history = model.fit(my_x, my_y, epochs=1, batch_size=632, validation_split=0.1)
my_x (shape: 2000, 20) contains integers referring to characters, while my_y contains the one-hot encoding of some labels. With Input(shape=(None,)), I see that I could use model.predict(my_x[:, 0:10]), i.e., I could give only 10 characters as an input instead of 20: how is that possible? I was assuming that all the 20 dimensions in my_x were needed to predict the corresponding y.
What you say with None is, that the sequences you feed into the model have the strict length of 20. While a model usually needs a fixed length, recurrent neural networks (as the LSTM you use there), do not need a fixed sequence Length. So the LSTM just does not care whether your sequence contains 20 or 100 timesteps, as it simply loops over them. However, when you specify the amount of timesteps to 20, the LSTM expects 20 and will raise an error if it does not get them.
For more information see this post of Tim♦

How to test a model trained using teacher forcing

I used keras to train a seq2seq model (keras.models.Model). The X and y to the model are [X_encoder, X_decoder] and y i.e. a list of encoder and decoder inputs and labels (Note that the decoder input, X_decoder is ‘y’ with one position ahead than the actual y. Basically, teacher forcing).
So my question is now after training, when it comes to actual prediction where I do not have any labels how do I provide ‘X_decoder’ to my input? Or do I train on something else?
This is a snippet of the model definition if at all that helps:)
# Encoder
encoder_inputs = Input(batch_shape=(batch_size, max_len,), dtype='int32')
encoder_embedding = embedding_layer(encoder_inputs)
encoder_LSTM = CuDNNLSTM(hidden_dim, return_state=True, stateful=True)
encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)
# Decoder
decoder_inputs = Input(shape=(max_len,), dtype='int32')
decoder_embedding = embedding_layer(decoder_inputs)
decoder_LSTM = CuDNNLSTM(hidden_dim, return_state=True, return_sequences=True)
decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=[state_h, state_c])
# Output
outputs = TimeDistributed(Dense(vocab_size, activation='softmax'))(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], outputs)
# model fitting:
model.fit([X_encoder, X_decoder], y, steps_per_epoch=int(number_of_train_samples/batch_size),
epochs=epochs)
Usually, when you train a seq2seq model, the first token of decoder_inputs is a special <start> token. So when you try to generate a sentence, you do it like
first_token = decoder(encoder_state, [<start>])
second_token = decoder(encoder_state, [<start>, first_token])
third_token = decoder(encoder_state, [<start>, first_token, second_token])
...
You execute this recursion, until your decoder generates another special token - <end>; then you stop.
Here is a very crude pythonic decoder for your model. It is inefficient, because it reads the input over and over again, instead of memorizing the RNN state - but it works.
input_seq = # some array of token indices
result = np.array([[START_TOKEN]])
next_token = -1
for i in range(100500):
next_token = model.predict([input_seq, result])[0][-1].argmax()
if next_token == END_TOKEN:
break
result = np.concatenate([result, [[next_token]]], axis=1)
output_seq = result[0][1:] # omit the first INPUT_TOKEN
A more efficient solution would output the RNN state along with each token and use it to produce the next token.

LSTM Encoder-Decoder Inference Model

Many tutorials for seq2seq encoder-decoder architecture based on LSTM, (for example English-French translation), define the model as follow:
encoder_inputs = Input(shape=(None,))
en_x= Embedding(num_encoder_tokens, embedding_size)(encoder_inputs)
# Encoder lstm
encoder = LSTM(50, return_state=True)
encoder_outputs, state_h, state_c = encoder(en_x)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
# french word embeddings
dex= Embedding(num_decoder_tokens, embedding_size)
final_dex= dex(decoder_inputs)
# decoder lstm
decoder_lstm = LSTM(50, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(final_dex,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# While training, model takes eng and french words and outputs #translated french word
fullmodel = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# rmsprop is preferred for nlp tasks
fullmodel.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])
fullmodel.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=128,
epochs=100,
validation_split=0.20)
Then for prediction, they define infernce models as follow:
# define the encoder model
encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()
# Redefine the decoder model with decoder will be getting below inputs from encoder while in prediction
decoder_state_input_h = Input(shape=(50,))
decoder_state_input_c = Input(shape=(50,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
final_dex2= dex(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
# sampling model will take encoder states and decoder_input(seed initially) and output the predictions(french word index) We dont care about decoder_states2
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs2] + decoder_states2)
Then predict using:
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items())
def decode_sequence(input_seq):
# Encode the input as state vectors.
states_value = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1,1))
# Populate the first character of target sequence with the start character.
target_seq[0, 0] = target_token_index['START_']
# Sampling loop for a batch of sequences
# (to simplify, here we assume a batch of size 1).
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict(
[target_seq] + states_value)
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
sampled_char = reverse_target_char_index[sampled_token_index]
decoded_sentence += ' '+sampled_char
# Exit condition: either hit max length
# or find stop character.
if (sampled_char == '_END' or
len(decoded_sentence) > 52):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1,1))
target_seq[0, 0] = sampled_token_index
# Update states
states_value = [h, c]
return decoded_sentence
My question is, they trained the model with the name 'fullmodel' to get best weights ... in prediction part, they used the inference models with names (encoder_model & decoder_model) ... so they didn't use any weights from the 'fullmodel' ?!
I don't understand how they benefit from the trained model!
The trick is that everything is in the same variable scope, so the variables got reused.
If you notice carefully, the trained layer weights are being reused.
For example, while creating decoder_model we use decoder_lstm layer which was defined as a part of full model,
decoder_outputs2, state_h2, state_c2 = decoder_lstm(final_dex2, initial_state=decoder_states_inputs),
and encoder model too uses, encoder_inputs and encoder_states layer defined previously.
encoder_model = Model(encoder_inputs, encoder_states)
Due to the architecture of the encoder-decoder model, we need to perform these implementations hacks.
Also, as the keras documentation mentions, With the functional API, it is easy to reuse trained models: you can treat any model as if it were a layer, by calling it on a tensor. Note that by calling a model you aren't just reusing the architecture of the model, you are also reusing its weights. For more details refer - https://keras.io/getting-started/functional-api-guide/#all-models-are-callable-just-like-layers

Keras seq2seq padding

I am working on seq2seq chatbot. I would ask you, how to ignore PAD symbols in chatbots responses while val_acc is counting.
For example, my model generates response: [I, am, reading, a, book, PAD, PAD, PAD, PAD, PAD]
But, right response should be: [My, brother, is, playing, fotball,PAD, PAD, PAD, PAD, PAD].
In this case, chatbot responded totally wrong, but val_acc is 50% because of padding symbols.
I use Keras, encoder-decoder model (https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html) with teacher forcing
My code is here:
encoder_inputs = Input(shape=(sentenceLength,), name="Encoder_input")
encoder = LSTM(n_units, return_state=True, name='Encoder_lstm')
Shared_Embedding = Embedding(output_dim=embedding, input_dim=vocab_size, name="Embedding", mask_zero='True')
word_embedding_context = Shared_Embedding(encoder_inputs)
encoder_outputs, state_h, state_c = encoder(word_embedding_context)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None,), name="Decoder_input")
decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True, name="Decoder_lstm")
word_embedding_answer = Shared_Embedding(decoder_inputs)
decoder_outputs, _, _ = decoder_lstm(word_embedding_answer, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax', name="Dense_layer")
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
Encoder input is sentence where each word is integer and 0 is padding: [1,2,5,4,3,0,0,0] -> User question
Decoder input is also sentence where each word is integer, 0 is padding and 100 is symbol GO: [100,8,4,2,0,0,0,0,0]] ->chatbot response shifted one timestamp
decoder output is sentence, where words are integers, and these integers are one hot encoded: [8,4,2,0,0,0,0,0, 0]] ->chatbot response (integers are one hot encoded.)
Problem is, that val_acc is too hight, also whan model predicts totaly wrong sentences. I think that it is caused because of paddings. Is there something wrong with my model? Should I add some another mask to my decoder?
Here is my graphs:
You are correct, it is because that tutorial doesn't use Masking (documentation) to ignore those padding values and shows examples of equal input output length. In your case, the model will still input output PAD but the mask will ignore them. For example, to mask the encoder:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_inputs = Masking()(encoder_inputs) # Assuming PAD is zeros
encoder = LSTM(latent_dim, return_state=True)
# Now the LSTM will ignore the PADs when encoding
# by skipping those timesteps that are masked
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

Resources