Strange behaviour sequence to sequence learning for variable length sequences - keras

I am training a sequence to sequence model for variable length sequences with Keras, but I am running into some unexpected problems. It is unclear to me whether the behaviour I am observing is the desired behaviour of the library and why it would be.
Model Creation
I've made a recurrent model with an embeddings layer and a GRU recurrent layer that illustrates the problem. I used mask_zero=0.0 for the embeddings layer instead of a masking layer, but changing this doesn't seem to make a difference (nor does adding a masking layer before the output):
import numpy
from keras.layers import Embedding, GRU, TimeDistributed, Dense, Input
from keras.models import Model
import keras.preprocessing.sequence
numpy.random.seed(0)
input_layer = Input(shape=(3,), dtype='int32', name='input')
embeddings = Embedding(input_dim=20, output_dim=2, input_length=3, mask_zero=True, name='embeddings')(input_layer)
recurrent = GRU(5, return_sequences=True, name='GRU')(embeddings)
output_layer = TimeDistributed(Dense(1), name='output')(recurrent)
model = Model(input=input_layer, output=output_layer)
output_weights = model.layers[-1].get_weights()
output_weights[1] = numpy.array([0.2])
model.layers[-1].set_weights(output_weights)
model.compile(loss='mse', metrics=['mse'], optimizer='adam', sample_weight_mode='temporal')
I use masking and the sample_weight parameter to exclude the padding values from the training/evaluation. I will test this model on one input/output sequence which I pad using the Keras padding function:
X = [[1, 2]]
X_padded = keras.preprocessing.sequence.pad_sequences(X, dtype='float32', maxlen=3)
Y = [[[1], [2]]]
Y_padded = keras.preprocessing.sequence.pad_sequences(Y, maxlen=3, dtype='float32')
Output Shape
Why the output is expected to be formatted in this way. Why can I not use input/output sequences that have exactly the same dimensionality? model.evaluate(X_padded, Y_padded) gives me a dimensionality error.
Then, when I run model.predict(X_padded) I get the following output (with numpy.random.seed(0) before generating the model):
[[[ 0.2 ]
[ 0.19946882]
[ 0.19175649]]]
Why isn't the first input masked for the output layer? Is the output_value computed anyways (and equal to the bias, as the hidden layer values are 0? This does not seem desirable. Adding a Masking layer before the output layer does not solve this problem.
MSE calculation
Then, when I evaluate the model (model.evaluate(X_padded, Y_padded)), this returns the Mean Squared Error (MSE) of the entire sequence (1.3168) including this first value, which I suppose is to be expected when it isn't masked, but not what I would want.
From the Keras documentation I understand I should use the sample_weight parameter to solve this problem, which I tried:
sample_weight = numpy.array([[0, 1, 1]])
model_evaluation = model.evaluate(X_padded, Y_padded, sample_weight=sample_weight)
print model.metrics_names, model_evaluation
The output I get is
['loss', 'mean_squared_error'] [2.9329459667205811, 1.3168648481369019]
This leaves the metric (MSE) unaltered, it is still the MSE over all values, including the one that I wanted masked. Why? This is not what I want when I evaluate my model. It does cause a change in the loss value, which appears to be the MSE over the last two values normalised to not give more weight to longer sequences.
Am I doing something wrong with the sample weights? Also, I can really not figure out how this loss value came about. What should I do to exclude the padded values from both training and evaluation (I assume the sample_weight parameter works the same in the fit function).

It was indeed a bug in the library, in Keras 2 this issue is resolved.

Related

Why unused embedding vector changed after backward?

I'm partially fine-tuning a embedding matrix (just for some particular index). But in backward phase, other embeddings which is not expected to be modified are changed.
Here is an example:
import torch
from torch import nn
embeds = nn.Embedding(10, 3)
optim = torch.optim.Adam(params=embeds.parameters(), lr=0.001, weight_decay=1e-6)
indexes = torch.tensor([1,2,3]) # expected to finetune
bce_loss = nn.BCELoss()
optim.zero_grad()
loss = bce_loss(embeds(indexes).sum(dim=-1), torch.zeros(3))
loss.backward()
optim.step()
I only want to finetune 1st, 2nd, 3rd embeddings, but after backward, the entire embedding matrix is changed. Why? And how to solve this problem? THX!
The change in embedding matrix after optim.step() is not entirely due to the loss you defined (i.e. BCELoss). You are overlooking the other part of the loss that is being implicitely defined due to the weight_decay parameter in your optimizer. When you supply a non-zero weight decay, it adds a second commponent to the loss that is over all parameters of the model. Just turn that off
optim = torch.optim.Adam(..., weight_decay=0.)
and you will see expected result.

Tensorflow 1.15 / Keras 2.3.1 Model.train_on_batch() returns more values than there are outputs/loss functions

I am trying to train a model that has more than one output and as a result, also has more than one loss function attached to it when I compile it.
I haven't done something similar in the past (not from scratch at least).
Here's some code I am using to figure out how this works.
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
batch_size = 50
input_size = 10
i = Input(shape=(input_size,))
x = Dense(100)(i)
x_1 = Dense(output_size)(x)
x_2 = Dense(output_size)(x)
model = Model(i, [x_1, x_2])
model.compile(optimizer = 'adam', loss = ["mse", "mse"])
# Data creation
x = np.random.random_sample([batch_size, input_size]).astype('float32')
y = np.random.random_sample([batch_size, output_size]).astype('float32')
loss = model.train_on_batch(x, [y,y])
print(loss) # sample output [0.8311912, 0.3519104, 0.47928077]
I would expect the variable loss to have two entries (one for each loss function), however, I get back three. I thought maybe one of them is the weighted average but that does not look to be the case.
Could anyone explain how passing in multiple loss functions works, because obviously, I am misunderstanding something.
I believe the three outputs are the sum of all the losses, followed by the individual losses on each output.
For example, if you look at the sample output you've printed there:
0.3519104 + 0.47928077 = 0.83119117 ≈ 0.8311912
Your assumption that there should be two losses in incorrect. You have a model with two outputs, and you specified one loss for each output, but the model has to be trained on a single loss, so Keras trains the model on a new loss that is the sum of the per-output losses.
You can control how these losses are mixed using the loss_weights parameter in model.compile. I think by default it takes weights values equal to 1.0.
So in the end what train_on_batch returns is the loss, output one mse, and output two mse. That is why you get three values.

Is it possible to predict a certain numerical value given a DNA sequence using LSTM?

I have 16 letters of DNA sequence. From this 16-letter DNA sequence, there is an output value so called 'Inhibition value' which ranges from 0 to 100. When I tried using LSTM, the prediction only output a constant. Is the problem lies in the code or is it just not a suitable task for LSTM or RNN in general to solve?
I have tried to increase batch size and epochs, make the LSTM deeper, change the number of LSTM units, but none of them works.
I was also wondering whether the labeling method matters or not. I tried to use One-hot encoder at first, but it didn't work. Then, I changed it to LabelEncoder, but it's also not working. Same constant output is produced.
Below here is the code for my model structure
def create_model():
input1 = Input(shape=(16,1))
classifier = LSTM(64, input_shape=(16,1), return_sequences=True)(input1)
for i in range(2):
classifier = LSTM(32, return_sequences=True)(classifier)
classifier = LSTM(32)(classifier)
classifier = Dense(1, activation='relu')(classifier)
model = Model(inputs = [input1], outputs = classifier)
adam = keras.optimizers.adam(lr=0.01)
model.compile(loss='mean_squared_error', optimizer=adam)
return model
If anyone wondering why I use functional API instead of sequential, it is because there is a possible modification where I need to use 2 input variables that needs to be processed independently before concatenating it at the end.
Thank you in advance.

How to use the result of embedding with mask_zero=True in keras

In keras, I want to calculate the mean of nonzero embedding output.
I wonder what is the difference between mask_zero=True or False in Embedding Layer.
I tried the code below :
input_data = Input(shape=(5,), dtype='int32', name='input')
embedding_layer = Embedding(1000, 24, input_length=5,mask_zero=True,name='embedding')
out = word_embedding_layer(input_data)
def antirectifier(x):
x = K.mean(x, axis=1, keepdims=True)
return x
def antirectifier_output_shape(input_shape):
shape = list(input_shape)
return tuple(shape)
out = Lambda(antirectifier, output_shape=antirectifier_output_shape,name='lambda')(out)
But it seems that the result is the mean of all the elements, how can i just calculate the mean of all nonzero inputs?
From the function's doc :
If this is True then all subsequent layers in the model need to
support masking
Your lambda function doesn't support masking. For example Recurrent layers in Keras support masking. If you set mask_zero=True in your embeddings, then all the 0 indices that you feed to the embedding layer will be propagated as "masked" and the following layers that are able to understand the "masked" information will use them.
Basically, if you build a "mean" layer that grabs the mask and computes the average only for non-masked values, then you will get the desired results.
You can find here a way to build your lambda layers that support masking
I hope it helps.

Keras: LSTM with class weights

my question is quite closely related to this question but also goes beyond it.
I am trying to implement the following LSTM in Keras where
the number of timesteps be nb_tsteps=10
the number of input features is nb_feat=40
the number of LSTM cells at each time step is 120
the LSTM layer is followed by TimeDistributedDense layers
From the question referenced above I understand that I have to present the input data as
nb_samples, 10, 40
where I get nb_samples by rolling a window of length nb_tsteps=10 across the original timeseries of shape (5932720, 40). The code is hence
model = Sequential()
model.add(LSTM(120, input_shape=(X_train.shape[1], X_train.shape[2]),
return_sequences=True, consume_less='gpu'))
model.add(TimeDistributed(Dense(50, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(20, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(10, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(3, activation='relu')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
Now to my question (assuming the above is correct so far):
The binary responses (0/1) are heavily imbalanced and I need to pass a class_weight dictionary like cw = {0: 1, 1: 25} to model.fit(). However I get an exception class_weight not supported for 3+ dimensional targets. This is because I present the response data as (nb_samples, 1, 1). If I reshape it into a 2D array (nb_samples, 1) I get the exception Error when checking model target: expected timedistributed_5 to have 3 dimensions, but got array with shape (5932720, 1).
Thanks a lot for any help!
I think you should use sample_weight with sample_weight_mode='temporal'.
From the Keras docs:
sample_weight: Numpy array of weights for the training samples, used
for scaling the loss function (during training only). You can either
pass a flat (1D) Numpy array with the same length as the input samples
(1:1 mapping between weights and samples), or in the case of temporal
data, you can pass a 2D array with shape (samples, sequence_length),
to apply a different weight to every timestep of every sample. In this
case you should make sure to specify sample_weight_mode="temporal" in
compile().
In your case you would need to supply a 2D array with the same shape as your labels.
If this is still an issue.. I think the TimeDistributed Layer expects and returns a 3D array (kind of similar to if you have return_sequences=True in the regular LSTM layer). Try adding a Flatten() layer or another LSTM layer at the end before the prediction layer.
d = TimeDistributed(Dense(10))(input_from_previous_layer)
lstm_out = Bidirectional(LSTM(10))(d)
output = Dense(1, activation='sigmoid')(lstm_out)
Using temporal is a workaround. Check out this stack. The issue is also documented on github.

Resources