Using Keras for predicting next word - keras

I have a sequence prediction problem that I approach as a language model.
My data contains 4 choices (1-4) and a reward (1-100) .
I started using Keras but I'm not sure it has the flexibility I need.
This is how the model's architecture looks :
I'm not sure about the test phase. One option is sampling:
And I'm not sure how to evaluate the output of this option vs my test set.
Another option is to give the trained model a sequence and let it plot the last timestep value (like giving a sentence and predicting last word) - but still having x = t_hat.
is it possible in Keras ? I can't find examples like this.
Besides passing the previous choice (or previous word) as an input , I need to pass the second feature, which is a reward value. The choice are one-hot encoded , how can I add a single number with an encoded vector?
EDIT :
This is the training phase (haven't done the sampling yet) :
model = Sequential()
model.add(LSTM(64, input_shape=(seq_length, X_train.shape[2]) , return_sequences=True))
model.add(Dense(y_cat_train.shape[2], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_cat_train, epochs=100, batch_size=10, verbose=2)

Google designed Keras to support all kind of needs and it should fit your need - YES.
In your case you are using the LSTM cells of some arbitrary number of units (usually 64 or 128), with: a<1>, a<2>, a<3>... a< Ty> as hidden parameters. Note: Your last index should not be 3, instead is should be Ty.
I would suggest checking https://keras.io/utils/#to_categorical function to convert your data to "one-hot" encoded format.

Related

LSTM Autoencoder for Anomaly detection in time series, correct way to fit model

I'm trying to find correct examples of using LSTM Autoencoder for defining anomalies in time series data in internet and see a lot of examples, where LSTM Autoencoder model are fitted with labels, which are future time steps for feature sequences (as for usual time series forecasting with LSTM), but I suppose, that this kind of model should be trained with labels which are the same sequence as sequence of features (previous time steps).
The first link in the google by this searching for example - https://towardsdatascience.com/time-series-of-price-anomaly-detection-with-lstm-11a12ba4f6d9
1.This function defines the way to get labels (y feature)
def create_sequences(X, **y**, time_steps=TIME_STEPS):
Xs, ys = [], []
for i in range(len(X)-time_steps):
Xs.append(X.iloc[i:(i+time_steps)].values)
ys.append(y.iloc**[i+time_steps]**)
return np.array(Xs), np.array(ys)
X_train, **y_train** = create_sequences(train[['Close']], train['Close'])
X_test, y_test = create_sequences(test[['Close']], test['Close'])
2.Model is fitted as follow
history = model.fit(X_train, **y_train**, epochs=100, batch_size=32, validation_split=0.1,
callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min')], shuffle=False)
Could you kindly comment the way how Autoencoder is implemented in the link on towardsdatascience.com/?
Is it correct method or model should be fitted following way ?
model.fit(X_train,X_train)
Thanks in advance!
This is time series auto-encoder. If you want to predict for future, it goes this way. The auto-encoder / machine learning model fitting is different for different problems and their solutions. You cannot train and fit one model / workflow for all problems. Time-series / time lapse can be what we already collected data for time period and predict, it can be for data collected and future prediction. Both are differently constructed. Like time series data for sub surface earth is differently modeled, and for weather forecast is differently. One model cannot work for both.
By definition an autoencoder is any model attempting at reproducing it's input, independent of the type of architecture (LSTM, CNN,...).
Framed this way it is a unspervised task so the training would be : model.fit(X_train,X_train)
Now, what she does in the article you linked, is to use a common architecture for LSTM autoencoder but applied to timeseries forecasting:
model.add(LSTM(128, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(RepeatVector(X_train.shape[1]))
model.add(LSTM(128, return_sequences=True))
model.add(TimeDistributed(Dense(X_train.shape[2])))
She's pre-processing the data in a way to get X_train = [x(t-seq)....x(t)] and y_train = x(t+1)
for i in range(len(X)-time_steps):
Xs.append(X.iloc[i:(i+time_steps)].values)
ys.append(y.iloc[i+time_steps])
So the model does not per-se reproduce the input it's fed, but it doesn't mean it's not a valid implementation since it produce valuable prediction.

Why can't I use softmax in regression task for probabilities?

I have a supervised learning task f(X)=y where X is a 2-dimentional np.array of np.int8 and y is a 1-dimentional array of np.float64 containing probabilities (so numbers between 0 and 1). I want to build a Neural Network model that performs regression in order to predict said probabilities y given X.
As the output of my Network is one real value (i.e. the output layer has one neuron) and is a probability (so in the range [0, 1]), I believe I should use softmax as the activation function of the output layer (i.e. output neuron) in order to squash the network's output to [0, 1].
As it is a regression task, I opted for using the mean_squared_error loss (instead of cross_entropy_loss that is typically used in classification tasks and often paired with softmax).
However, as I am trying to fit(X, y) the loss does not change at all between epochs and remains constant. Any ideas why? Is the combination of softmax and mean_squared_error loss wrong for some reason and why?
If I remove the softmax it does work, but then my model would also predict non probabilities which I do not want. Yes, I could squash it myself later but it doesn't seem right.
My code basically is (after removing some irrelevant additional callbacks for EarlyStopping and learning rate scheaduling):
model = Sequential()
model.add(Dense(W1_size, input_shape=(input_dims,), activation='relu'))
model.add(Dense(1, activation='softmax'))
# compile model
model.compile(optimizer=Adam(), loss='mse') # mse is the standard loss for regression
# fit
model.fit(X, y, batch_size=batch_size, epochs=MAX_EPOCHS)
Edit: Turns out I needed the sigmoid function to squash one real value to [0, 1] as the accepted answer suggests. The softmax function for a vector of size 1 is always 1.
As you stated you want to perform a regression task. (Which means, finding a continuous mapping between your input and desired output).
The softmax function creates a pseudo-probability distribution for multi-dimensional outputs (all values sum up to 1). This is the reason why the softmax function perfectly fits for classification tasks (predicting probabilities for different classes).
As you want to perform a regression task and your output is one-dimensional, softmax would not work properly because it is always 1 for a one-dimensional input.
A function which maps a one-dimensional input continuously to [0,1] works fine here (e.g Sigmoid).
Note that you can also interpret both the output of the sigmoid and the softmax function as probabilities. But be careful: these are only pseudo-probabilities and it is not representing the certainty or uncertainty of your model in making predictions.

I need to understand this LSTM and Masking layers result

I'm new at keras lstm could you please explain to me this model.summary()
in rasa core training
![model after training][1]
Also, what is the Masking layer doing and what does the value -1 in it mean?
A Masking layer is meant to "ignore steps" in sequences.
Your LSTM is working with sequences of 5 steps and 42 features per step.
If all features in a step have the same value defined in Masking (-1 in the example), that step will be ignored during training.
The idea is to simulate variable length sequences.
Not sure exactly, what exactly you don't understand but model.summary()
prints a summary representation of your model. (keras.io)
It lists all layers used in the given model with its respective size.
This particular model obviously starts with a masking layer for input sequences (I guess because of padding) and is followed by the simplest LSTM model possible.

Many to many RNN in keras - predict output for every nth input

I'm trying to figure out how to build a model using LSTM/GRU that predicts many to many but for every nth (7 in my case) input. For example, my input data has timesteps per day for a whole year but I'm only trying to predict the output at the end of each week and not each day.
The only information I was able to find is this answer:
Many to one and many to many LSTM examples in Keras
It says:
"Many-to-many when number of steps differ from input/output length: this is freaky hard in Keras. There are no easy code snippets to code that."
In pytorch it seems like you can set the ignore_index in the loss function which I think should do the trick.
Is there a solution for keras?
I think I found the answer. Since I'm trying to predict every nth value we can just keep the output from the LSTM layer that we are trying to predict and get rid of the rest. I created a lambda layer to do that - it just reads every 7th value from the lstm output.
This is the code:
X = np.random.normal(0,1,size=(100,365,5))
y = np.random.randint(2,size=(100,52,1))
model = Sequential()
model.add(LSTM(1, input_shape=(365, 5), return_sequences=True))
model.add(Lambda(lambda x: x[:, 6::7, :]))
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X,y,epochs=3,verbose=1)

Inner workings of Keras LSTM

I am working on a multi-class classification task: the goal is to identify what is the correct language of origin of a certain surname. For this, I am using a Keras LSTM.
So far, I have only worked with PyTorch and I am very surprised by the "black box" character of Keras. For this classification task, my understanding is that I need to retrieve the output of the last time step for a given input sequence in the LSTM and then apply the softmax on it to get the probability distribution over all classes.
Interestingly, without me explicitly defining to do so, the LSTM seems to automatically do the right thing and chooses the last time step's output and not e.g. the hidden state to apply the softmax on (good training & validation results so far). How is that possible? Does the choice of the appropriate loss function categorical_crossentropy indicate to the model that is should use the last time step's output to do the classification?
Code:
model = Sequential()
model.add(Dense(100, input_shape=(max_len, len(alphabet)), kernel_regularizer=regularizers.l2(0.00001)))
model.add(Dropout(0.85))
model.add(LSTM(100, input_shape=(100,)))
model.add(Dropout(0.85))
model.add(Dense(num_output_classes, activation='softmax'))
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, decay=1e-6)
model.compile(loss='categorical_crossentropy',
optimizer=adam,
metrics=['accuracy'])
history = model.fit(train_data, train_labels,
epochs=5000,
batch_size=num_train_examples,
validation_data = (valid_data, valid_labels))
No, returning the last time step's output is just what every Keras RNN layer does by default. See the documentation for return_sequences, which causes it to return every time step's output instead (which is necessary for stacking RNN layers). There's no automatic intuition based on what kinds of layers you're hooking together, you just got what you wanted by default, presumably because the designers figured that to be the most common case.

Resources