I need to understand this LSTM and Masking layers result - keras

I'm new at keras lstm could you please explain to me this model.summary()
in rasa core training
![model after training][1]
Also, what is the Masking layer doing and what does the value -1 in it mean?

A Masking layer is meant to "ignore steps" in sequences.
Your LSTM is working with sequences of 5 steps and 42 features per step.
If all features in a step have the same value defined in Masking (-1 in the example), that step will be ignored during training.
The idea is to simulate variable length sequences.

Not sure exactly, what exactly you don't understand but model.summary()
prints a summary representation of your model. (keras.io)
It lists all layers used in the given model with its respective size.
This particular model obviously starts with a masking layer for input sequences (I guess because of padding) and is followed by the simplest LSTM model possible.

Related

Conv2D filters and CNN architecture

I am currently pursuing undergraduation, I am working on CNN model to recognize Telegu characters.
This Questions has two parts,
I have a (32,32,1) shape Telegu character images, I want to train my CNN model to recognize the character. So, what should be my model architecture and how to decide the architecture, no of parameters and hidden layers. I know that my case is exactly same as handwritten digit recognition, but I want to know how to decide those parameters. Is there any common practice in building such architecture.
Operation Conv2D (32, (5,5)) means 32 filters of size 5x5 are applied on to the input, my question is are these filters all same or different, if different what kind of filters are initialized and who decides them?
I tried to surf internet but everywhere I go, the answer I get is Conv2D operation applies filters on input and does the convolution operation.
To decide which model architecture would be best, you need to experiment. Thats the only way. As you want to classify, VGG architecture would be a good starting point I believe. You need to experiment with number of parameters as it depends on your problem. You can use Keras Tuner for it: https://keras.io/keras_tuner/
For kernel initialization, as far as I know convolutional layers in Keras uses Glorot Uniform Initialization but you can change that by using kernel_initializer parameter. Long story short, convolutional layers are initialized with a distribution function and as training goes filters change the values inside, which is learning process. https://keras.io/api/layers/initializers
Edit: I forgot to inform you that I suggest VGG architecture but in a way you downsize the models a lot. Your input shape is little so if your model is too much deep, you will overfit really quickly.

Do I need to apply the Softmax Function ANYWHERE in my multi-class classification Model?

I am currently turning my Binary Classification Model to a multi-class classification Model. Bare with me.. I am very knew to pytorch and Machine Learning.
Most of what I state here, I know from the following video.
https://www.youtube.com/watch?v=7q7E91pHoW4&t=654s
What I read / know is that the CrossEntropyLoss already has the Softmax function implemented, thus my output layer is linear.
What I then read / saw is that I can just choose my Model prediction by taking the torch.max() of my model output (Which comes from my last linear output. This feels weird because I Have some negative outputs and i thought I need to apply the SOftmax function first, but It seems to work right without it.
So know the big confusing question I have is, when would I use the Softmax function? Would I only use it when my loss doesnt have it implemented? BUT then I would choose my prediction based on the outputs of the SOftmax layer which wouldnt be the same as with the linear output layer.
Thank you guys for every answer this gets.
For calculating the loss using CrossEntropy you do not need softmax because CrossEntropy already includes it. However to turn model outputs to probabilities you still need to apply softmax to turn them into probabilities.
Lets say you didnt apply softmax at the end of you model. And trained it with crossentropy. And then you want to evaluate your model with new data and get outputs and use these outputs for classification. At this point you can manually apply softmax to your outputs. And there will be no problem. This is how it is usually done.
Traning()
MODEL ----> FC LAYER --->raw outputs ---> Crossentropy Loss
Eval()
MODEL ----> FC LAYER --->raw outputs --> Softmax -> Probabilites
Yes you need to apply softmax on the output layer. When you are doing binary classification you are free to use relu, sigmoid,tanh etc activation function. But when you are doing multi class classification softmax is required because softmax activation function distributes the probability throughout each output node. So that you can easily conclude that the output node which has the highest probability belongs to a particular class. Thank you. Hope this is useful!

Doesn't keras.layers.Flatten lose information?

Brand new to keras and ML in general. I'm looking at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/, and it uses Flatten between Embedding and Dense because Embedding produces a 2D vector but Dense requires a single dimension.
I'm sure I'm missing something obvious here, but why doesn't this lose which words are in which input vectors? How are we able to still know that input #3 was "nice work" and is associated with label #3, 1, for "positive"?
I guess the original dimensions are retained from the original input and then somehow restored for Dense's output? Or am I just totally missing a major conceptual aspect?
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Thanks for any guidance!
Embedding layer gives you a vector for each word token, so the output is 2-d. We need to use flatten before any classifier block.
There is some information lost, for example when we use Convolutional layers, and then flat the feature maps, the spatial information is lost. But we already extract the most important features using Conv layers and we feed those features to fully connected layers.
In your example, the temporal dimension is no longer maintained, usually, it's desired to pass the output of the embedding matrix to an RNN/Conv layer for further feature extraction.
Flatten only is applied on the non-batch dimension, meaning the examples are still separated (if you mean that).
For each sample, let's say nice work, we get 2 vectors (1 for nice, 1 for work), now we only want to know the overall sentiment from the sentence so, once we extract the features, we can apply flatten.

Using Dropout with Keras and LSTM/GRU cell

In Keras you can specify a dropout layer like this:
model.add(Dropout(0.5))
But with a GRU cell you can specify the dropout as a parameter in the constructor:
model.add(GRU(units=512,
return_sequences=True,
dropout=0.5,
input_shape=(None, features_size,)))
What's the difference? Is one preferable to the other?
In Keras' documentation it adds it as a separate dropout layer (see "Sequence classification with LSTM")
The recurrent layers perform the same repeated operation over and over.
In each timestep, it takes two inputs:
Your inputs (a step of your sequence)
Internal inputs (can be states and the output of the previous step, for instance)
Note that the dimensions of the input and output may not match, which means that "your input" dimensions will not match "the recurrent input (previous step/states)" dimesions.
Then in every recurrent timestep there are two operations with two different kernels:
One kernel is applied to "your inputs" to process and transform it in a compatible dimension
Another (called recurrent kernel by keras) is applied to the inputs of the previous step.
Because of this, keras also uses two dropout operations in the recurrent layers. (Dropouts that will be applied to every step)
A dropout for the first conversion of your inputs
A dropout for the application of the recurrent kernel
So, in fact there are two dropout parameters in RNN layers:
dropout, applied to the first operation on the inputs
recurrent_dropout, applied to the other operation on the recurrent inputs (previous output and/or states)
You can see this description coded either in GRUCell and in LSTMCell for instance in the source code.
What is correct?
This is open to creativity.
You can use a Dropout(...) layer, it's not "wrong", but it will possibly drop "timesteps" too! (Unless you set noise_shape properly or use SpatialDropout1D, which is currently not documented yet)
Maybe you want it, maybe you dont. If you use the parameters in the recurrent layer, you will be applying dropouts only to the other dimensions, without dropping a single step. This seems healthy for recurrent layers, unless you want your network to learn how to deal with sequences containing gaps (this last sentence is a supposal).
Also, with the dropout parameters, you will be really dropping parts of the kernel as the operations are dropped "in every step", while using a separate layer will let your RNN perform non-dropped operations internally, since your dropout will affect only the final output.

LSTM with variable sequences & return full sequences

How can I set up a keras model such that the final LSTM layer outputs a prediction for each time step while having variable sequence lengths as input?
I'd then like to provide labels for each of the timesteps after a dense layer with linear activation.
When I try to add a reshape or a dense layer to the LSTM model that is returning the full sequence and has a masking layer to take care of variable sequence lengths, it says:
The reshape and the dense layers do not support masking.
Would this be possible to do?
You can use the TimeDistributed layer wrapper for this. This applies the layer you want to each timestep. In your case, you could also just use TimeDistributedDense.

Resources