Suppose I have the following data-set X with 2 features and Y labels .
X = [[0.3, 0.1], [0.2, 0.9], [0.4, 0.0]]
Y = [0, 1, 0]
# split into input (X) and output (Y) variables
X = dataset[:, 0:2] #X features are from the first column and the 50th column
Y = dataset[:, 2]
model = Sequential()
model.add(Embedding(2, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y)
It works, but I wanted o know more about parameter_1, parameter_2, parameter_3 that go in
Embedding(parameter_1, parameter_2, input_length=parameter_2)
P.S, I just put in random stuff and don't know what I am doing.
What would be the proper parameters to fill in Embedding() given the data set I described above?
Alright, following more precise questions in the comments, here is the explaination.
An embedding layer is usually used to embed words so I will use a "red line example" with words, but you can think of them as categorical features.
The embedding layer is useful indeed to represent words (categorical features) as vectors in a continuous vector space.
When you have a text, you will tokenize your words and assign them a number. They become then categorical features labelled with an index. You will have for example the sentence " I embed stuff" becoming the list of categorical objects [2, 1, 3] where a dictionnary maps the index to each words : {1: "embed", 2: "I", 3: "stuff", 4: "some_other_words", 0:"<pad>"}
When you use a neural network or a continuous mathematical framework, those discrete objects (=categories) are unordered, there is no sense in 2 > 1 when you talk about your words, those are not "numerical values", they are categories. So you want to make them become numbers, to embed them in a vector space.
This is precisely what the Embedding() layer does, it maps every indexes to a word. So to do that, there are three main parameters to define :
How many indices you want to use in total. This is the number of words you have in your vocabulary, or the number of categories that the categorical feature you want to encode has. This is the input_dim feature. In our little example, we have 5 words in the vocabulary (indices from 0 to 4), so we will have input_dim = 5. The reason why it is called a "dimension" is because under the hood, keras is transforming the index number into a one-hot vector of dimension = the number of different elements. For example, the word "stuff" which is index 3 will be transformed into the 5 dimesions vector : [0 0 0 1 0] before being embedded. This is why your inputs should be integer, they are indices representing where the 1 is in the one-hot vector.
How big do you want your output vectors. This is the size of the vector space where your features will live. The parameter is output_dim. if you don't have a lot of words in your vocabulary (different categories for your features), this number should be low, in our case we will set it to output_dim = 2. Our 5 words will be living in a 2D space.
As embedding layers are often the firsts in a Neural Network, you need to specify what is the number of words that you have in the samples. This will be the input_length. Our sample was a 3 words phrase so input_length=3.
The reason why you usually have the embedding layer as first layer is because it takes integers inputs, layers in neural networks return real values, so it wouldn't work.
So to summarize, what comes in the the layer is a sequence of indices : [2, 1, 3] in our example. And what comes out is the embedded vector corresponding to each index. This might be something like [[0.2, 0.4], [-1.2, 0.3], [-0.5, -0.8]].
And to come back to your example, the input should be a list of samples, samples being lists of indices. There is no use to embed features that are already real values, values which have a mathematical sense already, the model can understand it, as opposed to categorical values.
Is it clearer now? :)
Related
In the embedding example here:
https://www.tensorflow.org/text/guide/word_embeddings
result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))
result.shape
TensorShape([2, 3, 5])
Then it explains:
When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the simplest.
The GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.
Then the code:
embedding_dim=16
model = Sequential([
vectorize_layer,
Embedding(vocab_size, embedding_dim, name="embedding"),
GlobalAveragePooling1D(),
Dense(16, activation='relu'),
Dense(1)
])
The GlobalAveragePooling1D should calculate a single integer for each word's embedding of dimension = n. I don't understand this part:
This allows the model to handle input of variable length, in the simplest way possible.
Similarly:
To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches.
In each embedding layer, input length is already fixed by the parameter 'input_length'. Truncation and padding are used to ensure the fixed length of the input. So what does it mean by saying GlobalAveragePooling1D is used to convert from this sequence of variable length to a fixed representation? What does the 'variable length' mean here?
If we apply torch.nn.Conv2d to a "RGB" image which can be understood as 3 two-dimensional matrices, so parameter in_channels corresponds to the 3 channels 'R', 'G' and 'B'. And in my view, an embedded sentence whose shape is [sentence length, embedding size] should be considered as 1 two-dimensional matrix, so in this case, why parameter in_channels is not 1 but embedding size in torch.nn.Conv1d, not the same meaning as torch.nn.Conv2d?
Could you explain what's the true meaning of in_channels in torch.nn.Conv1d in nlp / TextCNN? Why it's different from torch.nn.Conv2d?
Thanks!
The embedding dimension can be considered as the in_channels in NLP while using Conv1d.
Explanation-
Assuming that you are trying to input a sentence of length 'N' as follows -
I am a user of this ...... N.
All these words are converted to a word-embedding, say E dimensional.
This vector will be of dimension --> [N, E]
If you consider a batch of input sentences of batch size "B" --> [B, N, E]
So now you can call it as:-
x = nn.Conv1d(in_channels = E, out_channels = .......)
Take a look at- https://gist.github.com/spro/c87cc706625b8a54e604fb1024106556
I am working on the LSTM and after the pre-processing of data I get the data X in form of a list which contains the 3 lists of features and each list contains the sequence of 50 points in form of a list.
X = [list:100 [list:3 [list:50]]]
Y = [list:100]
since its a multivariate LSTM, I am not sure how to give all 3 sequences as an input to Keras-Lstm. Do I need to convert it in Pandas data frame?
model = models.Sequential()
model.add(layers.Bidirectional(layers.LSTM(units=32,
input_shape=(?,?,?)))
You can do do the following to convert the lists into NumPy arrays:
X = np.array(X)
Y = np.array(Y)
Calling the following after this conversion:
print(X.shape)
print(Y.shape)
should output: (100, 3, 50) and (100,), respectively. Finally, the input_shape of the LSTM layer can be (None, 50).
LSTM Call arguments Doc:
inputs: A 3D tensor with shape [batch, timesteps, feature].
You would have to transform that list into a numpy array to work with Keras.
As per the shape of X you have provided, it should work in theory. However you do have to figure out what the 3 dimensions of your array actually contain.
The 1st dimension should be your batch_size i.e. how many batches of data you have.
The 2nd dimension is your timestep data.
Ex: words in a sentence, "cat sat on dog" -> 'cat' is timestep 1, 'sat' is timestep 2 and 'on' is timestep 3 and so on.
The 3rd dimension represent the features of your data of each timestep.. For our sentence earlier, we can vectorize each word
What is the difference between LSTM and LSTMCell in Pytorch (currently version 1.1)? It seems that LSTMCell is a special case of LSTM (i.e. with only one layer, unidirectional, no dropout).
Then, what's the purpose of having both implementations? Unless I'm missing something, it's trivial to use an LSTM object as an LSTMCell (or alternatively, it's pretty easy to use multiple LSTMCells to create the LSTM object)
Yes, you can emulate one by another, the reason for having them separate is efficiency.
LSTMCell is a cell that takes arguments:
Input of shape batch × input dimension;
A tuple of LSTM hidden states of shape batch x hidden dimensions.
It is a straightforward implementation of the equations.
LSTM is a layer applying an LSTM cell (or multiple LSTM cells) in a "for loop", but the loop is heavily optimized using cuDNN. Its input is
A three-dimensional tensor of inputs of shape batch × input length × input dimension;
Optionally, an initial state of the LSTM, i.e., a tuple of hidden states of shape batch × hidden dim (or tuple of such tuples if the LSTM is bidirectional)
You often might want to use the LSTM cell in a different context than apply it over a sequence, i.e. make an LSTM that operates over a tree-like structure. When you write a decoder in sequence-to-sequence models, you also call the cell in a loop and stop the loop when the end-of-sequence symbol is decoded.
Let me show some specific examples:
# LSTM example:
>>> rnn = nn.LSTM(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> c0 = torch.randn(2, 3, 20)
>>> output, (hn, cn) = rnn(input, (h0, c0))
# LSTMCell example:
>>> rnn = nn.LSTMCell(10, 20)
>>> input = torch.randn(3, 10)
>>> hx = torch.randn(3, 20)
>>> cx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
hx, cx = rnn(input[i], (hx, cx))
output.append(hx)
The key difference:
LSTM: the argument 2, stands num_layers, number of recurrent layers. There are seq_len * num_layers=5 * 2 cells. No loop but more cells.
LSTMCell: in for loop (seq_len=5 times), each output of ith instance will be input of (i+1)th instance. There is only one cell, Truly Recurrent
If we set num_layers=1 in LSTM or add one more LSTMCell, the codes above will be the same.
Obviously, It is easier to apply parallel computing in LSTM.
I have constructed LSTM architecture using Keras, but I am not certain if duplicating time steps is a good approach to deal with variable sequence length.
I have a multidimensional data set with multi-feature sequence and varying time steps. It is a multivariate time series data with multiple examples to train LSTM on, and Y is either 0 or 1. Currently, I am duplicating last time steps for each sequence to ensure timesteps = 3.
I appreciate if someone could answer the following questions or concerns:
1. Is creating additional time steps with feature values represented by zeroes more suitable?
2. What is the right way to frame this problem, pad sequences, and mask for evaluation.
3. I am duplicating last time step in Y variable as well for prediction, and the value 1 in Y only appears at the last time step if at all.
# The input sequences are
trainX = np.array([
[
# Input features at timestep 1
[1, 2, 3],
# Input features at timestep 2
[5, 2, 3] #<------ duplicate this to ensure compliance
],
# Datapoint 2
[
# Features at timestep 1
[1, 8, 9],
# Features at timestep 2
[9, 8, 9],
# Features at timestep 3
[7, 6, 1]
]
])
# The desired model outputs is as follows:
trainY = np.array([
# Datapoint 1
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[1] #<---------- duplicate this to ensure compliance
],
# Datapoint 2
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[0]
# Target class at time step 3
[0]
]
])
timesteps = 3
model = Sequential()
model.add(LSTM(3, kernel_initializer ='uniform', return_sequences=True, batch_input_shape=(None, timesteps, trainX.shape[2]),
kernel_constraint=maxnorm(3), name='LSTM'))
model.add(Dropout(0.2))
model.add(LSTM(3, return_sequences=True, kernel_constraint=maxnorm(3), name='LSTM-2'))
model.add(Flatten(name='Flatten'))
model.add(Dense(timesteps, activation='sigmoid', name='Dense'))
model.compile(loss="mse", optimizer="sgd", metrics=["mse"])
model.fit(trainX, trainY, epochs=2000, batch_size=2)
predY = model.predict(testX)
In my opinion there are two solutions to your problem. (Duplicating timesteps is None of them):
Use pad_sequence layer in combination with a masking layer. This is the common approach. Now thanks to padding every sample has the same number of timesteps. The good thing on this method, it's very easy to implement. Also, the Masking layer will give you a little performance boost.
The downside of this approach: If you train on a GPU, CuDNNLSTM is the layer to go, which is highly optimized for gpu and therefore a lot faster. But it's not working with a masking layer and if your dataset has a high range of timesteps, you're losing perfomance.
Set your timesteps-shape to None and write a keras generator which will group your batches by timesteps.(I think you'll also have to use the functional api) Now you can implement CuDNNLSTM and every sample will be computed with only the relavant timesteps (instead of padded ones), which is much more efficient.
If you're new to keras and perfomance is not so important, go with option 1. If you have a production environment where you often have to train the Network and it's cost relevant, try option 2.