Pytorch LSTM vs LSTMCell - pytorch

What is the difference between LSTM and LSTMCell in Pytorch (currently version 1.1)? It seems that LSTMCell is a special case of LSTM (i.e. with only one layer, unidirectional, no dropout).
Then, what's the purpose of having both implementations? Unless I'm missing something, it's trivial to use an LSTM object as an LSTMCell (or alternatively, it's pretty easy to use multiple LSTMCells to create the LSTM object)

Yes, you can emulate one by another, the reason for having them separate is efficiency.
LSTMCell is a cell that takes arguments:
Input of shape batch × input dimension;
A tuple of LSTM hidden states of shape batch x hidden dimensions.
It is a straightforward implementation of the equations.
LSTM is a layer applying an LSTM cell (or multiple LSTM cells) in a "for loop", but the loop is heavily optimized using cuDNN. Its input is
A three-dimensional tensor of inputs of shape batch × input length × input dimension;
Optionally, an initial state of the LSTM, i.e., a tuple of hidden states of shape batch × hidden dim (or tuple of such tuples if the LSTM is bidirectional)
You often might want to use the LSTM cell in a different context than apply it over a sequence, i.e. make an LSTM that operates over a tree-like structure. When you write a decoder in sequence-to-sequence models, you also call the cell in a loop and stop the loop when the end-of-sequence symbol is decoded.

Let me show some specific examples:
# LSTM example:
>>> rnn = nn.LSTM(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> c0 = torch.randn(2, 3, 20)
>>> output, (hn, cn) = rnn(input, (h0, c0))
# LSTMCell example:
>>> rnn = nn.LSTMCell(10, 20)
>>> input = torch.randn(3, 10)
>>> hx = torch.randn(3, 20)
>>> cx = torch.randn(3, 20)
>>> output = []
>>> for i in range(6):
hx, cx = rnn(input[i], (hx, cx))
output.append(hx)
The key difference:
LSTM: the argument 2, stands num_layers, number of recurrent layers. There are seq_len * num_layers=5 * 2 cells. No loop but more cells.
LSTMCell: in for loop (seq_len=5 times), each output of ith instance will be input of (i+1)th instance. There is only one cell, Truly Recurrent
If we set num_layers=1 in LSTM or add one more LSTMCell, the codes above will be the same.
Obviously, It is easier to apply parallel computing in LSTM.

Related

torch crossentropy loss calculation difference between 2D input and 3D input

i am running a test on torch.nn.CrossEntropyLoss. I am using the example shown on the official page.
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=False)
target = torch.randn(3, 5).softmax(dim=1)
output = loss(input, target)
the output is 2.05.
in the example, both the input and the target are 2D tensors. Since in most NLP case, the input should be 3D tensor and correspondingly the output should be 3D tensor as well. Therefore, i wrote the a couple lines of testing code, and found a weird issue.
input = torch.stack([input])
target = torch.stack([target])
output = loss(ins, ts)
the output is 0.9492
This result really confuse me, except the dimensions, the numbers inside the tensors are totally the same. Does anyone know the reason why the difference is?
the reason why i am testing on the method is i am working on project with Transformers.BartForConditionalGeneration. the loss result is given in the output, which is always in (1,) shape. the output is confusing. If my batch size is greater than 1, i am supposed to get batch size number of loss instead of just one. I took a look at the code, it just simply use nn.CrossEntropyLoss(), so i am considering that the issue may be in the nn.CrossEntropyLoss() method. However, it is stucked in the method.
In the second case, you are adding an extra dimension which means that ultimately, the softmax on the logits tensor (input) won't be applied on a different dimension.
Here we compute the two quantities separately:
>>> loss = nn.CrossEntropyLoss()
>>> input = torch.randn(3, 5, requires_grad=False)
>>> target = torch.randn(3, 5).softmax(dim=1)
First you have loss(input, target) which is identical to:
>>> o = -target*F.log_softmax(input, 1)
>>> o.sum(1).mean()
And your second scenario, loss(input[None], target[None]), identical to:
>>> o = -target[None]*F.log_softmax(input[None], 1)
>>> o.sum(1).mean()

LSTM input shape through json file

I am working on the LSTM and after the pre-processing of data I get the data X in form of a list which contains the 3 lists of features and each list contains the sequence of 50 points in form of a list.
X = [list:100 [list:3 [list:50]]]
Y = [list:100]
since its a multivariate LSTM, I am not sure how to give all 3 sequences as an input to Keras-Lstm. Do I need to convert it in Pandas data frame?
model = models.Sequential()
model.add(layers.Bidirectional(layers.LSTM(units=32,
input_shape=(?,?,?)))
You can do do the following to convert the lists into NumPy arrays:
X = np.array(X)
Y = np.array(Y)
Calling the following after this conversion:
print(X.shape)
print(Y.shape)
should output: (100, 3, 50) and (100,), respectively. Finally, the input_shape of the LSTM layer can be (None, 50).
LSTM Call arguments Doc:
inputs: A 3D tensor with shape [batch, timesteps, feature].
You would have to transform that list into a numpy array to work with Keras.
As per the shape of X you have provided, it should work in theory. However you do have to figure out what the 3 dimensions of your array actually contain.
The 1st dimension should be your batch_size i.e. how many batches of data you have.
The 2nd dimension is your timestep data.
Ex: words in a sentence, "cat sat on dog" -> 'cat' is timestep 1, 'sat' is timestep 2 and 'on' is timestep 3 and so on.
The 3rd dimension represent the features of your data of each timestep.. For our sentence earlier, we can vectorize each word

dimension of the input layer for embeddings in Keras

It is not clear to me whether there is any difference between specifying the input dimension Input(shape=(20,)) or not Input(shape=(None,)) in the following example:
input_layer = Input(shape=(None,))
emb = Embedding(86, 300) (input_layer)
lstm = Bidirectional(LSTM(300)) (emb)
output_layer = Dense(10, activation="softmax") (lstm)
model = Model(input_layer, output_layer)
model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["acc"])
history = model.fit(my_x, my_y, epochs=1, batch_size=632, validation_split=0.1)
my_x (shape: 2000, 20) contains integers referring to characters, while my_y contains the one-hot encoding of some labels. With Input(shape=(None,)), I see that I could use model.predict(my_x[:, 0:10]), i.e., I could give only 10 characters as an input instead of 20: how is that possible? I was assuming that all the 20 dimensions in my_x were needed to predict the corresponding y.
What you say with None is, that the sequences you feed into the model have the strict length of 20. While a model usually needs a fixed length, recurrent neural networks (as the LSTM you use there), do not need a fixed sequence Length. So the LSTM just does not care whether your sequence contains 20 or 100 timesteps, as it simply loops over them. However, when you specify the amount of timesteps to 20, the LSTM expects 20 and will raise an error if it does not get them.
For more information see this post of Tim♦

Why does output shape in a simple Elman RNN depend on the sequence length(while hidden state shape doesn't)?

I am learning about RNNs, and am trying to code one up using PyTorch.
I have some trouble understanding the output dimensions
Here is some code for a simple RNN architecture
class RNN(nn.Module):
def __init__(self, input_size, hidden_dim, n_layers):
super(RNN, self).__init__()
self.hidden_dim=hidden_dim
self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
def forward(self, x, hidden):
r_out, hidden = self.rnn(x, hidden)
return r_out, hidden
So, what I understand is the hidden_dim is the number blocks I will have in my hidden layer, and essentially the number of features in the output and in the hidden state.
I create some dummy data, to test it
test_rnn = RNN(input_size=1, hidden_dim=4, n_layers=1)
# generate evenly spaced, test data pts
time_steps = np.linspace(0, 6, 3)
data = np.sin(time_steps)
data.resize((3, 1))
test_input = torch.Tensor(data).unsqueeze(0) # give it a batch_size of 1 as first dimension
print('Input size: ', test_input.size())
# test out rnn sizes
test_out, test_h = test_rnn(test_input, None)
print('Hidden state size: ', test_h.size())
print('Output size: ', test_out.size())
What I get is
Input size: torch.Size([1, 3, 1])
Hidden state size: torch.Size([1, 1, 4])
Output size: torch.Size([1, 3, 4])
So I understand that the shape of x is determined like so
x = (batch_size, seq_length, input_size).. so 1 bath size, and input of 1 feature and 3 time steps(sequence length).
For hidden state, like so hidden = (n_layers, batch_size, hidden_dim).. so I had 1 layer, batch size 1, and 4 blocks in my hidden layer.
What I don't get is the RNN output. From the documentation, r_out = (batch_size, time_step, hidden_size).. Wasn't the output supposed to be the same as the hidden state that was output from the hidden units? That is, if I have 4 units in my hidden layer, I would expect it to output 4 numbers for the hidden state, and 4 numbers for the output. Why is the time step a dimension of the output? Because, each hidden unit, takes in some numbers, outputs a state S and output Y, and both of these are equal, yes? I tried a diagram, this is what I came up with. Help me understand what part of it I'm doing wrong.
So TL;DR
Why does output shape in a simple Elman RNN depend on the sequence length(while hidden state shape doesn't)? For in the diagram I drew, I see both of them being the same.
In the PyTorch API, the output is a sequence of hidden states during the RNN computation, i.e., there is one hidden state vector per input vector. The hidden state is the last hidden state, the state the RNN ends with after processing the input, so test_out[:, -1, :] = test_h.
Vector y in your diagrams is the same as a hidden state Ht, it indeed has 4 numbers, but the state is different for every time step, so you have 4 number for every time step.
The reason why PyTorch separates the sequence of outputs = hidden states (it's not the same in LSTMs, though) is that you can have a batch of sequences of different lengths. In that case, the final state is not simply test_out[:, -1, :], because you need to select final states based on the lengths of individual sequences.

Keras conv1d layer parameters: filters and kernel_size

I am very confused by these two parameters in the conv1d layer from keras:
https://keras.io/layers/convolutional/#conv1d
the documentation says:
filters: Integer, the dimensionality of the output space (i.e. the number output of filters in the convolution).
kernel_size: An integer or tuple/list of a single integer, specifying the length of the 1D convolution window.
But that does not seem to relate to the standard terminologies I see on many tutorials such as https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/ and https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
Using the second tutorial link which uses Keras, I'd imagine that in fact 'kernel_size' is relevant to the conventional 'filter' concept which defines the sliding window on the input feature space. But what about the 'filter' parameter in conv1d? What does it do?
For example, in the following code snippet:
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))
suppose the embedding layer outputs a matrix of dimension 50 (rows, each row is a word in a sentence) x 300 (columns, the word vector dimension), how does the conv1d layer transforms that matrix?
Many thanks
You're right to say that kernel_size defines the size of the sliding window.
The filters parameters is just how many different windows you will have. (All of them with the same length, which is kernel_size). How many different results or channels you want to produce.
When you use filters=100 and kernel_size=4, you are creating 100 different filters, each of them with length 4. The result will bring 100 different convolutions.
Also, each filter has enough parameters to consider all input channels.
The Conv1D layer expects these dimensions:
(batchSize, length, channels)
I suppose the best way to use it is to have the number of words in the length dimension (as if the words in order formed a sentence), and the channels be the output dimension of the embedding (numbers that define one word).
So:
batchSize = number of sentences
length = number of words in each sentence
channels = dimension of the embedding's output.
The convolutional layer will pass 100 different filters, each filter will slide along the length dimension (word by word, in groups of 4), considering all the channels that define the word.
The outputs are shaped as:
(number of sentences, 50 words, 100 output dimension or filters)
The filters are shaped as:
(4 = length, 300 = word vector dimension, 100 output dimension of the convolution)
Below code from the explanation can help do this. I went similar question and answered it myself.
from tensorflow.keras.layers import MaxPool1D
import tensorflow.keras.backend as K
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Conv1D
tf.random.set_seed(1) # nowadays instead of tf.set_random_seed(1)
batch,rows,cols = 3,8,3
m, n, k = batch, rows, cols
input_shape = (batch,rows,cols)
np.random.seed(132) # nowadays instead of np.set_random_seed = 132
data = np.random.randint(low=1,high=6,size=input_shape,dtype='int32')
data = np.float32(data)
data = tf.constant(data)
print("Data:")
print(K.eval(data))
print()
print(f'm,n,k:{input_shape}')
from tensorflow.keras.layers import Conv1D
#############################
# Understandin filters and kernel_size
##############################
num_filters=5
kernel_size= 3
'''
Few Notes about Kernel_size:
1. max_kernel_size == max_rows
2. since Conv1D, we are creating 1D Matrix of 1's with kernel_size
if kernel_size = 1, [[1,1,1..]]
if kernel_size = 2, [[1,1,1..][1,1,1,..]]
if kernel_size = 3, [[1,1,1..][1,1,1,..]]
I have chosen tf.keras.initializers.constant(1) to create a matrix of Ones.
Size of matrix is Kernel_Size
'''
y= Conv1D(filters=num_filters,kernel_size=kernel_size,
kernel_initializer=tf.keras.initializers.constant(1),
#glorot_uniform(seed=12)
input_shape=(k,n)
)(data)
#########################
# Checking the out outcome
#########################
print(K.eval(y))
print(f' Resulting output_shape == (batch_size, num_rows-kernel_size+1,num_filters): {y.shape}')
# # Verification
K.eval(tf.math.reduce_sum(data,axis=(2,1), # Sum along axis=2, and then along
axis=1,keep_dims=True)
###########################################
# Understanding MaxPool and Strides in
##########################################
pool = MaxPool1D(pool_size=3,strides=3)(y)
print(K.eval(pool))
print(f'Shape of Pool: {pool.shape}')

Resources