Why does multi layer perceprons outperform RNN in CartPole? - pytorch

Recently, I compared two models for a DQN on CartPole-v0 environment. One of them is a multilayer perceptron with 3 layers and the other is an RNN built up from an LSTM and 1 fully connected layer. I have an experience replay buffer of size 200000 and the training doesn't start until it is filled up.
Although MLP has solved the problem under a reasonable amount of training steps (this means to achieve a mean reward of 195 for the last 100 episodes), the RNN model could not converge as quickly and its maximum mean reward did not even reach 195 too!
I have already tried to increase batch size, add more neurons to the LSTM'S hidden state, increase the RNN'S sequence length and making the fully connected layer more complex - but every attempt failed as I saw enormous fluctuations in mean reward so the model hardly converged at all. May these are the sings of early overfitting?
class DQN(nn.Module):
def __init__(self, n_input, output_size, n_hidden, n_layers, dropout=0.3):
super(DQN, self).__init__()
self.n_layers = n_layers
self.n_hidden = n_hidden
self.lstm = nn.LSTM(input_size=n_input,
hidden_size=n_hidden,
num_layers=n_layers,
dropout=dropout,
batch_first=True)
self.dropout= nn.Dropout(dropout)
self.fully_connected = nn.Linear(n_hidden, output_size)
def forward(self, x, hidden_parameters):
batch_size = x.size(0)
output, hidden_state = self.lstm(x.float(), hidden_parameters)
seq_length = output.shape[1]
output1 = output.contiguous().view(-1, self.n_hidden)
output2 = self.dropout(output1)
output3 = self.fully_connected(output2)
new = output3.view(batch_size, seq_length, -1)
new = new[:, -1]
return new.float(), hidden_state
def init_hidden(self, batch_size, device):
weight = next(self.parameters()).data
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device))
return hidden
Contrarily to what I expected, the simpler model gave a much better result than the other; even though RNN is supposed to be better in processing time series data.
Can anybody tell me what's the reason for this?
Also, I have to state that I applied no feature engineering and both DQN's worked with raw data. Could RNN outperform the MLP on using normalized features? (I mean feeding both models with normalized data)
Is there anything you can recommend me to improve training efficiency on RNN's to achieve the best results?

Contrary to what I expected the simpler model gave much better result that the other; even though RNN's supposed to be better in processing time series data.
There is no time series in the cart-pole, the state contains all the information needed for optimal decision. It would be different if, for instance, you would learn from images and you would need to estimate the pole velocity from a series of images.
Also, it is not true that the more complex model should perform better. On the contrary, it is more likely to overfit. For the cart-pole you don't even need a NN, a simple linear approximator with RBFs or random Fourier features would suffice. A RNN + LSTM is for sure an overkill for such a simple problem.

Related

Ideal number of nodes in an autoencoder for small dataset with few features in PyTorch

I am dealing with a dataset with 6 features and around 1000 samples total. I am hoping to use unsupervised learning -- with dimensionality reduction followed by clustering -- as there are labels for these data which are often incorrect. I tested linear methods like PCA first, and found that the data are not linearly separable. I am now using an autoencoder to perform dimensionality reduction, but am coming across some questions regarding the number of nodes. I am testing a simple autoencoder in PyTorch at the moment, with one hidden layers. However, I am uncertain of the number of nodes that would be appropriate for this problem. I am a bit confused about whether the advice here about node selection references to 'input layer size' as the total training data size (0.2*1000 samples), total dataset (1000 samples), or features themself (6 features).
Here is my current PyTorch code aimed at handling this problem:
class Autoencoder(nn.Module):
def __init__(self,input_dim = 6, latent_dim = 2):
super(Autoencoder, self).__init__()
self.input_dim = input_dim
self.latent_dim = latent_dim
self.encode = nn.Sequential(nn.Linear(self.input_dim,32),
nn.LeakyReLU(0.02),
nn.Linear(16, self.latent_dim),
)
self.decode = nn.Sequential(nn.Linear(self.latent_dim,32),
nn.LeakyReLU(0.02),
nn.Linear(32,self.input_dim)
)
self.apply(weights_init)
def encoded(self, x):
#encodes data to latent space
return self.encode(x)
def decoded(self, x):
#decodes latent space data to 'real' space
return self.decode(x)
def forward(self, x):
en = self.encoded(x)
de = self.decoded(en)
return de`
This yields a training/test loss as follows:
and latent space as follows:
I would greatly appreciate any advice on this subject. I recognize this is likely a rather simple question, so apologies in advance!

Multivariate multi-step time forecasting bad prediction results PyTorch LSTM Seq2Seq

I am trying to build an LSTM based Seq2Seq model in PyTorch for multivariate multistep prediction.
Data
The data used is shown in the figure above, where the last column is the target, and all the front columns are features. For preprocessing, I use MaxMinScaler to scale all data between -1 and 1.
Features and Target
Then I used an Encoder-Decoder structure.
class Seq2Seq(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size, batch_size):
super().__init__()
self.output_size = output_size
self.Encoder = Encoder(input_size, hidden_size, num_layers, batch_size)
self.Decoder = Decoder(input_size, hidden_size,
num_layers, output_size, batch_size)
def forward(self, input_seq):
batch_size, seq_len, _ = input_seq.shape[0], input_seq.shape[1], input_seq.shape[2]
h, c = self.Encoder(input_seq)
outputs = torch.zeros(batch_size, seq_len, self.output_size).to(device)
for t in range(seq_len):
_input = input_seq[:, t, :]
# print(_input.shape)
output, h, c = self.Decoder(_input, h, c)
outputs[:, t, :] = output
return outputs[:, -1, :]
The Traning
def seq2seq_train(model, Dtr, Val, path):
model = model
loss_function = nn.MSELoss().to(device)
# loss_function = nn.L1Loss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001,
weight_decay=1e-4)
After 100 epochs of training, the obtained losses and test results are as follows.
Loss History
Test Result
The validation loss doesn't seem to drop, and the prediction seems bad.
Then I used Optuna to optimize hyperparameters, including different number of hidden layer nodes, LSTM layers, dropout, etc., but the results are not good, all have high validation loss.
I would like to know what caused this result, is it a problem with the data, the model structure or the hyperparameters?
I hope to get help, thank you very much.
Tentative answer based on info provided:
Note that when one uses cross-entropy loss for classification as it is usually done, bad predictions are penalized much more strongly than good predictions are rewarded. For a cat image, the loss is log(1−prediction), so even if many cat images are correctly predicted (low loss), a single misclassified cat image will have a high loss, hence "blowing up" your mean loss. See this answer for further illustration of this phenomenon. (Getting increasing loss and stable accuracy could also be caused by good predictions being classified a little worse, but I find it less likely because of this loss "asymmetry").
So I think that when both accuracy and loss are increasing, the network is starting to overfit, and both phenomena are happening at the same time. The network is starting to learn patterns only relevant for the training set and not great for generalization, leading to said phenomenon, some images from the validation set get predicted really wrong, with an effect amplified by the "loss asymmetry". However, it is at the same time still learning some patterns which are useful for generalization (phenomenon one, "good learning") as more and more images are being correctly classified.
There is also a great explanation in this Tweet that concisely explains why you may encounter validation loss being lower than training loss.

How to understand the bias term in language model head (when we tie the word embeddings)?

I was learning the masked language modeling codebase in Huggingface Transformers. Just a question to understand the language model head.
Here at the final linear layer where we project hidden size to vocab size (https://github.com/huggingface/transformers/blob/f2fbe4475386bfcfb3b83d0a3223ba216a3c3a91/src/transformers/models/bert/modeling_bert.py#L685-L702).
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
self.decoder.bias = self.bias
We set the bias term to zero at the moment. And later when we initialize the weight, we tie the weight of the linear layer and the word embedding.
But we don't do such a thing for the bias term. I wonder how we can understand that and why we want to initialize the bias term as a zero vector.
https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L1060-L1079
def tie_weights(self):
"""
Tie the weights between the input embeddings and the output embeddings.
If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the
weights instead.
"""
if getattr(self.config, "tie_word_embeddings", True):
output_embeddings = self.get_output_embeddings()
if output_embeddings is not None:
self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
if getattr(self.config, "is_encoder_decoder", False) and getattr(self.config, "tie_encoder_decoder", False):
if hasattr(self, self.base_model_prefix):
self = getattr(self, self.base_model_prefix)
self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)
for module in self.modules():
if hasattr(module, "_tie_weights"):
module._tie_weights()
My understanding:
Because the final linear weight accepts the hidden representations that have been transformed by several feed-forward layers. We might not be able to match them exactly, we need the bias term to somehow regularize them.
As I'm not sure my understanding is accurate, I would like to seek your opinions.

how effective is transfer learning? keeping only two specific output features without resetting features

I want to keep only two specific output features without resetting features.
Resetting features would lose the pre-trained weights.
For example, I don't want to do...
# https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html?highlight=transfer%20learning%20ant%20bees
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)
Here is code (following the transfer learning tutorial on Pytorch)
I want to do this to see how effective transfer learning is.
Even without transfer learning, a model might be effective. Removing 998 out of 1000 categories and leaving only two categories, ant and bee, could be a great categorical model since you are left with only two choices.
I do not want to re-train the model, I want to use the weights as it is, otherwise, it will be the same as transfer learning.
You can certainly try this. You can reduce the model output to just the two logits you want to compare with:
chosen_cats = torch.Tensor([ant_index, bee_index]).long()
with torch.set_grad_enabled(phase == 'train'):
outputs = model(inputs)
outputs = torch.index_select(output, 1, chosen_cats)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
In this scenario, the preds will be 0 or 1, with 0 predicting ant and 1 predicting bee, so you will need to also modify your labels to reflect this.

Why do I get the same prediction for all training samples?

I have a neural network with num_labels separate outputs where each output consists of a softmax layer with two nodes (Yes/No).
I am taking the output of a convolution_layer and feed it as input for a simple softmax_layer which I further feed into each of said outputs:
softmax_layer = Dense(num_labels, activation='softmax', name='softmax_layer')(convolution_layer)
outputs = list()
for i in range(num_labels):
out_y = Dense(2, activation='softmax', name='out_{:d}'.format(i))(softmax_layer)
outputs.append(out_y)
So far I was able to train the model by providing a list of training samples but now I noticed that I am getting the exact same output for completely different samples in a batch:
Please note: Here, each column consists of (2,1) arrays. Each column is the prediction for one sample.
I've checked the samples, they are different. I've also tried to e.g. feed the convolution_layer into the outputs. In that case the predictions are different. I can only see this outcome if I do it the way shown above.
I could live with the fact that the outputs are "similar". In that case I'd think that the network is just learning not what I want it to learn but since they are really the same I am not quite sure what the problem here is.
I've tried something similar with a simple feed forward network:
class FeedForward:
def __init__(self, input_dim, nb_classes):
in_x = Input(shape=(input_dim, ), name='in_x')
h1 = Dense(14, name='h1', activation='relu')(in_x)
h2 = Dense(8, name='h2', activation='relu')(h1)
out = Dense(nb_classes, name='out', activation='softmax')(h2)
self.model = Model(input=[in_x], output=[out])
def compile_model(self, optimizer='adam', loss='binary_crossentropy'):
self.model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
But it behaves similarly. I can't imagine it's due to imbalanced data. There are 13 classes. There is some imbalance but it's not like that one class has 90% of the mass.
Am I doing this right?

Resources