Seq2Seq(GNN+RNN) - Odd predictions despite optimized loss - pytorch

I’d like to ask for your advise/expertise on an issue that I am currently facing.
Summary:
I am training a Seq2Seq model that generates a natural language question based on a graph. Train and validation loss are decreasing throughout training but predicting questions outputs nonsense.
Context:
My goal is to create a model that is capable of generating a question given a small ‘question graph’, that represents some semantic context. For example, a question graph such as
(Person) - [acts in] - (Movie) - [directed by] - (Director)
(think of paths that can be matched using Cypher in Neo4j), could result in a question like
Who acts in the movie that is directed by the director?
Considering node attributes and copying them into the generated question can be ignored for now.
My Approach:
Using a question dataset (currently all questions from HotpotQA), i generate a question graph for each question using algorithmic approaches (NER & Entity Tagging for nodes and Dependency Parsing for edges). This works fine. Combined with the original questions, which the question graphs originate from, I now have a set of training pairs (question graph - question), that I am using to train a Seq2Seq model as such:
Encoder: GNN, that outputs a single graph embedding for each question graph. Each node and edge is initialized using pre-trained word embeddings (Conceptnet Numberbatch). It has a single GNN layer (NNConv from Pytorch Geometric) to prevent over-smoothing of node embeddings, since question graphs consist of about 3-4 nodes. This layer makes use of edge embeddings that encode the edge labels (e.g acts in and direced by).
Decoder: RNN, with the initial hidden state being the graph embedding of the Encoder. I use a GRU layer with a learnable embedding layer. Vocabulary is processed by keeping only words with a minimum word frequency (say 8 occurences).
Problem:
The underlying issue is that I can optimize the loss, but when examining the generated questions, they neither make sense grammatically nor do generated words seem to be related to the question graph (only very vaguely). Even predicting on the trainset does not improve that quality. Also, in terms of generating nice questions, I am not able to overfit the model (by disabeling dropout, using complex model and many many epochs).
What I have tried/observed:
I trained using various number of training pairs (1k - 60k). All show similar behaviour.
I tested various parameters: vocab size, batch size (16-128), teacher forcing (0.1 to 0.5) and especially learning rates (including scheduling) ranging from 0.1 to 0.001.
Various model complexities (dropout from 0.2 to 0.5), stacking 1 2 or 3 GRU layers
I typically notice high oscillations of validation loss in early training. Interestingly, these oscillations persist when I use the train data for evaluation (Hints given here regarding this couldn’t help me: https://stats.stackexchange.com/questions/255105/why-is-the-validation-accuracy-fluctuating/392215#392215)
I noticed that predictions are generally much shorter than the actual question and often contain very repetitive patterns.
While loss curves behave differently, the generated questions are of similar (low) quality.
Example Outputs:
(For context: The “MASK_” tokens are predictions of nodes. During preprocessing, I replace all Named Entities with their corresponding label (e.g. 'Who visited Michael last weekend?' could result in 'MASK_Person visited MASK_Person last MASK_DATE?') Since I do not adopt a classical Copy Mechanism to copy attribute names from the question graph, I instead predict their node labels (which are part of the vocabulary) which i replace with node attributes later in my architecture)
Actual: What type of series was the 2010 series which starred a Hong Kong actress born in 1970?
Predicted: ['which', 'was', 'a', 'of', 'a', 'the', 'in']
Actual: Which film was made first Portrait of Gina or Crazy Love?
Predicted: ['the', 'MASK_Movie', 'a', '?', 'a']
Actual: DJ Fisher was an agent for which basketball player whose name
means “special blessing”?
Predicted: ['which', 'was', 'a', 'UNK', '?']
Actual: What year did the group who sung “Another Rainy Day in New
York City” form?
Predicted: ['what', 'MASK_MusicGroup', 'a', 'the', 'that']
Actual: The song Lifted by Love was included in the soundtrack of a
film directed by who ?
Predicted: ['which', 'was', 'a', 'the', 'a', 'of', 'a']
Actual: Huh Jung directed a South Korean horror film that was
released on what day in 2017?
Predicted: ['what', 'directed', 'did', 'the', 'MASK_Nationality', 'in', '?', 'in']
Actual: What Indian Constitution established authority estimated the
money scandal around the 2G Spectrum scam?
Predicted: ['when', 'did', 'was', 'has', 'that']
Actual: Dolomedes briangreenei has been named after which American theoretical physicist and mathematician?
Predicted: ['the', 'of', 'MASK_Ordinal', '?']
Actual: Who released the album on which Outro is the final track?
Predicted: ['the', 'what', 'is', 'the', 'the', 'the', 'that']
Here are two representative training curves, that I obtained and lead to equally quality output like above:
A: Loss curves. Using 10K samples, inital lr=0.065, low complexy Decoder (single GRU, single dropout of 0.25, batch_size=128)
B: Loss curves. Using 20K samples, initial lr=0.065, higher complexity Decoder (3 GRU, two dropout of 0.4, batch_size=128)
(Be aware of the scale on y-axis.)
Encoder & Decoder:
Encoder GNN:
class QuestionGraphGNN(torch.nn.Module):
def __init__(self, in_channels=301, hidden_channels=256, out_channels=(vocab size), dropout=0.4, aggr='mean'):
super(QuestionGraphGNN, self).__init__()
nn1 = torch.nn.Sequential(
torch.nn.Linear(in_channels, hidden_channels),
torch.nn.ReLU(),
torch.nn.Linear(hidden_channels, in_channels * hidden_channels))
self.conv = NNConv(in_channels, hidden_channels, nn1, aggr=aggr)
self.lin = nn.Linear(hidden_channels, out_channels)
self.dropout = dropout
def forward(self, x, edge_index, edge_attr):
x = self.conv(x, edge_index, edge_attr)
x = F.leaky_relu(x)
x = F.dropout(x, p=self.dropout)
x = self.lin(x)
return x
Decoder RNN:
class DecoderRNN(nn.Module):
def __init__(self, embedding_size, output_size, dropout=0.4):
super(DecoderRNN, self).__init__()
self.output_size = output_size
self.dropout = dropout
self.embedding = nn.Embedding(output_size, embedding_size)
self.gru1 = nn.GRU(embedding_size, embedding_size)
self.gru2 = nn.GRU(embedding_size, embedding_size)
self.gru3 = nn.GRU(embedding_size, embedding_size)
self.out = nn.Linear(embedding_size, output_size)
self.logsoftmax = nn.LogSoftmax(dim=1)
def forward(self, inp, hidden):
output = self.embedding(inp).view(1, 1, -1)
output = F.leaky_relu(output)
output = F.dropout(output, p=self.dropout)
output, hidden = self.gru1(output, hidden)
output = F.dropout(output, p=self.dropout)
output, hidden = self.gru2(output, hidden)
output, hidden = self.gru3(output, hidden)
out = self.out(output[0])
output = self.logsoftmax(out)
return output, hidden
The training loop follows the implementation of this tutorial: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#training-the-model
The Loss is torch.nn.NLLLoss.
Thank you for taking time to read this post. I have already learned a lot throughout this project but at this point, I feel like I reached a point at which I am out of ideas to improve. Any advice into which road to take is appreciated.

Related

How to get intermediate output grad in Pytorch model

we can get loss of last layer by loss = loss_fn(y_pred, y_true), and results in a loss: Tensor
then we call loss.backward() to do back propagation.
after optimizer.step() we could see updated model.parameters()
taking below example
y = Model1(x) # with optimizer1
z = Model2(y) # with optimizer2
loss = loss_fn(z, z_true)
loss.backward()
optimizer2.optimize() # update Model2 parameters
# in order to update Model1 parameters I think we should do
y.backward(grad_tensor=the_output_gradient_from_Model2)
optimizer1.optimize()
How to get the intermediate back propagation result? e.g. the gradient of output grad, which will be taken by y_pred.backward(grad_tensor=grad).
Update: The solution is setting required_grad=True and take Tensor x.grad. Thanks for the answers.
PS: The scenario is I am doing a federated learning, the model is split into 2 parts. The first part takes input and forward to second part. And it need the second part to calculate the loss and back propagate the loss to first part, so that the first part takes the loss and do its own back propagation.
I will assume you're referring to intermediate gradients when you say "loss of a specific layer".
You can access the gradient of the layer with respect to the output loss by accessing the grad attribute on the parameters of your model which require gradient computation.
Here is a simplistic setup:
>>> f = nn.Sequential(
nn.Linear(10,5),
nn.Linear(5,2),
nn.Linear(2, 2, bias=False),
nn.Sigmoid())
>>> x = torch.rand(3, 10).requires_grad_(True)
>>> f(x).mean().backward()
Navigate through all the parameters per layer:
>>> for n, c in f.named_children():
... for p in c.parameters():
... print(f'<{n}>:{p.grad}')
<0>:tensor([[-0.0054, -0.0034, -0.0028, -0.0058, -0.0073, -0.0066, -0.0037, -0.0044,
-0.0035, -0.0051],
[ 0.0037, 0.0023, 0.0019, 0.0040, 0.0050, 0.0045, 0.0025, 0.0030,
0.0024, 0.0035],
[-0.0016, -0.0010, -0.0008, -0.0017, -0.0022, -0.0020, -0.0011, -0.0013,
-0.0010, -0.0015],
[ 0.0095, 0.0060, 0.0049, 0.0102, 0.0129, 0.0116, 0.0066, 0.0077,
0.0063, 0.0091],
[ 0.0005, 0.0003, 0.0002, 0.0005, 0.0006, 0.0006, 0.0003, 0.0004,
0.0003, 0.0004]])
<0>:tensor([-0.0090, 0.0062, -0.0027, 0.0160, 0.0008])
<1>:tensor([[-0.0035, 0.0035, -0.0026, -0.0106, -0.0002],
[-0.0020, 0.0020, -0.0015, -0.0061, -0.0001]])
<1>:tensor([-0.0289, -0.0166])
<2>:tensor([[0.0355, 0.0420],
[0.0354, 0.0418]])
To supplement gradient related answer(s), it should to say that you can't get the loss of the layer, loss is model level concept, generally, you can't say, which layer is responsible for error. See, if model deep enough one can freeze any model layer, and it can still train to high accuracy.

Keras Semantic Similarity model from pre-trained embeddings

I want to implement a Keras model to predict the similarity between two sentences from words embeddings as follows (I included my full script at the end):
Load words embeddings models, e.g., Word2Vec and fastText.
Generate samples (X1 and X2) by computing the average word vectors for all words in a sentence. If two or more models are used, calculate the arithmetic mean of all embeddings (Frustratingly Easy Meta-Embedding -- Computing Meta-Embeddings by Averaging Source Word Embeddings).
Concatenate X1 and X2 into one array before feeding them to the network.
Compile (and evaluate) the Keras model.
The entire script is as follows:
import numpy as np
from gensim.models import Word2Vec
from keras.layers import Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split
def encoder_vector(v: str, model: Word2Vec) -> np.array:
wv_dim = model.vector_size
if v in model.wv:
return model.wv[v]
else:
return np.zeros(wv_dim)
def encoder_words_avg(words: list[str], model: Word2Vec) -> np.array:
dim = model.vector_size
words = [word for word in words if word in model.wv]
if len(words) >= 1:
return np.mean(model.wv[words], axis=0)
else:
return np.zeros(dim)
def load_samples(mappings, w2v_model, fast_model):
dim = w2v_model.vector_size
num = len(mappings)
X1 = np.zeros((num, dim))
X2 = np.zeros((num, dim))
y = np.zeros((num, 1))
for i in range(num):
mapping = mappings[i].split("|")
sentence_1, sentence_2 = mapping[1:]
e = np.zeros((2, dim))
# Compute meta-embedding by averaging all embeddings.
e[0, :] = encoder_words_avg(words=sentence_1.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_1.split(), model=fast_model)
X1[i] = e.mean(axis=0)
e[0, :] = encoder_words_avg(words=sentence_2.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_2.split(), model=fast_model)
X2[i] = e.mean(axis=0)
y[i] = 0.0 if mapping[0].startswith("-") else 1.0
return X1, X2, y
def baseline_model(X_train, X_test, y_train, y_test):
model = Sequential()
model.add(
Dense(
200,
input_shape=(X_train.shape[1],),
activation="relu",
kernel_initializer="he_uniform",
)
)
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=8, epochs=14)
# Evaluate the trained model, using the train and test data
_, train_acc = model.evaluate(X_train, y_train, verbose=0)
_, test_acc = model.evaluate(X_test, y_test, verbose=0)
print("Train: %.3f, Test: %.3f\n" % (train_acc, test_acc))
return model
def main():
w2v_model = Word2Vec.load("")
fast_model = Word2Vec.load("")
mappings = [
"1|boiled chicken egg|hen egg whole boiled",
"2|tomato|tomato substance",
"3|sweet potatoes|potato chip",
"-1|watering plants|cornsalad plant",
"-2|butter|butane",
"-3|olive plant|black olives",
]
X1, X2, y = load_samples(mappings, w2v_model=w2v_model, fast_model=fast_model)
# Concatenate both arrays into one before feeding to the network.
X = np.concatenate([X1, X2], axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = baseline_model(X_train, X_test, y_train, y_test)
model.summary()
The above script seems to work, but the prediction result is very poor even when using only Word2Vec (which makes me think there could be an issue with the Keras model...). Any ideas on how to improve the outcome? Am I doing something wrong?
Thank you.
It's unclear what you're intending to predict.
Do you want your Keras NN to report the same value as the precise cosine-similarity calculation, between the two text summary vectors, would report? If so, why not just... do the calculation? It's not something I'd necessarily expect a neural-architecture to approxmate better.
Alternatively, if your tiny 6-pair dataset is the target:
Your existing 'gold standard' answers don't seem obviously correct to me. Superficially, 'olive plant' & 'black olives' seem nearly as 'similar' as 'tomato' & 'tomato substance'. Similarly, 'watering plants' & 'cornsalad plant' about-as-similar as 'sweet potatoes' & 'potato chip'.
A mere 6 examples (maybe 5 after train/test split?) is both inadequate to usefully train a larger neural classifier, and to the extent the classifer might be easily trained (indeed 'overfit') to the 5 training examples, it won't necessarily have learned anything generalizable to the one hold-out example (which is using vectors quite far from the training texts). (With such a paucity of training data, & testing using inputs that might be arbitrarily different than the training data, "very poor" performance would be expected. Neural nets require lots of varied training examples!)
Finally, the strategy of creating combined-embeddings-by-averaging, as investigated by your linked paper, is another atypical practice that seems fishy to me. Even if it could offer some benefits, there's no reason to mix that atypical, somewhat non-intuitive extra practice into your experiment before even having things work with a more typical and simple baseline approach, for comparison, to be sure the extra 'meta'/averaging is worth the complication.
The paper itself doesn't really show any advantage over concatenation, which has a stronger theoretical basis (preserving each model's full independent spaces) than averaging, except by a tiny amount in 1-of-6 tests. Further, average of GLoVe & CBOW performs the same or worse than GLoVe alone on 3 on their 6 evaluations – and pretty minimally better on the 3 other evaluations. That implies to me the outperformance might be mainly random jitter introduced by the extra steps, and the averaging is – at best – a cheap option to consider as a tiny boost, not a generally-better approach.
The paper also leaves many natural related questions unaddressed:
Is averaging better than, say, just picking a random half of each models' dimensions for concatenation? That'd be even cheaper!
Might some of the slight lift in some tasks be due not to the averaging, but the other transformations they've applied – the l2-normalization applied to each source model, or across the whole of each dimension for the GLoVE model? (It's unclear if this model-postprocessing was only applied before dual-model averaging, or also to GLoVe in its solo evaluation.)
There's other work suggesting post-training transformations of word-vector spaces may improve performance on downstream tasks – see for example 'All But The Top' – so which steps, exactly, get which advantages is important to distinguish.

Why does multi layer perceprons outperform RNN in CartPole?

Recently, I compared two models for a DQN on CartPole-v0 environment. One of them is a multilayer perceptron with 3 layers and the other is an RNN built up from an LSTM and 1 fully connected layer. I have an experience replay buffer of size 200000 and the training doesn't start until it is filled up.
Although MLP has solved the problem under a reasonable amount of training steps (this means to achieve a mean reward of 195 for the last 100 episodes), the RNN model could not converge as quickly and its maximum mean reward did not even reach 195 too!
I have already tried to increase batch size, add more neurons to the LSTM'S hidden state, increase the RNN'S sequence length and making the fully connected layer more complex - but every attempt failed as I saw enormous fluctuations in mean reward so the model hardly converged at all. May these are the sings of early overfitting?
class DQN(nn.Module):
def __init__(self, n_input, output_size, n_hidden, n_layers, dropout=0.3):
super(DQN, self).__init__()
self.n_layers = n_layers
self.n_hidden = n_hidden
self.lstm = nn.LSTM(input_size=n_input,
hidden_size=n_hidden,
num_layers=n_layers,
dropout=dropout,
batch_first=True)
self.dropout= nn.Dropout(dropout)
self.fully_connected = nn.Linear(n_hidden, output_size)
def forward(self, x, hidden_parameters):
batch_size = x.size(0)
output, hidden_state = self.lstm(x.float(), hidden_parameters)
seq_length = output.shape[1]
output1 = output.contiguous().view(-1, self.n_hidden)
output2 = self.dropout(output1)
output3 = self.fully_connected(output2)
new = output3.view(batch_size, seq_length, -1)
new = new[:, -1]
return new.float(), hidden_state
def init_hidden(self, batch_size, device):
weight = next(self.parameters()).data
hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device),
weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device))
return hidden
Contrarily to what I expected, the simpler model gave a much better result than the other; even though RNN is supposed to be better in processing time series data.
Can anybody tell me what's the reason for this?
Also, I have to state that I applied no feature engineering and both DQN's worked with raw data. Could RNN outperform the MLP on using normalized features? (I mean feeding both models with normalized data)
Is there anything you can recommend me to improve training efficiency on RNN's to achieve the best results?
Contrary to what I expected the simpler model gave much better result that the other; even though RNN's supposed to be better in processing time series data.
There is no time series in the cart-pole, the state contains all the information needed for optimal decision. It would be different if, for instance, you would learn from images and you would need to estimate the pole velocity from a series of images.
Also, it is not true that the more complex model should perform better. On the contrary, it is more likely to overfit. For the cart-pole you don't even need a NN, a simple linear approximator with RBFs or random Fourier features would suffice. A RNN + LSTM is for sure an overkill for such a simple problem.

LSTM's for Binary classification in Keras?

Suppose I have the following data-set X with 2 features and Y labels .
X = [[0.3, 0.1], [0.2, 0.9], [0.4, 0.0]]
Y = [0, 1, 0]
# split into input (X) and output (Y) variables
X = dataset[:, 0:2] #X features are from the first column and the 50th column
Y = dataset[:, 2]
model = Sequential()
model.add(Embedding(2, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y)
It works, but I wanted o know more about parameter_1, parameter_2, parameter_3 that go in
Embedding(parameter_1, parameter_2, input_length=parameter_2)
P.S, I just put in random stuff and don't know what I am doing.
What would be the proper parameters to fill in Embedding() given the data set I described above?
Alright, following more precise questions in the comments, here is the explaination.
An embedding layer is usually used to embed words so I will use a "red line example" with words, but you can think of them as categorical features.
The embedding layer is useful indeed to represent words (categorical features) as vectors in a continuous vector space.
When you have a text, you will tokenize your words and assign them a number. They become then categorical features labelled with an index. You will have for example the sentence " I embed stuff" becoming the list of categorical objects [2, 1, 3] where a dictionnary maps the index to each words : {1: "embed", 2: "I", 3: "stuff", 4: "some_other_words", 0:"<pad>"}
When you use a neural network or a continuous mathematical framework, those discrete objects (=categories) are unordered, there is no sense in 2 > 1 when you talk about your words, those are not "numerical values", they are categories. So you want to make them become numbers, to embed them in a vector space.
This is precisely what the Embedding() layer does, it maps every indexes to a word. So to do that, there are three main parameters to define :
How many indices you want to use in total. This is the number of words you have in your vocabulary, or the number of categories that the categorical feature you want to encode has. This is the input_dim feature. In our little example, we have 5 words in the vocabulary (indices from 0 to 4), so we will have input_dim = 5. The reason why it is called a "dimension" is because under the hood, keras is transforming the index number into a one-hot vector of dimension = the number of different elements. For example, the word "stuff" which is index 3 will be transformed into the 5 dimesions vector : [0 0 0 1 0] before being embedded. This is why your inputs should be integer, they are indices representing where the 1 is in the one-hot vector.
How big do you want your output vectors. This is the size of the vector space where your features will live. The parameter is output_dim. if you don't have a lot of words in your vocabulary (different categories for your features), this number should be low, in our case we will set it to output_dim = 2. Our 5 words will be living in a 2D space.
As embedding layers are often the firsts in a Neural Network, you need to specify what is the number of words that you have in the samples. This will be the input_length. Our sample was a 3 words phrase so input_length=3.
The reason why you usually have the embedding layer as first layer is because it takes integers inputs, layers in neural networks return real values, so it wouldn't work.
So to summarize, what comes in the the layer is a sequence of indices : [2, 1, 3] in our example. And what comes out is the embedded vector corresponding to each index. This might be something like [[0.2, 0.4], [-1.2, 0.3], [-0.5, -0.8]].
And to come back to your example, the input should be a list of samples, samples being lists of indices. There is no use to embed features that are already real values, values which have a mathematical sense already, the model can understand it, as opposed to categorical values.
Is it clearer now? :)

Why do I get the same prediction for all training samples?

I have a neural network with num_labels separate outputs where each output consists of a softmax layer with two nodes (Yes/No).
I am taking the output of a convolution_layer and feed it as input for a simple softmax_layer which I further feed into each of said outputs:
softmax_layer = Dense(num_labels, activation='softmax', name='softmax_layer')(convolution_layer)
outputs = list()
for i in range(num_labels):
out_y = Dense(2, activation='softmax', name='out_{:d}'.format(i))(softmax_layer)
outputs.append(out_y)
So far I was able to train the model by providing a list of training samples but now I noticed that I am getting the exact same output for completely different samples in a batch:
Please note: Here, each column consists of (2,1) arrays. Each column is the prediction for one sample.
I've checked the samples, they are different. I've also tried to e.g. feed the convolution_layer into the outputs. In that case the predictions are different. I can only see this outcome if I do it the way shown above.
I could live with the fact that the outputs are "similar". In that case I'd think that the network is just learning not what I want it to learn but since they are really the same I am not quite sure what the problem here is.
I've tried something similar with a simple feed forward network:
class FeedForward:
def __init__(self, input_dim, nb_classes):
in_x = Input(shape=(input_dim, ), name='in_x')
h1 = Dense(14, name='h1', activation='relu')(in_x)
h2 = Dense(8, name='h2', activation='relu')(h1)
out = Dense(nb_classes, name='out', activation='softmax')(h2)
self.model = Model(input=[in_x], output=[out])
def compile_model(self, optimizer='adam', loss='binary_crossentropy'):
self.model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])
But it behaves similarly. I can't imagine it's due to imbalanced data. There are 13 classes. There is some imbalance but it's not like that one class has 90% of the mass.
Am I doing this right?

Resources