Pytorch's Transformer decoder accuracy fluctuation - pytorch

I have a sequence to sequence POS tagging model which uses Transformer decoder to generate target tokens.
My implementation of Pytorch's Transformer decoder is as follows:
in the initialization:
self.decoder_layer = nn.TransformerDecoderLayer(d_model=ENV_HIDDEN_SIZE, nhead=2,batch_first=True,dim_feedforward=300 ,activation="relu")
self.transformer_decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=2)
and in the forward function:
if infer==False: # for training
embedded=embedded*math.sqrt(ENV_HIDDEN_SIZE)
embedded = self.pos_encoder(embedded)
zol = self.transformer_decoder(tgt=embedded,memory=newtensor
,memory_mask=self.transformer_mask
,memory_key_padding_mask=x_mask
,tgt_mask=self.transformer_mask)
scores = self.slot_trans(self.dropout3(zol))
else: #for inferrence
bos = Variable(torch.LongTensor([[tag2index['<BOS>']]*batch_size])).cuda().transpose(1,0)
bos = self.embedding(bos)
tokens=bos
for i in range(length):
temp_embedded=tokens*math.sqrt(ENV_HIDDEN_SIZE)
temp_embedded = self.pos_encoder(temp_embedded)
zol = self.transformer_decoder(tgt=temp_embedded,
memory=newtensor,
tgt_mask=self.transformer_mask[:i+1,:i+1],
memory_key_padding_mask=x_mask,
memory_mask=self.transformer_mask[:i+1,:]
)
scores = self.slot_trans(self.dropout3(zol))
softmaxed = self.softmax(scores)
_,input = torch.max(softmaxed,2)
newtok = self.embedding(input)
tokens=torch.cat((bos,newtok),dim=1)
and the memory_mask is generated by the function "generate_square_subsequent_mask" as given:
def generate_square_subsequent_mask(sz: int) :
"""Generates an upper-triangular matrix of -inf, with zeros on diag."""
return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)
I am observing something weird. If I do not feed the memory_mask with generate_subsequent_mask -which I should not according to this post-, the accuracy severely decreases. Furthermore, accuracy of the model fluctuates between 50% and 90% on each epoch randomly on the test set but not the training set.
if I do feed the memory_mask, everything is fine, and model accuracy steadily increases to 95% on the test set. Moreover, the final accuracy takes a hit when not feeding the memory_mask.
Things I tried:
Without memory_mask: Tuning the learning rate.
Without memory_mask: Increasing the nhead and num_layers.
Using a simple linear layer.
At the end-note, using a simple linear layer instead of the transformer decoder provides a better accuracy. Any ideas as to why this is happening?

Related

How do I build a probability matrix output layer in Keras

Suppose I need to build a network that takes two inputs:
A patient's information, represented as an array of features
Selected treatment, represented as one-hot encoded array
Now how do I build a network that outputs a 2D probability matrix A where A[i,j] represents the probability the patient will end up at state j under treatment i. Let's say there are n possible states, and under any treatment, the total probability of all n states sums up to 1.
I wanted to do this because I was motivated by a similar network, where the inputs are the same as above, but the output is a 1d array representing the expected lifetime after treatment i is delivered. And such network is built as follows:
def default_dense(feature_shape, n_treatment):
feature_input = keras.layers.Input(feature_shape)
treatment_input = keras.layers.Input((n_treatments,))
hidden_1 = keras.layers.Dense(16, activation = 'relu')(feature_input)
hidden_2 = keras.layers.Dense(16, activation = 'relu')(hidden_1)
output = keras.layers.Dense(n_treatments)(hidden_2)
output_on_action = keras.layers.multiply([output, treatment_input])
model = keras.models.Model([feature_input, treatment_input], output_on_action)
model.compile(optimizer=tf.optimizers.Adam(0.001),loss='mse')
return model
And the training is simply
model.fit(x = [features, encoded_treatments], y = encoded_treatments * lifetime[:, np.newaxis], verbose = 0)
This is super handy because when predicting, I can use np.ones() as the encoded_treatments, and the network gives expected lifetimes under all treatments, thus choosing the best one is one-step. Certainly I can create multiple networks, each for a treatment, but it would be much less efficient.
Now the questions is, can I do the same to probability output?
I have figured it out myself. The trick is to use RepeatVector() and Permute() layers to generate a matrix mask for treatments.
The output is an element-wise Multiply() of the mask and a Softmax() of same size.

TF Keras Custom Layer accuracy drop with element-wise operations

I'm writing a custom layer for a TF Keras application. This layer should be able to perform a 2D convolution with additional masking information.
The layer is quite simple (omitting the init and compute_output_shape functions):
def build(self, input_shape):
ks = self.kernel_size + (int(input_shape[0][-1]),self.filters)
self.kernel = self.add_weight(name = 'kernel',shape = ks)
self.ones = self.add_weight(name='ones',shape=ks,
trainable=False, initializer= initializers.get('ones'))
self.bias = self.add_weight(name='bias',shape=(self.filters,))
def call(self,x):
img,msk = x
#img = tf.multiply(img,msk)
img = tf.nn.convolution(img,self.kernel)
msk = tf.nn.convolution(msk,self.ones)
#img = tf.divide(img,msk)
img = bias_add(img,self.bias)
return [img,msk]
The problem lies within those two commented out lines. They should just provide a simple, element-wise multiplication and division. If they are commented out, everything works fine. If I just comment one in, the accuracy of my model drops by around factor 2-3.
For testing, I simply used a mask of ones. That should have no influence for the output of this layer or it's performance (in accuracy terms).
I tried this with the current version of TF (r 1.12), the current nightly (r 1.13) and the 2.0 preview. Also I tried to replace the troublesome lines with e.g. keras Lambda layers and keras Multiply layers.
This might or might not be correlated to this problem:
Custom TF-Keras Layer performs worse than built-in layer
Mathematically the element-wise operations shouldn't have an impact (as long as the mask is only consistent of ones).
Also the element-wise operations shouldn't have an impact on the performance of this layer, since they don't influence the weights, and don't influence the data.
I don't know why this happens and hope some of you have an idea.
EDIT: Added kernel initializer, which I forgot before

UpSampling1D in keras insanely slow?

I'm trying to build an autoencoder in Keras, everything is going fine but when I add the UpSampling1D layer and run the code and try to get a model summary the program just freezes forever. My problem is that I have an input and output size of 220500, the convolutional layers have no problems with this and compile almost instantly. However the upsampling layers start to become insanely slow when the number of layers to upsample reach about 50 000 and basically freeze. Is there any way around this or is it some inherent limitation in the upsample layer? Also, why would this be? How come a convolution can handle much larger sizes than upsampling? :S
Here's my actual code:
def autoencoder(input_dim):
input_layer = Input(shape=(input_dim,1))
encode = Conv1D(filters=1,kernel_size=10,strides=2,activation="relu",padding='same')(input_layer)
encode = BatchNormalization()(encode)
n=20
for i in range(15):
encode = Conv1D(filters=1,kernel_size=10,strides=2,activation="relu",padding='same')(encode)
encode = BatchNormalization()(encode)
decode = Conv1D(filters=1,kernel_size=10,strides=1,activation="relu",padding='same')(encode)
decode = UpSampling1D(2)(decode)
decode = BatchNormalization()(decode)
for i in range(14):
decode = Conv1D(filters=1,kernel_size=10,strides=1,activation="relu",padding='same')(decode)
decode = UpSampling1D(2)(decode)
decode = BatchNormalization()(decode)
decode = Conv1D(filters=1,kernel_size=10,strides=1,activation="sigmoid",padding='same')(decode)
autoencoder_model = Model(input_layer, decode)
autoencoder_model.compile(optimizer='adadelta', loss='binary_crossentropy')
autoencoder_model.summary()
return autoencoder_model

How to compare predictive power of PCA and NMF

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA.
In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.
I was wondering if its possible to identify the variance retained in each component of NMF.
For instance using PCA this would be given by:
retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)
Any ideas?
TL;DR
You should loop over different n_components and estimate explained_variance_score of the decoded X at each iteration. This will show you how many components do you need to explain 95% of variance.
Now I will explain why.
Relationship between PCA and NMF
NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:
encode input X into a compressed representation H;
decode H back to X', which should be as close to X as possible.
They do it in a somehow similar way:
Decoding is similar in PCA and NMF: they output X' = dot(H, W), where W is a learned matrix parameter.
Encoding is different. In PCA, it is also linear: H = dot(X, V), where V is also a learned parameter. In NMF, H = argmin(loss(X, H, W)) (with respect to H only), where loss is mean squared error between X and dot(H, W), plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X.
Training is also different. PCA learns sequentially: the first component minimizes MSE without constraints, each next kth component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W) as when encoding, but now with respect to both H and W.
How to measure performance of dimensionality reduction
If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:
Train your encoder+decoder on X_train
To measure in-sample performance, compare X_train'=decode(encode(X_train)) with X_train using your preferred metric (e.g. MAE, RMSE, or explained variance)
To measure out-of-sample performance (generalizing ability) of your algorithm, do step 2 with the unseen X_test.
Let's try it with PCA and NMF!
from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))
It will print you explained variance ratio of 0.9536930834362043 - the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:
def get_score(model, data, scorer=metrics.explained_variance_score):
""" Estimate performance of the model on the data """
prediction = model.inverse_transform(model.transform(data))
return scorer(data, prediction)
print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))
print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))
which gives
train set performance
0.9536930834362043 # same as before!
0.937291711378812
test set performance
0.9597828443047842
0.9590555069007827
You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:
H and W (the learned parameter) must be non-negative
H should be as small as possible (L1 and L2 penalties)
W should be as small as possible (L1 and L2 penalties)
These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.
How to choose the number of components
In PCA, it is simple, because its components h_1, h_2, ... h_k are learned sequentially. If you add the new component h_(k+1), the first k will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_ array after only a single fit to data.
NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1th component, the first k components will change, and you cannot match each particular component with its explained variance (or any other metric).
But what you can to is to fit a new instance of NMF for each number of components, and compare the total explained variance:
ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
perfs_train.append(get_score(nmf, X_train_sc))
perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)
which would give
[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]
Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.

Keras: triplet loss with positive and negative sample within batch

I try to refactor my Keras code to use 'Batch Hard' sampling for the triplets, as proposed in https://arxiv.org/pdf/1703.07737.pdf.
" the core idea is to form batches by randomly sampling P classes
(person identities), and then randomly sampling K images of each class
(person), thus resulting in a batch of PK images. Now, for each
sample a in the batch, we can select the hardest positive and the
hardest negative samples within the batch when forming the triplets
for computing the loss, which we call Batch Hard"
So at the moment I have a Python generator (for use with model.fit_generator in Keras) which produces batches on the CPU. Then the actual forward and backward passes through the model could be done on the GPU.
However, how to make this fit with the 'Batch Hard' method? The generator samples 64 images, for which 64 triplets should be formed. First a forward pass is required to obtain the 64 embeddings with the current model.
embedding_model = Model(inputs = input_image, outputs = embedding)
But then the hardest positive and hardest negative have to be selected from the 64 embeddings to form triplets. Then the loss can be computed
anchor = Input(input_shape, name='anchor')
positive = Input(input_shape, name='positive')
negative = Input(input_shape, name='negative')
f_anchor = embedding_model(anchor)
f_pos = embedding_model(pos)
f_neg = embedding_model(neg)
triplet_model = Model(inputs = [anchor, positive, negative], outputs=[f_anchor, f_pos, f_neg])
And this triplet_model can be trained by defining a triplet loss function. However, is it possible with Keras to use the fit_generator and the 'Batch Hard' method? Or how to obtain access to the embeddings from the other samples in the batch?
Edit: With keras.layers.Lambda I can define an own layer creating triplets with input (batch_size, height, width, 3) and output (batch_size, 3, height, width, 3), but I also need access to the id's somewhere. Is this possible within the layer?

Resources