TF Keras Custom Layer accuracy drop with element-wise operations - python-3.x

I'm writing a custom layer for a TF Keras application. This layer should be able to perform a 2D convolution with additional masking information.
The layer is quite simple (omitting the init and compute_output_shape functions):
def build(self, input_shape):
ks = self.kernel_size + (int(input_shape[0][-1]),self.filters)
self.kernel = self.add_weight(name = 'kernel',shape = ks)
self.ones = self.add_weight(name='ones',shape=ks,
trainable=False, initializer= initializers.get('ones'))
self.bias = self.add_weight(name='bias',shape=(self.filters,))
def call(self,x):
img,msk = x
#img = tf.multiply(img,msk)
img = tf.nn.convolution(img,self.kernel)
msk = tf.nn.convolution(msk,self.ones)
#img = tf.divide(img,msk)
img = bias_add(img,self.bias)
return [img,msk]
The problem lies within those two commented out lines. They should just provide a simple, element-wise multiplication and division. If they are commented out, everything works fine. If I just comment one in, the accuracy of my model drops by around factor 2-3.
For testing, I simply used a mask of ones. That should have no influence for the output of this layer or it's performance (in accuracy terms).
I tried this with the current version of TF (r 1.12), the current nightly (r 1.13) and the 2.0 preview. Also I tried to replace the troublesome lines with e.g. keras Lambda layers and keras Multiply layers.
This might or might not be correlated to this problem:
Custom TF-Keras Layer performs worse than built-in layer
Mathematically the element-wise operations shouldn't have an impact (as long as the mask is only consistent of ones).
Also the element-wise operations shouldn't have an impact on the performance of this layer, since they don't influence the weights, and don't influence the data.
I don't know why this happens and hope some of you have an idea.
EDIT: Added kernel initializer, which I forgot before

Related

Retrieve only the last hidden state from lstm layer in pytorch sequential

I have a pytorch model:
model = torch.nn.Sequential(
torch.nn.LSTM(40, 256, 3, batch_first=True),
torch.nn.Linear(256, 256),
torch.nn.ReLU()
)
And for the LSTM layer, I want to retrieve only the last hidden state from the batch to pass through the rest of the layers. Ex:
_, (hidden, _) = lstm(data)
hidden = hidden[-1]
Though, that example only works for a subclassed model. I need to somehow do this on a nn.Sequential() model that way when I save it, it can properly be converted to a tensorflow.js model. The reason I can't make and train this model in tensorflow.js is because I'm trying to implement this repo: Resemblyzer in tensorflow.js while still using the same weights as the pretrained Resemblyzer model which was made in pytorch as a subclassed model. I thought of using the torchvisions.transformations.Lambda() transformation but I would assume that would make it incompatible with tensorflow.js. Is there any way to make this possible while still allowing the model to convert properly?
You could split up your sequential but only doing so in the forward definition of your model on inference. Once defined:
model = nn.Sequential(nn.LSTM(40, 256, 3, batch_first=True),
nn.Linear(256, 256),
nn.ReLU())
You can split it:
>>> lstm, fc = model[0], model[1:]
Then infer in two steps:
>>> out, (hidden, _) = lstm(data)
>>> hidden = hidden[-1]
>>> out = fc(out) # <- or fc(out[-1]) depending on what you want
Though the answer is provided above, I thought of elaborating on the same as PyTorch LSTM documentation is confusing.
In TF, we directly get the last_state as the output. No further action needed.
Let us check the Torch output of LSTM:
There are 2 outputs - a sequence and a tuple. We are interested in the last state so we can ignore the sequence and focus on the tuple. The tuple consists of 2 values - the first is the hidden state of the last cell (of all layers in the LSTM) and the second is the cell state of the last cell (again of all layers in the LSTM). We are interested in the hidden state. So
_, tup = self.bilstm(inp)
We are interested in tup[0]. Let us dig further into this.
The shape of tup[0] is somewhat odd with batch size at the centre. On the left of the batch size is the number of layers in the LSTM (multiply 2 if is biLSTM). On the right is the dimension you have provided while defining the LSTM. You could take the output from the last layer by simply doing a tup[0][-1] which is the answer provided above.
Alternatively if you want to make use of hidden states across layers, you may try something like:
out = tup[0].swapaxes(0,1)
out = out.reshape(*out.shape[:-2], -1)
The first line produces shape of batch_size, num_layers, hidden_size_specified. The second line produces shape of batch_size, num_layers x hidden_size_specified
(For e.g., Let us say, yours is a biLSTM and you have 3 layers and your hiddensize is 100, you could choose to concatenate the output such that you get one vector of 2 x 3 x 100 = 600 dimensions and then run a simple linear layer on top of this to get the output you want.)
There is another way to get the output of the LSTM. We discussed that the first output of an LSTM is a sequence:
sequence, tup = self.bilstm(inp)
This sequence is the output of the LAST hidden layer of the LSTM. It is a sequence because it contains hidden states of EVERY cell in this layer. So its length will be the input sequence length that you have provided. We could choose to take the hidden state of the last element in the sequence by doing a:
#shape of sequence is: batch_size, seq_size, dim
sequence = sequence.swapaxes(0,1)
#shape of sequence is: seq_size, batch_size, dim
sequence = sequence[-1]
#shape of sequence is: batch_size, dim (ie last seq is taken)
Needless to say this will be the same value we got by taking the last layer from tup[0]. Well, not quite! If the LSTM is a biLSTM, then using the sequence approach returns is 2 x hidden_size dim output (which is correct) wheras using the tup[0][-1] approach will give us only hidden_size dim even for a biLSTM. OP's LSTM is a non-biLSTM so both answers hold true.

Keras: triplet loss with positive and negative sample within batch

I try to refactor my Keras code to use 'Batch Hard' sampling for the triplets, as proposed in https://arxiv.org/pdf/1703.07737.pdf.
" the core idea is to form batches by randomly sampling P classes
(person identities), and then randomly sampling K images of each class
(person), thus resulting in a batch of PK images. Now, for each
sample a in the batch, we can select the hardest positive and the
hardest negative samples within the batch when forming the triplets
for computing the loss, which we call Batch Hard"
So at the moment I have a Python generator (for use with model.fit_generator in Keras) which produces batches on the CPU. Then the actual forward and backward passes through the model could be done on the GPU.
However, how to make this fit with the 'Batch Hard' method? The generator samples 64 images, for which 64 triplets should be formed. First a forward pass is required to obtain the 64 embeddings with the current model.
embedding_model = Model(inputs = input_image, outputs = embedding)
But then the hardest positive and hardest negative have to be selected from the 64 embeddings to form triplets. Then the loss can be computed
anchor = Input(input_shape, name='anchor')
positive = Input(input_shape, name='positive')
negative = Input(input_shape, name='negative')
f_anchor = embedding_model(anchor)
f_pos = embedding_model(pos)
f_neg = embedding_model(neg)
triplet_model = Model(inputs = [anchor, positive, negative], outputs=[f_anchor, f_pos, f_neg])
And this triplet_model can be trained by defining a triplet loss function. However, is it possible with Keras to use the fit_generator and the 'Batch Hard' method? Or how to obtain access to the embeddings from the other samples in the batch?
Edit: With keras.layers.Lambda I can define an own layer creating triplets with input (batch_size, height, width, 3) and output (batch_size, 3, height, width, 3), but I also need access to the id's somewhere. Is this possible within the layer?

How to correctly implement a batch-input LSTM network in PyTorch?

This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.

How to initialize weights when using RELU activation function

I want to make a Conv network and I wish to use the RELU activation function. Can someone please give me a clue of the correct way to initialize weights (I'm using Theano)
Thanks
I'm not sure there is a hard and fast best way to initialize weights and bias for a ReLU layer.
Some claim that (a slightly modified version of) Xavier initialization works well with ReLUs. Others that small Gaussian random weights plus bias=1 (ensuring the weighted sum of positive inputs will remain positive and thus not end up in the ReLUs zero region).
In Theano, these can be achieved like this (assuming weights post-multiply the input):
w = theano.shared((numpy.random.randn((in_size, out_size)) * 0.1).astype(theano.config.floatX))
b = theano.shared(numpy.ones(out_size))
or
w = theano.shared((numpy.random.randn((in_size, out_size)) * tt.sqrt(2 / (in_size + out_size))).astype(theano.config.floatX))
b = theano.shared(numpy.zeros(out_size))

Why is the logloss negative?

I just applied the log loss in sklearn for logistic regression: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
My code looks something like this:
def perform_cv(clf, X, Y, scoring):
kf = KFold(X.shape[0], n_folds=5, shuffle=True)
kf_scores = []
for train, _ in kf:
X_sub = X[train,:]
Y_sub = Y[train]
#Apply 'log_loss' as a loss function
scores = cross_validation.cross_val_score(clf, X_sub, Y_sub, cv=5, scoring='log_loss')
kf_scores.append(scores.mean())
return kf_scores
However, I'm wondering why the resulting logarithmic losses are negative. I'd expect them to be positive since in the documentation (see my link above) the log loss is multiplied by a -1 in order to turn it into a positive number.
Am I doing something wrong here?
Yes, this is supposed to happen. It is not a 'bug' as others have suggested. The actual log loss is simply the positive version of the number you're getting.
SK-Learn's unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.
This is also described in sklearn GridSearchCV with Pipeline and in scikit-learn cross validation, negative values with mean squared error
a similar discussion can be found here.
In this way, an higher score means better performance (less loss).
I cross checked the sklearn implementation with several other methods. It seems to be an actual bug within the framework. Instead consider the follwoing code for calculating the log loss:
import scipy as sp
def llfun(act, pred):
epsilon = 1e-15
pred = sp.maximum(epsilon, pred)
pred = sp.minimum(1-epsilon, pred)
ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
ll = ll * -1.0/len(act)
return ll
Also take into account that the dimensions of act and pred have to Nx1 column vectors.

Resources