Keras: triplet loss with positive and negative sample within batch - keras

I try to refactor my Keras code to use 'Batch Hard' sampling for the triplets, as proposed in https://arxiv.org/pdf/1703.07737.pdf.
" the core idea is to form batches by randomly sampling P classes
(person identities), and then randomly sampling K images of each class
(person), thus resulting in a batch of PK images. Now, for each
sample a in the batch, we can select the hardest positive and the
hardest negative samples within the batch when forming the triplets
for computing the loss, which we call Batch Hard"
So at the moment I have a Python generator (for use with model.fit_generator in Keras) which produces batches on the CPU. Then the actual forward and backward passes through the model could be done on the GPU.
However, how to make this fit with the 'Batch Hard' method? The generator samples 64 images, for which 64 triplets should be formed. First a forward pass is required to obtain the 64 embeddings with the current model.
embedding_model = Model(inputs = input_image, outputs = embedding)
But then the hardest positive and hardest negative have to be selected from the 64 embeddings to form triplets. Then the loss can be computed
anchor = Input(input_shape, name='anchor')
positive = Input(input_shape, name='positive')
negative = Input(input_shape, name='negative')
f_anchor = embedding_model(anchor)
f_pos = embedding_model(pos)
f_neg = embedding_model(neg)
triplet_model = Model(inputs = [anchor, positive, negative], outputs=[f_anchor, f_pos, f_neg])
And this triplet_model can be trained by defining a triplet loss function. However, is it possible with Keras to use the fit_generator and the 'Batch Hard' method? Or how to obtain access to the embeddings from the other samples in the batch?
Edit: With keras.layers.Lambda I can define an own layer creating triplets with input (batch_size, height, width, 3) and output (batch_size, 3, height, width, 3), but I also need access to the id's somewhere. Is this possible within the layer?

Related

Pytorch's Transformer decoder accuracy fluctuation

I have a sequence to sequence POS tagging model which uses Transformer decoder to generate target tokens.
My implementation of Pytorch's Transformer decoder is as follows:
in the initialization:
self.decoder_layer = nn.TransformerDecoderLayer(d_model=ENV_HIDDEN_SIZE, nhead=2,batch_first=True,dim_feedforward=300 ,activation="relu")
self.transformer_decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=2)
and in the forward function:
if infer==False: # for training
embedded=embedded*math.sqrt(ENV_HIDDEN_SIZE)
embedded = self.pos_encoder(embedded)
zol = self.transformer_decoder(tgt=embedded,memory=newtensor
,memory_mask=self.transformer_mask
,memory_key_padding_mask=x_mask
,tgt_mask=self.transformer_mask)
scores = self.slot_trans(self.dropout3(zol))
else: #for inferrence
bos = Variable(torch.LongTensor([[tag2index['<BOS>']]*batch_size])).cuda().transpose(1,0)
bos = self.embedding(bos)
tokens=bos
for i in range(length):
temp_embedded=tokens*math.sqrt(ENV_HIDDEN_SIZE)
temp_embedded = self.pos_encoder(temp_embedded)
zol = self.transformer_decoder(tgt=temp_embedded,
memory=newtensor,
tgt_mask=self.transformer_mask[:i+1,:i+1],
memory_key_padding_mask=x_mask,
memory_mask=self.transformer_mask[:i+1,:]
)
scores = self.slot_trans(self.dropout3(zol))
softmaxed = self.softmax(scores)
_,input = torch.max(softmaxed,2)
newtok = self.embedding(input)
tokens=torch.cat((bos,newtok),dim=1)
and the memory_mask is generated by the function "generate_square_subsequent_mask" as given:
def generate_square_subsequent_mask(sz: int) :
"""Generates an upper-triangular matrix of -inf, with zeros on diag."""
return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)
I am observing something weird. If I do not feed the memory_mask with generate_subsequent_mask -which I should not according to this post-, the accuracy severely decreases. Furthermore, accuracy of the model fluctuates between 50% and 90% on each epoch randomly on the test set but not the training set.
if I do feed the memory_mask, everything is fine, and model accuracy steadily increases to 95% on the test set. Moreover, the final accuracy takes a hit when not feeding the memory_mask.
Things I tried:
Without memory_mask: Tuning the learning rate.
Without memory_mask: Increasing the nhead and num_layers.
Using a simple linear layer.
At the end-note, using a simple linear layer instead of the transformer decoder provides a better accuracy. Any ideas as to why this is happening?

Retrieve only the last hidden state from lstm layer in pytorch sequential

I have a pytorch model:
model = torch.nn.Sequential(
torch.nn.LSTM(40, 256, 3, batch_first=True),
torch.nn.Linear(256, 256),
torch.nn.ReLU()
)
And for the LSTM layer, I want to retrieve only the last hidden state from the batch to pass through the rest of the layers. Ex:
_, (hidden, _) = lstm(data)
hidden = hidden[-1]
Though, that example only works for a subclassed model. I need to somehow do this on a nn.Sequential() model that way when I save it, it can properly be converted to a tensorflow.js model. The reason I can't make and train this model in tensorflow.js is because I'm trying to implement this repo: Resemblyzer in tensorflow.js while still using the same weights as the pretrained Resemblyzer model which was made in pytorch as a subclassed model. I thought of using the torchvisions.transformations.Lambda() transformation but I would assume that would make it incompatible with tensorflow.js. Is there any way to make this possible while still allowing the model to convert properly?
You could split up your sequential but only doing so in the forward definition of your model on inference. Once defined:
model = nn.Sequential(nn.LSTM(40, 256, 3, batch_first=True),
nn.Linear(256, 256),
nn.ReLU())
You can split it:
>>> lstm, fc = model[0], model[1:]
Then infer in two steps:
>>> out, (hidden, _) = lstm(data)
>>> hidden = hidden[-1]
>>> out = fc(out) # <- or fc(out[-1]) depending on what you want
Though the answer is provided above, I thought of elaborating on the same as PyTorch LSTM documentation is confusing.
In TF, we directly get the last_state as the output. No further action needed.
Let us check the Torch output of LSTM:
There are 2 outputs - a sequence and a tuple. We are interested in the last state so we can ignore the sequence and focus on the tuple. The tuple consists of 2 values - the first is the hidden state of the last cell (of all layers in the LSTM) and the second is the cell state of the last cell (again of all layers in the LSTM). We are interested in the hidden state. So
_, tup = self.bilstm(inp)
We are interested in tup[0]. Let us dig further into this.
The shape of tup[0] is somewhat odd with batch size at the centre. On the left of the batch size is the number of layers in the LSTM (multiply 2 if is biLSTM). On the right is the dimension you have provided while defining the LSTM. You could take the output from the last layer by simply doing a tup[0][-1] which is the answer provided above.
Alternatively if you want to make use of hidden states across layers, you may try something like:
out = tup[0].swapaxes(0,1)
out = out.reshape(*out.shape[:-2], -1)
The first line produces shape of batch_size, num_layers, hidden_size_specified. The second line produces shape of batch_size, num_layers x hidden_size_specified
(For e.g., Let us say, yours is a biLSTM and you have 3 layers and your hiddensize is 100, you could choose to concatenate the output such that you get one vector of 2 x 3 x 100 = 600 dimensions and then run a simple linear layer on top of this to get the output you want.)
There is another way to get the output of the LSTM. We discussed that the first output of an LSTM is a sequence:
sequence, tup = self.bilstm(inp)
This sequence is the output of the LAST hidden layer of the LSTM. It is a sequence because it contains hidden states of EVERY cell in this layer. So its length will be the input sequence length that you have provided. We could choose to take the hidden state of the last element in the sequence by doing a:
#shape of sequence is: batch_size, seq_size, dim
sequence = sequence.swapaxes(0,1)
#shape of sequence is: seq_size, batch_size, dim
sequence = sequence[-1]
#shape of sequence is: batch_size, dim (ie last seq is taken)
Needless to say this will be the same value we got by taking the last layer from tup[0]. Well, not quite! If the LSTM is a biLSTM, then using the sequence approach returns is 2 x hidden_size dim output (which is correct) wheras using the tup[0][-1] approach will give us only hidden_size dim even for a biLSTM. OP's LSTM is a non-biLSTM so both answers hold true.

PyTorch's CrossEntropyLoss - how to deal with the sequence length dimension with transformers?

I'm training a transformer model for text generation.
let's assume:
vocab size = 100
embbeding size = 50
max sequence length = 30
batch size = 32
loss = cross entropy loss
the last layer in the model is a fully connected layer,
mapping from shape [30, 32, 50] to [30, 32, 100].
the idea is that each of the last 30 sequences in the first dimension, I have a target vector I want to calculate loss with.
the issue is that based on the docs, this loss only excepts 2 dims on the prediction and one on the target - so how can I fit my 3D prediction into it?
(and 2D target?)
Use torch.BCELoss() instead (Binary cross entropy). This expects input and target to be the same size but they can be any size, and should fall within the range [0,1]. It performs cross-entropy loss element-wise.
EDIT: if you expect only one element from the vocab to be output, then you should use CrossEntropyLoss and instead encode your labels as a 1D vector rather than a 2D vector (i.e. do 1-hot decoding). BCE treats each element in the output for a single example as independent from the others, which is not a valid assumption for a multi-class style problem. I originally misread and thought the final output was an embedding, rather than an element from the vocabulary, hence my original suggestion.

How to handle samples with multiple images in a pytorch image processing model?

My model training involves encoding multiple variants of a same image then summing the produced representation over all variants for the image.
The data loader produces tensor batches of the shape: [batch_size,num_variants,1,height,width].
The 1 corresponds to image color channels.
How can I train my model with minibatches in pytorch?
I am looking for a proper way to forward all the batch_size×num_variant images through the network and summing the results over all groups of variants.
My current solution involves flattening the first two dimensions and doing a for loop to sum the representations, but I feel like there should be a better way an d I am not sure the gradients will remember everything.
Not sure I understood you correctly, but I guess this is what you want (say the batched image tensor is called image):
Nb, Nv, inC, inH, inW = image.shape
# treat each variant as if it's an ordinary image in the batch
image = image.reshape(Nb*Nv, inC, inH, inW)
output = model(image)
_, outC, outH, outW = output.shape[1]
# reshapes the output such that dim==1 indicates variants
output = output.reshape(Nb, Nv, outC, outH, outW)
# summing over the variants and lose the dimension of summation, [Nb, outC, outH, outW]
output = output.sum(dim=1, keepdim=False)
I used inC, outC, inH, etc. in case the input and output channels/sizes are different.

Get Keras LSTM output inside Tensorflow code

I'm working with time-variant graph embedding, where at each time step, the adjacency matrix of the graph changes. The main idea is to perform the node embedding of each timestep of the graph by looking to a set of node features and the adjacency matrix. The node embedding step is long and complicated, and is not part of the core of the problem, so I will skip this part. Suffice it to say that I use Graph Convolutional Network to embed the nodes.
Consider that I have a stack of B adjacency matrices A with sizes NxN, where B = batch size and N = number of nodes in the graph. Also, the matrices are stacked according to a time series, where matrix in index i comes before matrix in index i+1. I have already embedded the nodes of the graph, which results in a matrix of dimensions B x N x E, where E = size of the embedding (parameter). Note that the model has to deal with any graph, therefore, N is not a parameter. Another important comment is that each batch contains adjacency matrices from the same graph, and therefore all matrices of a batch have the same number of node, but the matrices of other batches may have different number of nodes.
I now need to pass these embedding through an LSTM cell. I never used Keras before, so I'm having a hard time making the Keras LSTM blend in my Tensorflow code. What I want to do is: pass each node embedding through an LSTM such that the number of timesteps = B and the LSTM batch size = N, that is, the input to my LSTM has the shape [N, B, E], where N and B are only known through execution time. I want the output of my LSTM to have the shape of [B, E*E]. The embedding matrix is called here self.embed_mat. Here is my code:
def _LSTM_layer(self):
with tf.variable_scope(self.scope, reuse=tf.AUTO_REUSE), tf.device(self.device):
in_shape = tf.shape(self.embed_mat)
lstm_input = tf.reshape(self.embed_mat, [in_shape[1], in_shape[0], EMBED_SIZE]) #lstm = [N, B, E]
input_plh = K.placeholder(name="lstm_input", shape=(None, None, EMBED_SIZE))
lstm = LSTM(EMBED_SIZE*EMBED_SIZE, input_shape=(None, None, EMBED_SIZE))
get_output = K.function(inputs=[input_plh], outputs=[lstm(input_plh)])
h = get_output([lstm_input])
I am a bit lost with the K.function part. All I want is the output tensor of the LSTM cell. I've seen that in order to get that with Keras, we need to use K.function, but I don't quite get it what it does. When I call get_output([lstm_input]), I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'worker_global/A/shape' with dtype int64 and shape [?]
Here, A is the stacked adjacency matrices with dimension BxNxN. What is going on here? Does the value of N needs to be known during graph building step? I think I made some dumb mistake with the LSTM cell, but I can't get what it is.
Thanks in advance!
If you want to get the output of your LSTM layer "out" given input of "inp" in a keras Sequential() model called "model," where "inp" is your first / input layer and "out" is an LSTM layer that happens to be, for the sake of this example, in the 4th position in your sequential model, you would obtain the output of that LSTM layer from the data you call "lstm_input" above with the following code:
inp = model.layers[0].input
out = model.layers[3].output
inp_to_out = K.function([inp], [out])
output = inp_to_out([lstm_input])

Resources