How to measure time of executation of each layer in CNNs - conv-neural-network

In convolutional neural network architectures for image classification (e.g. VGG or AlexNet) I would like to compare the time it takes to compute the result of each layer of the network while making a forward pass in test time (preferebly using Caffe).
In particular, I am interesteed how much time is spent on convolutional layers vs. fully connected layers.

Another way of doing it would be to create two networks, one which only has the convolutional layers and one which only has the dense (fully connected) layers. Do a forward pass using the convolutional network, measure its time, pass the result (output of convolutional only network) into the fully connected network, do a forward pass and measure its time.

so, what is you problem. caffe time did not work?

What about using class caffe::Timer in net.cpp just for your test like this:
#include "caffe/util/benchmark.hpp" //use class caffe::Timer
Dtype Net<Dtype>::ForwardFromTo(int start, int end) {
... //Some original contents
Timer timer;
for (int i = start; i <= end; ++i) {
...//Some original contents
string layer_name = layers_[i]->layer_param().name(); //get layer name
timer.Start();
Dtype layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]);
float forward_time = timer.MicroSeconds();
LOG(ERROR) << layer_name << " consumes: " << forward_time << " microseconds during forward.";
...
}
return loss;
}

every caffe layer has a forward and backward function in src/caffe/layers dir, for example src/caffe/layers/pooling_layer.cpp is cpu implementation, src/caffe/layers/pooling_layer.cu is the gpu implementation.So you need to add a time function in the forward function in .cpp or .cu, depending on you are using cpu or gpu.
or, the simplest way, using caffe timecommand

Related

Pytorch's Transformer decoder accuracy fluctuation

I have a sequence to sequence POS tagging model which uses Transformer decoder to generate target tokens.
My implementation of Pytorch's Transformer decoder is as follows:
in the initialization:
self.decoder_layer = nn.TransformerDecoderLayer(d_model=ENV_HIDDEN_SIZE, nhead=2,batch_first=True,dim_feedforward=300 ,activation="relu")
self.transformer_decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=2)
and in the forward function:
if infer==False: # for training
embedded=embedded*math.sqrt(ENV_HIDDEN_SIZE)
embedded = self.pos_encoder(embedded)
zol = self.transformer_decoder(tgt=embedded,memory=newtensor
,memory_mask=self.transformer_mask
,memory_key_padding_mask=x_mask
,tgt_mask=self.transformer_mask)
scores = self.slot_trans(self.dropout3(zol))
else: #for inferrence
bos = Variable(torch.LongTensor([[tag2index['<BOS>']]*batch_size])).cuda().transpose(1,0)
bos = self.embedding(bos)
tokens=bos
for i in range(length):
temp_embedded=tokens*math.sqrt(ENV_HIDDEN_SIZE)
temp_embedded = self.pos_encoder(temp_embedded)
zol = self.transformer_decoder(tgt=temp_embedded,
memory=newtensor,
tgt_mask=self.transformer_mask[:i+1,:i+1],
memory_key_padding_mask=x_mask,
memory_mask=self.transformer_mask[:i+1,:]
)
scores = self.slot_trans(self.dropout3(zol))
softmaxed = self.softmax(scores)
_,input = torch.max(softmaxed,2)
newtok = self.embedding(input)
tokens=torch.cat((bos,newtok),dim=1)
and the memory_mask is generated by the function "generate_square_subsequent_mask" as given:
def generate_square_subsequent_mask(sz: int) :
"""Generates an upper-triangular matrix of -inf, with zeros on diag."""
return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)
I am observing something weird. If I do not feed the memory_mask with generate_subsequent_mask -which I should not according to this post-, the accuracy severely decreases. Furthermore, accuracy of the model fluctuates between 50% and 90% on each epoch randomly on the test set but not the training set.
if I do feed the memory_mask, everything is fine, and model accuracy steadily increases to 95% on the test set. Moreover, the final accuracy takes a hit when not feeding the memory_mask.
Things I tried:
Without memory_mask: Tuning the learning rate.
Without memory_mask: Increasing the nhead and num_layers.
Using a simple linear layer.
At the end-note, using a simple linear layer instead of the transformer decoder provides a better accuracy. Any ideas as to why this is happening?

TF Keras Custom Layer accuracy drop with element-wise operations

I'm writing a custom layer for a TF Keras application. This layer should be able to perform a 2D convolution with additional masking information.
The layer is quite simple (omitting the init and compute_output_shape functions):
def build(self, input_shape):
ks = self.kernel_size + (int(input_shape[0][-1]),self.filters)
self.kernel = self.add_weight(name = 'kernel',shape = ks)
self.ones = self.add_weight(name='ones',shape=ks,
trainable=False, initializer= initializers.get('ones'))
self.bias = self.add_weight(name='bias',shape=(self.filters,))
def call(self,x):
img,msk = x
#img = tf.multiply(img,msk)
img = tf.nn.convolution(img,self.kernel)
msk = tf.nn.convolution(msk,self.ones)
#img = tf.divide(img,msk)
img = bias_add(img,self.bias)
return [img,msk]
The problem lies within those two commented out lines. They should just provide a simple, element-wise multiplication and division. If they are commented out, everything works fine. If I just comment one in, the accuracy of my model drops by around factor 2-3.
For testing, I simply used a mask of ones. That should have no influence for the output of this layer or it's performance (in accuracy terms).
I tried this with the current version of TF (r 1.12), the current nightly (r 1.13) and the 2.0 preview. Also I tried to replace the troublesome lines with e.g. keras Lambda layers and keras Multiply layers.
This might or might not be correlated to this problem:
Custom TF-Keras Layer performs worse than built-in layer
Mathematically the element-wise operations shouldn't have an impact (as long as the mask is only consistent of ones).
Also the element-wise operations shouldn't have an impact on the performance of this layer, since they don't influence the weights, and don't influence the data.
I don't know why this happens and hope some of you have an idea.
EDIT: Added kernel initializer, which I forgot before

How to correctly implement a batch-input LSTM network in PyTorch?

This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.

How to initialize weights when using RELU activation function

I want to make a Conv network and I wish to use the RELU activation function. Can someone please give me a clue of the correct way to initialize weights (I'm using Theano)
Thanks
I'm not sure there is a hard and fast best way to initialize weights and bias for a ReLU layer.
Some claim that (a slightly modified version of) Xavier initialization works well with ReLUs. Others that small Gaussian random weights plus bias=1 (ensuring the weighted sum of positive inputs will remain positive and thus not end up in the ReLUs zero region).
In Theano, these can be achieved like this (assuming weights post-multiply the input):
w = theano.shared((numpy.random.randn((in_size, out_size)) * 0.1).astype(theano.config.floatX))
b = theano.shared(numpy.ones(out_size))
or
w = theano.shared((numpy.random.randn((in_size, out_size)) * tt.sqrt(2 / (in_size + out_size))).astype(theano.config.floatX))
b = theano.shared(numpy.zeros(out_size))

Purpose of 'givens' variables in Theano.function

I was reading the code for the logistic function given at http://deeplearning.net/tutorial/logreg.html. I am confused about the difference between inputs & givens variables for a function. The functions that compute mistakes made by a model on a minibatch are:
test_model = theano.function(inputs=[index],
outputs=classifier.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]})
validate_model = theano.function(inputs=[index],
outputs=classifier.errors(y),
givens={
x: valid_set_x[index * batch_size:(index + 1) * batch_size],
y: valid_set_y[index * batch_size:(index + 1) * batch_size]})
Why couldn't/wouldn't one just make x& y shared input variables and let them be defined when an actual model instance is created?
The givens parameter allows you to separate the description of the model and the exact definition of the inputs variable. This is a consequence of what the given parameter do: modify the graph to compile before compiling it. In other words, we substitute in the graph, the key in givens with the associated value.
In the deep learning tutorial, we use a normal Theano variable to build the model. We use givens to speed up the GPU. Here, if we keep the dataset on the CPU, we will transfer a mini-batch to the GPU at each function call. As we do many iterations on the dataset, we end up transferring the dataset multiple time to the GPU. As the dataset is small enough to fit on the GPU, we put it in a shared variable to have it transferred to the GPU if one is available (or stay on the Central Processing Unit if the Graphics Processing Unit is disabled). Then when compiling the function, we swap the input with a slice corresponding to the mini-batch of the dataset to use. Then the input of the Theano function is just the index of that mini-batch we want to use.
I don't think anything is stopping you from doing it that way (I didn't try the updates= dictionary using an input variable directly, but why not). Remark however that for pushing data to a GPU in a useful manner, you will need it to be in a shared variable (from which x and y are taken in this example).

Resources