Config change for a pre-trained transformer model - pytorch

I am trying to implement a classification head for the reformer transformer. The classification head works fine, but when I try to change one of the config parameters- config.axial_pos_shape i.e sequence length parameter for the model it throws an error;
size mismatch for reformer.embeddings.position_embeddings.weights.0: copying a param with shape torch.Size([512, 1, 64]) from checkpoint, the shape in current model is torch.Size([64, 1, 64]).
size mismatch for reformer.embeddings.position_embeddings.weights.1: copying a param with shape torch.Size([1, 1024, 192]) from checkpoint, the shape in current model is torch.Size([1, 128, 192]).
The config:
{
"architectures": [
"ReformerForSequenceClassification"
],
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"attn_layers": [
"local",
"lsh",
"local",
"lsh",
"local",
"lsh"
],
"axial_norm_std": 1.0,
"axial_pos_embds": true,
"axial_pos_embds_dim": [
64,
192
],
"axial_pos_shape": [
64,
256
],
"chunk_size_feed_forward": 0,
"chunk_size_lm_head": 0,
"eos_token_id": 2,
"feed_forward_size": 512,
"hash_seed": null,
"hidden_act": "relu",
"hidden_dropout_prob": 0.05,
"hidden_size": 256,
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": true,
"layer_norm_eps": 1e-12,
"local_attention_probs_dropout_prob": 0.05,
"local_attn_chunk_length": 64,
"local_num_chunks_after": 0,
"local_num_chunks_before": 1,
"lsh_attention_probs_dropout_prob": 0.0,
"lsh_attn_chunk_length": 64,
"lsh_num_chunks_after": 0,
"lsh_num_chunks_before": 1,
"max_position_embeddings": 8192,
"model_type": "reformer",
"num_attention_heads": 2,
"num_buckets": [
64,
128
],
"num_chunks_after": 0,
"num_chunks_before": 1,
"num_hashes": 1,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 0,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 100
}
},
"vocab_size": 320
}
Python Code:
config = ReformerConfig()
config.max_position_embeddings = 8192
config.axial_pos_shape=[64, 128]
#config = ReformerConfig.from_pretrained('./cnp/config.json', output_attention=True)
model = ReformerForSequenceClassification(config)
model.load_state_dict(torch.load("./cnp/pytorch_model.bin"))

I run into the same issue, trying to halve the size of the 65536 (128*512) by default max sequence length used in Reformer pre-training.
As #cronoik mentioned, you must:
load pretrained Reformer
resize it to your need by dropping unnecessary weights
save this new model
load this new model to perform your desired tasks
Those unnecessary weights are the ones from the Position Embeddings layer. In Reformer model, the Axial Position Encodings strategy was used to learn the position embeddings (rather than having fixed ones like BERT). Axial Position Encodings stores position embeddings in a memory efficient manner, using two small tensors rather than a big one.
However, the idea of position embeddings remains exactly the same, which is obtaining different embeddings for each position.
That said, in theory (correct me if I am misunderstanding somewhere), removing the last position embeddings to match your custom max sequence length should not hurt the performance. You can refer to this post from HuggingFace to see a more detailed description of Axial Position Encodings and understand where to truncate your position embeddings tensor.
I have managed to resize and use Reformer with a custom max length of 32768 (128*256) with the following code:
# Load intial pretrained model
model = ReformerForSequenceClassification.from_pretrained('google/reformer-enwik8', num_labels=2)
# Reshape Axial Position Embeddings layer to match desired max seq length
model.reformer.embeddings.position_embeddings.weights[1] = torch.nn.Parameter(model.reformer.embeddings.position_embeddings.weights[1][0][:256])
# Update the config file to match custom max seq length
model.config.axial_pos_shape = 128, 256
model.config.max_position_embeddings = 128*256 # 32768
# Save model with custom max length
output_model_path = "path/to/model"
model.save_pretrained(output_model_path)

Related

PyTorch high-dimensional tensor through linear layer

I have a tensor of size (32, 128, 50) in PyTorch. These are 50-dim word embeddings with a batch size of 32. That is, the three indices in my size correspond to number of batches, maximum sequence length (with 'pad' token), and the size of each embedding. Now, I want to pass this through a linear layer to get an output of size (32, 128, 1). That is, for every word embedding in every sequence, I want to make it one dimensional. I tried adding a linear layer to my network going from 50 to 1 dimension, and my output tensor is of the desired shape. So I think this works, but I would like to understand how PyTorch deals with this issue, since I did not explicitly tell it which dimension to apply the linear layer to. I played around with this and found that:
If I input a tensor of shape (32, 50, 50) -- thus creating ambiguity by having two dimensions along which the linear layer could be applied to (two 50s) -- it only applies it to the last dim and gives an output tensor of shape (32, 50, 1).
If I input a tensor of shape (32, 50, 128) it does NOT output a tensor of shape (32, 1, 128), but rather gives me an error.
This suggests that a linear layer in PyTorch applies the transformation to the last dimension of your tensor. Is that the case?
In the nn.Linear docs, it is specified that the input of this module can be any tensor of size (*, H_in) and the output will be a tensor of size (*, H_out), where:
* means any number of dimensions
H_in is the number of in_features
H_out is the number of out_features
To understand this better, for a tensor of size (n, m, 50) can be processed by a Linear module with in_features=50, while a tensor of size (n, 50, m) can be processed by a Linear module with in_features=m (in your case 128).

How to perform Batch inferencing with RoBERTa ONNX quantized model?

I have converted RoBERTa PyTorch model to ONNX model and quantized it. I am able to get the scores from ONNX model for single input data point (each sentence). I want to understand how to get batch predictions using ONNX Runtime inference session by passing multiple inputs to the session. Below is the example scenario.
Model : roberta-quant.onnx which is a ONNX quantized version of RoBERTa PyTorch model
Code used to convert RoBERTa to ONNX:
torch.onnx.export(model,
args=tuple(inputs.values()), # model input
f=export_model_path, # where to save the model
opset_version=11, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=['input_ids', # the model's input names
'attention_mask'],
output_names=['output_0'], # the model's output names
dynamic_axes={'input_ids': symbolic_names, # variable length axes
'attention_mask' : symbolic_names,
'output_0' : {0: 'batch_size'}})
Input sample to ONNXRuntime inference session:
{
'input_ids': array([[ 0, 510, 35, 21071, ....., 1, 1, 1, 1, 1, 1]]),
'attention_mask': array([[1, 1, 1, 1, ......., 0, 0, 0, 0, 0, 0]])
}
Running ONNX model for 400 data samples(sentences) using ONNXRuntime inference session:
session = onnxruntime.InferenceSession("roberta_quantized.onnx", providers=['CPUExecutionProvider'])
for i in range(400):
ort_inputs = {
'input_ids': input_ids[i].cpu().reshape(1, max_seq_length).numpy(), # max_seq_length=128 here
'input_mask': attention_masks[i].cpu().reshape(1, max_seq_length).numpy()
}
ort_outputs = session.run(None, ort_inputs)
In the above code I am looping through 400 sentences sequentially to get the scores "ort_outputs". Please help me understand how can I perform batch processing here using the ONNX model, where I can send the inputs_ids and attention_masks for multiple sentences and get the scores for all sentences in ort_outputs.
Thanks in advance!

Keras layer shape in plot_model()

I am using Keras to make a CNN, and I want to visualize the model with plot_model().
When I look at the shape of the Conv2d layers, there is a thing that I can't figure out.
Let's say my Conv2d layer has kernel size [8 x 8], stride is [4 by 4], padding is 'same' and I want 16 feature maps.
Input shape to this layer is [None, 3, 160, 320] and output is [None,1,40,16].
'None' is samples, but what is 1 and 40? I guess 16 is number of feature maps?
Since I implemented padding = 'same', shouldn't the image size out have the same width and height as input, or isn't this the same thing?
Thanks!
Well, since you're using "strides", you'll never have the same shape.
Your convolutional filter (which can be seen as a sliding window) is jumping four pixels in its sliding.
As a result, you get your final shape divided by 4 (and rounded up).
3/4 rounded up = 1
160/4 = 40
16 is the number of feature maps, indeed.

InvalidArgumentError: logits and labels must have the same first dimension seq2seq Tensorflow

I am getting this error in seq2seq.sequence_loss even though first dim of logits and labels has same dimension, i.e. batchSize
I have created a seq2seq model in TF 1.0 version. My loss function is as follows :
logits = self.decoder_logits_train
targets = self.decoder_train_targets
self.loss = seq2seq.sequence_loss(logits=logits, targets=targets, weights=self.loss_weights)
self.train_op = tf.train.AdamOptimizer().minimize(self.loss)
I am getting following error on running my network while training :
InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [1280,150000] and labels shape [1536]
[[Node: sequence_loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](sequence_loss/Reshape, sequence_loss/Reshape_1)]]
I confirm the shapes of logits and targets tensors as follows :
a,b = sess.run([model.decoder_logits_train, model.decoder_train_targets], feed_dict)
print(np.shape(a)) # (128, 10, 150000) which is (BatchSize, MaxSeqSize, Vocabsize)
print(np.shape(b)) # (128, 12) which is (BatchSize, Max length of seq including padding)
So, since the first dimension of targets and logits are same then why I am getting this error ?
Interestingly, in error u can observe that the dimension of logits is mentioned as (1280, 150000), which is (128 * 10, 150000) [product of first two dimension, vocab_size], and same for targets i.e. (1536), which is (128*12), again product of first two dimension ?
Note : Tensorflow 1.0 CPU version
maybe your way of padding wrong. if you padded _EOS to the end of target seq, then the max_length(real length of target sentence) should add 1 to be [batch, max_len+1]. Since you padded _GO and _EOS, your target sentence length should add 2, which makes it equals 12.
I read some other people's implementation of NMT, they only padded _EOS for target sentence, while _GO for input of decoder. Tell me if I'm wrong.
I had the same error as you and I understood the problem:
The problem:
You run the decoder using this parameters:
targets are the decoder_inputs. They have length max_length because of padding. Shape: [batch_size, max_length]
sequence_length are the non-padded-lengths of all the targets of your current batch. Shape: [batch_size]
Your logits, that are the output tf.contrib.seq2seq.dynamic_decode has shape:
[batch_size, longer_sequence_in_this_batch, n_classes]
Where longer_sequence_in_this_batch is equal to tf.reduce_max(sequence_length)
So, you have a problem when computing the loss because you try to use both:
Your logits with 1st dimension shape longer_sequence_in_this_batch
Your targets with 1st dimension shape max_length
Note that longer_sequence_in_this_batch <= max_length
How to fix it:
You can simply apply some padding to your logits.
logits = self.decoder_logits_train
targets = self.decoder_train_targets
paddings = [[0, 0], [0, max_length-tf.shape(logits)[1]], [0, 0]]
padded_logits = tf.pad(logits, paddings, 'CONSTANT', constant_values=0)
self.loss = seq2seq.sequence_loss(logits=padded_logits, targets=targets,
weights=self.loss_weights)
Using this method,you ensure that your logits will be padded as the targets and will have dimension [batch_size, max_length, n_classes]
For more information about the pad function, visit
Tensorflow's documentation
The error message seems to be a bit misleading, as you actually need first and second dimensions to be the same. This is written here:
logits: A Tensor of shape [batch_size, sequence_length,
num_decoder_symbols] and dtype float. The logits correspond to the
prediction across all classes at each timestep.
targets: A Tensor of shape [batch_size, sequence_length] and dtype
int. The target represents the true class at each timestep.
This also makes sense, as logits are probability vectors, while targets represent the real output, so they need to be of the same length.

Tensorflow: why softmax outputs [1, 0, 0..., 0]

I have a neural net model, it's last layer is fully connected layer with 9 output neurons.
To train my network correctly, I'm using softmax_cross_entropy_with_logits.
It trains ok, but when I want to evaluate my model, I want probabilities also.
So I take an evaluation sample and feed it to the network.
After that I apply softmax to the output and get
[[ 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
Here unnormalized probabilites also:
[[ -2710.10620117 -2914.37866211 -5045.04443359 -4361.91601562
-459.57000732 8843.65820312 -1871.62756348 5447.12451172
-10947.22949219]]
I'm also getting probility of 1 and rest are zeros.
Could anyone please help to handle this issue?
EDIT:
Input images are of shape 64 * 160.
All activation functions are relu.
Max poolings are 2x2.
In conv_plus_max_pool_layer(x_image, 5, 1, 96) 5 is kernel size.
Here is network layout:
hidden_block_1 = conv_plus_max_pool_layer(x_image, 5, 1, 96)
hidden_block_2 = conv_plus_max_pool_layer(hidden_block_1, 5, 96, 256)
hidden_block_3 = conv_plus_max_pool_layer(hidden_block_2, 3, 256, 384)
hidden_block_4 = conv_plus_max_pool_layer(hidden_block_3, 3, 384, 512)
fc1 = dropout_plus_fc(4 * 10 * 512, 512, hidden_block_4, keep_prob_drop1)
output = dropout_plus_fc(512, model_net10_train.class_num, fc1, keep_prob_drop2)
Looks like your network is pretty sure about the output ;)
In this case, I don't think we can do a lot for you without your network layout... Some gut feelings from my side: the layer leading up to your output layer has too many nodes (thus giving you these huuuge numbers), and I suspect that you don't use nonlinearities such as RELU, or tanh. Another thing you might want to check are the initial values for the weights (might be too big), and the learning rate you are using (might be too high).

Resources