How to change parameters of pre-trained longformer model from huggingface - python-3.x

I am using Hugging-face pre-trained LongformerModel model. I am using to extract embedding for sentence. I want to change the token length, max sentence length parameter but I am not able to do so. Here is the code.
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model.eval()
text=[" I like to play cricket"]
input_ids = torch.tensor(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True)).unsqueeze(0)
print(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True))
# [0, 38, 101, 7, 310, 5630, 2]
I expected encoder to give me list of size 20 with padding as I have passed a parameter max_length=20. But it returned list of size 7 only?
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
attention_mask[:, [0,-1]] = 2
outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
hidden_states = outputs[2]
print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")
layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))
Output:
Number of layers: 13 (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 512 # How can I change this parameter to pick up my sentence length during run-time
Number of hidden units: 768
How can I reduce number of tokens to sentence length instead of 512 ? Every-time I input a new sentence, it should pick up that length.

Question regarding padding
padding=True pads your input to the longest sequence. padding=max_length pads your input to the specified max_length (documentation):
from transformers import LongformerTokenizer
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
text=[" I like to play cricket"]
print(tokenizer.encode(text[0],max_length=20,padding='max_length',add_special_tokens=True))
Output:
[0, 38, 101, 7, 310, 5630, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Question regarding the number of tokens of the hidden states
The Longformer implementation applies padding to your sequence to match the attention window sizes. You can see the size of the attention windows in your model config:
model.config.attention_window
Output:
[512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512]
This is the corresponding code line: link.

Related

Need clear concept of the dimensions of output and hidden from LSTM layers

I know that the output carries all hiddens from the last layer of all the time steps and the hidden is the last time step hiddens of all the layers.
This context has each document with 850 tokens. Each token is embedded into 100 dimension. I took a 2-layer LSTM with 100 dim hidden.
I thought it would take a token at a time step and produce 100 dim hidden. For 850 tokens in a document it will produce output = [1, 850, 100], hidden [1, 2, 100] and cell [1, 2, 100]. But the hidden and cell are [2, 850, 100].
input_dim = len(tok2indx) # size of the vocabulary
emb_dim = 100 # Embedding of each word
hid_dim = 100 # The dimention of each hiddenstate comming out from a time step
n_layers = 2 # LSTM layers
class Encoder(nn.Module):
def __init__(self):
super().__init__()
self.hid_dim = hid_dim
self.n_layers = n_layers
self.embedding = nn.Embedding(input_dim, emb_dim)
self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout, device=device)
self.dropout = nn.Dropout(dropout)
def forward(self, X):
embedded = self.embedding(X).to(device)
outputs, (hidden, cell) = self.rnn(embedded)
return outputs, hidden, cell
If the encoder is passed a single document
enc = Encoder()
encd = enc.forward(train_x[:1])
print(encd[0].shape, encd[1].shape, encd[2].shape)
Output:
torch.Size([1, 850, 100]) torch.Size([2, 850, 100]) torch.Size([2, 850, 100])
With ten documents
encd = enc.forward(train_x[:10])
print(encd[0].shape, encd[1].shape, encd[2].shape)
Output:
torch.Size([10, 850, 100]) torch.Size([2, 850, 100]) torch.Size([2, 850, 100])
What's tripping you up is the input format to LSTM. The default input shape to a LSTM layer is Sequence (L), batch (N), features (H). While in you code you are sending input as NLH (batch, sequence, features). To use this correctly set the parameter batch_first=True (to the LSTM layer), then the input and output will be as you expect.
But there is a catch here too. Only the output (1st of the outputs) will be NLH while both hidden and cell (2nd and 3rd of the outputs) will still be LNH format.
The second thing to note here is the hidden cell will have dimensionality equal to the number of layers ie 2 in your example (each layer will require fill of its own hidden weights), hence the output [2, 850, 100] instead of [1, 850, 100].

Given groups=1, weight of size [32, 3, 3, 3], expected input[1, 1, 32, 340] to have 3 channels, but got 1 channels instead

This is the question:
Before we define the model, we define the size of our alphabet. Our alphabet consists of lowercase English letters, and additionally a special character used for space between symbols or before and after the word. For the first part of this assignment, we don't need that extra character.
Our end goal is to learn to transcribe words of arbitrary length. However, first, we pre-train our simple convolutional neural net to recognize single characters. In order to be able to use the same model for one character and for entire words, we are going to design the model in a way that makes sure that the output size for one character (or when input image size is 32x18) is 1x27, and Kx27 whenever the input image is wider. K here will depend on particular architecture of the network, and is affected by strides, poolings, among other things. A little bit more formally, our model 𝑓𝜃 , for an input image 𝑥 gives output energies 𝑙=𝑓𝜃(𝑥) . If 𝑥∈ℝ32×18 , then 𝑙∈ℝ1×27 . If 𝑥∈ℝ32×100 for example, our model may output 𝑙∈ℝ10×27 , where 𝑙𝑖 corresponds to a particular window in 𝑥 , for example from 𝑥0,9𝑖 to 𝑥32,9𝑖+18 (again, this will depend on the particular architecture).
The code:
# constants for number of classes in total, and for the special extra character for empty space
ALPHABET_SIZE = 27, # Extra character for space inbetween
BETWEEN = 26
print(alphabet.shape) # RETURNS: torch.Size([32, 340])
My CNN Block:
from torch import nn
import torch.nn.functional as F
"""
Remember basics:
1. Bigger strides = less overlap
2. More filters = More features
Image shape = 32, 18
Alphabet shape = 32, 340
"""
class SimpleNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.cnn_block = torch.nn.Sequential(
nn.Conv2d(3, 32, 3),
nn.BatchNorm2d(32),
nn.Conv2d(32, 32, 3),
nn.BatchNorm2d(32),
nn.Conv2d(32, 32, 3),
nn.BatchNorm2d(32),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, 3),
nn.BatchNorm2d(64),
nn.Conv2d(64, 64, 3),
nn.BatchNorm2d(64),
nn.MaxPool2d(2)
)
def forward(self, x):
x = self.cnn_block(x)
# after applying cnn_block, x.shape should be:
# batch_size, alphabet_size, 1, width
return x[:, :, 0, :].permute(0, 2, 1)
model = SimpleNet()
alphabet_energies = model(alphabet.view(1, 1, *alphabet.shape))
def plot_energies(ce):
fig=plt.figure(dpi=200)
ax = plt.axes()
im = ax.imshow(ce.cpu().T)
ax.set_xlabel('window locations →')
ax.set_ylabel('← classes')
ax.xaxis.set_label_position('top')
ax.set_xticks([])
ax.set_yticks([])
cax = fig.add_axes([ax.get_position().x1+0.01,ax.get_position().y0,0.02,ax.get_position().height])
plt.colorbar(im, cax=cax)
plot_energies(alphabet_energies[0].detach())
I get the error in the title at alphabet_energies = model(alphabet.view(1, 1, *alphabet.shape))
Any help would be appreciated.
You should begin to replace nn.Conv2d(3, 32, 3) to nn.Conv2d(1, 32, 3)
Your model begins with a conv2d from 3 channels to 32 but your input image has only 1 channel (greyscale image).

Ignore padding class (0) during multi class classification

I have a problem where given a set of tokens, predict another token. For this task I use an embedding layer with Vocab-size + 1 as input_size. The +1 is because the sequences are padded with zeros. Eg. given a Vocab-size of 10 000 and max_sequence_len=6, x_train looks like:
array([[ 0, 0, 0, 11, 22, 4],
[ 29, 6, 12, 29, 1576, 29],
...,
[ 0, 0, 67, 8947, 7274, 7019],
[ 0, 0, 0, 15, 10000, 50]])
y_train consists of integers between 1 and 10000, with other words, this becomes a multi-class classification problem with 10000 classes.
My problem: When I specify the output size in the output layer, I would like to specify 10000, but the model will predict the classes 0-9999 if I do this. Another approach is to set output size to 10001, but then the model can predict the 0-class (padding), which is unwanted.
Since y_train is mapped from 1 to 10000, I could remap it to 0-9999, but since they share mapping with the input, this seems like an unnecessary workaround.
EDIT:
I realize, and which #Andrey pointed out in the comments, that I could allow for 10001 classes, and simply add padding to the vocabulary, although I am never interested in the network predicting 0's.
How can I tell the model to predict on the labels 1-10000, whilst at the meantime have 10000 classes, not 10001?
I would use the following approach:
import tensorflow as tf
inputs = tf.keras.layers.Input(shape=())
x = tf.keras.layers.Embedding(10001, 512)(inputs) # input shape of full vocab size [10001]
x = tf.keras.layers.Dense(10000, activation='softmax')(x) # training weights based on reduced vocab size [10000]
z = tf.zeros(tf.shape(x)[:-1])[..., tf.newaxis]
x = tf.concat([z, x], axis=-1) # add constant zero on the first position (to avoid predicting 0)
model = tf.keras.Model(inputs=inputs, outputs=x)
inputs = tf.random.uniform([10, 10], 0, 10001, dtype=tf.int32)
labels = tf.random.uniform([10, 10], 0, 10001, dtype=tf.int32)
model.compile(loss='sparse_categorical_crossentropy')
model.fit(inputs, labels)
pred = model.predict(inputs) # all zero positions filled by 0 (which is minimum value)

Understanding input shape to PyTorch conv1D?

This seems to be one of the common questions on here (1, 2, 3), but I am still struggling to define the right shape for input to PyTorch conv1D.
I have text sequences of length 512 (number of tokens per sequence) with each token being represented by a vector of length 768 (embedding). The batch size I am using is 6.
So my input tensor to conv1D is of shape [6, 512, 768].
input = torch.randn(6, 512, 768)
Now, I want to convolve over the length of my sequence (512) with a kernel size of 2 using the conv1D layer from PyTorch.
Understanding 1:
I assumed that "in_channels" are the embedding dimension of the conv1D layer. If so, then a conv1D layer will be defined in this way where
in_channels = embedding dimension (768)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(768, 100, 2)
feature_map = convolution_layer(input)
But with this assumption, I get the following error:
RuntimeError: Given groups=1, weight of size 100 768 2, expected input `[4, 512, 768]` to have 768 channels, but got 512 channels instead
Understanding 2:
Then I assumed that "in_channels" is the sequence length of the input sequence. If so, then a conv1D layer will be defined in this way where
in_channels = sequence length (512)
out_channels = 100 (arbitrary number)
kernel = 2
convolution_layer = nn.conv1D(512, 100, 2)
feature_map = convolution_layer(input)
This works fine and I get an output feature map of dimension [batch_size, 100, 767]. However, I am confused. Shouldn't the convolutional layer convolve over the sequence length of 512 and output a feature map of dimension [batch_size, 100, 511]?
I will be really grateful for your help.
In pytorch your input shape of [6, 512, 768] should actually be [6, 768, 512] where the feature length is represented by the channel dimension and sequence length is the length dimension. Then you can define your conv1d with in/out channels of 768 and 100 respectively to get an output of [6, 100, 511].
Given an input of shape [6, 512, 768] you can convert it to the correct shape with Tensor.transpose.
input = input.transpose(1, 2).contiguous()
The .contiguous() ensures the memory of the tensor is stored contiguously which helps avoid potential issues during processing.
I found an answer to it (source).
So, usually, BERT outputs vectors of shape
[batch_size, sequence_length, embedding_dim].
where,
sequence_length = number of words or tokens in a sequence (max_length sequence BERT can handle is 512)
embedding_dim = the vector length of the vector describing each token (768 in case of BERT).
thus, input = torch.randn(batch_size, 512, 768)
Now, we want to convolve over the text sequence of length 512 using a kernel size of 2.
So, we define a PyTorch conv1D layer as follows,
convolution_layer = nn.conv1d(in_channels, out_channels, kernel_size)
where,
in_channels = embedding_dim
out_channels = arbitrary int
kernel_size = 2 (I want bigrams)
thus, convolution_layer = nn.conv1d(768, 100, 2)
Now we need a connecting link between the expected input by convolution_layer and the actual input.
For this, we require to
current input shape [batch_size, 512, 768]
expected input [batch_size, 768, 512]
To achieve this expected input shape, we need to use the transpose function from PyTorch.
input_transposed = input.transpose(1, 2)
I have a suggestion for you which may not be what you asked for but might help. Because your input is (6, 512, 768) you can use conv2d instead of 1d.
All you need to do is to add a dimension of 1 at index 1: input.unsqueeze(1) which works as your channel (consider it as a grayscale image)
def forward(self, x):
x = self.embedding(x) # [Batch, seq length, Embedding] = [5, 512, 768])
x = torch.unsqueeze(x, 1) # [5, 1, 512, 768]) # like a grayscale image
and also for your conv2d layer, you can define like this:
window_size=3 # for trigrams
EMBEDDING_SIZE = 768
NUM_FILTERS = 10 # or whatever you want
self.conv = nn.Conv2d(in_channels = 1,
out_channels = NUM_FILTERS,
kernel_size = [window_size, EMBEDDING_SIZE],
padding=(window_size - 1, 0))```

CNN Autoencoder with Embedding(300D GloveVec) layer for 10-15 word sentence not working problem due to padding

Using pretraining GloveVector from stanford to get the meaningful representation of each word but i want representations for a sentence containing 5-15 words, so that i can make use of cosine similarity to do a match when i receive a new sentence. I am setting a 15 words (fixed size) of each sentence and applied embedding layer then the new input shape is going to be 15 X 300 dimensions (If i have less than 15 words then padded values to make it 15 words (one random uniform distribution of 300D vector)
Below are my network shapes
[None, 15] -- Raw inputs embedding and padded(1) ID's
[None, 15, 300, 1], --input
[None, 8, 150, 128], -- conv 1
[None, 4, 75, 64], -- conv 2
[None, 2, 38, 32], -- conv 3
[None, 1, 19, 16], -- conv 4
[None, 1, 10, 4] -- conv 5
[None, 50] ---------Latent shape (new meaningful representati)------
[None, 1, 10, 4] -- encoded input for de-conv
[None, 1, 19, 16], -- conv_trans 5
[None, 2, 38, 32], -- conv_trans 4
[None, 4, 75, 64], -- conv_trans 3
[None, 8, 150, 128], -- conv_trans 2
[None, 15, 300, 1] -- conv_trans 1 -- for loss funtion with input
I have tried the CNN model with embedding layer in tensorflow
self._inputs = tf.placeholder(dtype=tf.int64, shape=[None, self.sent_len], name='input_x') #(?,15)
losses = []
# lookup layer
with tf.variable_scope('embedding') as scope:
self._W_emb = _variable_on_cpu(name='embedding', shape=[self.vocab_size, self.emb_size], initializer=tf.random_uniform_initializer(minval=-1.0, maxval=1.0))
# assigned pretrained embedding here, so initializer would be overrided
sent_batch = tf.nn.embedding_lookup(params=self._W_emb, ids=self._inputs)
sent_batch = tf.expand_dims(sent_batch, -1)
self._x = sent_batch
encoder = []
shapes = []
current_input = sent_batch
shapes.append(current_input.get_shape().as_list())
for layer_i, n_output in enumerate(n_filters[1:]):
with tf.variable_scope('Encode_conv-%d' % layer_i) as scope:
n_input = current_input.get_shape().as_list()[3]
W, wd = _variable_with_weight_decay('W-%d' % layer_i, shape=[filter_size,filter_size,n_input,n_output],
initializer=tf.random_uniform_initializer(minval=-1.0, maxval=1.0), wd=self.l2_reg)
losses.append(wd)
biases = _variable_on_cpu('bias-%d' % layer_i, shape=[n_output], initializer=tf.constant_initializer(0.00))
encoder.append(W)
output = tf.nn.relu(tf.add(tf.nn.conv2d(current_input, W, strides=[1, 2, 2, 1], padding='SAME'), biases), name=scope.name)
current_input = output
shapes.append(output.get_shape().as_list())
#z = current_input
original_shape = current_input.get_shape().as_list()
flatsize = original_shape[1]*original_shape[2]*original_shape[3]
height,width,channel = original_shape[1]*1,original_shape[2]*1,original_shape[3]*1
current_input = tf.reshape(current_input,[-1,flatsize])
with tf.variable_scope('Encode_Z-%d' % layer_i) as scope:
W_en, wd_en = _variable_with_weight_decay('W', shape=[current_input.get_shape().as_list()[1], outsize],
initializer=tf.truncated_normal_initializer(stddev=0.05),
wd=self.l2_reg)
losses.append(wd_en)
biases_en = _variable_on_cpu('bias', shape=[outsize],initializer=tf.constant_initializer(0.00))
self._z = tf.nn.relu(tf.nn.bias_add(tf.matmul(current_input, W_en), biases_en)) # Compressed representation (?,50)
with tf.variable_scope('Decode_Z-%d' % layer_i) as scope:
W_dc, wd_dc = _variable_with_weight_decay('W', shape=[self._z.get_shape().as_list()[1], current_input.get_shape().as_list()[1]],
initializer=tf.truncated_normal_initializer(stddev=0.05), wd=self.l2_reg)
losses.append(wd_dc)
biases_dc = _variable_on_cpu('bias', shape=[current_input.get_shape().as_list()[1]],initializer=tf.constant_initializer(0.00))
current_input = tf.nn.relu(tf.nn.bias_add(tf.matmul(self._z, W_dc), biases_dc))
current_input = tf.reshape(current_input,[-1,height,width,channel])
encoder.reverse()
shapes.reverse()
for layer_i, shape in enumerate(shapes[1:]):
with tf.variable_scope('Decode_conv-%d' % layer_i) as scope:
W = encoder[layer_i]
b = _variable_on_cpu('bias-%d' % layer_i, shape=[W.get_shape().as_list()[2]], initializer=tf.constant_initializer(0.00))
hh,ww,cc = shape[1], shape[2], shape[3]
output = tf.nn.relu(tf.add( tf.nn.conv2d_transpose(current_input, W, [tf.shape(sent_batch)[0],hh,ww,cc],strides=[1, 2, 2, 1],padding='SAME'), b),name=scope.name)
current_input = output
self._y = current_input
# loss
with tf.variable_scope('loss') as scope:
cross_entropy_loss = tf.reduce_mean(tf.square(current_input - sent_batch))
losses.append(cross_entropy_loss)
self._total_loss = tf.add_n(losses, name='total_loss')
opt = tf.train.AdamOptimizer(0.0001)
grads = opt.compute_gradients(self._total_loss)
self._train_op = opt.apply_gradients(grads)
But the results are not performing well because below two sentence cosine similarity is 0.9895 after getting the latent compressed representation from above model.
Functional disorders of polymorphonuclear neutrophils'
Unspecified fracture of skull, sequela'
And if i take sentences with 2-5 words and the similarity is going up to 0.9999 (suspecting the issue was caused by more default padding values with same uniform distribution from embedding lookups)
Below information may be helpful,
Total of 10,000 training samples with 10 epochs
Used Relu activations
MSE loss function
Adam optimizers
Below is the words distributions of over all sentence [
And finally can anyone suggest what's going wrong? and approach itself is not good to proceed?

Resources