HuggingFace transformers - encoding long input with context

HuggingFace transformers - encoding long input with context - nlp

I am using a BERT like model, which has a limit for input's length.
I am looking to encode a long input and feed into BERT.
Most common solution I know of is sliding-window to add context to input's segments.
For example:
model_max_size = 5
stride = 2
input = [1, ..., 12]
output = [
[1, 2, 3, 4, 5], -> [1, 2, 3, 4, 5]
[4, 5, 6, 7, 8], -> [6, 7, 8]
[7, 8, 9, 10, 11], -> [9, 10, 11]
[10, 11, 12] -> [12]
]
Is there a known good strategy?
Do you send each input into consecutive windows and average their outputs?
Any already built in implementation for this?
HuggingFace tokenizer has the stride and return_overflowing_tokens feature but it's not quite it as it works only for the first sliding window.
*I know there are other models accepting longer input (e.g. LongFormer, BigBird etc.) but I need to use this specific one.
Thanks!

Related

creating tensor by composition of smaller tensors

I would like to create a 4x4 tensor that is composed of four smaller 2x2 tensors in this manner:
The tensor I would like to create:
in_t = torch.tensor([[14, 7, 6, 2],
[ 4, 8, 11, 1],
[ 3, 5, 9, 10],
[12, 15, 16, 13]])
I would like to create this tensor composed from these four smaller tensors:
a = torch.tensor([[14, 7], [ 4, 8]])
b = torch.tensor([[6, 2], [11, 1]])
c = torch.tensor([[3, 5], [12, 15]])
d = torch.tensor([[9, 10], [16, 13]])
I have tried to use torch.cat like this:
mm_ab = torch.cat((a,b,c,d), dim=0)
but I end up with an 8x2 tensor.

You can control the layout of your tensor and achieve the desired result with a combination of torch.transpose and torch.reshape. You can perform an outer transpose followed by an inner transpose:
>>> stack = torch.stack((a,b,c,d))
tensor([[[14, 7],
[ 4, 8]],
[[ 6, 2],
[11, 1]],
[[ 3, 5],
[12, 15]],
[[ 9, 10],
[16, 13]]])
Reshape-tranpose-reshape-transpose-reshape:
>>> stack.reshape(4,2,-1).transpose(0,1).reshape(-1,2,4).transpose(0,1).reshape(-1,4)
tensor([[14, 7, 6, 2],
[ 4, 8, 11, 1],
[ 3, 5, 9, 10],
[12, 15, 16, 13]])
Essentially, reshapes allow you to group and view your tensor differently while transpose operation will alter its layout (it won't remain contiguous) meaning you can achieve the desired output.

If you concatenate all your tensors this way below, you will get exactly your output:
tensor a
tensor b
tensor c
tensor d
You really started with a good and easy approach, this is the completion of your attempt:
p1 = torch.concat((a,b),axis=1)
p2 = torch.concat((c,d),axis=1)
p3 = torch.concat((p1,p2),axis=0)
print(p3)
#output
tensor([[14, 7, 6, 2],
[ 4, 8, 11, 1],
[ 3, 5, 9, 10],
[12, 15, 16, 13]])

How can I calculate all cross-terms in pytorch?

I would like to calculate all cross-terms of each vector in a matrix.
For example, consider the following matrix:
X = tensor([[1, 2, 3],
[4, 5, 6]]),
and I would like to obtain all cross-terms of each vector in this matrix as:
Y = [[1*1, 1*2, 1*3, 2*2, 2*3, 3*3],
[4*4, 4*5, 4*6, 5*5, 5*6, 6*6]].
= [[1, 2, 3, 4, 6, 9],
[16, 20, 24, 25, 30, 36]].
That is, this is the all combination values of the vector elements
and I believe that this can be calculated using torch.combinations;
however, torch.combinations does not provide the batch implementation
and I couldn't produce the above result in pytorch.
How can I calculate all cross-terms in pytorch?

You can stack the product of combinations with replacement for each of the rows in that matrix
>>> torch.stack(tuple(torch.prod(torch.combinations(data[i],with_replacement=True),1) for i in range(data.shape[0])),0)
>>> tensor([[ 1, 2, 3, 4, 6, 9],
[16, 20, 24, 25, 30, 36]])

Does Scipy recognize the special structure of this matrix to decompose it faster?

I have a matrix whose many rows are already in the upper triangular form. I would like to ask if the command scipy.linalg.lu recognize this special structure to faster decompose it. If I decompose this matrix on paper, I only use Gaussian elimination on those rows that are not in the upper triangular form. For example, I will only make transformations on the last row of matrix B.
import numpy as np
A = np.array([[2, 5, 8, 7, 8],
[5, 2, 2, 8, 9],
[7, 5, 6, 6, 10],
[5, 4, 4, 8, 10]])
B = np.array([[2, 5, 8, 7, 8],
[0, 2, 2, 8, 9],
[0, 0, 6, 6, 10],
[5, 4, 4, 8, 10]])
Because my square matrix is of very large dimension and this procedure is repeated thousands of times. I would like to make use of this special structure to reduce the computational complexity.
Thank you so much for your elaboration!

Not automatically.
You'll need to use the structure yourself if want to. Whether you can make it faster then the built-in implementation depends on many factors (the number of zeros etc)

How should I understand the nn.Embeddings arguments num_embeddings and embedding_dim?

I'm trying to get used to the Embedding class in the PyTorch nn module.
I've noticed that quite a few other people have had the same problem as myself, and therefore posted questions on the PyTorch discussion forum and on Stack Overflow, but I'm still having some confusion.
According to the official documentation, the arguments that are passed are num_embeddings and embedding_dim which each refer to how large our dictionary (or vocabulary) is and how many dimensions we want our embeddings to be, respectively.
What I'm confused about is how exactly I should interpret those. For example, the small practice code that I ran:
import torch
import torch.nn as nn
embedding = nn.Embedding(num_embeddings=10, embedding_dim=3)
a = torch.LongTensor([[1, 2, 3, 4], [4, 3, 2, 1]]) # (2, 4)
b = torch.LongTensor([[1, 2, 3], [2, 3, 1], [4, 5, 6], [3, 3, 3], [2, 1, 2],
[6, 7, 8], [2, 5, 2], [3, 5, 8], [2, 3, 6], [8, 9, 6],
[2, 6, 3], [6, 5, 4], [2, 6, 5]]) # (13, 3)
c = torch.LongTensor([[1, 2, 3, 2, 1, 2, 3, 3, 3, 3, 3],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]]) # (2, 11)
When I run a, b, and c through the embedding variable, I get embedded results of shapes (2, 4, 3), (13, 3, 3), (2, 11, 3).
What's confusing me is that I thought of the number of samples we have exceeds the predefined number of embeddings, we should get an error? Since the embedding I've defined has 10 embeddings, shouldn't b give me an error since it is a tensor containing 13 words of dimension 3?

In your case, here is how your input tensor are interpreted:
a = torch.LongTensor([[1, 2, 3, 4], [4, 3, 2, 1]]) # 2 sequences of 4 elements
Moreover, this is how your embedding layer is interpreted:
embedding = nn.Embedding(num_embeddings=10, embedding_dim=3) # 10 distinct elements and each those is going to be embedded in a 3 dimensional space
So, it doesn't matter if your input tensor has more than 10 elements, as long as they are in the range [0, 9]. For example, if we create a tensor of two elements such as:
d = torch.LongTensor([[1, 10]]) # 1 sequence of 2 elements
We would get the following error when we pass this tensor through the embedding layer:
RuntimeError: index out of range: Tried to access index 10 out of table with 9 rows
To summarize num_embeddings is total number of unique elements in the vocabulary, and embedding_dim is the size of each embedded vector once passed through the embedding layer. Therefore, you can have a tensor of 10+ elements, as long as each element in the tensor is in the range [0, 9], because you defined a vocabulary size of 10 elements.

Why there are two square brackets required inside numpy array?

I am learning python, and I recently came across a module Numpy. With the help of Numpy, one can convert list to arrays and perform operations much faster.
Let's say we create an array with following values :
import numpy as np
np_array=array([1,2,3,4,5])
So we need one square bracket if we need to store one list in the form of array. Now if I want to create a 2D array, why it should be defined like this:
np_array=array([[1,2,3,4,5],[6,7,8,9,10]])
And not like this:
np_array=array([1,2,3,4,5],[6,7,8,9,10])
I apologize if this question is a duplicate, but I couldn't find any answer.
Many Thanks

Array function has the following form.
array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
If you use
np_array=array([1,2,3,4,5],[6,7,8,9,10])
The function call will result in passing [1,2,3,4,5] to object and [6,7,8,9,10] to dtype, which wont make any sense.

This actually has little to do with numpy. You are essentially asking what is the difference between foo(a, b) and foo([a, b]).
arbitrary_function([1, 2, 3, 4, 5], [6, 7, 8, 9, 10]) passes two lists as separate arguments to arbitrary_function (one argument is [1, 2, 3, 4, 5] and the second is [6, 7, 8, 9, 10]).
arbitrary_function([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) passes a list of lists ([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) to arbitrary_function.
Now, numpy creators could have chosen to allow arbitrary_function([1, 2, 3, 4, 5], [6, 7, 8, 9, 10]) but it would have made little to no sense to do so.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

HuggingFace transformers - encoding long input with context - nlp

Related

creating tensor by composition of smaller tensors

How can I calculate all cross-terms in pytorch?

Does Scipy recognize the special structure of this matrix to decompose it faster?

How should I understand the nn.Embeddings arguments num_embeddings and embedding_dim?

Why there are two square brackets required inside numpy array?

Categories

Resources