Why would a Torchscript trace return different looking encoded_inputs compared to the original Transformer model? - pytorch

I'm working with a finetuned Mbart50 model that I need sped up for inferencing because using the HuggingFace model as-is is fairly slow with my current hardware. I wanted to use TorchScript because I couldn't get onnx to export this particular model as it seems it will be supported at a later time (I would be glad to be wrong otherwise).
Convert Transformer to a Pytorch trace:
import torch
""" Model data """
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", torchscript= True)
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
tokenizer.src_lang = 'en_XX'
dummy = "To celebrate World Oceans Day, we're swimming through a shoal of jack fish just off the coast of Baja, California, in Cabo Pulmo National Park. This Mexican marine park in the Sea of Cortez is home to the northernmost and oldest coral reef on the west coast of North America, estimated to be about 20,000 years old. Jacks are clearly plentiful here, but divers and snorkelers in Cabo Pulmo can also come across many other species of fish and marine mammals, including several varieties of sharks, whales, dolphins, tortoises, and manta rays."
myTokenBatch = tokenizer(dummy, max_length=192, padding='max_length', truncation = True, return_tensors="pt")
torch.jit.save(torch.jit.trace(model, [myTokenBatch.input_ids,myTokenBatch.attention_mask]), "././traced-model/mbart-many.pt")
Inference Step:
import torch
""" Model data """
from transformers import MBart50TokenizerFast
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
model = torch.jit.load('././traced-model/mbart-many.pt')
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
tokenizer.src_lang = 'en_XX'
dummy = "To celebrate World Oceans Day, we're swimming through a shoal of jack fish just off the coast of Baja, California, in Cabo Pulmo National Park. This Mexican marine park in the Sea of Cortez is home to the northernmost and oldest coral reef on the west coast of North America, estimated to be about 20,000 years old. Jacks are clearly plentiful here, but divers and snorkelers in Cabo Pulmo can also come across many other species of fish and marine mammals, including several varieties of sharks, whales, dolphins, tortoises, and manta rays."
myTokenBatch = tokenizer(dummy, max_length=192, padding='max_length', truncation = True, return_tensors="pt")
encode, pool , norm = model(myTokenBatch.input_ids,myTokenBatch.attention_mask)
Expected Encoding Output:
These are tokens that can be decoded to words with MBart50TokenizerFast.
tensor([[250004, 717, 176016, 6661, 55609, 7, 10013, 4, 642,
25, 107, 192298, 8305, 10, 15756, 289, 111, 121477,
67155, 1660, 5773, 70, 184085, 111, 118191, 4, 39897,
4, 23, 143740, 21694, 432, 9907, 5227, 5, 3293,
181815, 122084, 9201, 23, 70, 27414, 111, 48892, 169,
83, 5368, 47, 70, 144477, 9022, 840, 18, 136,
10332, 525, 184518, 456, 4240, 98, 70, 65272, 184085,
111, 23924, 21629, 4, 25902, 3674, 47, 186, 1672,
6, 91578, 5369, 10332, 5, 21763, 7, 621, 123019,
32328, 118, 7844, 3688, 4, 1284, 41767, 136, 120379,
2590, 1314, 23, 143740, 21694, 432, 831, 2843, 1380,
36880, 5941, 3789, 114149, 111, 67155, 136, 122084, 21968,
8080, 4, 26719, 40368, 285, 68794, 111, 54524, 1224,
4, 148, 50742, 7, 4, 13111, 19379, 1779, 4,
43807, 125216, 7, 4, 136, 332, 102, 62656, 7,
5, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1]])
Actual Output:
I don't know what this is... print(encode)
(tensor([[[[-9.3383e-02, -2.0395e-01, 4.8226e-03, ..., 1.8068e+00,
1.1528e-01, 7.0406e-02],
[-4.4630e-02, -2.2453e-01, 9.5264e-02, ..., 1.6921e+00,
1.4607e-01, 4.8238e-02],
[-7.8206e-01, 1.2699e-01, 1.6467e+00, ..., -1.7057e+00,
8.7768e-01, 8.2230e-01],
[-1.2145e-02, -2.1855e-03, -6.0966e-03, ..., 2.9296e-02,
2.2141e-03, 3.2074e-02],
[-1.4671e-02, -2.8995e-03, -5.8610e-03, ..., 2.8525e-02,
2.4620e-03, 3.1593e-02],
[-1.5877e-02, -3.5165e-03, -4.8743e-03, ..., 2.8930e-02,
2.9877e-03, 3.3892e-02]]]], grad_fn=<CopyBackwards>))

Found the answer here: https://stackoverflow.com/a/66117248/13568346
You can't directly convert a seq2seq model (encoder-decoder model) using this method. To convert a seq2seq model (encoder-decoder) you have to split them and convert them separately, an encoder to onnx and a decoder to onnx. you can follow this guide (it was done for T5 which is also a seq2seq model). you need to provide a dummy variable to both encoder and to the decoder separately. by default when converting using this method it provides the encoder the dummy variable.


How does Huggingface's tokenizers tokenize non-English characters?

I use tokenizers to tokenize natural language sentences into tokens.
But came up with some questions:
Here is some examples I tried using tokenizers:
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# {'input_ids': [42468], 'attention_mask': [1]}
# {'input_ids': [22755, 239, 46237, 112, 19526, 254, 161, 222, 240, 42468, 33232, 104, 163, 224, 117, 161, 243, 232], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# {'input_ids': [30266, 109], 'attention_mask': [1, 1]}
# {'input_ids': [30266, 109, 12859, 105], 'attention_mask': [1, 1, 1, 1]}
# {'input_ids': [30266, 109, 12859, 105, 26998, 13298, 16253], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
# {'input_ids': [26998, 13298, 16253], 'attention_mask': [1, 1, 1]}
tokenizer("This is my fault")
{'input_ids': [1212, 318, 616, 8046], 'attention_mask': [1, 1, 1, 1]}
The code above is some examples I tried.
The last example is an English sentence and I can understand that This corresponds to "This":1212 in the vocab.json, is corresponds to "\u0120is": 318.
But I can not understand why this tool tokenizes non-English sequence into some tokens I can not find in the vocab.
For example:
東 is been tokenized into 30266 and 109. The results in the vocab.json is "æĿ":30266 and "±":109.
メ is been tokenized into 26998. The results in the vocab.json is "ãĥ¡":26998.
I searched the Huggingface documents and website and find no clue.
And the source code is written in Rust, which is hard for me to understand.
So could you help me figure out why?

Is there a way to use Huggingface pretrained tokenizer with wordpiece prefix?

I'm doing a sequence labeling task with Bert. In order to align the word pieces with labels, I need the some marker to identify them so I can get an single embedding for each word by either summing or averaging.
For example I want the word New~york tokenized into New ##~ ##york, and looking at some old examples on the internet, that was what you get by using BertTokenizer before, but clearly not anymore (Says their documentation)
So when I run:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer(batch_sentences, return_tensors="pt")
decoded = tokenizer.decode(inputs["input_ids"][0])
and I get:
[CLS] hello, i'm testing this efauenufefu [SEP]
But the encoding clear suggesting otherwise that the nonsense at the end was indeed broken up into pieces...
In [4]: inputs
{'input_ids': tensor([[ 101, 19082, 117, 178, 112, 182, 5193, 1142, 174, 8057,
23404, 16205, 11470, 1358, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
I also tried to use the BertTokenizerFast, which unlike the BertTokenizer, it allows you to specify wordpiece prefix:
tokenizer2 = BertTokenizerFast("bert-base-cased-vocab.txt", wordpieces_prefix = "##")
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer2(batch_sentences, return_tensors="pt")
decoded = tokenizer2.decode(inputs["input_ids"][0])
Yet the decoder gave me exactly the same...
[CLS] hello, i'm testing this efauenufefu [SEP]
So, is there a way to use the pretrained Huggingface tokenizer with prefix, or must I train a custom tokenizer myself?
Maybe you are looking for tokenize:
from transformers import BertTokenizerFast
t = BertTokenizerFast.from_pretrained('bert-base-uncased')
t.tokenize("hello, i'm testing this efauenufefu")
You can also get a mapping of each token to the respecting word and other things:
o = t("hello, i'm testing this efauenufefu", add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False)
[0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7]

Pytorch how to reshape/reduce the number of filters without altering the shape of the individual filters

With a 3D tensor of shape (number of filters, height, width), how can one reduce the number of filters with a reshape which keeps the original filters together as whole blocks?
Assume the new size has dimensions chosen such that a whole number of the original filters can fit side by side in one of the new filters. So an original size of (4, 2, 2) can be reshaped to (2, 2, 4).
A visual explanation of the side by side reshape where you see the standard reshape will alter the individual filter shapes:
I have tried various pytorch functions such as gather and select_index but not found a way to get to the end result in a general manner (i.e. works for different numbers of filters and different filter sizes).
I think it would be easier to rearrange the tensor values after performing the reshape but could not get a tensor of the pytorch reshaped form:
for completeness, the original tensor before reshaping:
Another option is to construct a list of parts and concatenate them
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
y = torch.cat([x[i::2] for i in range(2)], dim=2)
print('Before\n', x)
print('After\n', y)
which gives
tensor([[[0, 0],
[0, 0]],
[[1, 1],
[1, 1]],
[[2, 2],
[2, 2]],
[[3, 3],
[3, 3]]])
tensor([[[0, 0, 1, 1],
[0, 0, 1, 1]],
[[2, 2, 3, 3],
[2, 2, 3, 3]]])
Or a little more generally we could write a function that takes groups of neighbors along a source dimension and concatenates them along a destination dimension
def group_neighbors(x, group_size, src_dim, dst_dim):
assert x.shape[src_dim] % group_size == 0
return torch.cat([x[[slice(None)] * (src_dim) + [slice(i, None, group_size)] + [slice(None)] * (len(x.shape) - (src_dim + 2))] for i in range(group_size)], dim=dst_dim)
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
# read as "take neighbors in groups of 2 from dimension 0 and concatenate them in dimension 2"
y = group_neighbors(x, group_size=2, src_dim=0, dst_dim=2)
print('Before\n', x)
print('After\n', y)
You could do it by chunking tensor and then recombining.
def side_by_side_reshape(x):
n_pairs = x.shape[0] // 2
filter_size = x.shape[-1]
x = x.reshape((n_pairs, 2, filter_size, filter_size))
return torch.stack(list(map(lambda x: torch.hstack(x.unbind()), k)))
>> p = torch.arange(1, 91).reshape((10, 3, 3))
>> side_by_side_reshape(p)
tensor([[[ 1, 2, 3, 10, 11, 12],
[ 4, 5, 6, 13, 14, 15],
[ 7, 8, 9, 16, 17, 18]],
[[19, 20, 21, 28, 29, 30],
[22, 23, 24, 31, 32, 33],
[25, 26, 27, 34, 35, 36]],
[[37, 38, 39, 46, 47, 48],
[40, 41, 42, 49, 50, 51],
[43, 44, 45, 52, 53, 54]],
[[55, 56, 57, 64, 65, 66],
[58, 59, 60, 67, 68, 69],
[61, 62, 63, 70, 71, 72]],
[[73, 74, 75, 82, 83, 84],
[76, 77, 78, 85, 86, 87],
[79, 80, 81, 88, 89, 90]]])
but I know it's not ideal since there is map, list and unbind which disrupts memory. This is what I offer till I figure out how to do it via view only (so a real reshape)

How to slice a Tensor by another Tensor in Keras?

I have a Keras network with two inputs:
image of shape (128, 128, 3)
bounding-box of shape (4), i.e. (x0, y0, x1, y1)
In my network definition, I need to include the extraction of the image patch defined by the bounding-box from the input image, but I do not know how (or my attempts did not work). Here is my current attempt to achieve this, can someone please help me to understand slicing Tensors by Values of other Tensors in Keras?
# get masked image and bounding box information as inputs
masked_img = Input(shape=self.input_shape)
mask_bounding_box = Input(shape=(4,))
# fill in the masked region and extract the fill-in region
filled_img = self.generator(masked_img)
fill_in = K.slice(filled_img, (int(mask_bounding_box[0]), int(mask_bounding_box[1])),
(int(mask_bounding_box[2]), int(mask_bounding_box[3])))
Does anybody know how to do this? Any hint in the right direction would help me, please ...
Thanks in advance!
here's a native numpy solution.
import numpy as np
a = np.arange(48).reshape(3,4,4)
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]],
[[32, 33, 34, 35],
[36, 37, 38, 39],
[40, 41, 42, 43],
[44, 45, 46, 47]]])
box = (1,1,2,2) # slicing from (1,1) to (2,2)
b = a[:, box[0]:box[2]+1, box[1]:box[3]+1] # slicing on all channels
array([[[ 5, 6],
[ 9, 10]],
[[21, 22],
[25, 26]],
[[37, 38],
[41, 42]]])
Keras.backend.slice() requires starts and offsets, so you could do it like this:
import keras.backend as K
start=(0,1,1) # 1st channel, x1, y1
sizes=(3,2,2) # number of channels, x2-x1+1, y2-y1+1
with sess.as_default():
b=K.slice(a, start, sizes)
[[[ 5 6]
[ 9 10]]
[[21 22]
[25 26]]
[[37 38]
[41 42]]]

Method of vectors in various vector length to fixed length (NLP)

Recently I have been looking around about Natural Language Processing and its vectorization method and advantages of each vectorizer.
I am into character to vectorize, but it seems like the most concerns about the character vectorizer for each word is the embedding to have fixed length.
I do not want to just embed them with 0, which is well known as 0 padding, for instance, the target fixed length is 100 and 72 characters only exists then all 28 of 0 will be padded at the end.
"The example of paragraphs and phrases.... ... in vectorizer form" < with length 72
[0, 25, 60, 12, 24, 0, 19, 99, 7, 32, 47, 11, 19, 43, 18, 19, 6, 25,
43, 99, 0, 32, 40, 14, 20, 5, 37, 47, 99, 11, 29, 7, 19, 47, 18, 20,
60, 18, 19, 2, 19, 11, 31, 130, 130, 76, 0, 32, 40, 14, 20, 7, 19, 47,
18, 20, 60, 11, 37, 43, 99, 11, 29, 99, 17, 39, 47, 11, 31, 18, 19,
43, 0, 19, 77, 0, 0, 0, 0, 0, 0, 0, 0, ...., 0, 0, 0, 0, 0, 0]
I want to make the vectors be in a fair distribution form in N fixed dimensions, not like the one above
If you know any papers or algorithms preferring consider this matter, or common way to produce a fixed length vectors from various length of vectors please share .
Further information added as gojomo requested;
I am trying to get the character level vectors for words in corpus.
Let say, in above example, "The example of paragraphs...." starts with
T [40]
h [17]
e [3]
e [3]
x [53]
a [1]
m [21]
p [25]
l [14]
e [3]
Notice that each character has its own number (etc, could be ascii) and word represents the vectors of character vectors combination, for example,
The [40, 17, 3]
example [3, 53, 1, 21, 25, 14, 3]
which the vectors are not in same dimension. With the case mention above, many people are padding 0 at the end to make it in uniform size
For example, if someone wants to make the dimension of each word to be 300, then 297 of 0s will be padded to letter "The" and 293 of 0s will be padded to "example"., like
The [40, 17, 3, 0, 0, 0, 0, 0, ...., 0]
example [3, 53, 1, 21, 25, 14, 3, 0, 0, 0, 0, 0, ...., 0]
Now I do not think this padding method is appropriate to my experiments so I want to know if there are any methods to convert its vectors to in uniform form with not sparsed form(if this term is allowed).
Even with the phrase with two words, "The example" only takes 11 characters long , still not long enough either.
Whatever the case is that, I would like to know if there are some well known techniques to convert the informal length of vectors to some fixed length.
Thank you !
