huggingface transformers: truncation strategy in encode_plus - pytorch

encode_plus in huggingface's transformers library allows truncation of the input sequence. Two parameters are relevant: truncation and max_length. I'm passing a paired input sequence to encode_plus and need to truncate the input sequence simply in a "cut off" manner, i.e., if the whole sequence consisting of both inputs text and text_pair is longer than max_length it should just be truncated correspondingly from the right.
It seems that neither of the truncation strategies allows to do this, instead longest_first removes tokens from the longest sequence (which could be either text or text_pair, but not just simply from the right or end of the sequence, e.g., if text is longer that text_pair, it seems this would remove tokens from text first), only_first and only_second remove tokens from only the first or second (hence, also not simply from the end), and do_not_truncate does not truncate at all. Or did I misunderstood this and actually longest_first might be what I'm looking for?

No longest_first is not the same as cut from the right. When you set the truncation strategy to longest_first, the tokenizer will compare the length of both text and text_pair everytime a token needs to be removed and remove a token from the longest. The could for example mean that it will cut at first 3 tokens from text_pair and will cut the rest of the tokens which need be cut alternately from text and text_pair. An example:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
seq1 = 'This is a long uninteresting text'
seq2 = 'What could be a second sequence to the uninteresting text'
print(len(tokenizer.tokenize(seq1)))
print(len(tokenizer.tokenize(seq2)))
print(tokenizer(seq1, seq2))
print(tokenizer(seq1, seq2, truncation= True, max_length = 15))
print(tokenizer.decode(tokenizer(seq1, seq2, truncation= True, max_length = 15)['input_ids']))
Output:
9
13
{'input_ids': [101, 2023, 2003, 1037, 2146, 4895, 18447, 18702, 3436, 3793, 102, 2054, 2071, 2022, 1037, 2117, 5537, 2000, 1996, 4895, 18447, 18702, 3436, 3793, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2003, 1037, 2146, 4895, 18447, 102, 2054, 2071, 2022, 1037, 2117, 5537, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] this is a long unint [SEP] what could be a second sequence [SEP]
As far as I can tell from your question you are actually looking for only_second because it cuts from the right (which is text_pair):
print(tokenizer(seq1, seq2, truncation= 'only_second', max_length = 15))
Output:
{'input_ids': [101, 2023, 2003, 1037, 2146, 4895, 18447, 18702, 3436, 3793, 102, 2054, 2071, 2022, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
It throw an exception when you try your text input is longer as the specified max_length. That is correct in my opinion because in this case it is not any longer a sequnece pair input.
Just in case only_second doesn't meet your requirements, you can simply create your own truncation strategy. As an example only_second by hand:
tok_seq1 = tokenizer.tokenize(seq1)
tok_seq2 = tokenizer.tokenize(seq2)
maxLengthSeq2 = myMax_len - len(tok_seq1) - 3 #number of special tokens for bert sequence pair
if len(tok_seq2) > maxLengthSeq2:
tok_seq2 = tok_seq2[:maxLengthSeq2]
input_ids = [tokenizer.cls_token_id]
input_ids += tokenizer.convert_tokens_to_ids(tok_seq1)
input_ids += [tokenizer.sep_token_id]
token_type_ids = [0]*len(input_ids)
input_ids += tokenizer.convert_tokens_to_ids(tok_seq2)
input_ids += [tokenizer.sep_token_id]
token_type_ids += [1]*(len(tok_seq2)+1)
attention_mask = [1]*len(input_ids)
print(input_ids)
print(token_type_ids)
print(attention_mask)
Output:
[101, 2023, 2003, 1037, 2146, 4895, 18447, 18702, 3436, 3793, 102, 2054, 2071, 2022, 102]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Related

How to do BERT tokenization in batches and concatenate results in dictionary with tensor lists?

Updating a dictionary in python
I am generating tokenizers in batches and storing it in dictionary. How can I combine tensor dictionary into single dictionary at end.
Here is my code:
sequence = ["hello world", "I am","hiiiiiiiiiii","whyyyyyyy", "hello", "hi", "what", "hiiiii", "please", "heuuuuu", "whuuuuuu", "why"]
tokenizer = tr.XLMRobertaTokenizer.from_pretrained("nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large")
orig={}
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
for c in chunks(sequence, 6):
print(c)
train_encodings = tokenizer.batch_encode_plus(c, truncation=True, padding=True, max_length=512, return_tensors="pt")
print(train_encodings)
orig.update(train_encodings)
Problem is every batch has has different dimension of input_ids and attention_mask. As pad token is added based on largest sequence in that batch.
{'input_ids': tensor([[ 0, 33600, 31, 8999, 2, 1],
[ 0, 87, 444, 2, 1, 1],
[ 0, 1274, 195922, 153869, 2, 1],
[ 0, 15400, 34034, 34034, 34034, 2],
[ 0, 33600, 31, 2, 1, 1],
[ 0, 1274, 2, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0]])}
{'input_ids': tensor([[ 0, 2367, 2, 1, 1, 1],
[ 0, 1274, 153869, 2, 1, 1],
[ 0, 22936, 2, 1, 1, 1],
[ 0, 71570, 125489, 2, 1, 1],
[ 0, 148, 1132, 125489, 34, 2],
[ 0, 15400, 2, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 0, 0]])}
To avoid duplicate I am following this link: Updating a dictionary in python but I was not able to solve this.

Why do other values change in an ndarray when I try to change a specific cell value?

For example, I have a 3D ndarray of the shape (10,10,10) and whenever I try to change all the cells in this section [5,:,9] to a specific single value I end up changing values in this section too [4,:,9]. Which to me makes no sense. I do not get this behavior when I convert to a list of lists.
I use a simply for loop:
For i in range(0,10):
matrix[5,i, 9]= matrix[5,9,9]
Is there anyway to avoid this? I do not get this behavior when using a list of lists but I don’t wanna convert back and forth between the two as it takes too much processing time.
Doesn't happen that way for me:
In [232]: arr = np.ones((10,10,10),int)
In [233]: arr[5,9,9] = 10
In [234]: for i in range(10): arr[5,i,9]=arr[5,9,9]
In [235]: arr[5,:,9]
Out[235]: array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
In [236]: arr[4,:,9]
Out[236]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
or assigning a whole "column" at once:
In [237]: arr[5,:,9] = np.arange(10)
In [239]: arr[5]
Out[239]:
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 7],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 8],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 9]])

Python: Converting Binary to Decimal

What I'm currently doing is a implementation of Genetic Algorithms. I have written my Crossover and mutation methods and now i'm currently writing my Fitness method.
I need to convert my list of 0s and 1s to decimal values for calculating distance.
My current output that I'm working with are a list of integer values of 1s and 0s. (Example below):
[[0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1]]
<class 'list'>
I want to convert these numbers to their respected binary equivalent.
I have tried converting the list to groups of 4 and then calling a binaryToDecimal function to convert the bits to decimal values. However, Im getting an error 'TypeError: 'numpy.ndarray' object is not callable'.
I have summarized my code and this is what it looks like so far.
def converting_binary_to_decimal(L):
output = []
for l in L:
l = list(map(str, l))
sub_output = []
for j in range(0, len(l)-1, 4):
sub_output.append(int(''.join(l[j:j+4]), 2))
output.append(sub_output)
return output
def chunks(L, n):
for i in range(0, len(L), n):
yield L[i:i+n]
def fitness(child):
newList1=list(chunks(child[0], 4))
newList2=list(chunks(child[1], 4))
if __name__ == "__main__":
myFitness = fitness(afterMU)
A sample output of what i want is:
[[0, 13, 6, 8, 12, 8, 10, 9, 15], [0, 8, 7, 0, 4, 4, 1, 8, 15]]
Try this code.
def converting_binary_to_decimal(L):
output = []
for l in L:
l = list(map(str, l))
sub_output = []
for j in range(0, len(l)-1, 4):
sub_output.append(int(''.join(l[j:j+4]), 2))
output.append(sub_output)
return output
L = [[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1], [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1]]
converting_binary_to_decimal(L)
I think i figured it out.
x=[0, 1, 1, 0]
k = 4
n = len(x)//k
for i in range(n):
y = x[i*k:(i+1)*k]
y = [str(j) for j in y]
y = ''.join(y)
y = int(y,2)
print(y)
Thank you.

numpy packbits pack to uint16 array

I´ve got a 3D numpy bit array, I need to pack them along the third axis. So exactly what numpy.packbits does. But unfortunately it packs it only to uint8, but I need more data, is there a similar way to pack it to uint16 or uint32?
Depending on your machine's endianness it is either a matter of simple view casting or of byte swapping and then view casting:
>>> a = np.random.randint(0, 2, (4, 16))
>>> a
array([[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0],
[0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1],
[1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1]])
>>> np.packbits(a.reshape(-1, 2, 8)[:, ::-1]).view(np.uint16)
array([53226, 23751, 25853, 64619], dtype=uint16)
# check:
>>> [bin(x + (1<<16))[-16:] for x in _]
['1100111111101010', '0101110011000111', '0110010011111101', '1111110001101011']
You may have to reshape in the end.

Conversion of numpy 2d array to ENVI binary file through gdal

I have SAR CEOS format files which consist of data file, leader file, null volume directory file and volume directory file.
I am reading the data file using gdal ReadAsArray and then I am doing operations on this 2d Array and now I want to save this 2d array as an ENVI binary file.
Kindly guide how to do this in Python 3.5.
Find help for Tutorial Website: https://pcjericks.github.io/py-gdalogr-cookbook/
Such as the example of
import gdal, ogr, os, osr
import numpy as np
def array2raster(newRasterfn,rasterOrigin,pixelWidth,pixelHeight,array):
cols = array.shape[1]
rows = array.shape[0]
originX = rasterOrigin[0]
originY = rasterOrigin[1]
driver = gdal.GetDriverByName('ENVI')
outRaster = driver.Create(newRasterfn, cols, rows, 1, gdal.GDT_Byte)
outRaster.SetGeoTransform((originX, pixelWidth, 0, originY, 0, pixelHeight))
outband = outRaster.GetRasterBand(1)
outband.WriteArray(array)
outRasterSRS = osr.SpatialReference()
outRasterSRS.ImportFromEPSG(4326)
outRaster.SetProjection(outRasterSRS.ExportToWkt())
outband.FlushCache()
def main(newRasterfn,rasterOrigin,pixelWidth,pixelHeight,array):
reversed_arr = array[::-1] # reverse array so the tif looks like the array
array2raster(newRasterfn,rasterOrigin,pixelWidth,pixelHeight,reversed_arr) # convert array to raster
if __name__ == "__main__":
rasterOrigin = (-123.25745,45.43013)
pixelWidth = 10
pixelHeight = 10
newRasterfn = 'test.tif'
array = np.array([[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1],
[ 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[ 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1],
[ 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1],
[ 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
main(newRasterfn,rasterOrigin,pixelWidth,pixelHeight,array)

Resources