How does Huggingface's tokenizers tokenize non-English characters? - nlp

I use tokenizers to tokenize natural language sentences into tokens.
But came up with some questions:
Here is some examples I tried using tokenizers:
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer("是")
# {'input_ids': [42468], 'attention_mask': [1]}
tokenizer("我说你倒是快点啊")
# {'input_ids': [22755, 239, 46237, 112, 19526, 254, 161, 222, 240, 42468, 33232, 104, 163, 224, 117, 161, 243, 232], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenizer("東")
# {'input_ids': [30266, 109], 'attention_mask': [1, 1]}
tokenizer("東京")
# {'input_ids': [30266, 109, 12859, 105], 'attention_mask': [1, 1, 1, 1]}
tokenizer("東京メトロ")
# {'input_ids': [30266, 109, 12859, 105, 26998, 13298, 16253], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokenizer("メトロ")
# {'input_ids': [26998, 13298, 16253], 'attention_mask': [1, 1, 1]}
tokenizer("This is my fault")
{'input_ids': [1212, 318, 616, 8046], 'attention_mask': [1, 1, 1, 1]}
The code above is some examples I tried.
The last example is an English sentence and I can understand that This corresponds to "This":1212 in the vocab.json, is corresponds to "\u0120is": 318.
But I can not understand why this tool tokenizes non-English sequence into some tokens I can not find in the vocab.
For example:
東 is been tokenized into 30266 and 109. The results in the vocab.json is "æĿ":30266 and "±":109.
メ is been tokenized into 26998. The results in the vocab.json is "ãĥ¡":26998.
I searched the Huggingface documents and website and find no clue.
And the source code is written in Rust, which is hard for me to understand.
So could you help me figure out why?

Related

Plot a dataframe based on specific group/id in Python

I have a dataset given as such:
#Load the required libraries
import pandas as pd
import matplotlib.pyplot as plt
#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1,
2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3,
4, 4, 4, 4,
5, 5, 5, 5, 5,5],
'cycle': [1,2, 3, 4, 5,6,7,
1,2, 3,4,5,6,
1,2, 3, 4, 5,
1,2, 3, 4,
1,2, 3, 4, 5,6,],
'Salary': [7, 7, 7,7,7,7,7,
4, 4, 4,4,4,4,
8,8,8,8,8,
10,10,10,10,
15, 15,15,15,15,15,],
'Jobs': [123, 18, 69, 65, 120, 11, 52,
96, 120,10, 141, 52,6,
101,99, 128, 1, 141,
141,123, 12, 66,
12, 128, 66, 100, 141, 52,],
'Days': [123, 128, 66, 66, 120, 141, 52,
96, 120,120, 141, 52,96,
15,123, 128, 120, 141,
141,123, 128, 66,
123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such:
In order to plot the 'cycle' vs 'Salary' for id =1, I have used following codes:
plt.plot(df.groupby(by="id").get_group(1)['cycle'], df.groupby(by="id").get_group(1)['Salary'], label = 'id=1')
plt.xlabel('cycle')
plt.ylabel('Salary')
plt.legend()
plt.xlim(0, 10)
plt.ylim(0, 20)
plt.show()
The plot looks as such:
However, I wish to plot the 'cycle' vs 'Salary' for all id's in one single plot. The graph need to look as such:
Can somebody please let me know how to achieve this task in Python.
Use a pivot:
ax = df.pivot(index='cycle', columns='id', values='Salary').plot()
# display
ax.set_ylim(bottom=0)
ax.set_xlim(left=0)
ax.set_ylabel('Salary')
Output:
swapping axes
using seaborn.lineplot
import seaborn as sns
sns.lineplot(data=df, x='Salary', y='cycle', hue='id',
palette='Set1', estimator=None)
Output:

Is there a way to use Huggingface pretrained tokenizer with wordpiece prefix?

I'm doing a sequence labeling task with Bert. In order to align the word pieces with labels, I need the some marker to identify them so I can get an single embedding for each word by either summing or averaging.
For example I want the word New~york tokenized into New ##~ ##york, and looking at some old examples on the internet, that was what you get by using BertTokenizer before, but clearly not anymore (Says their documentation)
So when I run:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer(batch_sentences, return_tensors="pt")
decoded = tokenizer.decode(inputs["input_ids"][0])
print(decoded)
and I get:
[CLS] hello, i'm testing this efauenufefu [SEP]
But the encoding clear suggesting otherwise that the nonsense at the end was indeed broken up into pieces...
In [4]: inputs
Out[4]:
{'input_ids': tensor([[ 101, 19082, 117, 178, 112, 182, 5193, 1142, 174, 8057,
23404, 16205, 11470, 1358, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
I also tried to use the BertTokenizerFast, which unlike the BertTokenizer, it allows you to specify wordpiece prefix:
tokenizer2 = BertTokenizerFast("bert-base-cased-vocab.txt", wordpieces_prefix = "##")
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer2(batch_sentences, return_tensors="pt")
decoded = tokenizer2.decode(inputs["input_ids"][0])
print(decoded)
Yet the decoder gave me exactly the same...
[CLS] hello, i'm testing this efauenufefu [SEP]
So, is there a way to use the pretrained Huggingface tokenizer with prefix, or must I train a custom tokenizer myself?
Maybe you are looking for tokenize:
from transformers import BertTokenizerFast
t = BertTokenizerFast.from_pretrained('bert-base-uncased')
t.tokenize("hello, i'm testing this efauenufefu")
Output:
['hello',
',',
'i',
"'",
'm',
'testing',
'this',
'e',
'##fa',
'##uen',
'##uf',
'##ef',
'##u']
You can also get a mapping of each token to the respecting word and other things:
o = t("hello, i'm testing this efauenufefu", add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False)
o.words()
Output:
[0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7]

Get the length of every sentence before padding in torchtext bucketiterator

Is it possible to get the length of every sentence before padding in torchtext bucketiterator :
train_loader = torchtext.legacy.data.BucketIterator(train_data, batch_size = 64, repeat=True, shuffle=True, sort_key = lambda x: len(x.text), sort=False, sort_within_batch=True, device = device)
bucketiterator dataloader :
inputs: tensor([[ 34, 87, 2, ..., 227, 239, 263],
[ 138, 7, 1006, ..., 840, 142, 665],
[ 549, 4, 1028, ..., 11, 14, 4],
...,
[ 1, 1, 5, ..., 66, 23, 13],
[ 1, 1, 1062, ..., 177, 252, 1587],
[ 1, 1, 66, ..., 553, 52, 73]]), shape: torch.Size([64, 91])
Like when using pytorch dataloader:
train_loader = data.DataLoader(train_data, batch_size = 64, shuffle=True, collate_fn=padding)
def padding(batch):
doc = [doc['input'] for doc in batch]
len_doc = [len(doc['input']) for doc in batch]
doc_pad = pad_sequence(doc, batch_first=True, padding_value=0)
return doc_pad, len_doc
pytorch dataloader :
inputs: tensor([[ 2, 1396, 2686, ..., 0, 0, 0],
[ 2, 1391, 1396, ..., 0, 0, 0],
[ 2, 2018, 2597, ..., 0, 0, 0],
...,
[ 2, 1546, 1623, ..., 0, 0, 0],
[ 2, 1435, 1396, ..., 0, 0, 0],
[ 2, 1391, 1396, ..., 0, 0, 0]]), shape: torch.Size([64, 40])
inputs_len_before_padding: tensor([18, 8, 21, 16, 16, 12, 40, 12, 9, 12, 17, 12, 17, 15, 16, 12, 8, 24,
25, 10, 22, 8, 8, 13, 12, 22, 17, 14, 21, 14, 19, 13, 21, 8, 28, 16,
31, 24, 23, 19, 10, 7, 16, 12, 16, 12, 17, 12, 18, 11, 8, 13, 17, 14,
11, 13, 13, 20, 8, 12, 22, 7, 9, 11]), shape: torch.Size([64])
Here is a minimal example that uses torchtext.data.Field and torchtext.data.BucketIterator:
import torchtext.data as data
# sample data
text = [
'This is sentence 1.',
'This sentence is a bit longer than the previous sentence.'
]
# define field -- notice include_lengths is set to True
text_field = data.Field(include_lengths=True, tokenize=lambda x: x.split())
fields = [('text', text_field)]
# create dataset and build vocabulary
examples = [data.Example.fromlist([t], fields) for t in text]
dataset = data.Dataset(examples, fields)
text_field.build_vocab(dataset)
# create iterator
data_iter = data.BucketIterator(dataset, batch_size=2, shuffle=False)
# the text field will now return both the data tensor and the length of the input text
for x in data_iter:
print('Data:', x.text[0])
print('Lengths:', x.text[1])
This should print (data tensor shortened for brevity):
Data: tensor([[ 2, 2],
...
[ 1, 10]])
Lengths: tensor([ 4, 10])

Pytorch how to reshape/reduce the number of filters without altering the shape of the individual filters

With a 3D tensor of shape (number of filters, height, width), how can one reduce the number of filters with a reshape which keeps the original filters together as whole blocks?
Assume the new size has dimensions chosen such that a whole number of the original filters can fit side by side in one of the new filters. So an original size of (4, 2, 2) can be reshaped to (2, 2, 4).
A visual explanation of the side by side reshape where you see the standard reshape will alter the individual filter shapes:
I have tried various pytorch functions such as gather and select_index but not found a way to get to the end result in a general manner (i.e. works for different numbers of filters and different filter sizes).
I think it would be easier to rearrange the tensor values after performing the reshape but could not get a tensor of the pytorch reshaped form:
[[[1,2,3,4],
[5,6,7,8]],
[[9,10,11,12],
[13,14,15,16]]]
to:
[[[1,2,5,6],
[3,4,7,8]],
[[9,10,13,14],
[11,12,15,16]]]
for completeness, the original tensor before reshaping:
[[[1,2],
[3,4]],
[[5,6],
[7,8]],
[[9,10],
[11,12]],
[[13,14],
[15,16]]]
Another option is to construct a list of parts and concatenate them
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
y = torch.cat([x[i::2] for i in range(2)], dim=2)
print('Before\n', x)
print('After\n', y)
which gives
Before
tensor([[[0, 0],
[0, 0]],
[[1, 1],
[1, 1]],
[[2, 2],
[2, 2]],
[[3, 3],
[3, 3]]])
After
tensor([[[0, 0, 1, 1],
[0, 0, 1, 1]],
[[2, 2, 3, 3],
[2, 2, 3, 3]]])
Or a little more generally we could write a function that takes groups of neighbors along a source dimension and concatenates them along a destination dimension
def group_neighbors(x, group_size, src_dim, dst_dim):
assert x.shape[src_dim] % group_size == 0
return torch.cat([x[[slice(None)] * (src_dim) + [slice(i, None, group_size)] + [slice(None)] * (len(x.shape) - (src_dim + 2))] for i in range(group_size)], dim=dst_dim)
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
# read as "take neighbors in groups of 2 from dimension 0 and concatenate them in dimension 2"
y = group_neighbors(x, group_size=2, src_dim=0, dst_dim=2)
print('Before\n', x)
print('After\n', y)
You could do it by chunking tensor and then recombining.
def side_by_side_reshape(x):
n_pairs = x.shape[0] // 2
filter_size = x.shape[-1]
x = x.reshape((n_pairs, 2, filter_size, filter_size))
return torch.stack(list(map(lambda x: torch.hstack(x.unbind()), k)))
>> p = torch.arange(1, 91).reshape((10, 3, 3))
>> side_by_side_reshape(p)
tensor([[[ 1, 2, 3, 10, 11, 12],
[ 4, 5, 6, 13, 14, 15],
[ 7, 8, 9, 16, 17, 18]],
[[19, 20, 21, 28, 29, 30],
[22, 23, 24, 31, 32, 33],
[25, 26, 27, 34, 35, 36]],
[[37, 38, 39, 46, 47, 48],
[40, 41, 42, 49, 50, 51],
[43, 44, 45, 52, 53, 54]],
[[55, 56, 57, 64, 65, 66],
[58, 59, 60, 67, 68, 69],
[61, 62, 63, 70, 71, 72]],
[[73, 74, 75, 82, 83, 84],
[76, 77, 78, 85, 86, 87],
[79, 80, 81, 88, 89, 90]]])
but I know it's not ideal since there is map, list and unbind which disrupts memory. This is what I offer till I figure out how to do it via view only (so a real reshape)

huggingface transformers: truncation strategy in encode_plus

encode_plus in huggingface's transformers library allows truncation of the input sequence. Two parameters are relevant: truncation and max_length. I'm passing a paired input sequence to encode_plus and need to truncate the input sequence simply in a "cut off" manner, i.e., if the whole sequence consisting of both inputs text and text_pair is longer than max_length it should just be truncated correspondingly from the right.
It seems that neither of the truncation strategies allows to do this, instead longest_first removes tokens from the longest sequence (which could be either text or text_pair, but not just simply from the right or end of the sequence, e.g., if text is longer that text_pair, it seems this would remove tokens from text first), only_first and only_second remove tokens from only the first or second (hence, also not simply from the end), and do_not_truncate does not truncate at all. Or did I misunderstood this and actually longest_first might be what I'm looking for?
No longest_first is not the same as cut from the right. When you set the truncation strategy to longest_first, the tokenizer will compare the length of both text and text_pair everytime a token needs to be removed and remove a token from the longest. The could for example mean that it will cut at first 3 tokens from text_pair and will cut the rest of the tokens which need be cut alternately from text and text_pair. An example:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
seq1 = 'This is a long uninteresting text'
seq2 = 'What could be a second sequence to the uninteresting text'
print(len(tokenizer.tokenize(seq1)))
print(len(tokenizer.tokenize(seq2)))
print(tokenizer(seq1, seq2))
print(tokenizer(seq1, seq2, truncation= True, max_length = 15))
print(tokenizer.decode(tokenizer(seq1, seq2, truncation= True, max_length = 15)['input_ids']))
Output:
9
13
{'input_ids': [101, 2023, 2003, 1037, 2146, 4895, 18447, 18702, 3436, 3793, 102, 2054, 2071, 2022, 1037, 2117, 5537, 2000, 1996, 4895, 18447, 18702, 3436, 3793, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2003, 1037, 2146, 4895, 18447, 102, 2054, 2071, 2022, 1037, 2117, 5537, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] this is a long unint [SEP] what could be a second sequence [SEP]
As far as I can tell from your question you are actually looking for only_second because it cuts from the right (which is text_pair):
print(tokenizer(seq1, seq2, truncation= 'only_second', max_length = 15))
Output:
{'input_ids': [101, 2023, 2003, 1037, 2146, 4895, 18447, 18702, 3436, 3793, 102, 2054, 2071, 2022, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
It throw an exception when you try your text input is longer as the specified max_length. That is correct in my opinion because in this case it is not any longer a sequnece pair input.
Just in case only_second doesn't meet your requirements, you can simply create your own truncation strategy. As an example only_second by hand:
tok_seq1 = tokenizer.tokenize(seq1)
tok_seq2 = tokenizer.tokenize(seq2)
maxLengthSeq2 = myMax_len - len(tok_seq1) - 3 #number of special tokens for bert sequence pair
if len(tok_seq2) > maxLengthSeq2:
tok_seq2 = tok_seq2[:maxLengthSeq2]
input_ids = [tokenizer.cls_token_id]
input_ids += tokenizer.convert_tokens_to_ids(tok_seq1)
input_ids += [tokenizer.sep_token_id]
token_type_ids = [0]*len(input_ids)
input_ids += tokenizer.convert_tokens_to_ids(tok_seq2)
input_ids += [tokenizer.sep_token_id]
token_type_ids += [1]*(len(tok_seq2)+1)
attention_mask = [1]*len(input_ids)
print(input_ids)
print(token_type_ids)
print(attention_mask)
Output:
[101, 2023, 2003, 1037, 2146, 4895, 18447, 18702, 3436, 3793, 102, 2054, 2071, 2022, 102]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Resources