PyTorch tokenizers: how to truncate tokens from left? - pytorch

As we can see in the below code snippet, specifying max_length and truncation for a tokenizer cuts excess tokens from the left:
tokenizer("hello, my name", truncation=True, max_length=6).input_ids
> [0, 42891, 6, 127, 766, 2]
tokenizer("hello, my name", truncation=True, max_length=4).input_ids
> [0, 42891, 6, 2]
The different tokenization strategies like only_second, only_first, longest_first, all seem to cut from the right. Is there a way to cut from the left? So that the tokens would be [0, 127, 766, 2] in the second example?

Related

How does Huggingface's tokenizers tokenize non-English characters?

I use tokenizers to tokenize natural language sentences into tokens.
But came up with some questions:
Here is some examples I tried using tokenizers:
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer("是")
# {'input_ids': [42468], 'attention_mask': [1]}
tokenizer("我说你倒是快点啊")
# {'input_ids': [22755, 239, 46237, 112, 19526, 254, 161, 222, 240, 42468, 33232, 104, 163, 224, 117, 161, 243, 232], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenizer("東")
# {'input_ids': [30266, 109], 'attention_mask': [1, 1]}
tokenizer("東京")
# {'input_ids': [30266, 109, 12859, 105], 'attention_mask': [1, 1, 1, 1]}
tokenizer("東京メトロ")
# {'input_ids': [30266, 109, 12859, 105, 26998, 13298, 16253], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokenizer("メトロ")
# {'input_ids': [26998, 13298, 16253], 'attention_mask': [1, 1, 1]}
tokenizer("This is my fault")
{'input_ids': [1212, 318, 616, 8046], 'attention_mask': [1, 1, 1, 1]}
The code above is some examples I tried.
The last example is an English sentence and I can understand that This corresponds to "This":1212 in the vocab.json, is corresponds to "\u0120is": 318.
But I can not understand why this tool tokenizes non-English sequence into some tokens I can not find in the vocab.
For example:
東 is been tokenized into 30266 and 109. The results in the vocab.json is "æĿ":30266 and "±":109.
メ is been tokenized into 26998. The results in the vocab.json is "ãĥ¡":26998.
I searched the Huggingface documents and website and find no clue.
And the source code is written in Rust, which is hard for me to understand.
So could you help me figure out why?

Is there a way to use Huggingface pretrained tokenizer with wordpiece prefix?

I'm doing a sequence labeling task with Bert. In order to align the word pieces with labels, I need the some marker to identify them so I can get an single embedding for each word by either summing or averaging.
For example I want the word New~york tokenized into New ##~ ##york, and looking at some old examples on the internet, that was what you get by using BertTokenizer before, but clearly not anymore (Says their documentation)
So when I run:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer(batch_sentences, return_tensors="pt")
decoded = tokenizer.decode(inputs["input_ids"][0])
print(decoded)
and I get:
[CLS] hello, i'm testing this efauenufefu [SEP]
But the encoding clear suggesting otherwise that the nonsense at the end was indeed broken up into pieces...
In [4]: inputs
Out[4]:
{'input_ids': tensor([[ 101, 19082, 117, 178, 112, 182, 5193, 1142, 174, 8057,
23404, 16205, 11470, 1358, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
I also tried to use the BertTokenizerFast, which unlike the BertTokenizer, it allows you to specify wordpiece prefix:
tokenizer2 = BertTokenizerFast("bert-base-cased-vocab.txt", wordpieces_prefix = "##")
batch_sentences = ["hello, i'm testing this efauenufefu"]
inputs = tokenizer2(batch_sentences, return_tensors="pt")
decoded = tokenizer2.decode(inputs["input_ids"][0])
print(decoded)
Yet the decoder gave me exactly the same...
[CLS] hello, i'm testing this efauenufefu [SEP]
So, is there a way to use the pretrained Huggingface tokenizer with prefix, or must I train a custom tokenizer myself?
Maybe you are looking for tokenize:
from transformers import BertTokenizerFast
t = BertTokenizerFast.from_pretrained('bert-base-uncased')
t.tokenize("hello, i'm testing this efauenufefu")
Output:
['hello',
',',
'i',
"'",
'm',
'testing',
'this',
'e',
'##fa',
'##uen',
'##uf',
'##ef',
'##u']
You can also get a mapping of each token to the respecting word and other things:
o = t("hello, i'm testing this efauenufefu", add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False)
o.words()
Output:
[0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7]

How do I turn a large multi dimentional numpy ndarray into a string without truncating and save it to .txt in python3? [duplicate]

When I print a numpy array, I get a truncated representation, but I want the full array.
>>> numpy.arange(10000)
array([ 0, 1, 2, ..., 9997, 9998, 9999])
>>> numpy.arange(10000).reshape(250,40)
array([[ 0, 1, 2, ..., 37, 38, 39],
[ 40, 41, 42, ..., 77, 78, 79],
[ 80, 81, 82, ..., 117, 118, 119],
...,
[9880, 9881, 9882, ..., 9917, 9918, 9919],
[9920, 9921, 9922, ..., 9957, 9958, 9959],
[9960, 9961, 9962, ..., 9997, 9998, 9999]])
Use numpy.set_printoptions:
import sys
import numpy
numpy.set_printoptions(threshold=sys.maxsize)
import numpy as np
np.set_printoptions(threshold=np.inf)
I suggest using np.inf instead of np.nan which is suggested by others. They both work for your purpose, but by setting the threshold to "infinity" it is obvious to everybody reading your code what you mean. Having a threshold of "not a number" seems a little vague to me.
Temporary setting
You can use the printoptions context manager:
with numpy.printoptions(threshold=numpy.inf):
print(arr)
(of course, replace numpy by np if that's how you imported numpy)
The use of a context manager (the with-block) ensures that after the context manager is finished, the print options will revert to whatever they were before the block started. It ensures the setting is temporary, and only applied to code within the block.
See numpy.printoptions documentation for details on the context manager and what other arguments it supports. It was introduced in NumPy 1.15 (released 2018-07-23).
The previous answers are the correct ones, but as a weaker alternative you can transform into a list:
>>> numpy.arange(100).reshape(25,4).tolist()
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21,
22, 23], [24, 25, 26, 27], [28, 29, 30, 31], [32, 33, 34, 35], [36, 37, 38, 39], [40, 41,
42, 43], [44, 45, 46, 47], [48, 49, 50, 51], [52, 53, 54, 55], [56, 57, 58, 59], [60, 61,
62, 63], [64, 65, 66, 67], [68, 69, 70, 71], [72, 73, 74, 75], [76, 77, 78, 79], [80, 81,
82, 83], [84, 85, 86, 87], [88, 89, 90, 91], [92, 93, 94, 95], [96, 97, 98, 99]]
Here is a one-off way to do this, which is useful if you don't want to change your default settings:
def fullprint(*args, **kwargs):
from pprint import pprint
import numpy
opt = numpy.get_printoptions()
numpy.set_printoptions(threshold=numpy.inf)
pprint(*args, **kwargs)
numpy.set_printoptions(**opt)
This sounds like you're using numpy.
If that's the case, you can add:
import numpy as np
np.set_printoptions(threshold=np.nan)
That will disable the corner printing. For more information, see this NumPy Tutorial.
Using a context manager as Paul Price sugggested
import numpy as np
class fullprint:
'context manager for printing full numpy arrays'
def __init__(self, **kwargs):
kwargs.setdefault('threshold', np.inf)
self.opt = kwargs
def __enter__(self):
self._opt = np.get_printoptions()
np.set_printoptions(**self.opt)
def __exit__(self, type, value, traceback):
np.set_printoptions(**self._opt)
if __name__ == '__main__':
a = np.arange(1001)
with fullprint():
print(a)
print(a)
with fullprint(threshold=None, edgeitems=10):
print(a)
numpy.savetxt
numpy.savetxt(sys.stdout, numpy.arange(10000))
or if you need a string:
import StringIO
sio = StringIO.StringIO()
numpy.savetxt(sio, numpy.arange(10000))
s = sio.getvalue()
print s
The default output format is:
0.000000000000000000e+00
1.000000000000000000e+00
2.000000000000000000e+00
3.000000000000000000e+00
...
and it can be configured with further arguments.
Note in particular how this also not shows the square brackets, and allows for a lot of customization, as mentioned at: How to print a Numpy array without brackets?
Tested on Python 2.7.12, numpy 1.11.1.
This is a slight modification (removed the option to pass additional arguments to set_printoptions)of neoks answer.
It shows how you can use contextlib.contextmanager to easily create such a contextmanager with fewer lines of code:
import numpy as np
from contextlib import contextmanager
#contextmanager
def show_complete_array():
oldoptions = np.get_printoptions()
np.set_printoptions(threshold=np.inf)
try:
yield
finally:
np.set_printoptions(**oldoptions)
In your code it can be used like this:
a = np.arange(1001)
print(a) # shows the truncated array
with show_complete_array():
print(a) # shows the complete array
print(a) # shows the truncated array (again)
with np.printoptions(edgeitems=50):
print(x)
Change 50 to how many lines you wanna see
Source: here
A slight modification: (since you are going to print a huge list)
import numpy as np
np.set_printoptions(threshold=np.inf, linewidth=200)
x = np.arange(1000)
print(x)
This will increase the number of characters per line (default linewidth of 75). Use any value you like for the linewidth which suits your coding environment. This will save you from having to go through huge number of output lines by adding more characters per line.
Complementary to this answer from the maximum number of columns (fixed with numpy.set_printoptions(threshold=numpy.nan)), there is also a limit of characters to be displayed. In some environments like when calling python from bash (rather than the interactive session), this can be fixed by setting the parameter linewidth as following.
import numpy as np
np.set_printoptions(linewidth=2000) # default = 75
Mat = np.arange(20000,20150).reshape(2,75) # 150 elements (75 columns)
print(Mat)
In this case, your window should limit the number of characters to wrap the line.
For those out there using sublime text and wanting to see results within the output window, you should add the build option "word_wrap": false to the sublime-build file [source] .
To turn it off and return to the normal mode
np.set_printoptions(threshold=False)
Since NumPy version 1.16, for more details see GitHub ticket 12251.
from sys import maxsize
from numpy import set_printoptions
set_printoptions(threshold=maxsize)
Suppose you have a numpy array
arr = numpy.arange(10000).reshape(250,40)
If you want to print the full array in a one-off way (without toggling np.set_printoptions), but want something simpler (less code) than the context manager, just do
for row in arr:
print row
If you're using a jupyter notebook, I found this to be the simplest solution for one off cases. Basically convert the numpy array to a list and then to a string and then print. This has the benefit of keeping the comma separators in the array, whereas using numpyp.printoptions(threshold=np.inf) does not:
import numpy as np
print(str(np.arange(10000).reshape(250,40).tolist()))
You won't always want all items printed, especially for large arrays.
A simple way to show more items:
In [349]: ar
Out[349]: array([1, 1, 1, ..., 0, 0, 0])
In [350]: ar[:100]
Out[350]:
array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
It works fine when sliced array < 1000 by default.
If you are using Jupyter, try the variable inspector extension. You can click each variable to see the entire array.
This is the hackiest solution it even prints it nicely as numpy does:
import numpy as np
a = np.arange(10000).reshape(250,40)
b = [str(row) for row in a.tolist()]
print('\n'.join(b))
Out:
You can use the array2string function - docs.
a = numpy.arange(10000).reshape(250,40)
print(numpy.array2string(a, threshold=numpy.nan, max_line_width=numpy.nan))
# [Big output]
If you have pandas available,
numpy.arange(10000).reshape(250,40)
print(pandas.DataFrame(a).to_string(header=False, index=False))
avoids the side effect of requiring a reset of numpy.set_printoptions(threshold=sys.maxsize) and you don't get the numpy.array and brackets. I find this convenient for dumping a wide array into a log file
If an array is too large to be printed, NumPy automatically skips the central part of the array and only prints the corners:
To disable this behaviour and force NumPy to print the entire array, you can change the printing options using set_printoptions.
>>> np.set_printoptions(threshold='nan')
or
>>> np.set_printoptions(edgeitems=3,infstr='inf',
... linewidth=75, nanstr='nan', precision=8,
... suppress=False, threshold=1000, formatter=None)
You can also refer to the numpy documentation numpy documentation for "or part" for more help.

Not able to use Stratified-K-Fold on multi label classifier

The following code is used to do KFold Validation but I am to train the model as it is throwing the error
ValueError: Error when checking target: expected dense_14 to have shape (7,) but got array with shape (1,)
My target Variable has 7 classes. I am using LabelEncoder to encode the classes into numbers.
By seeing this error, If I am changing the into MultiLabelBinarizer to encode the classes. I am getting the following error
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
The following is the code for KFold validation
skf = StratifiedKFold(n_splits=10, shuffle=True)
scores = np.zeros(10)
idx = 0
for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
print("Training on fold " + str(index+1) + "/10...")
# Generate batches from indices
xtrain, xval = X[train_indices], X[val_indices]
ytrain, yval = y[train_indices], y[val_indices]
model = None
model = load_model() //defined above
scores[idx] = train_model(model, xtrain, ytrain, xval, yval)
idx+=1
print(scores)
print(scores.mean())
I don't know what to do. I want to use Stratified K Fold on my model. Please help me.
MultiLabelBinarizer returns a vector which is of the length of your number of classes.
If you look at how StratifiedKFold splits your dataset, you will see that it only accepts a one-dimensional target variable, whereas you are trying to pass a target variable with dimensions [n_samples, n_classes]
Stratefied split basically preserves your class distribution. And if you think about it, it does not make a lot of sense if you have a multi-label classification problem.
If you want to preserve the distribution in terms of the different combinations of classes in your target variable, then the answer here explains two ways in which you can define your own stratefied split function.
UPDATE:
The logic is something like this:
Assuming you have n classes and your target variable is a combination of these n classes. You will have (2^n) - 1 combinations (Not including all 0s). You can now create a new target variable considering each combination as a new label.
For example, if n=3, you will have 7 unique combinations:
1. [1, 0, 0]
2. [0, 1, 0]
3. [0, 0, 1]
4. [1, 1, 0]
5. [1, 0, 1]
6. [0, 1, 1]
7. [1, 1, 1]
Map all your labels to this new target variable. You can now look at your problem as simple multi-class classification, instead of multi-label classification.
Now you can directly use StartefiedKFold using y_new as your target. Once the splits are done, you can map your labels back.
Code sample:
import numpy as np
np.random.seed(1)
y = np.random.randint(0, 2, (10, 7))
y = y[np.where(y.sum(axis=1) != 0)[0]]
OUTPUT:
array([[1, 1, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 1, 0, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 1, 1, 1],
[1, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 1, 0, 1, 1],
[0, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 1, 1],
[0, 1, 1, 1, 1, 0, 0]])
Label encode your class vectors:
from sklearn.preprocessing import LabelEncoder
def get_new_labels(y):
y_new = LabelEncoder().fit_transform([''.join(str(l)) for l in y])
return y_new
y_new = get_new_labels(y)
OUTPUT:
array([7, 6, 3, 3, 2, 5, 8, 0, 4, 1])

scikit-learn: Get selected features for prediction data

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])

Resources