HuggingFace - tokenizers - Lower case with input ids

HuggingFace - tokenizers - Lower case with input ids - huggingface-tokenizers

It is possible to do lower case to given input ids without decode and then encode again ?
for example
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
text1 = tokenizer.decode([713, 16, 10, 3645, 4])
print(text1)
>>> This is a sentence.
text2 = tokenizer.decode([9226, 16, 10, 3645, 4])
print(text2)
>>> this is a sentence.
I would like to know if there is some fast way to convert the id 713 to 9226, without decode, do lower and then encode again.
Thanks,
Shon

Related

BERT embeddings in batches

I am following this post to extract embeddings for sentences and for a single sentence the steps are described as follows:
text = "After stealing money from the bank vault, the bank robber was seen " \
"fishing on the Mississippi river bank."
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True,
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
hidden_states = outputs[2]
And I want to do this for a batch of sequences. Here is my example code:
seql = ['this is an example', 'today was sunny and', 'today was']
encoded = [tokenizer.encode(seq, max_length=5, pad_to_max_length=True) for seq in seql]
encoded
[[2, 2511, 1840, 3251, 3],
[2, 1663, 2541, 1957, 3],
[2, 1663, 2541, 3, 0]]
But since I'm working with batches, sequences need to have same length. So I introduce a padding token (3rd sentence) which confuses me about several points:
What should the segment id for pad_token (0) will be?
Should I use attention masking when feeding the tensors to the model so that padding is ignored? In the example only token and segment tensors are used.
outputs = model(tokens_tensor, segments_tensors)
If I don't work with batches but with individual sentences, then I might not need a padding token. Would it be better to do that compared to batches?

You could do all the work you need using one function ( padding,truncation)
encode_plus
check the parameters: the docs
The same you could do with a list of sequences
batch_encode_plus
docs

Confusion in Pre-processing text for Roberta Model

I want to apply Roberta model for text similarity. Given a pair of sentences,the input should be in the format <s> A </s></s> B </s>. I figure out two possible ways to generate the input ids namely
a)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands')
list2 = tokenizer.encode('Numbness of upper limb')
sequence = list1+[2]+list2[1:]
In this case, sequence is [0, 12178, 3814, 2400, 11, 1420, 2, 2, 234, 4179, 1825, 9, 2853, 29654, 2]
b)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = [0]+list1+[2,2]+list2+[2]
In this case, sequence is [0, 25101, 3814, 2400, 11, 1420, 2, 2, 487, 4179, 1825, 9, 2853, 29654, 2]
Here 0 represents <s> token and 2 represents </s> token. I'm not sure which is the correct way to encode the given two sentences for calculating sentence similarity using Roberta model.

The easiest way is probably to directly use the provided function by HuggingFace's Tokenizers themselves, namely the text_pair argument in the encode function, see here. This allows you to directly feed in two sentences, which will be giving you the desired output:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
sequence = tokenizer.encode(text='Very severe pain in hands',
text_pair='Numbness of upper limb',
add_special_tokens=True)
This is especially convenient if you are dealing with very long sequences, as the encode function automatically reduces your lengths according to the truncaction_strategy argument. You obviously don't have to worry about this, if it is only short sequences.
Alternatively, you can also make use of the more explicit build_inputs_with_special_tokens() function of the RobertaTokenizer, specifically, which could be added to your example like so:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
list1 = tokenizer.encode('Very severe pain in hands', add_special_tokens=False)
list2 = tokenizer.encode('Numbness of upper limb', add_special_tokens=False)
sequence = tokenizer.build_inputs_with_special_tokens(list1, list2)
Note that in that case, you have to generate the sequences list1 and list2 still without any special tokens, as you have already done correctly.

How to get words from output of XLNet using Transformers library

I am using Hugging Face's Transformer library to work with different NLP models. Following code does masking with XLNet. It outputs a tensor with numbers. How do I convert the output to words again?
import torch
from transformers import XLNetModel, XLNetTokenizer, XLNetLMHeadModel
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')
# We show how to setup inputs to predict a next token using a bi-directional context.
input_ids = torch.tensor(tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")).unsqueeze(0) # We will predict the masked token
print(input_ids)
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
The current output I get is:
tensor([[[ -5.1466, -17.3758, -17.3392, ..., -12.2839, -12.6421, -12.4505]]],
grad_fn=AddBackward0)

The output you have is a tensor of size 1 by 1 by vocabulary size. The meaning of the nth number in this tensor is the estimated log-odds of the nth vocabulary item. So, if you want to get out the word that the model predicts to be most likely to come in the final position (the position you specified with target_mapping), all you need to do is find the word in the vocabulary with the maximum predicted log-odds.
Just add the following to the code you have:
predicted_index = torch.argmax(next_token_logits[0][0]).item()
predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)
So predicted_token is the token the model predicts as most likely in that position.
Note, by default behaviour of XLNetTokenizer.encoder() adds special tokens and to the end of a string of tokens when it encodes it. The code you have given masks and predicts the final word, which, after running though tokenizer.encoder() is the special character '<cls>', which is probably not what you want.
That is, when you run
tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")
the result is a list of token ids,
[35, 388, 22, 6, 313, 21, 685, 18, 6, 6, 540, 9, 4, 3]
which, if you convert back to tokens (by calling tokenizer.convert_ids_to_tokens() on the above id list), you will see has two extra tokens added at the end,
['▁I', '▁went', '▁to', '<mask>', '▁York', '▁and', '▁saw', '▁the', '<mask>', '<mask>', '▁building', '.', '<sep>', '<cls>']
So, if the word you are meaning to predict is 'building', you should use perm_mask[:, :, -4] = 1.0 and target_mapping[0, 0, -4] = 1.0.

keras pre-processing of text using one_hot class

I came across this code while learning keras online.
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
one_hot(text, length)
This returns the intergers like this...
[3, 1, 1, 2, 3]
I did not understand why and how does unique words return duplicate numbers. For e.g. 3 and 1 is repeated even if the words in the text are unique.

From the documentation of one_hot it is described how it is a wrapper of hashing_trick:
This is a wrapper to the hashing_trick function using hash as the hashing function; unicity of word to index mapping non-guaranteed.
From the documentation of hasing_trick:
Two or more words may be assigned to the same index, due to possible collisions by the hashing function. The probability of a collision is in relation to the dimension of the hashing space and the number of distinct objects.
Since hashing is used there is a probability that different words will be hashed to the same index. The probability of a non-unique hash is proportional to the vocabulary size selected.
It is suggested by Jason Brownlee Jason Brownlee to use a vocabulary size 25% larger than the word size to increase the uniqueness of the hashes.
Following Jason Brownlee suggestion in you case results in:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.random import set_random_seed
import math
set_random_seed(1)
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
print(one_hot(text, math.ceil(length*1.25)))
which returns the integers
[3, 4, 5, 1, 6]

Using Sklearn's CountVectorizer to find multiple strings not in order

Can the CountVectorizer be used to identify if a set of words appear in the corpus regardless of order?
It can do ordered phrases: How can I use sklearn CountVectorizer with mutliple strings?
Yet for my case the set of words do not happen to fall next to each over so tokenizing the whole phrase and then trying to find in some text document will result in zero finds
What I dream is for the following to happen:
import numpy as np
from sklearn import feature_extraction
sentences = [ "The only cool Washington is DC",
"A cool city in Washington is Seattle",
"Moses Lake is the dirtiest water in Washington" ]
listOfStrings = ["Washington DC",
"Washington Seattle",
"Washington cool"]
vectorizer = CountVectorizer(vocabulary=listOfStrings)
bagowords = np.matrix(vectorizer.fit_transform(sentences).todense())
bagowords
matrix([[1, 0, 1],
[0, 1, 1],
[0, 0, 0],])
The actual problem entails more words in between and thus removing stop words here would not be a valid solution. Any advice would be awesome!

As discussed in comments, since you only want to find out certain words are present in the document or not, then you will need to change the vocabulary (listOfStrings) a little bit.
sentences = [ "The only cool Washington is DC",
"A cool city in Washington is Seattle",
"Moses Lake is the dirtiest water in Washington" ]
from sklearn.feature_extraction.text import CountVectorizer
listOfStrings = ["washington", "dc", "seattle", "cool"]
vectorizer = CountVectorizer(vocabulary=listOfStrings,
binary=True)
bagowords = vectorizer.fit_transform(sentences).toarray()
vectorizer.vocabulary
['washington', 'dc', 'seattle', 'cool']
bagowords
array([[1, 1, 0, 1],
[1, 0, 1, 1],
[1, 0, 0, 0]])
I have added binary=True to the CountVectorizer since you dont want the actual counts, only check if word is present or not.
The output of bagowords matches the order of vocabulary (listOfStrings) you supplied. So the first column represents if "washinton" is present in documents or not, second column checks for "dc" and so on.
Of course you will need to give attention to other parameters in CountVectorizer which can affect this. For example:,
lowercase is True by default, so I used lowercase words in listOfStrings. Otherwise, "DC", "Dc", "dc" are considered as separate words.
You should also study about the effect of token_pattern param which by default only keeps alphanumeric strings of length 2 or more. So if you want to detect a single letter words like "a", "I" etc, then you will need to change that.
Hope this helps. If not understand anything, feel free to ask.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

HuggingFace - tokenizers - Lower case with input ids - huggingface-tokenizers

Related

BERT embeddings in batches

Confusion in Pre-processing text for Roberta Model

How to get words from output of XLNet using Transformers library

keras pre-processing of text using one_hot class

Using Sklearn's CountVectorizer to find multiple strings not in order

Categories

Resources