Tokens to Words mapping in the tokenizer decode step huggingface? - pytorch

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function?
For example:
from transformers.tokenization_roberta import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)
str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str)
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']
encoded = tokenizer.encode_plus(str)
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]
decoded = tokenizer.decode(encoded['input_ids'])
## '<s> this is a tokenization example</s>'
And the objective is to have a function that maps each token in the decode process to the correct input word, for here it will be:
desired_output = [[1],[2],[3],[4,5],[6]] As this corresponds to id 42, while token and ization corresponds to ids [19244,1938] which are at indexes 4,5 of the input_ids array.

As far as I know their is no built-in method for that, but you can create one by yourself:
from transformers.tokenization_roberta import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)
example = "This is a tokenization example"
print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})
Output:
{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}
To get exactly your desired output, you have to work with a list comprehension:
#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1
enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]
desired_output = []
for token in enc:
tokenoutput = []
for ids in token:
tokenoutput.append(idx)
idx +=1
desired_output.append(tokenoutput)
print(desired_output)
Output:
[[1], [2], [3], [4, 5], [6]]

If you use the fast tokenizers, i.e. the rust backed versions from the tokenizers library the encoding contains a word_ids method that can be used to map sub-words back to their original word. What constitutes a word vs a subword depends on the tokenizer, a word is something generated by the pre-tokenization stage, i.e. split by whitespace, a subword is generated by the actual model (BPE or Unigram for example).
The code below should work in general, even if the pre-tokenization performs additional splitting. For example I created my own custom step that splits based on PascalCase - the words here are Pascal and Case, the accepted answer wont work in this case since it assumes words are whitespace delimited.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-large', do_lower_case=True)
example = "This is a tokenization example"
encoded = tokenizer(example)
desired_output = []
for word_id in encoded.word_ids():
if word_id is not None:
start, end = encoded.word_to_tokens(word_id)
if start == end - 1:
tokens = [start]
else:
tokens = [start, end-1]
if len(desired_output) == 0 or desired_output[-1] != tokens:
desired_output.append(tokens)
desired_output

Related

Counting: How do I add a zero if a word does not occur in a list?

I would like to find keywords from a list, but return a zero if the word does not exist (in this case: part). In this example, collabor occurs 4 times and part 0 times.
My current output is
[['collabor', 4]]
But what I would like to have is
[['collabor', 4], ['part', 0]]
str1 = ["collabor", "part"]
x10 = []
for y in wordlist:
for string in str1:
if y.find(string) != -1:
x10.append(y)
from collections import Counter
x11 = Counter(x10)
your_list = [list(i) for i in x11.items()]
rowssorted = sorted(your_list, key=lambda x: x[0])
print(rowssorted)
Although you have not clearly written your problem and requirements,I think I understood the task.
I assume that you have a set of words that may or may not occur in a given list and you want to print the count of those words based on the occurrence in the given list.
Code:
constants=["part","collabor"]
wordlist = ["collabor", "collabor"]
d={}
for const in constants:
d[const]=0
for word in wordlist:
if word in d:
d[word]+=1
else:
d[word]=0
from collections import Counter
x11 = Counter(d)
your_list = [list(i) for i in x11.items()]
rowssorted = sorted(your_list, key=lambda x: x[0])
print(rowssorted)
output:
[['collabor', 2], ['part', 0]]
This approach gives the required output.
In python, to get the count of occurrence dictionary is popular.
Hope it helps!

How to efficient batch-process in huggingface?

I am using Huggingface library and transformers to find whether a sentence is well-formed or not. I am using a masked language model called XLMR. I first tokenize my sentence, and then mask each word of the sentence one by one, and then process the masked sentences and find the probability that the predicted masked word is right.
def calculate_scores(sent, model, tokenizer, device, print_pred=False, maskval=False):
k = 0
dic = {}
ls = tokenizer.batch_encode_plus(sent)
input_list = ls.input_ids
h=0
with torch.no_grad():
for i in tqdm(range(len(input_list))):
item = input_list[i]
real_input = item
attmask = [1]*len(item)
seg = [0]*len(item)
seglist = [seg]
masked_list = [real_input]
attlist = [attmask]
for j in range(1, len(item)-1):
input = copy.deepcopy(real_input)
input[j] = 50264
masked_list.append(input)
attlist.append(attmask)
seglist.append(seg)
inid = torch.tensor(masked_list)
segtensor = torch.tensor(seglist)
atttensor = torch.tensor(attlist)
inid=inid.to(device)
segtensor=segtensor.to(device)
output = model(inid, segtensor)
predictions_logits = output.logits
predictions = torch.softmax(predictions_logits, dim=2)
ppscore = 0
for j in range(1, len(item)-1):
ppscore = ppscore+math.log(predictions[j, j, item[j]], 2)
try:
score = math.pow(2, (-1/(len(item)-2))*ppscore)
dic[sent[i]] = score
except:
print(sent[i])
dic[sent[i]] = 10000000
# dic[sent[i]]=10000000
return dic
I will explain my code quickly. The function calculate_scores has sent as an input which is a list of sentences. I first batch encode this list of sentences. And then for each encoded sentence that I get, I generate masked sentences where only one word is masked and the rest are un-masked. Then I input these generated sentences to output and get the probability. Then I compute perplexity.
But the way I'm using this is not a very good way of utilizing GPU. I want to process multiple sentences at once but at the same time, I also need to find the perplexity scores for each sentence. How would I go about doing this?

how to predict a character based on character based RNN model?

i want to create a prediction function which complete a part of "sentence"
the model used here is a character based RNN(LSTM). what are the steps we should fellow ?
i tried this but i can't give as input the sentence
def generate(self) -> Tuple[List[Token], torch.tensor]:
start_symbol_idx = self.vocab.get_token_index(START_SYMBOL, 'tokens')
# print(start_symbol_idx)
end_symbol_idx = self.vocab.get_token_index(END_SYMBOL, 'tokens')
padding_symbol_idx = self.vocab.get_token_index(DEFAULT_PADDING_TOKEN, 'tokens')
log_likelihood = 0.
words = []
state = (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))
word_idx = start_symbol_idx
for i in range(self.max_len):
tokens = torch.tensor([[word_idx]])
embeddings = self.embedder({'tokens': tokens})
output, state = self.rnn._module(embeddings, state)
output = self.hidden2out(output)
log_prob = torch.log_softmax(output[0, 0], dim=0)
dist = torch.exp(log_prob)
word_idx = start_symbol_idx
while word_idx in {start_symbol_idx, padding_symbol_idx}:
word_idx = torch.multinomial(
dist, num_samples=1, replacement=False).item()
log_likelihood += log_prob[word_idx]
if word_idx == end_symbol_idx:
break
token = Token(text=self.vocab.get_token_from_index(word_idx, 'tokens'))
words.append(token)
return words, log_likelihood,start_symbol_idx
Here are two tutorial on how to use machine learning libraries to generate text Tensorflow and PyTorch.
this code snippet is the part of allennlp "language model" tutorial, here the generate function is defined to compute the probability of tokens and find the best token and sequence of tokens according to the maximum likelihood of model output, the full code is in the colab notebook bellow you can refer to it: https://colab.research.google.com/github/mhagiwara/realworldnlp/blob/master/examples/generation/lm.ipynb#scrollTo=8AU8pwOWgKxE
after training the the language model for using this function you can say:
for _ in range(50):
tokens, _ = model.generate()
print(''.join(token.text for token in tokens))

Python 3.x replace for loop with something faster

I am trying to produce a vector that represents the match of a string and a list's elements. I have made a function in python3.x:
def vector_build (docs, var):
vector = []
features = docs.split(' ')
for ngram in var:
if ngram in features:
vector.append(docs.count(ngram))
else:
vector.append(0)
return vector
It works fine:
vector_build ('hi my name is peter',['hi', 'name', 'are', 'is'])
Out: [1, 1, 0, 1]
But this function is not scalable to significant data. When its string parameter 'docs' is heavier than 190kb it takes more time that need. So I am trying to replace the for loop with map function like:
var = ['hi', 'name', 'are', 'is']
doc = 'hi my name is peter'
features = doc.split(' ')
vector = list(map(var,if ngram in var in features: vector.append(doc.count(ngram))))
But this return this error:
SyntaxError: invalid syntax
Is there a way to replace that for loop with map, lambda, itertools in order to make the execution faster?
You can use list comprehension for this task. Also, lookups in a set of features should help the function some as well.
var = ['hi', 'name', 'are', 'is']
doc = 'hi my name is peter'
features = doc.split(' ')
features_set = set(features) #faster lookups
vector = [doc.count(ngram) if ngram in features_set else 0 for ngram in var]
print(vector)

Why did NLTK NaiveBayes classifier misclassify one record?

This is the first time I am building a sentiment analysis machine learning model using the nltk NaiveBayesClassifier in Python. I know it is too simple of a model, but it is just a first step for me and I will try tokenized sentences next time.
The real issue I have with my current model is: I have clearly labeled the word 'bad' as negative in the training data set (as you can see from the 'negative_vocab' variable). However, when I ran the NaiveBayesClassifier on each sentence (lower case) in the list ['awesome movie', ' i like it', ' it is so bad'], the classifier mistakenly labeled 'it is so bad' as positive.
INPUT:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')
def word_feat(word):
return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].
for word in words:
classResult = classifier.classify(word_feat(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(word) + ' is ' + str(classResult))
print()
OUTPUT:
awesome movie is pos
i like it is pos
it is so bad is pos
To make sure the function 'word_feat(word)' iterates over each sentences instead of each word or letter, I did some diagnostic codes to see what is each element in 'word_feat(word)':
for word in words:
print(word_feat(word))
And it printed out:
{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}
So it seems like the function 'word_feat(word)' is correct?
Does anyone know why the classifier classified 'It is so bad' as positive? As mentioned before, I had clearly labeled the word 'bad' as negative in my training data.
This particular failure is because your word_feats() function expects a list of words (a tokenized sentence), but you pass it each word separately... so word_feats() iterates over its letters. You've built a classifier that classifies strings as positive or negative on the basis of the letters they contain.
You're probably in this predicament because you pay no attention to what you name your variables. In your main loop, none of the variables sentence, words, or word contain what their name claims. To understand and improve your program, start by naming things properly.
Bugs aside, this is not how you build a sentiment classifier. The training data should be a list of tokenized sentences (each labeled with its sentiment), not a list of individual words. Similarly, you classify tokenized sentences.
Here is the modified code for you
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]
def word_feats(words):
return dict([(word, True) for word in words])
positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]
train_set = negative_features_1 + positive_features_1 + neutral_features_1
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.') # these are actually list of sentences
for sent in sentences:
if sent != "":
words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
classResult = classifier.classify(word_feats(words))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print(str(sent) + ' --> ' + str(classResult))
print
I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'
Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.
Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.
The above code gives below output
awesome movie --> pos
i like it --> pos
it is so bad --> neg
So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.
Let me show a rewriting of your code. All I changed near the top was adding import re, as it is easier to tokenize with regexes. Everything else up to defining classifier is the same as your code.
I added one more test case (something really, really negative), but more importantly I used proper variable names - then it is much harder to get confused about what is going on:
test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')
So sentences now contains 4 strings, each a single sentence.
I left your word_feat() function unchanged.
For using the classifier I did quite a big rewrite:
for sentence in sentences:
if(len(sentence) == 0):continue
neg = 0
pos = 0
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
print(word, classResult)
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print("\n%s: %d vs -%d\n"%(sentence,pos,neg))
The outer loop is again descriptive, so that sentence contains one sentence.
I then have an inner loop where we classify each word in the sentence; I am using a regex to split the sentence up on whitespace and punctuation marks:
for word in re.findall(r"[\w']+", sentence):
classResult = classifier.classify(word_feat(word))
The rest is just basic adding up and reporting. I get this output:
awesome pos
movie neu
awesome movie: 1 vs -0
i pos
like pos
it pos
i like it: 3 vs -0
it pos
is neu
so pos
bad neg
it is so bad: 2 vs -1
i pos
hate neg
this pos
terrible neg
useless neg
movie neu
i hate this terrible useless movie: 2 vs -3
I still get the same as you - "it is so bad" is considered positive. And with the extra debug lines we can see it is because "it" and "so" are considered positive words, and "bad" is the only negative word, so overall it is positive.
I suspect this is because it hadn't seen those words in its training data.
...yes, if I add "it" and "so" to the list of neutral words, I get "it is so bad: 0 vs -1".
As next things to try, I'd suggest:
Try with more training data; toy examples like this carry the risk that the noise will swamp the signal.
Look into removing stop words.
You can try this code
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))
results are:
Positive: 0.7142857142857143
Negative: 0.14285714285714285

Resources