Tokenizer didn't add BOS token when encoding the sentence

Tokenizer didn't add BOS token when encoding the sentence - huggingface-tokenizers

I would like to encode the sentence with BOS and EOS token. When I load a pretrained tokenizer, there is no BOS token, so I added BOS token to the tokenizer. After that, I encoded the sentence.
model_checkpoint = "facebook/wmt19-en-de"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.add_special_tokens({'bos_token' : '<s>'})
tokenizer.encode("Resumption of the session", add_special_tokens = True)
result: [2642, 4584, 636, 9, 6, 9485, 2] # 2642 is not BOS token, and 2 is EOS token.
However, the result shows that BOS token does not appear in the encoded sentence. How could I include BOS token when encoding?

Even though you've specified bos_token to be a string of your choosing, you still need to set the add_bos_token property of tokenizer to True to get the tokenizer to stick a bos_token on the front of its output.
You can do this when you instantiate AutoTokenizer:
tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint,
bos_token = '<s>',
add_bos_token = True
)
tok_enc = tokenizer.encode("Resumption of the session")
print(tok_enc)
print(tokenizer.decode(tok_enc[0]))
Output:
[50257, 4965, 24098, 286, 262, 6246]`
<s>
... or you could use the add_special_tokens method & set add_bos_token property following instantiation:
tokenizer = AutoTokenizer.from_pretrained(
model_checkpoint
)
tokenizer.add_special_tokens({'bos_token' : '<s>'})
tokenizer.add_bos_token = True
tok_enc = tokenizer.encode("Resumption of the session")
print(tok_enc)
print(tokenizer.decode(tok_enc[0]))
Output:
[50257, 4965, 24098, 286, 262, 6246]
<s>
Note that the tokenizer you select may have a default bos_token which means you could simply add add_bos_token = True without specifying bos_token = '<s>' (unless you want to customise the bos_token of course).

Related

BERT embeddings in batches

I am following this post to extract embeddings for sentences and for a single sentence the steps are described as follows:
text = "After stealing money from the bank vault, the bank robber was seen " \
"fishing on the Mississippi river bank."
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True,
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
hidden_states = outputs[2]
And I want to do this for a batch of sequences. Here is my example code:
seql = ['this is an example', 'today was sunny and', 'today was']
encoded = [tokenizer.encode(seq, max_length=5, pad_to_max_length=True) for seq in seql]
encoded
[[2, 2511, 1840, 3251, 3],
[2, 1663, 2541, 1957, 3],
[2, 1663, 2541, 3, 0]]
But since I'm working with batches, sequences need to have same length. So I introduce a padding token (3rd sentence) which confuses me about several points:
What should the segment id for pad_token (0) will be?
Should I use attention masking when feeding the tensors to the model so that padding is ignored? In the example only token and segment tensors are used.
outputs = model(tokens_tensor, segments_tensors)
If I don't work with batches but with individual sentences, then I might not need a padding token. Would it be better to do that compared to batches?

You could do all the work you need using one function ( padding,truncation)
encode_plus
check the parameters: the docs
The same you could do with a list of sequences
batch_encode_plus
docs

Why we need a decoder_start_token_id during generation in HuggingFace BART?

During the generation phase in HuggingFace's code:
https://github.com/huggingface/transformers/blob/master/src/transformers/generation_utils.py#L88-L100
They pass in a decoder_start_token_id, I'm not sure why they need this. And in the BART config, the decoder_start_token_id is actually 2 (https://huggingface.co/facebook/bart-base/blob/main/config.json), which is the end of sentence token </s>.
And I tried a simple example:
from transformers import *
import torch
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
input_ids = torch.LongTensor([[0, 894, 213, 7, 334, 479, 2]])
res = model.generate(input_ids, num_beams=1, max_length=100)
print(res)
preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).strip() for g in res]
print(preds)
The results I obtained:
tensor([[ 2, 0, 894, 213, 7, 334, 479, 2]])
['He go to school.']
Though it does not affect the final "tokenization decoding" results. But it seems weird to me that the first token we generate is actually 2(</s>).

You can see in the code for encoder-decoder models that the input tokens for the decoder are right-shifted from the original (see function shift_tokens_right). This means that the first token to guess is always BOS (beginning of sentence). You can check that this is the case in your example.
For the decoder to understand this, we must choose a first token that is always followed by BOS, so which could it be? BOS? Obviously not because it must be followed by regular tokens. The padding token? Also not a good choice because it is followed by another padding token or by EOS (end of sentence). So, what about EOS then? Well, that makes sense because it is never followed by anything in the training set so there is no next token coming in conflict. And besides, isn't it natural that the beginning of sentence follows the end of another?

Convert NER SpaCy format to IOB format

I have data which is already labelled in SpaCy format. For example:
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]})
But I want to try training it with any other NER model, such as BERT-NER, which requires IOB tagging instead. Is there any conversion code from SpaCy data format to IOB?
Thanks!

This is closely related to and mostly copied from https://stackoverflow.com/a/59209377/461847, see the notes in the comments there, too:
import spacy
from spacy.gold import biluo_tags_from_offsets
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
# then convert L->I and U->B to have IOB tags for the tokens in the doc

I am afraid, you will have to write your own conversion because IOB encoding depends on what tokenization will the pre-trained representation model (BERT, RoBERTa or whatever pre-trained model of your choice) uses.
The SpaCy format specifies the character span of the entity, i.e.
"Who is Shaka Khan?"[7:17]
will return "Shaka Khan". You need to match that to tokens used by the pre-trained model.
Here are examples of how different models tokenize the example sentence when you used Huggingface's Transformers.
BERT: ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
RoBERTa: ['Who', '_is', '_Sh', 'aka', '_Khan', '?']
XLNet: ['▁Who', '▁is', '▁Shak', 'a', '▁Khan', '?']
When knowing how the tokenizer work, you can implement the conversion. Something like this can work for BERT tokenization.
entities = [(7, 17, "PERSON")]}
tokenized = ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
cur_start = 0
state = "O" # Outside
tags = []
for token in tokenized:
# Deal with BERT's way of encoding spaces
if token.startswith("##"):
token = token[2:]
else:
token = " " + token
cur_end = cur_start + len(token)
if state == "O" and cur_start < entities[0][0] < cur_end:
tags.append("B-" + entitites[0][2])
state = "I-" + entitites[0][2]
elif state.startswith("I-") and cur_start < entities[0][1] < cur_end:
tags.append(state)
state = "O"
entities.pop(0)
else:
tags.append(state)
cur_start = cur_end
Note that the snippet would break if one BERT token would contain the end of one entity and the start of the following one. The tokenizer also does not distinguish how many spaces (or other whitespaces) there were in the original string, this is a potential source of errors as well.

First You need to convert your annotated json file to csv.
Then you can run the below code to convert into spaCy V2 Binary format
df = pd.read_csv('SC_CSV.csv')
l1 = []
l2 = []
for i in range(0, len(df['ner'])):
l1.append(df['ner'][i])
l2.append({"entities":[(0,len(df['ner'][i]),df['label'][i])]})
TRAIN_DATA = list(zip(l1, l2))
TRAIN_DATA
Now the TRAIN_DATA in spaCy V2 format
This helps to convert the file from your old Spacy v2 formats to the brand new Spacy v3 format.
import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
db.to_disk("./train.spacy") # save the docbin object

I have faced this kind of problem.
what i did is transforming the data to spacy binary then I load the data from docbin object using this code.
import spacy
from spacy.tokens import DocBin
db=DocBin().from_disk("your_docbin_name.spacy")
nlp=spacy.blank("language_used")
Documents=list(db.get_docs(nlp.vocab))
`
then this code may help you to extract the iob format from it.
for elem in Documents[0]:
if(elem.ent_iob_!="O"):
print(elem.text,elem.ent_iob_,"-",elem.ent_type_)
else :
print(elem.text,elem.ent_iob_)
here is the example of my output :
عبرت O
الديناميكية B - POLITIQUE
النسوية I - POLITIQUE
التي O
تأسست O
بعد O
25 O
جويلية O
2021 O
عن O
رفضها O
القطعي O
لمشروع O
تنقيح B - POLITIQUE
المرسوم B - POLITIQUE
عدد O
88 O
لسنة O

import spacy
from spacy.gold import biluo_tags_from_offsets
data = data
nlp = spacy.blank("en")
for text, labels in data:
doc = nlp("read our spacy format data here")
ents = []
for start, end, label in labels["entities"]:
ents.append(doc.char_span(start, end, label))
doc.ents = ents
for tok in doc:
label = tok.ent_iob_
if tok.ent_iob_ != "O":
label += '-' + tok.ent_type_
print(tok, label, sep="\t")
if getting none-type error do add try block depending on your dataset or clean your dataset.

Classify text using NaiveBayesClassifier

I have a text file with a sentence on each line:
eg ""Have you registered your email ID with your Bank Account?"
I want to classify it into interrogative or not. FYI these are sentences from bank websites.
I've seen this answer
with this nltk code block:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
So I did some preprocessing to my text file i.e. stemming words, removing stop words etc, to make each sentence into a bag of words. From the code above, I have a trained classifier. How do I implement it on my text file of sentences (either raw or preprocessed)?
Update: here is an example of my text file.

Assuming that you have preprocessed the document data as we discussed, you can do the following:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(nltk.classify.accuracy(classifier, test_set))
0.668
For your data, you can iterate in your lines and fit, predict:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

Doing this for all lines in the text file works:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

Keras Tokenizer: random results due to different mapping of text tokens and IDs [duplicate]

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).
Convert Text corpus into sequences using Tokenizer object/class
Build a model using the model.fit() method
Evaluate this model
Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?

The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)

Tokenizer class has a function to save date into JSON format:
tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(tokenizer_json, ensure_ascii=False))
The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:
with open('tokenizer.json') as f:
data = json.load(f)
tokenizer = tokenizer_from_json(data)

The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).
Concrete Example:
from keras.preprocessing.text import Tokenizer
docs = ["A heart that",
"full up like",
"a landfill",
"no surprises",
"and no alarms"
"a job that slowly"
"Bruises that",
"You look so",
"tired happy",
"no alarms",
"and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train) # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))
# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs) # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))
Results:
result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]
Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].

I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True

I found the following snippet provided at following link by #thusv89.
Save objects:
import pickle
with open('data_objects.pickle', 'wb') as handle:
pickle.dump(
{'input_tensor': input_tensor,
'target_tensor': target_tensor,
'inp_lang': inp_lang,
'targ_lang': targ_lang,
}, handle, protocol=pickle.HIGHEST_PROTOCOL)
Load objects:
with open("dataset_fr_en.pickle", 'rb') as f:
data = pickle.load(f)
input_tensor = data['input_tensor']
target_tensor = data['target_tensor']
inp_lang = data['inp_lang']
targ_lang = data['targ_lang']

Quite easy, because Tokenizer class has provided two funtions for save and load:
save —— Tokenizer.to_json()
load —— keras.preprocessing.text.tokenizer_from_json
In to_json() method，it call "get_config" method which handle this:
json_word_counts = json.dumps(self.word_counts)
json_word_docs = json.dumps(self.word_docs)
json_index_docs = json.dumps(self.index_docs)
json_word_index = json.dumps(self.word_index)
json_index_word = json.dumps(self.index_word)
return {
'num_words': self.num_words,
'filters': self.filters,
'lower': self.lower,
'split': self.split,
'char_level': self.char_level,
'oov_token': self.oov_token,
'document_count': self.document_count,
'word_counts': json_word_counts,
'word_docs': json_word_docs,
'index_docs': json_index_docs,
'index_word': json_index_word,
'word_index': json_word_index
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Tokenizer didn't add BOS token when encoding the sentence - huggingface-tokenizers

Related

BERT embeddings in batches

Why we need a decoder_start_token_id during generation in HuggingFace BART?

Convert NER SpaCy format to IOB format

Classify text using NaiveBayesClassifier

Keras Tokenizer: random results due to different mapping of text tokens and IDs [duplicate]

Categories

Resources