Removing tokens from the GPT tokenizer - python-3.x

How can I remove unwanted sub-tokens from GPT vocabulary or tokenizer? I have tried an existing approach that was used for a ROBERTa kind of model as shown below (https://github.com/huggingface/transformers/issues/15032). However it fails at the point of initializing the "model" component of the backend_tokenizer with the new vocabulary.
#1. Get your tokenizer and the list of tokens you want to remove
import json
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# get all tokens with "unused" in target_tokenizer
unwanted_words = [ 'ply', 'Ġmor','Ġprovide','IC','ung','Ġparty', 'Ġexist', 'Ġmag',]
#2. Get the arguments that allowed to initialize the "model" component of the backend_tokenizer.
model_state = json.loads(tokenizer.backend_tokenizer.model.__getstate__())
print(len(model_state["vocab"]))
#3. Modify the initialization arguments, in particular the vocabulary to remove the tokens we don't want
# remove all unwanted tokens from the vocabulary
for word in unwanted_words:
del model_state["vocab"][word]
print(len(model_state["vocab"]))
#4. Intitialize again the "model" component of the backend_tokenizer with the new vocabulary
from tokenizers import models
model_class = getattr(models, model_state.pop("type"))
tokenizer.backend_tokenizer.model = model_class(**model_state)
print(len(tokenizer.vocab))
And below is the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-fa908d23c419> in <module>
30 model_class = getattr(models, model_state.pop("type"))
31
---> 32 tokenizer.backend_tokenizer.model = model_class(**model_state)
33
34 print(len(tokenizer.vocab))
TypeError: argument 'merges': failed to extract enum PyMerges ('Merges | Filename')
- variant Merges (Merges): TypeError: failed to extract field PyMerges::Merges.0, caused by TypeError: 'str' object cannot be converted to 'PyTuple'
- variant Filename (Filename): TypeError: failed to extract field PyMerges::Filename.0, caused by TypeError: 'list' object cannot be converted to 'PyString'
What other methods can I use or refer to? The original script I adapter was used for ROBERTa which uses Sentencepiece but GPT uses BPE.

Related

Tokenizer can add padding without error, but data collator cannot

I'm trying to fine tune a GPT2-based model on my data using the run_clm.py example script from HuggingFace.
I have a .json data file that looks like this:
...
{"text": "some text"}
{"text": "more text"}
...
I had to change the default behavior of the script that used to concatenate input text, because all my examples are separate demonstrations that should not be concatenated:
def add_labels(example):
example['labels'] = example['input_ids'].copy()
return example
with training_args.main_process_first(desc="grouping texts together"):
lm_datasets = tokenized_datasets.map(
add_labels,
batched=False,
# batch_size=1,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
desc=f"Grouping texts in chunks of {block_size}",
)
This essentially only adds the appropriate 'labels' field required by CLM.
However since GPT2 has a 1024-sized context-window, the examples should be padded to that length.
I can achieve this by modifying the tokenization procedure like this:
def tokenize_function(examples):
with CaptureLogger(tok_logger) as cl:
output = tokenizer(
examples[text_column_name], padding='max_length') # added: padding='max_length'
# ...
The training runs correctly.
However, I believe this should not be done by the tokenizer, but by the data collator instead. When I remove padding='max_length' from the tokenizer, I get the following error:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
And also, above that:
Traceback (most recent call last):
File "/home/jan/repos/text2task/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 716, in convert_to_tensors
tensor = as_tensor(value)
ValueError: expected sequence of length 9 at dim 1 (got 33)
During handling of the above exception, another exception occurred:
To fix this, I have created a data collator that should do the padding:
data_collator = DataCollatorWithPadding(tokenizer, padding='max_length')
This is what is passed to the trainer. However, the above error remains.
What's going on?
I managed to fix the error but I'm really unsure about my solution, details below. Will accept a better answer.
This seems to solve it:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)
Found in the documentation
It seems like DataCollatorWithPadding doesn't pad the labels?
My problem is about generating an output sequence from an input sequence, so I'm guessing that using DataCollatorForSeq2Seq is what I actually want to do. However, my data does not have separate input and target columns, but a single text column (that contains a string input => target). I'm not really that this collator is intended to be used for GPT2...

ElasticSearch | TypeError: string indices must be integers

I'm using this Notebook, where section Apply DocumentClassifier is altered as below.
Jupyter Labs, kernel: conda_mxnet_latest_p37.
I understand the error means I'm passing str instead of an int. However, this should not be a problem, as it works with other .pdf/ .txt files from the original Notebook.
Code Cell:
doc_dir = "GRIs/" # contains 2 .pdfs
with open('filt_gri.txt', 'r') as filehandle:
tags = [current_place.rstrip() for current_place in filehandle.readlines()]
doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
task="zero-shot-classification",
labels=tags,
batch_size=2)
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_to_classify)
# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
print(classified_docs[0].to_dict())
all_docs = convert_files_to_dicts(dir_path=doc_dir)
preprocessor_sliding_window = PreProcessor(split_overlap=3,
split_length=10,
split_respect_sentence_boundary=False,
split_by='passage')
Output Error:
INFO - haystack.modeling.utils - Using devices: CUDA
INFO - haystack.modeling.utils - Number of GPUs: 1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-82b54cd162ff> in <module>
14
15 # classify using gpu, batch_size makes sure we do not run out of memory
---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
17
18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
144 for prediction, doc in zip(predictions, documents):
145 if self.task == 'zero-shot-classification':
--> 146 prediction["label"] = prediction["labels"][0]
147 doc.meta["classification"] = prediction
148
TypeError: string indices must be integers
Please let me know if there is anything else I should add to post/ clarify.
I swapped out variable docs_sliding_window with my_dsw.
my_dsw only keeps lines with <= 1000 characters in length. This helps the shape of my data to fit better.
my_dsw = []
for dsw in range(0, len(docs_sliding_window)-1):
if len(docs_sliding_window[dsw]['content']) <= 1000:
my_dsw.append(docs_sliding_window[dsw])
Swapping it out in docs_to_classify line:
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
Admittedly, I'm not sure how this relates to specifically to the error; but it does help the data to fit better; as now I can increase batch_size=4.

Get TypeError when using TfidfVectorizer in python

I'm new to python and I'm needing your help.
I'm working with NLP, and I want to classify a field that is string.
I read the dataset
data = pd.read_csv("dataset.csv",sep=';',encoding='latin-1',error_bad_lines=False)
tokenize the field
data['campo']= data['campo'].str.split()
the output is:
1- [Su, inexperto, personal] 2- [Atención, al, cliente]
when I check tutorials that exist on the internet, to the majority, when tokeniza returns the separated words with apostrophe.
the problem is when I want to vectorize (TfidfVectorizer), I get an error and I think my problem is here.
Can you help me? Why do not I have the tokens with apostrophe?
After executing this, I add the possibility to vectorize the field:
Tfidf_vect = TfidfVectorizer (max_features = 5000)
Tfidf_vect.fit(data ['field'])
From here, I throw the error:
AttributeError: 'list' object has no attribute 'lower'
I thought I was coming for the subject of the lower, so I added:
Tfidf_vect = TfidfVectorizer (lowercase = False, max_features = 5000)
Tfidf_vect.fit (data ['field'])
and from there he shoots me:
TypeError: expected string or bytes-like object
Do you know what it is the problem?
Do not tokenize your text before feeding into tfidfVectorizer(), which means you have to remove the following line in your code.
data['campo']= data['campo'].str.split()
TfidfVectorizer internally does the tokenization. Try your following lines of code directly!
Tfidf_vect = TfidfVectorizer (max_features = 5000)
Tfidf_vect.fit(data ['campo'])

TypeError: '<' not supported between instances of 'NoneType' and 'str' using Pyner for Name entity recognition

I am trying to pass an email string to Pyner to pull out all the entities into a dictionary. I can verify my setup works with this returning two PERSON entities
import ner
tagger = ner.SocketNER(port=9191, output_format='slashTags')
t = "My daughter Sophia goes to the university of California. James also goes there"
print(type(t))
test = tagger.get_entities(t)
person_ents = test['PERSON']
for i in person_ents:
print(i)
This outputs as expected
Sophia
James
The only difference is here that I have email text here instead I can verify it's a string
print(type(firstEmail))
test = tagger.get_entities(firstEmail)
person_ents = test['PERSON']
print (type(person_ents))
for i in person_ents:
print(i)
This returns the following error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-79-ff847452c8df> in <module>()
3
4
----> 5 test = tagger.get_entities(firstEmail)
6 person_ents = test['PERSON']
7 print (type(person_ents))
~/anaconda3/envs/nlp/lib/python3.6/site-packages/ner-0.1-py3.6.egg/ner/client.py in get_entities(self, text)
90 else: #inlineXML
91 entities = self.__inlineXML_parse_entities(tagged_text)
---> 92 return self.__collapse_to_dict(entities)
93
94 def json_entities(self, text):
~/anaconda3/envs/nlp/lib/python3.6/site-packages/ner-0.1-py3.6.egg/ner/client.py in __collapse_to_dict(self, pairs)
71 """
72 return dict((first, list(map(itemgetter(1), second))) for (first, second)
---> 73 in groupby(sorted(pairs, key=itemgetter(0)), key=itemgetter(0)))
74
75 def get_entities(self, text):
TypeError: '<' not supported between instances of 'NoneType' and 'str'
Any idea how what's wrong
The issue here is that NER is setup so that when the output is set to SlashTags it output a dictionary format. However the text is parsed with slash characters where a named entity occurs and this character is then used to separate dictionary entity before the dictionary is generated. As a result if any slashes occur in your text data you need to parse this out.
Something like
#text is your string
text = text.replace('/', '-')
This shouldn't be an issue in NLP terms as dates should still be picked out with this format. But if some key part of your analysis requires this tag to be there this solution might not be suitable. I can't verify if this issue exists in the java implementation but it's possible

LdaModel - random_state parameter not recognized - gensim

I'm using gensim's LdaModel, which, according to the documentation, has the parameter random_state. However, I'm getting an error that says:
TypeError: __init__() got an unexpected keyword argument 'random_state'
Without the random_state parameter, the function works as expected. So, the workflow looks like this for those that want to know what else is happening...
from gensim import corpora, models
import numpy as np
# pseudo code of text pre-processing all on "comments" variable
# stop words
# remove punctuation (optional)
# keep alpha only
# stemming
# get bigrams and integrate with corpus (gensim makes this very easy)
dictionary = corpora.Dictionary(comments)
corpus = [dictionary.doc2bow(comm) for comm in comments]
tfidf = models.TfidfModel(corpus) # change weights
corp_tfidf = tfidf[corpus] # apply them to corpus
# set random seed
random_seed = 135
state = np.random.RandomState(random_seed)
# train model
num_topics = 3
lda_mod = models.LdaModel(corp_tfidf, # corpus
num_topics=num_topics, # number of topics we want back
id2word=dictionary, # our id-word map
passes=10, # how many passes to take over the data
random_state=state) # reproduce the results
Which results in the error message above...
TypeError: __init__() got an unexpected keyword argument 'random_state'
I'd like to be able to recreate my results, if possible.
According to this, random_state parameter was added in the latest version (0.13.2). You can update your gensim installation with pip install gensim --upgrade. You might need to update scipy first, because it caused me problems.

Resources