Loading saved NER back into HuggingFace pipeline? - nlp

I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their website:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
What I would like to do is save and run this locally without having to download the "ner" model every time (which is over 1 GB in size). In their documentation, I see that you can save the pipeline using the "pipeline.save_pretrained()" function to a local folder. The results of this are various files which I am storing into a specific folder.
My question would be how can I load this model back up into a script to continue classifying as in the example above after saving? The output of "pipeline.save_pretrained()" is multiple files.
Here is what I have tried so far:
1: Following the documentation about pipeline
pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')
The error I got was: 'str' object has no attribute "config"
2: Following HuggingFace example on ner:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
This yields an error: list index out of range
I also tried printing out just predictions which is not returning the text format of the tokens along with their entities.
Any help would be much appreciated!

Loading a model like this has always worked for me:
from transformers import pipeline
pipe = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
Have a look at here for further examples on how to use pipelines.

Related

How to download pipeline code from Pycaret AutoML into .py files?

PyCaret seems like a great AutoML tool. It works, fast and simple and I would like to download the generated pipeline code into .py files to double check and if needed to customize some parts. Unfortunately, I don't know how to make it real. Reading the documentation have not helped. Is it possible or not?
It is not possible to get the underlying code since PyCaret takes care of this for you. But it is up to you as the user to decide the steps that you want your flow to take e.g.
# Setup experiment with user-defined options for preprocessing, etc.
setup(...)
# Create a model (uses training split only)
model = create_model("lr")
# Tune hyperparameters (user can pass a custom tuning grid if needed)
# Again, uses training split only
tuned = tune_model(model, ...)
# Finalize model (so that the best hyperparameters are retrained on the entire dataset
finalize_model(tuned)
# Any other steps you would like to do.
...
Finally, you can save the entire pipeline as a pkl file for use later
# Saves the model + pipeline as a pkl file
save_model(final, "my_best_model")
You may get a partial answer: incomplete with 'get_config("prep_pipe")' in 2.6.10 or in 3.0.0rc1
Just run a setup like in examples, store as a cdf1, and try cdf.pipeline and you may get a text like this: Pipeline(..)
When working with pycaret=3.0.0rc4, you have two options.
Option 1:
get_config("pipeline")
Option 2:
lb = get_leaderboard()
lb.iloc[0]['Model']
Option 1 will give you the transformations done to the data whilst option 2 will give you the same plus the model and its parameters.
Here's some sample code (from a notebook, based on their documentation on the Binary Classification Tutorial (CLF101) - Level Beginner):
from pycaret.datasets import get_data
from pycaret.classification import *
dataset = get_data('credit')
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
exp_clf101 = setup(data = data, target = 'default', session_id=123)
best = compare_models()
evaluate_model(best)
# OPTION 1
get_config("pipeline")
# OPTION 2
lb = get_leaderboard()
lb.iloc[0]['Model']

Pre-trained FastText hyperparameters

I'm using the pre-trained model:
import fasttext.util
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.en.300.bin')
Where can I find an exhaustive list of the values of the hyperparameters used to train the model?
https://fasttext.cc/docs/en/options.html list the default values, that differ from the used one: for example, the dimension of the word vectors is 300 and not 100 (citing https://fasttext.cc/docs/en/crawl-vectors.html that doesn't list them all).
From looking at the _FastText Python model class in Facebook's source...
https://github.com/facebookresearch/fastText/blob/a20c0d27cd0ee88a25ea0433b7f03038cd728459/python/fasttext_module/fasttext/FastText.py#L99
...it looks like, at least when creating a model, all the hyperparameters are added as attributes on the object.
Have you checked if that's the case on your loaded model? For example, does ft.dim report 300, and other parameters like ft.minCount report anything interesting?
Update: As that didn't seem to work, it also looks like the _FastText model wraps an internal instance of a native (not-in-Python) FastText model in its .f attribute. (See a few lines up from the source code I pointed to earlier.)
And that native-instance is set up by the module specified by fasttext_pybind.cc. That code looks like it specified a bunch of read-write class variable, associated with the metaparameters - see for example starting at:
https://github.com/facebookresearch/fastText/blob/a20c0d27cd0ee88a25ea0433b7f03038cd728459/python/fasttext_module/fasttext/pybind/fasttext_pybind.cc#L88
So: does ft.f.minCount or ft.f.dim return anything useful from a post-loaded model ft?
Citing NVS Abhilash from https://github.com/facebookresearch/fastText/issues/887#issuecomment-649018188 the right code to write is:
args_obj = ft.f.getArgs()
for hparam in dir(args_obj):
if not hparam.startswith('__'):
print(f"{hparam} -> {getattr(args_obj, hparam)}")
This will print all the hyperparameters of the trained model!

How to find back the architecture of a pytorch model having only the weight dictionnary?

I wanted to use the multilingual-codesearch model but first the code doesn't work and outputs the following error which suggest that it cannot load with only weights:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ncoop57/multilingual-codesearch")
model = AutoModel.from_pretrained("ncoop57/multilingual-codesearch")
ValueError: Unrecognized model in ncoop57/multilingual-codesearch. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: gpt_neo, big_bird, speech_to_text, vit, wav2vec2, m2m_100, convbert, led, blenderbot-small, retribert, ibert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta-v2, deberta, flaubert, fsmt, squeezebert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas
Then I downloaded the pytorch bin file but it only contains the weight dictionnary (state dictionnary as mentioned here), which means that if I want to use the model I have to initialize the good architecture and then load the weights.
But how am I supposed to find the architecture fitting the weight of a model that complex ? I saw that some method could find back the model based on the weight dictionnary but I didn't manage to make them work (I think about enter link description here).
How can one find back the architecture of a weight dictionnary in order to make the model work ? Is it even possible ?

Inception v3, train test and validation splits in retrain.py

I have been using tensorflow's retrain.py script for the inception v3 model to build a multilabel classification model, I found some code in the script that I don't fully understand, I'm asking for help clarifying what it does:
for file_name in file_list:
base_name = os.path.basename(file_name)
# We want to ignore anything after '_nohash_' in the file name when
# deciding which set to put an image in, the data set creator has a way of
# grouping photos that are close variations of each other. For example
# this is used in the plant disease data set to group multiple pictures of
# the same leaf.
hash_name = re.sub(r'_nohash_.*$', '', file_name)
# This looks a bit magical, but we need to decide whether this file should
# go into the training, testing, or validation sets, and we want to keep
# existing files in the same set even if more files are subsequently
# added.
# To do that, we need a stable way of deciding based on just the file name
# itself, so we do a hash of that and then use that to generate a
# probability value that we use to assign it.
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) *
(100.0 / MAX_NUM_IMAGES_PER_CLASS))
if percentage_hash < validation_percentage:
validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name)
Is there a way to name the images to decide which ones go to the train set, validation set or test set? If someone has a link for the plant disease dataset mentioned here that would also be helpful, thanks!

Extract human name from his CV in Python

As you all know names of persons normally on the top of their resume, so i did NER(name entity recognition) tagging using spaCy library on CV's and then i extract the first tag of PERSON (hoping it should be Human Name). Some time it works for me fine but some time it gives me Other things which are not names(because spaCy don't even recognize some names with any NER tag), so it is giving me some other things which it recognize as a PERSON it may be like 'Curriculam vitae' obviously this i don't want.
Following is a Code for which i was talking above...
import spacy
import docx2txt
nlp = spacy.load('en_default')
my_text = docx2txt.process("/home/waqar/CV data/Adnan.docx")
doc_2 = nlp(my_text)
for ent in doc_2.ents:
if ent.label_ == "PERSON":
print('{}'.format(ent))
break
Is there any way through which i can add some name to NER for 'PERSON' tag in spaCy because then it will be able to recognize human names written in CV's
i think my logic is fine but something i am missing....
I would be very thankful if u peoples help me as i am Student and a beginer in python hope u peoples will definitely suggest some thing
OutPut
Abdul Ahad Ghous
but some time it giving me OutPuts like following as NER recognize it as a PERSON and don't even give any tag to the human name in this CV.
Curriculum Vitae

Resources