I am new to NLP and spaCy. I am using the en_core_web_md model. I am loading it using spacy.load()
Whenever I run my program it loads the model. Is there a way to load the model just once for all the subsequent runs?
Yes you can, in the example code below
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
keep the variable nlp save since it contains model.
and you can pass nlpagain and again by just passing it through your required function.
Related
I want to indicate that certain parameters are of spaCy return objects. For example, corpus is an item returned as a result of calling nlp. Assume that hh is located in different modules.
import spacy
nlp = spacy.load("en_core_web_sm")
def hh(corpus):
pass
result = hh(nlp(text))
How to annotate parameters with spaCy built-in object, for instince like corpus:spacy_token?
I had a hard time figuring this out. Try this and that but no result. The property like lemma_ is not detected in VS Code.
You just need to import the spacy objects and use them like regular type annotations
from spacy.tokens import Token
import spacy
nlp = spacy.load("en_core_web_sm")
def hh(corpus: Token):
pass
result = hh(nlp(text))
SpaCy doesn't support the Arabic language, but Can I use SpaCy with the pretrained Arabert model?
Is it possible to modify this code so it can accept bert-large-arabertv02 instead of en_core_web_lg?
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")
Here How we can call AraBertV.02
from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name="aubmindlab/bert-large-arabertv02"
arabert_prep = ArabertPreprocessor(model_name=model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
spaCy actually does support Arabic, though only at an alpha level, which basically just means tokenization support (see here). That's enough for loading external models or training your own, though, so in this case you should be able to load this like any HuggingFace model - see this FAQ.
In this case this would look like:
import spacy
nlp = spacy.blank("ar") # empty English pipeline
# create the config with the name of your model
# values omitted will get default values
config = {
"model": {
"#architectures": "spacy-transformers.TransformerModel.v3",
"name": "aubmindlab/bert-large-arabertv02"
}
}
nlp.add_pipe("transformer", config=config)
nlp.initialize() # XXX don't forget this step!
doc = nlp("فريك الذرة لذيذة")
print(doc._.trf_data) # all the Transformer output is stored here
I don't speak Arabic, so I can't check the output thoroughly, but that code ran and produced an embedding for me.
PyCaret seems like a great AutoML tool. It works, fast and simple and I would like to download the generated pipeline code into .py files to double check and if needed to customize some parts. Unfortunately, I don't know how to make it real. Reading the documentation have not helped. Is it possible or not?
It is not possible to get the underlying code since PyCaret takes care of this for you. But it is up to you as the user to decide the steps that you want your flow to take e.g.
# Setup experiment with user-defined options for preprocessing, etc.
setup(...)
# Create a model (uses training split only)
model = create_model("lr")
# Tune hyperparameters (user can pass a custom tuning grid if needed)
# Again, uses training split only
tuned = tune_model(model, ...)
# Finalize model (so that the best hyperparameters are retrained on the entire dataset
finalize_model(tuned)
# Any other steps you would like to do.
...
Finally, you can save the entire pipeline as a pkl file for use later
# Saves the model + pipeline as a pkl file
save_model(final, "my_best_model")
You may get a partial answer: incomplete with 'get_config("prep_pipe")' in 2.6.10 or in 3.0.0rc1
Just run a setup like in examples, store as a cdf1, and try cdf.pipeline and you may get a text like this: Pipeline(..)
When working with pycaret=3.0.0rc4, you have two options.
Option 1:
get_config("pipeline")
Option 2:
lb = get_leaderboard()
lb.iloc[0]['Model']
Option 1 will give you the transformations done to the data whilst option 2 will give you the same plus the model and its parameters.
Here's some sample code (from a notebook, based on their documentation on the Binary Classification Tutorial (CLF101) - Level Beginner):
from pycaret.datasets import get_data
from pycaret.classification import *
dataset = get_data('credit')
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
exp_clf101 = setup(data = data, target = 'default', session_id=123)
best = compare_models()
evaluate_model(best)
# OPTION 1
get_config("pipeline")
# OPTION 2
lb = get_leaderboard()
lb.iloc[0]['Model']
I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their website:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
What I would like to do is save and run this locally without having to download the "ner" model every time (which is over 1 GB in size). In their documentation, I see that you can save the pipeline using the "pipeline.save_pretrained()" function to a local folder. The results of this are various files which I am storing into a specific folder.
My question would be how can I load this model back up into a script to continue classifying as in the example above after saving? The output of "pipeline.save_pretrained()" is multiple files.
Here is what I have tried so far:
1: Following the documentation about pipeline
pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')
The error I got was: 'str' object has no attribute "config"
2: Following HuggingFace example on ner:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
This yields an error: list index out of range
I also tried printing out just predictions which is not returning the text format of the tokens along with their entities.
Any help would be much appreciated!
Loading a model like this has always worked for me:
from transformers import pipeline
pipe = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
Have a look at here for further examples on how to use pipelines.
After I call this function like this:
from tensorflow.contrib import learn
#----------------------------------------
#Do some process here
#----------------------------------------
classifier = learn.Estimator(model_fn=bag_of_words_model,model_dir='F:/data')
classifier.fit(feature_train, target_train, steps=1000)
I will have some file in my folder "F:/data" like this
And I wonder Do I have anyway to reuse this model ? Like move to new computer and use this to predict new data. Sorry for my bad English. Thanks for all the answers!! Hope you all have a nice day.
In a script where you want to reuse your model, redefine/import bag_of_words_model again and define
classifier_loaded = learn.Estimator(model_fn=bag_of_words_model, model_dir='F:/data')
Tensorflow will reload the graph and import the weights into the graph and you can just use it as
classifier_loaded.predict(input_fn=input_fn)
or continue the training with
claissifier_loaded.fit(feature_train, target_train)