Using Arabert model with SpaCy - nlp

SpaCy doesn't support the Arabic language, but Can I use SpaCy with the pretrained Arabert model?
Is it possible to modify this code so it can accept bert-large-arabertv02 instead of en_core_web_lg?
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")
Here How we can call AraBertV.02
from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name="aubmindlab/bert-large-arabertv02"
arabert_prep = ArabertPreprocessor(model_name=model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

spaCy actually does support Arabic, though only at an alpha level, which basically just means tokenization support (see here). That's enough for loading external models or training your own, though, so in this case you should be able to load this like any HuggingFace model - see this FAQ.
In this case this would look like:
import spacy
nlp = spacy.blank("ar") # empty English pipeline
# create the config with the name of your model
# values omitted will get default values
config = {
"model": {
"#architectures": "spacy-transformers.TransformerModel.v3",
"name": "aubmindlab/bert-large-arabertv02"
}
}
nlp.add_pipe("transformer", config=config)
nlp.initialize() # XXX don't forget this step!
doc = nlp("فريك الذرة لذيذة")
print(doc._.trf_data) # all the Transformer output is stored here
I don't speak Arabic, so I can't check the output thoroughly, but that code ran and produced an embedding for me.

Related

Using spaCy Specific Type Annotations (Python 3.11)

I want to indicate that certain parameters are of spaCy return objects. For example, corpus is an item returned as a result of calling nlp. Assume that hh is located in different modules.
import spacy
nlp = spacy.load("en_core_web_sm")
def hh(corpus):
pass
result = hh(nlp(text))
How to annotate parameters with spaCy built-in object, for instince like corpus:spacy_token?
I had a hard time figuring this out. Try this and that but no result. The property like lemma_ is not detected in VS Code.
You just need to import the spacy objects and use them like regular type annotations
from spacy.tokens import Token
import spacy
nlp = spacy.load("en_core_web_sm")
def hh(corpus: Token):
pass
result = hh(nlp(text))

How to solve the problem of importing when trying to import 'SentenceSegmenter' from 'spacy.pipeline' package?

ImportError: cannot import name 'SentenceSegmenter' from 'spacy.pipeline'
Spacy version: 3.2.1
I know this class is for an earlier version of spacy, but would it have something similar for this version of spacy?
There are several methods to perform sentence segmentation in spacy. You can read about these in the docs here: https://spacy.io/usage/linguistic-features#sbd.
This example is copied as-is from the docs, showing how to segment sentences based on an English language model.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence. This is another sentence.")
assert doc.has_annotation("SENT_START")
for sent in doc.sents:
print(sent.text)
You can also use the rule-based one to perform punctuation-split based on language only, like so (also from the docs):
import spacy
from spacy.lang.en import English
nlp = English() # just the language with no pipeline
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
This should work for spacy 3.0.5 and above.

lemmatizer is not working in python spacy librarary

I am trying to create a small chatbot using spacy library , while i use lemmtizer the code gives incorrect output. Can someone help me.
Below is my code:
import spacy
from spacy.lang.en import English
lemmatizer = English.Defaults.create_lemmatizer()
nlp = spacy.load('en_core_web_sm')
lemmatizer = nlp.Defaults.create_lemmatizer()
lemmatizer(u'chuckles', 'Noun')
Output
['chuckles']
The expected output is "chuckle"
A recommended way of using SpaCy is to create a document:
doc = nlp("chuckles")
for word in doc:
print(word.lemma_)

SpaCy loading models

I am new to NLP and spaCy. I am using the en_core_web_md model. I am loading it using spacy.load()
Whenever I run my program it loads the model. Is there a way to load the model just once for all the subsequent runs?
Yes you can, in the example code below
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
keep the variable nlp save since it contains model.
and you can pass nlpagain and again by just passing it through your required function.

How to POS_TAG a french sentence?

I'm looking for a way to pos_tag a French sentence like the following code is used for English sentences:
def pos_tagging(sentence):
var = sentence
exampleArray = [var]
for item in exampleArray:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
return tagged
here is the full code source it works very well
download link for Standford NLP https://nlp.stanford.edu/software/tagger.shtml#About
from nltk.tag import StanfordPOSTagger
jar = 'C:/Users/m.ferhat/Desktop/stanford-postagger-full-2016-10-31/stanford-postagger-3.7.0.jar'
model = 'C:/Users/m.ferhat/Desktop/stanford-postagger-full-2016-10-31/models/french.tagger'
import os
java_path = "C:/Program Files/Java/jdk1.8.0_121/bin/java.exe"
os.environ['JAVAHOME'] = java_path
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
res = pos_tagger.tag('je suis libre'.split())
print (res)
The NLTK doesn't come with pre-built resources for French. I recommend using the Stanford tagger, which comes with a trained French model. This code shows how you might set up the nltk for use with Stanford's French POS tagger. Note that the code is outdated (and for Python 2), but you could use it as a starting point.
Alternately, the NLTK makes it very easy to train your own POS tagger on a tagged corpus, and save it for later use. If you have access to a (sufficiently large) French corpus, you can follow the instructions in the nltk book and simply use your corpus in place of the Brown corpus. You're unlikely to match the performance of the Stanford tagger (unless you can train a tagger for your specific domain), but you won't have to install anything.

Resources