Using spaCy Specific Type Annotations (Python 3.11) - nlp

I want to indicate that certain parameters are of spaCy return objects. For example, corpus is an item returned as a result of calling nlp. Assume that hh is located in different modules.
import spacy
nlp = spacy.load("en_core_web_sm")
def hh(corpus):
pass
result = hh(nlp(text))
How to annotate parameters with spaCy built-in object, for instince like corpus:spacy_token?
I had a hard time figuring this out. Try this and that but no result. The property like lemma_ is not detected in VS Code.

You just need to import the spacy objects and use them like regular type annotations
from spacy.tokens import Token
import spacy
nlp = spacy.load("en_core_web_sm")
def hh(corpus: Token):
pass
result = hh(nlp(text))

Related

Using Arabert model with SpaCy

SpaCy doesn't support the Arabic language, but Can I use SpaCy with the pretrained Arabert model?
Is it possible to modify this code so it can accept bert-large-arabertv02 instead of en_core_web_lg?
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")
Here How we can call AraBertV.02
from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name="aubmindlab/bert-large-arabertv02"
arabert_prep = ArabertPreprocessor(model_name=model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
spaCy actually does support Arabic, though only at an alpha level, which basically just means tokenization support (see here). That's enough for loading external models or training your own, though, so in this case you should be able to load this like any HuggingFace model - see this FAQ.
In this case this would look like:
import spacy
nlp = spacy.blank("ar") # empty English pipeline
# create the config with the name of your model
# values omitted will get default values
config = {
"model": {
"#architectures": "spacy-transformers.TransformerModel.v3",
"name": "aubmindlab/bert-large-arabertv02"
}
}
nlp.add_pipe("transformer", config=config)
nlp.initialize() # XXX don't forget this step!
doc = nlp("فريك الذرة لذيذة")
print(doc._.trf_data) # all the Transformer output is stored here
I don't speak Arabic, so I can't check the output thoroughly, but that code ran and produced an embedding for me.

How to solve the problem of importing when trying to import 'SentenceSegmenter' from 'spacy.pipeline' package?

ImportError: cannot import name 'SentenceSegmenter' from 'spacy.pipeline'
Spacy version: 3.2.1
I know this class is for an earlier version of spacy, but would it have something similar for this version of spacy?
There are several methods to perform sentence segmentation in spacy. You can read about these in the docs here: https://spacy.io/usage/linguistic-features#sbd.
This example is copied as-is from the docs, showing how to segment sentences based on an English language model.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence. This is another sentence.")
assert doc.has_annotation("SENT_START")
for sent in doc.sents:
print(sent.text)
You can also use the rule-based one to perform punctuation-split based on language only, like so (also from the docs):
import spacy
from spacy.lang.en import English
nlp = English() # just the language with no pipeline
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
for sent in doc.sents:
print(sent.text)
This should work for spacy 3.0.5 and above.

lemmatizer is not working in python spacy librarary

I am trying to create a small chatbot using spacy library , while i use lemmtizer the code gives incorrect output. Can someone help me.
Below is my code:
import spacy
from spacy.lang.en import English
lemmatizer = English.Defaults.create_lemmatizer()
nlp = spacy.load('en_core_web_sm')
lemmatizer = nlp.Defaults.create_lemmatizer()
lemmatizer(u'chuckles', 'Noun')
Output
['chuckles']
The expected output is "chuckle"
A recommended way of using SpaCy is to create a document:
doc = nlp("chuckles")
for word in doc:
print(word.lemma_)

What is the difference between RegexpTokenizer and spacy tokenizer?

import spacy
from nltk.tokenize import RegexpTokenizer
EN = spacy.load('en')
def tokenize_docstring(text):
"Apply tokenization using spacy to docstrings."
tokens = EN.tokenizer(text)
return [token.text.lower() for token in tokens if not token.is_space]
def tokenize_code(text):
"A very basic procedure for tokenizing code strings."
return RegexpTokenizer(r'\w+').tokenize(text)
spacy has many more capabilities including word relationships, adding named entities to words etc.; following the official documentation
Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents.

Spacy - custom stop words are not working

I am trying to add custom STOP_WORDS to spacy.
The following code shall add the custom STOP_WORD "Bestellung" to the standard set of STOP_WORDS.
The problem I have is, that the adding works,i.e. the set contains "Bestellung" after adding it but when testing the custom stopword "Bestellung" with .is_stop, python returns FALSE.
Another test with an default STOP_WORD (i.e. it is standard in STOP_WORDS) "darunter" returns TRUE. I dont get it, beacause both words "Bestellung" and "darunter" are in the same set of STOP_WORDS.
Does anyone have an idea why it behaves like that?
Thank you
import spacy
from spacy.lang.de.stop_words import STOP_WORDS
STOP_WORDS.add("Bestellung")
print(STOP_WORDS) #Printing STOP_WORDS proofs, that "Bestellung" is part of the Set "STOP_WORDS". Both tested words "darunter" and "Bestellung" are part of it.
nlp=spacy.load("de_core_news_sm")
print(nlp.vocab["Bestellung"].is_stop) # return: FALSE
print(nlp.vocab["darunter"].is_stop) # return: TRUE
Thank you
This is related to a bug in previous spaCy models. Works well in latest spaCy.
Example on English model:
>>> import spacy
>>> nlp = spacy.load('en')
>>> from spacy.lang.en.stop_words import STOP_WORDS
>>> STOP_WORDS.add("Bestellung")
>>> print(nlp.vocab["Bestellung"].is_stop)
True
In case you want to fix this on your existing spaCy, you can use this work around, which alters the is_stop attribute on the words present in STOP_WORDS.
nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)
This is mentioned in this spaCy issue on Github

Resources