Replace entity with its label in SpaCy

Replace entity with its label in SpaCy - nlp

Is there anyway by SpaCy to replace entity detected by SpaCy NER with its label?
For example:
I am eating an apple while playing with my Apple Macbook.
I have trained NER model with SpaCy to detect "FRUITS" entity and the model successfully detects the first "apple" as "FRUITS", but not the second "Apple".
I want to do post-processing of my data by replacing each entity with its label, so I want to replace the first "apple" with "FRUITS". The sentence will be "I am eating an FRUITS while playing with my Apple Macbook."
If I simply use regex, it will replace the second "Apple" with "FRUITS" as well, which is incorrect. Is there any smart way to do this?
Thanks!

the entity label is an attribute of the token (see here)
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
s = "His friend Nicolas is here."
doc = nlp(s)
print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', '.']
print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here .
Edit:
In order to handle cases were entities can span several words the following code can be used instead:
s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
start = e.start_char
end = start + len(e.text)
newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.
Update:
Jinhua Wang brought to my attention that there is now a more built-in and simpler way to do this using the merge_entities pipe.
See Jinhua's answer below.

A more elegant modification to #DBaker's solution above when entities can span several words:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("merge_entities")
s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', 'with', 'PERSON', 'and', 'PERSON', '.']
print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here with PERSON and PERSON .
You can check the documentation on Spacy here. It uses the built in Pipeline for the job and has good support for multiprocessing. I believe this is the officially supported way to replace entities by their tags.

A slightly shorter version of #DBaker answer which uses end_char instead of computing it:
for ent in reversed(doc.ents):
text = text[:ent.start_char] + ent.label_ + text[ent.end_char:]

Related

spacy identifies names with digit in it as PERSON entity

I am using spacy 2.3.2 version
while predicting names like santosh12647578 kadge16577. spacy identifies them as PERSON entity.
How do I tell spacy if it encounters number in it don't consider it as PERSON entity.
Can I use entity ruler for this. If can how I should I approach this
Any help will be highly appreciated

You can use rule-based components after the NER statistical model to correct common errors.
import spacy
nlp = spacy.load('en_core_web_lg')
def reduce_person_entities(doc):
new_ents = []
for ent in doc.ents:
if ent.label_ == "PERSON" and any(char.isdigit() for tok in ent for char in tok.text):
pass
else:
new_ents.append(ent)
doc.ents = new_ents
return doc
nlp.add_pipe(reduce_person_entities, after='ner')
doc = nlp('Some example usernames include kadge16577 (Kadge Smith).')
for ent in doc.ents:
print(ent, ent.label_)
Output
Kadge Smith PERSON

I am using spaCy to find location from string

I wrote a code to find locations present in a string.
import spacy
nlp= spacy.load('en')
doc1='Pune, India'
doc2='India, Pune'
doc3='Pune India'
doc4='India Pune'
print([(X.text, X.label_) for X in nlp(doc1).ents])
print([(X.text, X.label_) for X in nlp(doc2).ents])
print([(X.text, X.label_) for X in nlp(doc3).ents])
print([(X.text, X.label_) for X in nlp(doc4).ents])
and my output is:
[('India', 'GPE')]
[('India', 'GPE'), ('Pune', 'GPE')]
[('India', 'GPE')]
[('India', 'GPE')]
How can I get [('India', 'GPE'), ('Pune', 'GPE')] same output for all?

SpaCy is able to understand that 'Pune' is probably a GPE only because of the comma (,) after India in doc2. This isn't the case for other three examples, and it only detected a known GPE, India.
You may have to write some custom code based on the case you have in your dataset. You may try considering the words 'before' and 'after' the detected GPEs and check them if they come in English dictionary. For example, in your cases when only 'India' is detected, you may check 'Pune' in the English word list (you can get it in SpaCy), and if it doesn't exist, it can be considered as a candidate GPE.
Do let me know if you need help in coding this?

This is because the model you are using does not know that 'Pune' is an entity with label 'GPE'.
You can add an extra component to your pipeline with add_pipe() to identify 'Pune' as an entity as well.
Note that the presence of a comma is not the reason why 'Pune' is classified an entity.
Since you got a list of text, let's make a list so match nlp.pipe() can be used.
texts = ['Pune, India', 'India, Pune', 'Pune India', 'India Pune']
LOCATIONS = ["Pune", "India"]
First, you got to find the places where 'Pune' occurs in the text. PhraseMatcher can be used for that.
from spacy.matcher import PhraseMatcher
Create a matcher object and add the patterns to be recognized to it.
nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
matcher.add("LOCATIONS", None, *list(nlp.pipe(LOCATIONS)))
Now define the component function.
def places_component(doc):
doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matcher(doc)]
return doc
Add the component to the pipeline.
nlp.add_pipe(places_component) #last=True
Now try
for doc in nlp.pipe(texts):
print([(ent.text, ent.label_) for ent in doc.ents])
You'll get
[('Pune', 'GPE'), ('India', 'GPE')]
[('India', 'GPE'), ('Pune', 'GPE')]
[('Pune', 'GPE'), ('India', 'GPE')]
[('India', 'GPE'), ('Pune', 'GPE')]
The whole program would be something like
import spacy
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher
LOCATIONS = ["Pune", "India"]
texts = ['Pune, India', 'India, Pune', 'Pune India', 'India Pune']
nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
matcher.add("LOCATIONS", None, *list(nlp.pipe(LOCATIONS)))
def places_component(doc):
doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matcher(doc)]
return doc
nlp.add_pipe(places_component) #last=True
for doc in nlp.pipe(texts):
print([(ent.text, ent.label_) for ent in doc.ents])

Sentence split using spacy sentenizer

I am using spaCy's sentencizer to split the sentences.
from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)
text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)
sents_list = []
for sent in doc.sents:
sents_list.append(sent.text)
print(sents_list)
print([token.text for token in doc])
OUTPUT
['Please read the analysis. (',
"You'll be amazed.)"]
['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be',
'amazed', '.', ')']
Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.

I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).
Below custom boundaries only works with sm model and behave different splitting with lg model.
nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ".(" or token.text == ").":
doc[token.i+1].is_sent_start = True
elif token.text == "Rs." or token.text == ")":
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
for sent in doc.sents:
print(sent.text)

The sentencizer is a very fast but also very minimal sentence splitter that's not going to have good performance with punctuation like this. It's good for splitting texts into sentence-ish chunks, but if you need higher quality sentence segmentation, use the parser component of an English model to do sentence segmentation.

Lemmatization of words using spacy and nltk not giving correct lemma

I want to get the lemmatized words of the words in list given below:
(eg)
words = ['Funnier','Funniest','mightiest','tighter']
When I do spacy,
import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
print(items.lemma_)
I got the lemmas like:
Funnier
Funniest
mighty
tight
When I go for nltk WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token))
I got:
Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
Anyone help for this.
Thanks.

Lemmatisation is totally dependent on the part of speech tag that you are using while getting the lemma of the particular word.
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.
Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().
Sometimes, the same word can have a multiple lemmas based on the meaning / context.
print(lemmatizer.lemmatize("stripes", 'v'))
#> strip
print(lemmatizer.lemmatize("stripes", 'n'))
#> stripe
For the above example(), specify the corresponding pos tag:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token, wordnet.ADJ_SAT))

How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired"?

How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site."
here "Criticized Trump" is recognized as person instead of "Trump" as person.
How to pre-process and lower case the text like "Criticized" or "Texts" from the above string to overcome above issue or any other technique to do so.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint
sent = ("F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site")
doc = nlp(sent)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Result from above code:-
"Criticized Trump" as 'PERSON' and "Texts" as 'GPE'
Expected result should be:-
"Trump" as 'PERSON' instead of "Criticized Trump" as 'PERSON' and "Texts" as '' instead of "Texts" as 'GPE'

You can add more examples of Named Entities to tune the NER model. Here you have all the information needed for the preparation of train data https://spacy.io/usage/training. You can use prodigy (annotation tool from spaCy creators, https://prodi.gy) to mark Named Entities in your data.

Indeed, you can pre-process using POS tagging in order to change to lower case words like "Criticized" or "Texts" which are not proper nouns.
Proper capitalization (lower vs. upper case) will help the NER tagger.
sent = "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site"
doc = nlp(sent)
words = []
spaces = []
for a in doc:
if a.pos_ != 'PROPN':
words.append( a.text.lower() )
else:
words.append(a.text)
spaces.append(a.whitespace_)
spaces = [len(sp) for sp in spaces]
docNew = Doc(nlp.vocab, words=words, spaces=spaces)
print(docNew)
# F.B.I. Agent Peter Strzok, who criticized Trump in texts, is fired - the New York Times SectionsSEARCHSkip to contentskip to site

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Replace entity with its label in SpaCy - nlp

A slightly shorter version of #DBaker answer which uses end_char instead of computing it: for ent in reversed(doc.ents): text = text[:ent.start_char] + ent.label_ + text[ent.end_char:]

Related

spacy identifies names with digit in it as PERSON entity

I am using spaCy to find location from string

Sentence split using spacy sentenizer

Lemmatization of words using spacy and nltk not giving correct lemma

How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired"?

Categories

Resources