Is there a way to remove name of person in noun chunks ?
Here is the code
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
text = "John Smith is lookin for Apple ipod"
doc = nlp(text)
for chunk in doc.noun_chunks:
print(chunk.text)
Current output
John Smith
Apple ipod
I would like to have an output like below where name of the people is ignored. How to achieve this ?
Apple ipod
Reference spaCy ents
import spacy
# loading the model
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'"John Smith is lookin for Apple ipod"')
# creating the filter list for tokens that are identified as person
fil = [i for i in doc.ents if i.label_.lower() in ["person"]]
# looping through noun chunks
for chunk in doc.noun_chunks:
# filtering the name of the person
if chunk not in fil:
print(chunk.text)
Output:
Apple ipod
Hope this helps.
Related
I am using spacy 2.3.2 version
while predicting names like santosh12647578 kadge16577. spacy identifies them as PERSON entity.
How do I tell spacy if it encounters number in it don't consider it as PERSON entity.
Can I use entity ruler for this. If can how I should I approach this
Any help will be highly appreciated
You can use rule-based components after the NER statistical model to correct common errors.
import spacy
nlp = spacy.load('en_core_web_lg')
def reduce_person_entities(doc):
new_ents = []
for ent in doc.ents:
if ent.label_ == "PERSON" and any(char.isdigit() for tok in ent for char in tok.text):
pass
else:
new_ents.append(ent)
doc.ents = new_ents
return doc
nlp.add_pipe(reduce_person_entities, after='ner')
doc = nlp('Some example usernames include kadge16577 (Kadge Smith).')
for ent in doc.ents:
print(ent, ent.label_)
Output
Kadge Smith PERSON
I am looking for a way to singularize noun chunks with spacy
S='There are multiple sentences that should include several parts and also make clear that studying Natural language Processing is not difficult '
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentences', 'parts', 'Processing']
I am looking for a way to singularize those roots of the chunks.
GOAL: Singulirized: ['sentence', 'part', 'Processing']
Is there any obvious way? Is that always depending on the POS of every root word?
Thanks
note:
I found this: https://www.geeksforgeeks.org/nlp-singularizing-plural-nouns-and-swapping-infinite-phrases/
but that approach looks to me that leads to many many different methods and of course different for every language. ( I am working in EN, FR, DE)
To get the basic form of each word, you can use ".lemma_" property of chunk or token property
I use Spacy version 2.x
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp('did displaying words')
print (" ".join([token.lemma_ for token in doc]))
and the output :
do display word
Hope it helps :)
There is! You can take the lemma of the head word in each noun chunk.
[chunk.root.lemma_ for chunk in doc.noun_chunks]
Out[82]: ['sentence', 'part', 'processing']
Is there anyway by SpaCy to replace entity detected by SpaCy NER with its label?
For example:
I am eating an apple while playing with my Apple Macbook.
I have trained NER model with SpaCy to detect "FRUITS" entity and the model successfully detects the first "apple" as "FRUITS", but not the second "Apple".
I want to do post-processing of my data by replacing each entity with its label, so I want to replace the first "apple" with "FRUITS". The sentence will be "I am eating an FRUITS while playing with my Apple Macbook."
If I simply use regex, it will replace the second "Apple" with "FRUITS" as well, which is incorrect. Is there any smart way to do this?
Thanks!
the entity label is an attribute of the token (see here)
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
s = "His friend Nicolas is here."
doc = nlp(s)
print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', '.']
print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here .
Edit:
In order to handle cases were entities can span several words the following code can be used instead:
s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
start = e.start_char
end = start + len(e.text)
newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.
Update:
Jinhua Wang brought to my attention that there is now a more built-in and simpler way to do this using the merge_entities pipe.
See Jinhua's answer below.
A more elegant modification to #DBaker's solution above when entities can span several words:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("merge_entities")
s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', 'with', 'PERSON', 'and', 'PERSON', '.']
print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here with PERSON and PERSON .
You can check the documentation on Spacy here. It uses the built in Pipeline for the job and has good support for multiprocessing. I believe this is the officially supported way to replace entities by their tags.
A slightly shorter version of #DBaker answer which uses end_char instead of computing it:
for ent in reversed(doc.ents):
text = text[:ent.start_char] + ent.label_ + text[ent.end_char:]
How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site."
here "Criticized Trump" is recognized as person instead of "Trump" as person.
How to pre-process and lower case the text like "Criticized" or "Texts" from the above string to overcome above issue or any other technique to do so.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint
sent = ("F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site")
doc = nlp(sent)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Result from above code:-
"Criticized Trump" as 'PERSON' and "Texts" as 'GPE'
Expected result should be:-
"Trump" as 'PERSON' instead of "Criticized Trump" as 'PERSON' and "Texts" as '' instead of "Texts" as 'GPE'
You can add more examples of Named Entities to tune the NER model. Here you have all the information needed for the preparation of train data https://spacy.io/usage/training. You can use prodigy (annotation tool from spaCy creators, https://prodi.gy) to mark Named Entities in your data.
Indeed, you can pre-process using POS tagging in order to change to lower case words like "Criticized" or "Texts" which are not proper nouns.
Proper capitalization (lower vs. upper case) will help the NER tagger.
sent = "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site"
doc = nlp(sent)
words = []
spaces = []
for a in doc:
if a.pos_ != 'PROPN':
words.append( a.text.lower() )
else:
words.append(a.text)
spaces.append(a.whitespace_)
spaces = [len(sp) for sp in spaces]
docNew = Doc(nlp.vocab, words=words, spaces=spaces)
print(docNew)
# F.B.I. Agent Peter Strzok, who criticized Trump in texts, is fired - the New York Times SectionsSEARCHSkip to contentskip to site
I have a sample text like
'I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017
From this type of text i want to separate from and to countries and dates.
How can i approach?
To Install Follow These Steps https://spacy.io/docs/usage/
string = "I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017"
import re
import spacy
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
if ent.label_ == 'GPE':
print ent.text
Output
Spain
India
Reference
https://spacy.io/docs/usage/entity-recognition