separate from and to places from text using nlp python - nlp

I have a sample text like
'I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017
From this type of text i want to separate from and to countries and dates.
How can i approach?

To Install Follow These Steps https://spacy.io/docs/usage/
string = "I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017"
import re
import spacy
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
if ent.label_ == 'GPE':
print ent.text
Output
Spain
India
Reference
https://spacy.io/docs/usage/entity-recognition

Related

spacy identifies names with digit in it as PERSON entity

I am using spacy 2.3.2 version
while predicting names like santosh12647578 kadge16577. spacy identifies them as PERSON entity.
How do I tell spacy if it encounters number in it don't consider it as PERSON entity.
Can I use entity ruler for this. If can how I should I approach this
Any help will be highly appreciated
You can use rule-based components after the NER statistical model to correct common errors.
import spacy
nlp = spacy.load('en_core_web_lg')
def reduce_person_entities(doc):
new_ents = []
for ent in doc.ents:
if ent.label_ == "PERSON" and any(char.isdigit() for tok in ent for char in tok.text):
pass
else:
new_ents.append(ent)
doc.ents = new_ents
return doc
nlp.add_pipe(reduce_person_entities, after='ner')
doc = nlp('Some example usernames include kadge16577 (Kadge Smith).')
for ent in doc.ents:
print(ent, ent.label_)
Output
Kadge Smith PERSON

Replace specific text with a redacted version using Python

I am looking to do the opposite of what has been done here:
import re
text = '1234-5678-9101-1213 1415-1617-1819-hello'
re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)
output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'
Partial replacement with re.sub()
My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.
The end result would look like:
XXXX went to XXXX XXXXXX
Sponge Bob went to Disney World.
In short, I am unmasking text and replacing it with a generated dataset using fuzzy.
You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.
NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.
Example:
Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.
Returns:
Just be aware that this is not 100%!
Here are a little snippet for you to try out:
import spacy
phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
doc = nlp(phrase)
replaced = ""
for token in doc:
if token in doc.ents:
replaced+="XXXX "
else:
replaced+=token.text+" "
Read more here: https://spacy.io/usage/linguistic-features#named-entities
You could, instead of replacing with XXXX, replace based on the entity type, like:
if ent.label_ == "PERSON":
replaced += "<PERSON> "
Then:
import re, random
personames = ["Jack", "Mike", "Bob", "Dylan"]
phrase = re.replace("<PERSON>", random.choice(personames), phrase)

How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired"?

How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site."
here "Criticized Trump" is recognized as person instead of "Trump" as person.
How to pre-process and lower case the text like "Criticized" or "Texts" from the above string to overcome above issue or any other technique to do so.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint
sent = ("F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site")
doc = nlp(sent)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Result from above code:-
"Criticized Trump" as 'PERSON' and "Texts" as 'GPE'
Expected result should be:-
"Trump" as 'PERSON' instead of "Criticized Trump" as 'PERSON' and "Texts" as '' instead of "Texts" as 'GPE'
You can add more examples of Named Entities to tune the NER model. Here you have all the information needed for the preparation of train data https://spacy.io/usage/training. You can use prodigy (annotation tool from spaCy creators, https://prodi.gy) to mark Named Entities in your data.
Indeed, you can pre-process using POS tagging in order to change to lower case words like "Criticized" or "Texts" which are not proper nouns.
Proper capitalization (lower vs. upper case) will help the NER tagger.
sent = "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site"
doc = nlp(sent)
words = []
spaces = []
for a in doc:
if a.pos_ != 'PROPN':
words.append( a.text.lower() )
else:
words.append(a.text)
spaces.append(a.whitespace_)
spaces = [len(sp) for sp in spaces]
docNew = Doc(nlp.vocab, words=words, spaces=spaces)
print(docNew)
# F.B.I. Agent Peter Strzok, who criticized Trump in texts, is fired - the New York Times SectionsSEARCHSkip to contentskip to site

Removing names from noun chunks in spacy

Is there a way to remove name of person in noun chunks ?
Here is the code
import en_vectors_web_lg
nlp = en_vectors_web_lg.load()
text = "John Smith is lookin for Apple ipod"
doc = nlp(text)
for chunk in doc.noun_chunks:
print(chunk.text)
Current output
John Smith
Apple ipod
I would like to have an output like below where name of the people is ignored. How to achieve this ?
Apple ipod
Reference spaCy ents
import spacy
# loading the model
nlp = spacy.load('en_core_web_lg')
doc = nlp(u'"John Smith is lookin for Apple ipod"')
# creating the filter list for tokens that are identified as person
fil = [i for i in doc.ents if i.label_.lower() in ["person"]]
# looping through noun chunks
for chunk in doc.noun_chunks:
# filtering the name of the person
if chunk not in fil:
print(chunk.text)
Output:
Apple ipod
Hope this helps.

Relationship extraction between person and city/state

I'm trying to take a sentence and extract the relationship between Person(PER) and Place(GPE).
Sentence: "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
For the final person, she has both a city and a state that could get extracted as her place. So far, I've tried using nltk to do this, but have only been able to extract her city and not her state.
What I've tried:
import re
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.sem.relextract import extract_rels, rtuple
sentence = "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
chunked = ne_chunk(pos_tag(word_tokenize(sentence)))
ISFROM = re.compile(r'.*\bfrom\b.*')
rels = extract_rels('PER', 'GPE', chunked, corpus = 'ace', pattern = ISFROM)
for rel in rels:
print(rtuple(rel))
My output is:
[PER: 'John/NNP'] 'is/VBZ from/IN' [GPE: 'Ohio/NNP']
[PER: 'Michael/NNP'] 'is/VBZ from/IN' [GPE: 'Florida/NNP']
[PER: 'Rebecca/NNP'] 'is/VBZ from/IN' [GPE: 'Nashville/NNP']
The problem is Rebecca. How can I extract that both Nashville and Tennesee are part of her location? Or even just Tennessee alone?
It seems to me that you have to first extract intra-location relationship (Nashville in Tennessee). Then ensure that you transitively assign all locations to Rebecca (if Rebecca is in Nashville and Nashville is in Tennessee then Rebecca is in Nashville and Rebecca is in Tennessee).
That would be one more relationship type and some logic for the above inference (things get complicated pretty quickly but it is hard to avoid it).

Resources