I'm trying to take a sentence and extract the relationship between Person(PER) and Place(GPE).
Sentence: "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
For the final person, she has both a city and a state that could get extracted as her place. So far, I've tried using nltk to do this, but have only been able to extract her city and not her state.
What I've tried:
import re
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.sem.relextract import extract_rels, rtuple
sentence = "John is from Ohio, Michael is from Florida and Rebecca is from Nashville which is in Tennessee."
chunked = ne_chunk(pos_tag(word_tokenize(sentence)))
ISFROM = re.compile(r'.*\bfrom\b.*')
rels = extract_rels('PER', 'GPE', chunked, corpus = 'ace', pattern = ISFROM)
for rel in rels:
print(rtuple(rel))
My output is:
[PER: 'John/NNP'] 'is/VBZ from/IN' [GPE: 'Ohio/NNP']
[PER: 'Michael/NNP'] 'is/VBZ from/IN' [GPE: 'Florida/NNP']
[PER: 'Rebecca/NNP'] 'is/VBZ from/IN' [GPE: 'Nashville/NNP']
The problem is Rebecca. How can I extract that both Nashville and Tennesee are part of her location? Or even just Tennessee alone?
It seems to me that you have to first extract intra-location relationship (Nashville in Tennessee). Then ensure that you transitively assign all locations to Rebecca (if Rebecca is in Nashville and Nashville is in Tennessee then Rebecca is in Nashville and Rebecca is in Tennessee).
That would be one more relationship type and some logic for the above inference (things get complicated pretty quickly but it is hard to avoid it).
Related
I’m new to python. I used python and Jupiter notebook and imported
Pandas & pypostal.
This is my code:
import numpy as np
import pandas as pd
from postal.parser import parse_address
df = pd.read_csv("./file.csv").head(20)
df['LongAddr'].apply(parse_address)
df['parse_addr'] = df['LongAddr'].apply(parse_address)
df.to_csv('./new_file.csv', index=False)
print ("JOB DONE")
This is my file.csv:
customer_key Company_Code Name Address_Type LongAddr
0 CHIT000001 ZY1 Terry CHI Nathan Road, Kowloon, Hong Kong
1 ENGT000002 BH6 Mary ENG Flat E, 19/F, Blk A, Hilton building
2 RCHIT000003 EG9 John.G CHI Marble Road Tai Koo Hong Kong
I had try output as cvs, json, xml.
However the file format wasn’t change any.
I have no clue with this format.
However, it turns out like this:
0 [(Hong Kong, state),(Kowloon, city),(Nathan Road, Road)]
1 [(flat E, unit), (19, level), (blk a hilton building, House)]
2 [(Hong Kong, state),(Tai Koo, city),(Marble Road, Road)]
All I want is .csv or .xlsx file
And output like this:
customer_key, state, city, road, house, level, unit
0 CHIT000001, Hong Kong, Kowloon, Nathan Road,,
1 ENGT000002, ,,,Blk A Hilton building, 19/F, Flat E
2 RCHIT000003 Hong Kong, Tai Koo, Marble Road
Create dictionary from resulting list of tuples by extracting State, city, road.
Create new dataframe from dictionary and you can use to_csv() for exporting file.
use respective file extension in to_csv()
Please provide Sample Output next time. Steps to reproduce are not clear.
Refer to below link:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
I am looking to do the opposite of what has been done here:
import re
text = '1234-5678-9101-1213 1415-1617-1819-hello'
re.sub(r"(\d{4}-){3}(?=\d{4})", "XXXX-XXXX-XXXX-", text)
output = 'XXXX-XXXX-XXXX-1213 1415-1617-1819-hello'
Partial replacement with re.sub()
My overall goal is to replace all XXXX within a text using a neural network. XXXX can represent names, places, numbers, dates, etc. that are in a .csv file.
The end result would look like:
XXXX went to XXXX XXXXXX
Sponge Bob went to Disney World.
In short, I am unmasking text and replacing it with a generated dataset using fuzzy.
You can do it using named-entity recognition (NER). It's fairly simple and there are out-of-the-shelf tools out there to do it, such as spaCy.
NER is an NLP task where a neural network (or other method) is trained to detect certain entities, such as names, places, dates and organizations.
Example:
Sponge Bob went to South beach, he payed a ticket of $200!
I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.
Returns:
Just be aware that this is not 100%!
Here are a little snippet for you to try out:
import spacy
phrases = ['Sponge Bob went to South beach, he payed a ticket of $200!', 'I know, Michael is a good person, he goes to McDonalds, but donates to charity at St. Louis street.']
nlp = spacy.load('en')
for phrase in phrases:
doc = nlp(phrase)
replaced = ""
for token in doc:
if token in doc.ents:
replaced+="XXXX "
else:
replaced+=token.text+" "
Read more here: https://spacy.io/usage/linguistic-features#named-entities
You could, instead of replacing with XXXX, replace based on the entity type, like:
if ent.label_ == "PERSON":
replaced += "<PERSON> "
Then:
import re, random
personames = ["Jack", "Mike", "Bob", "Dylan"]
phrase = re.replace("<PERSON>", random.choice(personames), phrase)
How do I get the correct NER using SpaCy from text like "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site."
here "Criticized Trump" is recognized as person instead of "Trump" as person.
How to pre-process and lower case the text like "Criticized" or "Texts" from the above string to overcome above issue or any other technique to do so.
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint
sent = ("F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site")
doc = nlp(sent)
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])
Result from above code:-
"Criticized Trump" as 'PERSON' and "Texts" as 'GPE'
Expected result should be:-
"Trump" as 'PERSON' instead of "Criticized Trump" as 'PERSON' and "Texts" as '' instead of "Texts" as 'GPE'
You can add more examples of Named Entities to tune the NER model. Here you have all the information needed for the preparation of train data https://spacy.io/usage/training. You can use prodigy (annotation tool from spaCy creators, https://prodi.gy) to mark Named Entities in your data.
Indeed, you can pre-process using POS tagging in order to change to lower case words like "Criticized" or "Texts" which are not proper nouns.
Proper capitalization (lower vs. upper case) will help the NER tagger.
sent = "F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times SectionsSEARCHSkip to contentSkip to site"
doc = nlp(sent)
words = []
spaces = []
for a in doc:
if a.pos_ != 'PROPN':
words.append( a.text.lower() )
else:
words.append(a.text)
spaces.append(a.whitespace_)
spaces = [len(sp) for sp in spaces]
docNew = Doc(nlp.vocab, words=words, spaces=spaces)
print(docNew)
# F.B.I. Agent Peter Strzok, who criticized Trump in texts, is fired - the New York Times SectionsSEARCHSkip to contentskip to site
I am trying to read a wiki page, collect and enumerate all sentences.
#read the wiki page
import wikipedia
eliz = wikipedia.page("Elizabeth II")
fullText2=eliz.content
m = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',fullText2)
docs=[]
for i in m:
print (i)
docs.append(i)
But it seems it doesn't work properly to split sentences:
for example I get this in the prints as a whole!:
"Elizabeth received private tuition in constitutional history from
Henry Marten, Vice-Provost of Eton College, and learned French from a
succession of native-speaking governesses. A Girl Guides company, the
1st Buckingham Palace Company, was formed specifically so she could
socialise with girls her own age. Later, she was enrolled as a Sea
Ranger.In 1939, Elizabeth's parents toured Canada and the United
States. As in 1927, when her parents had toured Australia and New
Zealand, Elizabeth remained in Britain, since her father thought her
too young to undertake public tours. Elizabeth "looked tearful" as her
parents departed. They corresponded regularly, and she and her parents
made the first royal transatlantic telephone call on 18 May."
I have a sample text like
'I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017
From this type of text i want to separate from and to countries and dates.
How can i approach?
To Install Follow These Steps https://spacy.io/docs/usage/
string = "I'm travelling from Spain to India i.e on 23/09/2017 to 27/09/2017"
import re
import spacy
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
if ent.label_ == 'GPE':
print ent.text
Output
Spain
India
Reference
https://spacy.io/docs/usage/entity-recognition