NLTK Named Entity Category Labels - python-3.x

I keep hitting a wall when it comes to NLTK. I've been able to Token and Categorize a single string of text, however, if I try to apply the script across multiple rows I get the Tokens, but it does not return a Category which is the most important part for me.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
Example:
ex = 'John'
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
Output:
(S (PERSON John/NNP))
That is exactly what I'm looking for. I need the Category not just NNP.
When I apply this across a table I just get the token and no Category.
Example:
df = pd.read_csv('ex3.csv')
df
Input:
Order Text
0 0 John
1 1 Paul
2 2 George
3 3 Ringo
Code:
df['results'] = df.Text.apply(lambda x: nltk.ne_chunk(pos_tag(word_tokenize(x))))
df
Output:
print(df)
Order Text results
0 0 John [[(John, NNP)]]
1 1 Paul [[(Paul, NNP)]]
2 2 George [[(George, NNP)]]
3 3 Ringo [[(Ringo, NN)]]
I'm getting the tokens and it's working across all rows, but it is not giving me a Category 'PERSON'.
I really need Categories.
Is this not possible? Thanks for the help.

Here we go...
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
df = pd.read_csv("ex3.csv")
# print(df)
text1 = df['text'].to_list()
text =[]
for i in text1:
text.append(i.capitalize())
# create a column for store resullts
df['results'] = ""
for i in range(len(text)):
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(text[i])))
df['results'][i] = ne_tree[0].label()
print(df)

Related

Problems to extract the pair verb-noun

I'm interested in extracting the verb-noun pair from my "task" column, so I first loaded the table using pandas
import pandas as pd
and then the file
DF = pd.read_excel(r'/content/contentdrive/MyDrive/extrac.xlsx')
After I import nltk and some packages import nltk
I create a function to process each text: `
def processa(Text_tasks):
text = nltk.word_tokenize(Text_tasks)
pos_tagged = nltk.pos_tag(text)
NV = list(filter(lambda x: x[1] == "NN" or x[1] == "VB", pos_tagged))
return NV
In the end, I try to generate a list with the results:
results = DF[‘task’].map(processa) and this happen
[enter image description here][1]
here is the data: https://docs.google.com/spreadsheets/d/1bRuTqpATsBglWMYIe-AmO5A2kq_i-0kg/edit?usp=sharing&ouid=115543599430411372875&rtpof=true&sd=true

How to add each element (sentence) of a list to a pandas column?

I am extracting information about chemical elements from Wikipedia. It contains sentences, and I want each sentence to be added as follows:
Molecule
Sentence1
Sentence1 and sentence2
All_sentence
MgO
this is s1.
this is s1. this is s2.
all_sentence
CaO
this is s1.
this is s1. this is s2.
all_sentence
What I've achieved so far
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
text_sentences = nlp(sumary)
sent_list = []
for sentence in text_sentences.sents:
sent_list.append(sentence.text)
#print(sent_list)
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list})
print(df.head())
The output looks like:
Molecule
Description
MgO
All sentences are here
Mgo
The Molecule columns are shown repeatedly for each line of sentence which is not correct.
Please suggest some solution
It's not clear why you would want to repeat all sentences in each column but you can get to the form you want with pivot:
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
sent_list = [sent.text for sent in nlp(sumary).sents]
#cumul_sent_list = [' '.join(sent_list[:i]) for i in range(1, len(sent_list)+1)] # uncomment to cumulate sentences in columns
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list}) # replace sent_list with cumul_sent_list if cumulating
df["Sentences"] = pd.Series([f"Sentence{i + 1}" for i in range(len(df))]) # replace "Sentence{i+1}" with "Sentence1-{i+1}" if cumulating
df = df.pivot(index="Molecule", columns="Sentences", values="Description")
print(df)
sent_list can be created using a list comprehension. Create cumul_sent_list if you want your sentences to be repeated in columns.
Output:
Sentences Sentence1 ... Sentence9
Molecule ...
MgO Magnesium oxide (MgO), or magnesia, is a white... ... According to evolutionary crystal structure pr...

Select rows of a Pandas DataFrame based on Spacy Rule Matcher

I need to slice a pandas DataFrame based on spacy rule-based matcher results. The following is what I tried.
import pandas as pd
import numpy as np
import spacy
from spacy.matcher import Matcher
df = pd.DataFrame([['Eight people believed injured in serious SH1 crash involving truck and three cars at Hunterville',
'Fire and emergency responding to incident at Mataura, Southland ouvea premix site',
'Civil Defence Minister Peeni Henare heartbroken over Northland flooding',
'Far North flooding: New photos reveal damage to roads']]).T
df.columns = ['col1']
nlp = spacy.load("en_core_web_sm")
flood_pattern = [{'LOWER': 'flooding'}]
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("FLOOD_DIS", None, flood_pattern)
titles = (_ for _ in df['col1'])
g = (d for d in nlp.pipe(titles) if matcher(d))
x = list(g)
df2 = df[df['col1'].isin(x)]
df2
This produces an empty DataFrame. However, It should extract the following two rows from df.
Civil Defence Minister Peeni Henare heartbroken over Northland flooding
Far North flooding: New photos reveal damage to roads
You can do the following.
titles = (_ for _ in df['col1'])
g = (d for d in nlp.pipe(titles) if matcher(d))
A = []
for i in range(len(df)):
doc = nlp(next(titles))
if len(matcher(doc)) == 1:
A.append(str(doc))
df2 = df[df['col1'].isin(A)]
Try this:
matcher.add("FLOOD_DIS", None, flood_pattern)
matches = [True if matcher(doc) else False for doc in nlp.pipe(df['col1'])]
df2 = df[matches][['col1']]

Python Itertools groupby - writing lines to csv file

I am new to Python, apologies if this is a stupid question.
I have a text file with the following input:
Apple Apple1
Apple Apple2
Aaron Aaron1
Aaron Aaron2
Aaron Aaron3
Tree Tree1
I have the following code:
import csv
import sys
from itertools import groupby
with open('File.txt', 'r', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
next(reader, None)
a = [[k] + [x[1] for x in g] for k, g in groupby(reader, key=lambda row: row[0])]
sys.stdout=open('Out.txt','w', encoding='utf-8')
print (str(a))
What I want to achieve:
Apple Apple1,Apple2
Aaron Aaron1,Aaron2,Aaron3
Tree Tree1
However, the output I am now getting is in list form, while I want it to be printed line per line. How can I achieve this?
How about
import pandas as pd
df = pd.read_csv('File.txt',delimiter=' ' , header=None)
grouped = df.groupby(0).agg(lambda col: ', '.join(col)).to_records()
for group in grouped:
print(group[0] + ' ' + group[1])

Using Python Faker generate different data for 5000 rows

I would like to use the Python Faker library to generate 500 lines of data, however I get repeated data using the code I came up with below. Can you please point out where I'm going wrong. I believe it has something to do with the for loop. Thanks in advance:
from faker import Factory
import pandas as pd
import random
def create_fake_stuff(fake):
df = pd.DataFrame(columns=('name'
, 'email'
, 'bs'
, 'address'
, 'city'
, 'state'
, 'date_time'
, 'paragraph'
, 'Conrad'
,'randomdata'))
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000,2000)]
for i in range(10):
df.loc[i] = [item for item in stuff]
print(df)
if __name__ == '__main__':
fake = Factory.create()
create_fake_stuff(fake)
Disclaimer: this answer is added much after the question and adds some new info not directly answering the question.
Now there is a fast new library Mimesis - Fake Data Generator.
Upside: It is stated it works times faster than faker (see below my test of data similar to one in question).
Downside: works from 3.6 version of Python only.
pip install mimesis
>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')
>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'
The same with developed earlier faker:
pip install faker
>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子
Below it my recent timing of Mimesis vs. Faker based on code provided in answer from forzer0eight:
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
#"bs":fake.bs(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
#"paragraph":fake.paragraph(),
#"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_faker = pd.DataFrame(create_rows_faker(5000))
CPU times: user 3.51 s, sys: 2.86 ms, total: 3.51 s
Wall time: 3.51 s
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
output = [{"name":person.full_name(gender=Gender.FEMALE),
"address":addess.address(),
"name":person.name(),
"email":person.email(),
#"bs":person.bs(),
"city":addess.city(),
"state":addess.state(),
"date_time":datetime.datetime(),
#"paragraph":person.paragraph(),
#"Conrad":person.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))
CPU times: user 178 ms, sys: 1.7 ms, total: 180 ms
Wall time: 179 ms
Below is resulting data for comparison:
df_faker.head(2)
address city date_time email name randomdata state
0 3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport 2004-10-18 20:35:52 joseph81#gomez-beltran.info Deborah Garcia 1218 Oklahoma
1 2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 lbeck#wang.com Barbara Pineda 1536 Tennessee
df_mimesis.head(2)
address city date_time email name randomdata state
0 351 Nobles Viaduct Cedar Falls 2013-08-22 08:20:25.288883 chemotherapeutics1964#gmail.com Ernest 1673 Georgia
1 517 Williams Hill Malden 2008-01-26 18:12:01.654995 biochemical1972#yandex.com Jonathan 1845 North Dakota
Following scripts can remarkably enhance the pandas performance.
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
"bs":fake.bs(),
"address":fake.address(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
"paragraph":fake.paragraph(),
"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
It takes 5.55s.
%%time
df = pd.DataFrame(create_rows(5000))
Wall time: 5.55 s
I placed the fake stuff array inside my for loop to achieve the desired result:
for i in range(10):
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000, 2000)]
df.loc[i] = [item for item in stuff]
print(df)
Using the farsante and mimesis libraries is the easiest way to create Pandas DataFrames with fake data.
import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime
person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
def some_rand_int():
return random.randint(min_int, max_int)
return some_rand_int
df = farsante.pandas_df([
person.full_name,
address.address,
person.name,
person.email,
address.city,
address.state,
datetime.datetime,
rand_int(1000, 2000)], 5)
print(df)
full_name address name ... state datetime some_rand_int
0 Weldon Durham 1027 Nellie Square Bruna ... West Virginia 2030-06-10 09:21:29.179412 1453
1 Veta Conrad 932 Cragmont Arcade Betsey ... Iowa 2017-08-11 23:50:27.479281 1909
2 Vena Kinney 355 Edgar Highway Tyson ... New Hampshire 2002-12-21 05:26:45.723531 1735
3 Adam Sheppard 270 Williar Court Treena ... North Dakota 2011-03-30 19:16:29.015598 1503
4 Penney Allison 592 Oakdale Road Chas ... Maine 2009-12-14 16:31:37.714933 1175
This approach keeps your code clean.

Resources