I need to slice a pandas DataFrame based on spacy rule-based matcher results. The following is what I tried.
import pandas as pd
import numpy as np
import spacy
from spacy.matcher import Matcher
df = pd.DataFrame([['Eight people believed injured in serious SH1 crash involving truck and three cars at Hunterville',
'Fire and emergency responding to incident at Mataura, Southland ouvea premix site',
'Civil Defence Minister Peeni Henare heartbroken over Northland flooding',
'Far North flooding: New photos reveal damage to roads']]).T
df.columns = ['col1']
nlp = spacy.load("en_core_web_sm")
flood_pattern = [{'LOWER': 'flooding'}]
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("FLOOD_DIS", None, flood_pattern)
titles = (_ for _ in df['col1'])
g = (d for d in nlp.pipe(titles) if matcher(d))
x = list(g)
df2 = df[df['col1'].isin(x)]
df2
This produces an empty DataFrame. However, It should extract the following two rows from df.
Civil Defence Minister Peeni Henare heartbroken over Northland flooding
Far North flooding: New photos reveal damage to roads
You can do the following.
titles = (_ for _ in df['col1'])
g = (d for d in nlp.pipe(titles) if matcher(d))
A = []
for i in range(len(df)):
doc = nlp(next(titles))
if len(matcher(doc)) == 1:
A.append(str(doc))
df2 = df[df['col1'].isin(A)]
Try this:
matcher.add("FLOOD_DIS", None, flood_pattern)
matches = [True if matcher(doc) else False for doc in nlp.pipe(df['col1'])]
df2 = df[matches][['col1']]
Related
I'm interested in extracting the verb-noun pair from my "task" column, so I first loaded the table using pandas
import pandas as pd
and then the file
DF = pd.read_excel(r'/content/contentdrive/MyDrive/extrac.xlsx')
After I import nltk and some packages import nltk
I create a function to process each text: `
def processa(Text_tasks):
text = nltk.word_tokenize(Text_tasks)
pos_tagged = nltk.pos_tag(text)
NV = list(filter(lambda x: x[1] == "NN" or x[1] == "VB", pos_tagged))
return NV
In the end, I try to generate a list with the results:
results = DF[‘task’].map(processa) and this happen
[enter image description here][1]
here is the data: https://docs.google.com/spreadsheets/d/1bRuTqpATsBglWMYIe-AmO5A2kq_i-0kg/edit?usp=sharing&ouid=115543599430411372875&rtpof=true&sd=true
I am extracting information about chemical elements from Wikipedia. It contains sentences, and I want each sentence to be added as follows:
Molecule
Sentence1
Sentence1 and sentence2
All_sentence
MgO
this is s1.
this is s1. this is s2.
all_sentence
CaO
this is s1.
this is s1. this is s2.
all_sentence
What I've achieved so far
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
text_sentences = nlp(sumary)
sent_list = []
for sentence in text_sentences.sents:
sent_list.append(sentence.text)
#print(sent_list)
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list})
print(df.head())
The output looks like:
Molecule
Description
MgO
All sentences are here
Mgo
The Molecule columns are shown repeatedly for each line of sentence which is not correct.
Please suggest some solution
It's not clear why you would want to repeat all sentences in each column but you can get to the form you want with pivot:
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
sent_list = [sent.text for sent in nlp(sumary).sents]
#cumul_sent_list = [' '.join(sent_list[:i]) for i in range(1, len(sent_list)+1)] # uncomment to cumulate sentences in columns
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list}) # replace sent_list with cumul_sent_list if cumulating
df["Sentences"] = pd.Series([f"Sentence{i + 1}" for i in range(len(df))]) # replace "Sentence{i+1}" with "Sentence1-{i+1}" if cumulating
df = df.pivot(index="Molecule", columns="Sentences", values="Description")
print(df)
sent_list can be created using a list comprehension. Create cumul_sent_list if you want your sentences to be repeated in columns.
Output:
Sentences Sentence1 ... Sentence9
Molecule ...
MgO Magnesium oxide (MgO), or magnesia, is a white... ... According to evolutionary crystal structure pr...
I keep hitting a wall when it comes to NLTK. I've been able to Token and Categorize a single string of text, however, if I try to apply the script across multiple rows I get the Tokens, but it does not return a Category which is the most important part for me.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
Example:
ex = 'John'
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)
Output:
(S (PERSON John/NNP))
That is exactly what I'm looking for. I need the Category not just NNP.
When I apply this across a table I just get the token and no Category.
Example:
df = pd.read_csv('ex3.csv')
df
Input:
Order Text
0 0 John
1 1 Paul
2 2 George
3 3 Ringo
Code:
df['results'] = df.Text.apply(lambda x: nltk.ne_chunk(pos_tag(word_tokenize(x))))
df
Output:
print(df)
Order Text results
0 0 John [[(John, NNP)]]
1 1 Paul [[(Paul, NNP)]]
2 2 George [[(George, NNP)]]
3 3 Ringo [[(Ringo, NN)]]
I'm getting the tokens and it's working across all rows, but it is not giving me a Category 'PERSON'.
I really need Categories.
Is this not possible? Thanks for the help.
Here we go...
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
+nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
df = pd.read_csv("ex3.csv")
# print(df)
text1 = df['text'].to_list()
text =[]
for i in text1:
text.append(i.capitalize())
# create a column for store resullts
df['results'] = ""
for i in range(len(text)):
SENT_DETECTOR = nltk.data.load('tokenizers/punkt/english.pickle')
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(text[i])))
df['results'][i] = ne_tree[0].label()
print(df)
Below is the code
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
for w in Wrd_Freq:
print(ps.stem(w))
Output
read
peopl
say
work
I need the output as
['read',
'people',
'say',
'work']
Full Code without Potter Stemmer
lower = []
for item in df_text['job_description']:
lower.append(item.lower()) # lowercase description
tokens = []
type(tokens)
token_string= [str(i) for i in lower]
string = "".join(token_string)
string = string.replace("-","")
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"\W+", gaps=True)
tokens = tokenizer.tokenize(string)
tokens
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
Wrd_Freq
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
I need to use potter stemmer separately to check whether there is any change to the count list. I need to perform the same after including potter stemmer and compare the output.
Use list comprehension:
L= [ps.stem(w) for w in Wrd_Freq]
EDIT:
If need top values by counts:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
from collections import Counter
c = Counter(tokens)
top = [x for x, y in c.most_common(10)]
print (top)
['data', 'experience', 'business', 'work', 'science',
'learning', 'analytics', 'team', 'analysis', 'machine']
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
df = freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
print (df)
Word Frequency
data 124289
experience 59135
business 33528
work 28146
science 26864
learning 26850
analytics 21828
team 20825
analysis 20607
machine 20484
I am trying to find specific words from a pandas column and assign it to a new column and column may contain two or more words. Once I find it I wish to replicate the row by creating it for that word.
import pandas as pd
import numpy as np
import re
wizard=pd.read_excel(r'C:\Python\L\Book1.xlsx'
,sheet_name='Sheet1'
, header=0)
test_set = {'941', '942',}
test_set2={'MN','OK','33/3305'}
wizard['ZTYPE'] = wizard['Comment'].apply(lambda x: any(i in test_set for i in x.split()))
wizard['ZJURIS']=wizard['Comment'].apply(lambda x: any(i in test_set2 for i in x.split()))
wizard_new = pd.DataFrame(np.repeat(wizard.values,3,axis=0))
wizard_new.columns = wizard.columns
wizard_new.head()
I am getting true and false, however unable to split it.
Above is how the sample data reflects. I need to find anything like this '33/3305', Year could be entered as '19', '2019', and quarter could be entered are 'Q1'or '1Q' or 'Q 1' or '1 Q' and my test set lists.
ZJURIS = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in wizard.comment()])))
def to_category(x):
for w in x.lower().split(" "):
if w in ZJURIS:
return ZJURIS[w]
return None
Finally, apply the method on the column and save the result to a new one:
wizard["ZJURIS"] = wizard["comment"].apply(to_category)
I tried the above solution well it did not
Any suggestions how to do I get the code to work.
Sample data.
data={ 'ID':['351362278576','351539320880','351582465214','351609744560','351708198604'],
'BU':['SBS','MAS','NAS','ET','SBS'],
'Comment':['940/941/w2-W3NYSIT/SUI33/3305/2019/1q','OK SUI 2Q19','941 - 3Q2019NJ SIT - 3Q2019NJ SUI/SDI - 3Q2019','IL,SUI,2016Q4,2017Q1,2017Q2','1Q2019 PA 39/5659 39/2476','UT SIT 1Q19-3Q19']
}
df = pd.DataFrame(data)
Based on the data sample data set attached is the output.