How does one extract the verb phrase in Spacy? - nlp

For example:
Ultimate Swirly Ice Cream Scoopers are usually overrated when one considers all of the scoopers one could buy.
Here I'd like to pluck:
Subject: "Ultimate Swirly Ice Cream Scoopers"
Adverbial Clause: "When one considers all of the scoopers one could buy"
Verb Phrase: "are usually overrated"
I have the following functions for subject, object, and adverbial clause:
def get_subj(decomp):
for token in decomp:
if ("subj" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
def get_obj(decomp):
for token in decomp:
if ("dobj" in token.dep_ or "pobr" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
def get_advcl(decomp):
for token in decomp:
# print(f"pos: {token.pos_}; lemma: {token.lemma_}; dep: {token.dep_}")
if ("advcl" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
phrase = "Ultimate Swirly Ice Cream Scoopers are usually overrated when one considers all of the scoopers one could buy."
nlp = spacy.load("en_core_web_sm")
decomp = nlp(phrase)
subj = get_subj(decomp)
obj = get_obj(decomp)
advcl = get_advcl(decomp)
print("subj: ", subj)
print("obj: ", obj)
print("advcl: ", advcl)
Output:
subj: Ultimate Swirly Ice Cream Scoopers
obj: all of the scoopers
advcl: when one considers all of the scoopers one could buy
However, the actual depenency type .dep_ for the final word of the VP, "are usually overrated", is "ROOT".
So, the subtree technique fails, as the subtree of ROOT returns the entire sentence.

You are wanting to construct something more like a “verb group” where you keep with the root verb only certain close dependents like aux, cop, and advmod but not ones like nsubj, obj, or advcl.

Related

fetch specific protobuf members

I want to get an array of all the lines which start by text: (till the first asset_performance_label)
I saw this post, but wasn't sure how to apply it.
Should I convert the proto to string, as I have tried?
text = extract_text_from_proto(r"(\w+)text:(\w+)asset_performance_label:", '''[pinned_field: HEADLINE_1
text: "5 Best Products"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}
, pinned_field: HEADLINE_1
text: "10 Best Products 2021"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}''')
def extract_text_from_proto(regex, proto_string):
regex = re.escape(regex)
result_array = [m.group() for m in re.finditer(regex, proto_string)]
return result_array
# return [extract_text(each_item, regex) for each_item in proto],
def extract_text(regex, item):
m = re.match(regex, str(item))
if m is None:
# text = "MISSING TEXT"
raise Exception("Ad is missing text")
else:
text = m.group(2)
return text
Expected result: ["5 Best Products","10 Best Products 2021"]
What if I want to match (optional) pinned_field: (word)? so the result could be: [HEADLINE_1: 5 Best Products', 'HEADLINE_1:10 Best Products 2021', 'some_text_without_pinned_field']` ?
You can use a single capture group, and match assert_performance_label in the next line. Use re.findall to return the group values.
\btext:\s*"([^"]+)"\n\s*asset_performance_label\b
The pattern matches
\btext:\s*" Match text: predeced by a word boundary \b to prevent a partial match
([^"]+) Capture group 1, match 1+ chars other than a double quote
"\n\s* Match a newline an optional whitespace chars
asset_performance_label\b Match `asset_performance_label followed by a word boundary
For example
import re
def extract_text_from_proto(regex, proto_string):
return re.findall(regex, proto_string)
text = extract_text_from_proto(r'\btext:\s*"([^"]+)"\n\s*asset_performance_label\b', '''[pinned_field: HEADLINE_1
text: "5 Best Products"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}
, pinned_field: HEADLINE_1
text: "10 Best Products 2021"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}''')
print(text)
Output
['5 Best Products', '10 Best Products 2021']

how can I get no more than three n-grams

I am finding the n-grams using the following function.
from nltk.util import ngrams
booksAfterRemovingStopWords = ['Zombies and Calculus by Colin Adams', 'Zone to Win: Organizing to Compete in an Age of Disruption', 'Zig Zag: The Surprising Path to Greater Creativity']
booksWithNGrams = list()
for line_no, line in enumerate(booksAfterRemovingStopWords):
tokens = line.split(" ")
output = list(ngrams(tokens, 3))
temp = list()
for x in output: # Adding n-grams
temp.append(' '.join(x))
booksWithNGrams.append(temp)
print(booksWithNGrams)
The output looks like:
[['Zombies and Calculus', 'and Calculus by', 'Calculus by Colin', 'by Colin Adams'], ['Zone to Win:', 'to Win: Organizing', 'Win: Organizing to', 'Organizing to Compete', 'to Compete in', 'Compete in an', 'in an Age', 'an Age of', 'Age of Disruption'], ['Zig Zag: The', 'Zag: The Surprising', 'The Surprising Path', 'Surprising Path to', 'Path to Greater', 'to Greater Creativity']]
However, I want no more that three n-grams. I mean I want the output to be like:
[['Zombies and Calculus', 'and Calculus by', 'Calculus by Colin'], ['Zone to Win:', 'to Win: Organizing', 'Win: Organizing to'], ['Zig Zag: The', 'Zag: The Surprising', 'The Surprising Path']]
How can I achieve that?
This is how you'd do it:
Logic:
Just count till three in the loop and break on i>2 (counts i=0,1,2 and breaks).
booksAfterRemovingStopWords = ['Zombies and Calculus by Colin Adams', 'Zone to Win: Organizing to Compete in an Age of Disruption', 'Zig Zag: The Surprising Path to Greater Creativity']
booksWithNGrams = list()
for line_no, line in enumerate(booksAfterRemovingStopWords):
tokens = line.split(" ")
output = list(ngrams(tokens, 3))
temp = list()
for i,x in enumerate(output):# Adding n-grams
if i>2:
break
temp.append(' '.join(x))
booksWithNGrams.append(temp)

Using AI service to recognize a free text search field question?

Is there an API service, paid or not paid (IBM Watson, Google Natural Language), that can accept a free text "ask a question" field and convert it into a set of keywords to be used for a regular keyword search?
For example if my website has a search field "Ask a question about our products", and a user types in "Do you have red dresses?", is there an API we can integrate into our code that can just convert this to "red dress" which we then simply feed into our regular keyword search for "red dress"?
Ideally it can handle variations of questions such as:
"How do you return a product?" -- return product
"Do you accept Mastercard?" -- mastercard
"Where can I find blue shoes?" -- blue shoes
You can extract noun chunks and then use those as keywords.
For example using Spacy, you can extract noun chunks as follows:
import spacy
nlp = spacy.load('en_core_web_md')
def getNounChunks(doc):
inc = ['NN', 'NNP', 'NNPS', 'NNS', 'JJ', 'HYPH']
incn = ['NN', 'NNP', 'NNPS' ,'NNS']
excl = ['other', 'some', 'many', 'certain', 'various']
lspans = []
chunk =[]
for t in doc:
if t.text.lower() in excl:
continue
if chunk:
if chunk[-1].tag_ == 'HYPH':
chunk.append(t)
continue
if t.tag_ in inc:
if t.tag_ != 'JJ':
chunk.append(t)
else:
if not any([t.tag_ in incn for t in chunk]):
chunk.append(t)
else:
if chunk:
if any([t.tag_ in incn for t in chunk]):
lspans.append(doc[chunk[0].i:chunk[-1].i + 1])
chunk = list()
return(lspans)
questions = [
"How do you return a product?" ,
"Do you accept Mastercard?" ,
"Where can I find blue shoes?",
"Do you have red dresses?",]
for q in questions:
doc = nlp(q)
print(getNounChunks(doc))
#output:
#[product]
#[Mastercard]
#[blue shoes]
#[red dresses]

Using specific elements from a list in different loops for a multiple choice test python 3.x

Basically i'm trying to create a multiple choice test that uses information stored inside of lists to change the questions/ answers by location.
so far I have this
import random
DATASETS = [["You first enter the car", "You start the car","You reverse","You turn",
"Coming to a yellow light","You get cut off","You run over a person","You have to stop short",
"in a high speed chase","in a stolen car","A light is broken","The car next to you breaks down",
"You get a text message","You get a call","Your out of gas","Late for work","Driving angry",
"Someone flips you the bird","Your speedometer stops working","Drinking"],
["Put on seat belt","Check your mirrors","Look over your shoulder","Use your turn signal",
"Slow to a safe stop","Relax and dont get upset","Call 911", "Thank your brakes for working",
"Pull over and give up","Ask to get out","Get it fixed","Offer help","Ignore it","Ignore it",
"Get gas... duh","Drive the speed limit","Don't do it","Smile and wave","Get it fixed","Don't do it"],
[''] * 20,
['B','D','A','A','C','A','B','A','C','D','B','C','D','A','D','C','C','B','D','A'],
[''] * 20]
def main():
questions(0)
answers(1)
def questions(pos):
for words in range(len(DATASETS[0])):
DATASETS[2][words] = input("\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] +
'\nA.)'+random.choice(DATASETS[1]) + '\nB.)%s' %DATASETS[1][words] + '\nC.)'
+random.choice(DATASETS[1]) + '\nD.)'+random.choice(DATASETS[1])+
"\nChoose your answer carefully: ")
def answers(pos):
for words in range(len(DATASETS[0])):
DATASETS[4] = list(x is y for x, y in zip(DATASETS[2], DATASETS[3]))
print(DATASETS)
I apologize if the code is crude to some... i'm in my first year of classes and this is my first bout of programming.
list 3 is my key for the right answer's, I want my code in questions() to change the position of the correct answer so that it correlates to the key provided....
I've tried for loops, if statements and while loops but just cant get it to do what I envision. Any help is greatly appreciated
tmp = "\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] + '\nA.)'
if DATASETS[3][words] == 'A': #if the answer key is A
tmp = tmp + DATASETS[1][words] #append the first choice as correct choice
else:
tmp = tmp + random.choice(DATASETS[1]) #if not, randomise the choice
Do similar if-else for 'B', 'C', and 'D'
Once your question is formulated, then you can use it:
DATASETS[2][words] = input(tmp)
This is a bit long but I am not sure if any shorter way exists.

Assign Class attributes from list elements

I'm not sure if the title accurately describes what I'm trying to do. I have a Python3.x script that I wrote that will issue flood warning to my facebook page when the river near my home has reached it's lowest flood stage. Right now the script works, however it only reports data from one measuring station. I would like to be able to process the data from all of the stations in my county (total of 5), so I was thinking that maybe a class method may do the trick but I'm not sure how to implement it. I've been teaching myself Python since January and feel pretty comfortable with the language for the most part, and while I have a good idea of how to build a class object I'm not sure how my flow chart should look. Here is the code now:
#!/usr/bin/env python3
'''
Facebook Flood Warning Alert System - this script will post a notification to
to Facebook whenever the Sabine River # Hawkins reaches flood stage (22.3')
'''
import requests
import facebook
from lxml import html
graph = facebook.GraphAPI(access_token='My_Access_Token')
river_url = 'http://water.weather.gov/ahps2/river.php?wfo=SHV&wfoid=18715&riverid=203413&pt%5B%5D=147710&allpoints=143204%2C147710%2C141425%2C144668%2C141750%2C141658%2C141942%2C143491%2C144810%2C143165%2C145368&data%5B%5D=obs'
ref_url = 'http://water.weather.gov/ahps2/river.php?wfo=SHV&wfoid=18715&riverid=203413&pt%5B%5D=147710&allpoints=143204%2C147710%2C141425%2C144668%2C141750%2C141658%2C141942%2C143491%2C144810%2C143165%2C145368&data%5B%5D=all'
def checkflood():
r = requests.get(river_url)
tree = html.fromstring(r.content)
stage = ''.join(tree.xpath('//div[#class="stage_stage_flow"]//text()'))
warn = ''.join(tree.xpath('//div[#class="current_warns_statmnts_ads"]/text()'))
stage_l = stage.split()
level = float(stage_l[2])
#check if we're at flood level
if level < 22.5:
pass
elif level == 37:
major_diff = level - 23.0
major_r = ('The Sabine River near Hawkins, Tx has reached [Major Flood Stage]: #', stage_l[2], 'Ft. ', str(round(major_diff, 2)), ' Ft. \n Please click the link for more information.\n\n Current Warnings and Alerts:\n ', warn)
major_p = ''.join(major_r)
graph.put_object(parent_object='me', connection_name='feed', message = major_p, link = ref_url)
<--snip-->
checkflood()
Each station has different 5 different catagories for flood stage: Action, Flood, Moderate, Major, each different depths per station. So for Sabine river in Hawkins it will be Action - 22', Flood - 24', Moderate - 28', Major - 32'. For the other statinos those depths are different. So I know that I'll have to start out with something like:
class River:
def __init__(self, id, stage):
self.id = id #station ID
self.stage = stage #river level'
#staticmethod
def check_flood(stage):
if stage < 22.5:
pass
elif stage.....
but from there I'm not sure what to do. Where should it be added in(to?) the code, should I write a class to handle the Facebook postings as well, is this even something that needs a class method to handle, is there any way to clean this up for efficiency? I'm not looking for anyone to write this up for me, but some tips and pointers would sure be helpful. Thanks everyone!
EDIT Here is what I figured out and is working:
class River:
name = ""
stage = ""
action = ""
flood = ""
mod = ""
major = ""
warn = ""
def checkflood(self):
if float(self.stage) < float(self.action):
pass
elif float(self.stage) >= float(self.major):
<--snip-->
mineola = River()
mineola.name = stations[0]
mineola.stage = stages[0]
mineola.action = "13.5"
mineola.flood = "14.0"
mineola.mod = "18.0"
mineola.major = "21.0"
mineola.alert = warn[0]
hawkins = River()
hawkins.name = stations[1]
hawkins.stage = stages[1]
hawkins.action = "22.5"
hawkins.flood = "23.0"
hawkins.mod = "32.0"
hawkins.major = "37.0"
hawkins.alert = warn[1]
<--snip-->
So from here I'm tring to stick all the individual river blocks into one block. What I have tried so far is this:
class River:
... name = ""
... stage = ""
... def testcheck(self):
... return self.name, self.stage
...
>>> for n in range(num_river):
... stations[n] = River()
... stations[n].name = stations[n]
... stations[n].stage = stages[n]
...
>>> for n in range(num_river):
... stations[n].testcheck()
...
<__main__.River object at 0x7fbea469bc50> 4.13
<__main__.River object at 0x7fbea46b4748> 20.76
<__main__.River object at 0x7fbea46b4320> 22.13
<__main__.River object at 0x7fbea46b4898> 16.08
So this doesn't give me the printed results that I was expecting. How can I return the string instead of the object? Will I be able to define the Class variables in this manner or will I have to list them out individually? Thanks again!
After reading many, many, many articles and tutorials on class objects I was able to come up with a solution for creating the objects using list elements.
class River():
def __init__(self, river, stage, flood, action):
self.river = river
self.stage = stage
self.action = action
self.flood = flood
self.action = action
def alerts(self):
if float(self.stage < self.flood):
#alert = "The %s is below Flood Stage (%sFt) # %s Ft. \n" % (self.river, self.flood, self.stage)
pass
elif float(self.stage > self.flood):
alert = "The %s has reached Flood Stage(%sFt) # %sFt. Warnings: %s \n" % (self.river, self.flood, self.stage, self.action)
return alert
'''this is the function that I was trying to create
to build the class objects automagically'''
def riverlist():
river_list = []
for n in range(len(rivers)):
station = River(river[n], stages[n], floods[n], warns[n])
river_list.append(station)
return river_list
if __name__ == '__main__':
for x in riverlist():
print(x.alerts())

Resources