Using AI service to recognize a free text search field question? - nlp

Is there an API service, paid or not paid (IBM Watson, Google Natural Language), that can accept a free text "ask a question" field and convert it into a set of keywords to be used for a regular keyword search?
For example if my website has a search field "Ask a question about our products", and a user types in "Do you have red dresses?", is there an API we can integrate into our code that can just convert this to "red dress" which we then simply feed into our regular keyword search for "red dress"?
Ideally it can handle variations of questions such as:
"How do you return a product?" -- return product
"Do you accept Mastercard?" -- mastercard
"Where can I find blue shoes?" -- blue shoes

You can extract noun chunks and then use those as keywords.
For example using Spacy, you can extract noun chunks as follows:
import spacy
nlp = spacy.load('en_core_web_md')
def getNounChunks(doc):
inc = ['NN', 'NNP', 'NNPS', 'NNS', 'JJ', 'HYPH']
incn = ['NN', 'NNP', 'NNPS' ,'NNS']
excl = ['other', 'some', 'many', 'certain', 'various']
lspans = []
chunk =[]
for t in doc:
if t.text.lower() in excl:
continue
if chunk:
if chunk[-1].tag_ == 'HYPH':
chunk.append(t)
continue
if t.tag_ in inc:
if t.tag_ != 'JJ':
chunk.append(t)
else:
if not any([t.tag_ in incn for t in chunk]):
chunk.append(t)
else:
if chunk:
if any([t.tag_ in incn for t in chunk]):
lspans.append(doc[chunk[0].i:chunk[-1].i + 1])
chunk = list()
return(lspans)
questions = [
"How do you return a product?" ,
"Do you accept Mastercard?" ,
"Where can I find blue shoes?",
"Do you have red dresses?",]
for q in questions:
doc = nlp(q)
print(getNounChunks(doc))
#output:
#[product]
#[Mastercard]
#[blue shoes]
#[red dresses]

Related

How does one extract the verb phrase in Spacy?

For example:
Ultimate Swirly Ice Cream Scoopers are usually overrated when one considers all of the scoopers one could buy.
Here I'd like to pluck:
Subject: "Ultimate Swirly Ice Cream Scoopers"
Adverbial Clause: "When one considers all of the scoopers one could buy"
Verb Phrase: "are usually overrated"
I have the following functions for subject, object, and adverbial clause:
def get_subj(decomp):
for token in decomp:
if ("subj" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
def get_obj(decomp):
for token in decomp:
if ("dobj" in token.dep_ or "pobr" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
def get_advcl(decomp):
for token in decomp:
# print(f"pos: {token.pos_}; lemma: {token.lemma_}; dep: {token.dep_}")
if ("advcl" in token.dep_):
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(decomp[start:end])
phrase = "Ultimate Swirly Ice Cream Scoopers are usually overrated when one considers all of the scoopers one could buy."
nlp = spacy.load("en_core_web_sm")
decomp = nlp(phrase)
subj = get_subj(decomp)
obj = get_obj(decomp)
advcl = get_advcl(decomp)
print("subj: ", subj)
print("obj: ", obj)
print("advcl: ", advcl)
Output:
subj: Ultimate Swirly Ice Cream Scoopers
obj: all of the scoopers
advcl: when one considers all of the scoopers one could buy
However, the actual depenency type .dep_ for the final word of the VP, "are usually overrated", is "ROOT".
So, the subtree technique fails, as the subtree of ROOT returns the entire sentence.
You are wanting to construct something more like a “verb group” where you keep with the root verb only certain close dependents like aux, cop, and advmod but not ones like nsubj, obj, or advcl.

fetch specific protobuf members

I want to get an array of all the lines which start by text: (till the first asset_performance_label)
I saw this post, but wasn't sure how to apply it.
Should I convert the proto to string, as I have tried?
text = extract_text_from_proto(r"(\w+)text:(\w+)asset_performance_label:", '''[pinned_field: HEADLINE_1
text: "5 Best Products"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}
, pinned_field: HEADLINE_1
text: "10 Best Products 2021"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}''')
def extract_text_from_proto(regex, proto_string):
regex = re.escape(regex)
result_array = [m.group() for m in re.finditer(regex, proto_string)]
return result_array
# return [extract_text(each_item, regex) for each_item in proto],
def extract_text(regex, item):
m = re.match(regex, str(item))
if m is None:
# text = "MISSING TEXT"
raise Exception("Ad is missing text")
else:
text = m.group(2)
return text
Expected result: ["5 Best Products","10 Best Products 2021"]
What if I want to match (optional) pinned_field: (word)? so the result could be: [HEADLINE_1: 5 Best Products', 'HEADLINE_1:10 Best Products 2021', 'some_text_without_pinned_field']` ?
You can use a single capture group, and match assert_performance_label in the next line. Use re.findall to return the group values.
\btext:\s*"([^"]+)"\n\s*asset_performance_label\b
The pattern matches
\btext:\s*" Match text: predeced by a word boundary \b to prevent a partial match
([^"]+) Capture group 1, match 1+ chars other than a double quote
"\n\s* Match a newline an optional whitespace chars
asset_performance_label\b Match `asset_performance_label followed by a word boundary
For example
import re
def extract_text_from_proto(regex, proto_string):
return re.findall(regex, proto_string)
text = extract_text_from_proto(r'\btext:\s*"([^"]+)"\n\s*asset_performance_label\b', '''[pinned_field: HEADLINE_1
text: "5 Best Products"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}
, pinned_field: HEADLINE_1
text: "10 Best Products 2021"
asset_performance_label: PENDING
policy_summary_info
{
review_status: REVIEWED
approval_status: APPROVED
}''')
print(text)
Output
['5 Best Products', '10 Best Products 2021']

Regex Error and Improvement Driving Licence Data Extraction

I am trying to extract the Name, License No., Date Of Issue and Validity from an Image I processed using Pytesseract. I am quite a lot confused with regex but still went through few documentations and codes over the web.
I got till here:
import pytesseract
import cv2
import re
import cv2
from PIL import Image
import numpy as np
import datetime
from dateutil.relativedelta import relativedelta
def driver_license(filename):
"""
This function will handle the core OCR processing of images.
"""
i = cv2.imread(filename)
newdata=pytesseract.image_to_osd(i)
angle = re.search('(?<=Rotate: )\d+', newdata).group(0)
angle = int(angle)
i = Image.open(filename)
if angle != 0:
#with Image.open("ro2.jpg") as i:
rot_angle = 360 - angle
i = i.rotate(rot_angle, expand="True")
i.save(filename)
i = cv2.imread(filename)
# Convert to gray
i = cv2.cvtColor(i, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
i = cv2.dilate(i, kernel, iterations=1)
i = cv2.erode(i, kernel, iterations=1)
txt = pytesseract.image_to_string(i)
print(txt)
text = []
data = {
'firstName': None,
'lastName': None,
'age': None,
'documentNumber': None
}
c = 0
print(txt)
#Splitting lines
lines = txt.split('\n')
for lin in lines:
c = c + 1
s = lin.strip()
s = s.replace('\n','')
if s:
s = s.rstrip()
s = s.lstrip()
text.append(s)
try:
if re.match(r".*Name|.*name|.*NAME", s):
name = re.sub('[^a-zA-Z]+', ' ', s)
name = name.replace('Name', '')
name = name.replace('name', '')
name = name.replace('NAME', '')
name = name.replace(':', '')
name = name.rstrip()
name = name.lstrip()
nmlt = name.split(" ")
data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
data['lastName'] = nmlt[-1]
if re.search(r"[a-zA-Z][a-zA-Z]-\d{13}", s):
data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]-\d{13}', s)
data['documentNumber'] = data['documentNumber'].group().replace('-', '')
if not data['firstName']:
name = lines[c]
name = re.sub('[^a-zA-Z]+', ' ', name)
name = name.rstrip()
name = name.lstrip()
nmlt = name.split(" ")
data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
data['lastName'] = nmlt[-1]
if re.search(r"[a-zA-Z][a-zA-Z]\d{2} \d{11}", s):
data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]\d{2} \d{11}', s)
data['documentNumber'] = data['documentNumber'].group().replace(' ', '')
if not data['firstName']:
name = lines[c]
name = re.sub('[^a-zA-Z]+', ' ', name)
name = name.rstrip()
name = name.lstrip()
nmlt = name.split(" ")
data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
data['lastName'] = nmlt[-1]
if re.match(r".*DOB|.*dob|.*Dob", s):
yob = re.sub('[^0-9]+', ' ', s)
yob = re.search(r'\d\d\d\d', yob)
data['age'] = datetime.datetime.now().year - int(yob.group())
except:
pass
print(data)
I need to extract the Validity and Issue Date as well. But not getting anywhere near it. Also, I have seen using regex shortens the code like a lot so is there any better optimal way for it?
My input data is a string somewhat like this:
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India
Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL
: SHRI DARSHAN SINGH GILL
DOB: 10/05/1966 BG: U
Address :
104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034
Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010
(Holder's Sig natu re)
Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR
Or like this:
in
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India
2
Licence No. : DL-0320170595326 () WN
Name : AZAZ AHAMADSIDDIQUIE
s/w/D : SALAHUDDIN ALI
____... DOB: 26/12/1992 BG: O+
\ \ Address:
—.~J ~—; ROO NO-25 AMK BOYS HOSTEL, J.
— NAGAR, DELHI 110025
Auth to Drive Date of Issue
M.CYL. 12/12/2017
4 wt 4
Iseue Date: 12/12/2017 a
falidity(NT) < 2037
Validity(T) +: NA /
Inv CarrNo : NA te sntian sana
Note: In the second example you wouldn't get the validity, will optimise the OCR for later. Any proper guide which can help me with regex which is a bit simpler would be good.
You can use this pattern: (?<=KEY\s*:\s*)\b[^\n]+ and replace KEY with one of the issues of the date, License No. and others.
Also for this pattern, you need to use regex library.
Code:
import regex
text1 = """
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India
Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL
: SHRI DARSHAN SINGH GILL
DOB: 10/05/1966 BG: U
Address :
104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034
Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010
(Holder's Sig natu re)
Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR
"""
for key in ('Issue Date', 'Licence No\.', 'N', 'Validity\(NT\)'):
print(regex.findall(fr"(?<={key}\s*:\s*)\b[^\n]+", text1, regex.IGNORECASE))
Output:
['20/05/2016']
['DL-0820100052000 (P) R']
['PARMINDER PAL SINGH GILL']
['19/05/2021 : c']
You can also use re with a single regex based on alternation that will capture your keys and values:
import re
text = "Transport Department Government of NCT of Delhi\nLicence to Drive Vehicles Throughout India\n\nLicence No. : DL-0820100052000 (P) R\nN : PARMINDER PAL SINGH GILL\n\n: SHRI DARSHAN SINGH GILL\n\nDOB: 10/05/1966 BG: U\nAddress :\n\n104 SHARDA APPTT WEST ENCLAVE\nPITAMPURA DELHI 110034\n\n\n\nAuth to Drive Date of Issue\nM.CYL. 24/02/2010\nLMV-NT 24/02/2010\n\n(Holder's Sig natu re)\n\nIssue Date : 20/05/2016\nValidity(NT) : 19/05/2021 : c\nValidity(T) : NA Issuing Authority\nInvCarrNo : NA NWZ-I, WAZIRPUR"
search_phrases = ['Issue Date', 'Licence No.', 'N', 'Validity(NT)']
reg = r"\b({})\s*:\W*(.+)".format( "|".join(sorted(map(re.escape, search_phrases), key=len, reverse=True)) )
print(re.findall(reg, text, re.IGNORECASE))
Output of this short online Python demo:
[('Licence No.', 'DL-0820100052000 (P) R'), ('N', 'PARMINDER PAL SINGH GILL'), ('Issue Date', '20/05/2016'), ('Validity(NT)', '19/05/2021 : c')]
The regex is
\b(Validity\(NT\)|Licence\ No\.|Issue\ Date|N)\s*:\W*(.+)
See its online demo.
Details:
map(re.escape, search_phrases) - escapes all special chars in your search phrases to be used as literal texts in a regex (else, . will match any chars, ? won't match a ? char, etc.)
sorted(..., key=len, reverse=True) - sorts the search phrases by length in descending order (to get longer matches first)
"|".join(...) - creates an alternation pattern, a|b|c|...
r"\b({})\s*:\W*(.+)".format( ... ) - creates the final regex.
Regex details
\b - a word boundary (NOTE: replace with (?m)^ if your matches occur at the beginning of a line)
(Validity\(NT\)|Licence\ No\.|Issue\ Date|N) - Group 1: one of the search phrases
\s* - zero or more whitespaces
: - a colon
\W* - zero or more non-word chars
(.+) - (capturing) Group 2: one or more chars other than line break chars, as many as possible.

python3 RNG variables

been coding for 2 days and ran into a snag.
I wanted to make a fun little practice thingy to help some people practice their english.
Basically they type in the words they wanna practice and then the randomizer throws out sentences for them to read.
Issue is I want to get the grammer right, I have done two different "Pronoun" catagories and 2 different "verb" catagories.
How do I link them together but still retain the random element of never knowing what combo you would get but making it still follow the grammer rules.
Code is below, any help would be awesome =D
# coding: utf-8
# In[73]:
noun1 = ("apple", "peach", "plum", "cat", "dog", "mouse")
verb1 = ("drink", "eat", "swim", "kill", "kick", "hit", "die")
# verb + S
verbS = ("drinks", "eats", "swims", "kills", "kicks", "hits", "dies")
adj1 = ("blue", "black", "red","big", "small", "tall", "short")
direction1 = ("up", "in", "out", "behind", "infront of", "over")
#pronouns Capital
Pronoun1 = ("I", "You", "They", "We")
PronounS = ("He", "She","Tt")
#pronouns non Capital
pronoun1 = ("I", "you", "they", "we")
pronounS = ("he", "she","it")
# In[74]:
import random
# In[75]:
def sentence1():
print(random.choice(Pronoun1),end=" ")
print(random.choice(verb1),end=" ")
print("the",end=" ")
print(random.choice(adj1),end=" ")
print(random.choice(noun1),end=" ")
return"."
# In[82]:
print(sentence1())

Using specific elements from a list in different loops for a multiple choice test python 3.x

Basically i'm trying to create a multiple choice test that uses information stored inside of lists to change the questions/ answers by location.
so far I have this
import random
DATASETS = [["You first enter the car", "You start the car","You reverse","You turn",
"Coming to a yellow light","You get cut off","You run over a person","You have to stop short",
"in a high speed chase","in a stolen car","A light is broken","The car next to you breaks down",
"You get a text message","You get a call","Your out of gas","Late for work","Driving angry",
"Someone flips you the bird","Your speedometer stops working","Drinking"],
["Put on seat belt","Check your mirrors","Look over your shoulder","Use your turn signal",
"Slow to a safe stop","Relax and dont get upset","Call 911", "Thank your brakes for working",
"Pull over and give up","Ask to get out","Get it fixed","Offer help","Ignore it","Ignore it",
"Get gas... duh","Drive the speed limit","Don't do it","Smile and wave","Get it fixed","Don't do it"],
[''] * 20,
['B','D','A','A','C','A','B','A','C','D','B','C','D','A','D','C','C','B','D','A'],
[''] * 20]
def main():
questions(0)
answers(1)
def questions(pos):
for words in range(len(DATASETS[0])):
DATASETS[2][words] = input("\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] +
'\nA.)'+random.choice(DATASETS[1]) + '\nB.)%s' %DATASETS[1][words] + '\nC.)'
+random.choice(DATASETS[1]) + '\nD.)'+random.choice(DATASETS[1])+
"\nChoose your answer carefully: ")
def answers(pos):
for words in range(len(DATASETS[0])):
DATASETS[4] = list(x is y for x, y in zip(DATASETS[2], DATASETS[3]))
print(DATASETS)
I apologize if the code is crude to some... i'm in my first year of classes and this is my first bout of programming.
list 3 is my key for the right answer's, I want my code in questions() to change the position of the correct answer so that it correlates to the key provided....
I've tried for loops, if statements and while loops but just cant get it to do what I envision. Any help is greatly appreciated
tmp = "\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] + '\nA.)'
if DATASETS[3][words] == 'A': #if the answer key is A
tmp = tmp + DATASETS[1][words] #append the first choice as correct choice
else:
tmp = tmp + random.choice(DATASETS[1]) #if not, randomise the choice
Do similar if-else for 'B', 'C', and 'D'
Once your question is formulated, then you can use it:
DATASETS[2][words] = input(tmp)
This is a bit long but I am not sure if any shorter way exists.

Resources