How do I use SpaCy to capture number to noun relationship here? - nlp

I use the following function to identify that token is connected to the number:
def get_number_relationships(sentence):
doc = nlp(sentence)
print(doc)
for token in doc:
#print(f"--- TOKEN {token} ---")
for t in token.subtree:
if t.like_num:
pass
#print(f"Numeric subtree token {t}")
for c in token.children:
if c.like_num and c.dep_ == "nummod":
print(f"RELATED as NUMBERS: {c.text.upper()} and {token.text.upper()} with dependency {c.dep_}")
But it doesn't capture the relationship between "40" and "birds" here:
'40 of those birds were blue.'
This is the dependency graph:
I tried looking at ancestors but that has the same issue. How can I capture this dependency?

Related

Using AI service to recognize a free text search field question?

Is there an API service, paid or not paid (IBM Watson, Google Natural Language), that can accept a free text "ask a question" field and convert it into a set of keywords to be used for a regular keyword search?
For example if my website has a search field "Ask a question about our products", and a user types in "Do you have red dresses?", is there an API we can integrate into our code that can just convert this to "red dress" which we then simply feed into our regular keyword search for "red dress"?
Ideally it can handle variations of questions such as:
"How do you return a product?" -- return product
"Do you accept Mastercard?" -- mastercard
"Where can I find blue shoes?" -- blue shoes
You can extract noun chunks and then use those as keywords.
For example using Spacy, you can extract noun chunks as follows:
import spacy
nlp = spacy.load('en_core_web_md')
def getNounChunks(doc):
inc = ['NN', 'NNP', 'NNPS', 'NNS', 'JJ', 'HYPH']
incn = ['NN', 'NNP', 'NNPS' ,'NNS']
excl = ['other', 'some', 'many', 'certain', 'various']
lspans = []
chunk =[]
for t in doc:
if t.text.lower() in excl:
continue
if chunk:
if chunk[-1].tag_ == 'HYPH':
chunk.append(t)
continue
if t.tag_ in inc:
if t.tag_ != 'JJ':
chunk.append(t)
else:
if not any([t.tag_ in incn for t in chunk]):
chunk.append(t)
else:
if chunk:
if any([t.tag_ in incn for t in chunk]):
lspans.append(doc[chunk[0].i:chunk[-1].i + 1])
chunk = list()
return(lspans)
questions = [
"How do you return a product?" ,
"Do you accept Mastercard?" ,
"Where can I find blue shoes?",
"Do you have red dresses?",]
for q in questions:
doc = nlp(q)
print(getNounChunks(doc))
#output:
#[product]
#[Mastercard]
#[blue shoes]
#[red dresses]

When working with the Stripe API, is it better to sort each request or store locally and perform queries?

This is my first post, I've been lurking for a while.
Some context to my question;
I'm working with the Stripe API to pull transaction data and match these with booking numbers from another API source. (property reservations --> funds received for reconciliation)
I started by just making calls to the API and sorting the data in place using python 3, however it started to get very complicated and I thought I should persist the data in a mongodb stored on localhost. I began to do this, however I decided that storing the sorted data was still just as complicated and the request times were getting quite long, I thought, maybe I should pull all the stripe data and store it locally and then query whatever I needed.
So here I am, with a bunch of code I've written for both and still not alot of progress. I'm a bit lost with the next move. I feel like I should probably pick a path and stick with it. I'm a little unsure what is the "best practise" when working with API's, usually I would turn to YouTube, but I haven't been able to find a video which covers this specific scenario. The amount of data being pulled from the API would be around 100kb per request.
Here is the original code which would grab each query. Recently I've learnt I can use the expand method (I think this is what it's called) so I don't need to dig down so many levels in my for loop.
The goal was to get just the metadata which contains the booking reference numbers that can then be match against a response from my property management systems API. My code is a bit embarrassing, I've kinda just learnt it over the last little while in my downtime from work.
import csv
import datetime
import os
import pymongo
import stripe
"""
We need to find a Valid reservation_ref or reservation_id in the booking.com Metadata. Then we need to match this to a property ID from our list of properties in the book file.
"""
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient["mydatabase"]
stripe_payouts = mydb["stripe_payouts"]
stripe.api_key = "sk_live_thisismyprivatekey"
r = stripe.Payout.list(limit=4)
payouts = []
for data in r['data']:
if data['status'] == 'paid':
p_id = data['id']
amount = data['amount']
meta = []
txn = stripe.BalanceTransaction.list(payout=p_id)
amount_str = str(amount)
amount_dollar = str(amount / 100)
txn_len = len(txn['data'])
for x in range(txn_len):
if x != 0:
charge = (txn['data'][x]['source'])
if charge.startswith("ch_"):
meta_req = stripe.Charge.retrieve(charge)
meta = list(meta_req['metadata'])
elif charge.startswith("re_"):
meta_req = stripe.Refund.retrieve(charge)
meta = list(meta_req['metadata'])
if stripe_payouts.find({"_id": p_id}).count() == 0:
payouts.append(
{
"_id": str(p_id),
"payout": str(p_id),
"transactions": txn['data'],
"metadata": {
charge: [meta]
}
}
)
# TODO: Add error exception to check for po id already in the database.
if len(payouts) != 0:
x = stripe_payouts.insert_many(payouts)
print("Inserted into Database ", len(x.inserted_ids), x.inserted_ids)
else:
print("No entries made")
"_id": str(p_id),
"payout": str(p_id),
"transactions": txn['data'],
"metadata": {
charge: [meta]
This last section doesn't work properly, this is kinda where I stopped and starting calling all the data and storing it in mongodb locally.
I appreciate if you've read this wall of text this far.
Thanks
EDIT:
I'm unsure what the best practise is for adding additional information, but I've messed with the code below per the answer given. I'm now getting a "Key error" when trying to insert the entries into the database. I feel like It's duplicating keys somehow.
payouts = []
def add_metadata(payout_id, transaction_type):
transactions = stripe.BalanceTransaction.list(payout=payout_id, type=transaction_type, expand=['data.source'])
for transaction in transactions.auto_paging_iter():
meta = [transaction.source.metadata]
if stripe_payouts.Collection.count_documents({"_id": payout_id}) == 0:
payouts.append(
{
transaction.id: transaction
}
)
for data in r['data']:
p_id = data['id']
add_metadata(p_id, 'charge')
add_metadata(p_id, 'refund')
# TODO: Add error exception to check for po id already in the database.
if len(payouts) != 0:
x = stripe_payouts.insert_many(payouts)
#print(payouts)
print("Inserted into Database ", len(x.inserted_ids), x.inserted_ids)
else:
print("No entries made")```
To answer your high level question. If you're frequently accessing the same data and that data isn't changing much then it can make sense to try to keep your local copy of the data in sync and make your frequent queries against your local data.
No need to be embarrassed by your code :) we've all been new at something at some point.
Looking at your code I noticed a few things:
Rather than fetch all payouts, then use an if statement to skip all except paid, instead you can pass another filter to only query those paid payouts.
r = stripe.Payout.list(limit=4, status='paid')
You mentioned the expand [B] feature of the API, but didn't use it so I wanted to share how you can do that here with an example. In this case, you're making 1 API call to get the list of payouts, then 1 API call per payout to get the transactions, then 1 API call per charge or refund to get the metadata for charges or metadata for refunds. This results in 1 * (n payouts) * (m charges or refunds) which is a pretty big number. To cut this down, let's pass expand=['data.source'] when fetching transactions which will include all of the metadata about the charge or refund along with the transaction.
transactions = stripe.BalanceTransaction.list(payout=p_id, expand=['data.source'])
Fetching the BalanceTransaction list like this will only work as long as your results fit on one "page" of results. The API returns paginated [A] results, so if you have more than 10 transactions per payout, this will miss some. Instead, you can use an auto-pagination feature of the stripe-python library to iterate over all results from the BalanceTransaction list.
for transaction in transactions.auto_paging_iter():
I'm not quite sure why we're skipping over index 0 with if x != 0: so that may need to be addressed elsewhere :D
I didn't see how or where amount_str or amount_dollar was actually used.
Rather than determining the type of the object by checking the ID prefix like ch_ or re_ you'll want to use the type attribute. Again in this case, it's better to filter by type so that you only get exactly the data you need from the API:
transactions = stripe.BalanceTransaction.list(payout=p_id, type='charge', expand=['data.source'])
I'm unable to test because I lack the same database that you have, but wanted to share a refactoring of your code that you may consider.
r = stripe.Payout.list(limit=4, status='paid')
payouts = []
for data in r['data']:
p_id = data['id']
amount = data['amount']
meta = []
amount_str = str(amount)
amount_dollar = str(amount / 100)
transactions = stripe.BalanceTransaction.list(payout=p_id, type='charge', expand=['data.source'])
for transaction in transactions.auto_paging_iter():
meta = list(transaction.source.metadata)
if stripe_payouts.find({"_id": p_id}).count() == 0:
payouts.append(
{
"_id": str(p_id),
"payout": str(p_id),
"transactions": transactions,
"metadata": {
charge: [meta]
}
}
)
transactions = stripe.BalanceTransaction.list(payout=p_id, type='refund', expand=['data.source'])
for transaction in transactions.auto_paging_iter():
meta = list(transaction.source.metadata)
if stripe_payouts.find({"_id": p_id}).count() == 0:
payouts.append(
{
"_id": str(p_id),
"payout": str(p_id),
"transactions": transactions,
"metadata": {
charge: [meta]
}
}
)
# TODO: Add error exception to check for po id already in the database.
if len(payouts) != 0:
x = stripe_payouts.insert_many(payouts)
print("Inserted into Database ", len(x.inserted_ids), x.inserted_ids)
else:
print("No entries made")
Here's a further refactoring using functions defined to encapsulate just the bit adding to the database:
r = stripe.Payout.list(limit=4, status='paid')
payouts = []
def add_metadata(payout_id, transaction_type):
transactions = stripe.BalanceTransaction.list(payout=payout_id, type=transaction_tyep, expand=['data.source'])
for transaction in transactions.auto_paging_iter():
meta = list(transaction.source.metadata)
if stripe_payouts.find({"_id": payout_id}).count() == 0:
payouts.append(
{
"_id": str(payout_id),
"payout": str(payout_id),
"transactions": transactions,
"metadata": {
charge: [meta]
}
}
)
for data in r['data']:
p_id = data['id']
add_metadata('charge')
add_metadata('refund')
# TODO: Add error exception to check for po id already in the database.
if len(payouts) != 0:
x = stripe_payouts.insert_many(payouts)
print("Inserted into Database ", len(x.inserted_ids), x.inserted_ids)
else:
print("No entries made")
[A] https://stripe.com/docs/api/pagination
[B] https://stripe.com/docs/api/expanding_objects

Twitter API: How to make query keep running?

Novice programmer here seeking help.
I already set up my code to my requirements to use the Twitter's premium API.
SEARCH_TERM = '#AAPL OR #FB OR #KO OR #ABT OR #PEPCO'
PRODUCT = 'fullarchive'
LABEL = 'my_label'
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM, 'fromDate':201501010000, 'toDate':201812310000})
However, when I run it I obtain the maximum number of tweets per search which is 500.
My question is should I add to the query maxResults = 500? And how do I use the next parameter to keep the code running until all the tweets that correspond to my query are obtained?
To up the results from the default of 100 to 500, yes, add maxResults to the query like this:
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{
'query':SEARCH_TERM,
'fromDate':201501010000, 'toDate':201812310000,
'maxResults':500
})
You can make successive queries to get more results by using the next parameter. But, even easier, you can let TwitterAPI do this for you by using the TwitterPager class. Here is an example:
from TwitterAPI import TwitterAPI, TwitterPager
SEARCH_TERM = '#AAPL OR #FB OR #KO OR #ABT OR #PEPCO'
PRODUCT = 'fullarchive'
LABEL = 'my_label'
api = TwitterAPI(<consumer key>,
<consumer secret>,
<access token key>,
<access token secret>)
pager = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL),
{
'query':SEARCH_TERM,
'fromDate':201501010000, 'toDate':201812310000
})
for item in pager.get_iterator():
print(item['text'] if 'text' in item else item)
This example will keep making successive requests with the next parameter until no tweets can be downloaded.
Use the count variable in a raw_query, for example:
results = api.GetSearch(
raw_query="q=twitter%20&result_type=recent&since=2014-07-19&count=100")

export data from service now rest API using python

I have to export incident data from service now rest API. The incident state is one of new, in progress, pending not resolved and closed. I am able to fetch data that are in active state but not able to apply correct filter also in output it is showing one extra character 'b', so how to remove that extra character?
input:
import requests
URL = 'https://instance_name.service-now.com/incident.do?CSV&sysparm_query=active=true'
user = 'user_name'
password = 'password'
headers = {"Accept": "application/xml"}
response = requests.get(URL,auth=(user, password), headers=headers)
if response.status_code != 200:
print('Status:', response.status_code, 'Headers:', response.headers, 'Error Response:', response.content)
exit()
print(response.content.splitlines())
Output:
[b'"number","short_description","state"', b'"INC0010001","Test incident creation through REST","New"', b'"INC0010002","incident creation","Closed"', b'"INC0010004","test","In Progress"']
It's a Byte literal (for more info. please refer https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)
To remove Byte literal, we need to decode the strings,
Just give a try as-
new_list = []
s = [b'"number","short_description","state"', b'"INC0010001","Test
incident creation through REST","New"', b'"INC0010002","incident
creation","Closed"', b'"INC0010004","test","In Progress"']
for ele in s:
ele= ele.decode("utf-8")
new_list.append(ele)
print(new_list)
Output:
['"number","short_description","state"', '"INC0010001","Test incident
creation through REST","New"', '"INC0010002","incid
ent creation","Closed"', '"INC0010004","test","In Progress"']
Hope! It will work

Obtain the desired value from the output of a method Python

i use a method in telethon python3 library:
"client(GetMessagesRequest(peers,[pinnedMsgId]))"
this return :
ChannelMessages(pts=41065, count=0, messages=[Message(out=False, mentioned=False,
media_unread=False, silent=False, post=False, id=20465, from_id=111104071,
to_id=PeerChannel(channel_id=1111111111), fwd_from=None, via_bot_id=None,
reply_to_msg_id=None, date=datetime.utcfromtimestamp(1517325331),
message=' test message test', media=None, reply_markup=None,
entities=[], views=None, edit_date=None, post_author=None, grouped_id=None)],
chats=[Channel(creator=..............
i only need text of message ------> test message test
how can get that alone?
the telethon team say:
"This is not related to the library. You just need more Python knowledge so ask somewhere else"
thanks
Assuming you have saved the return value in some variable, say, result = client(...), you can access members of any instance through the dot operator:
result = client(...)
message = result.messages[0]
The [0] is a way to access the first element of a list (see the documentation for __getitem__). Since you want the text...:
text = message.message
Should do the trick.

Resources