Downloading a CSV that requires authentication from an email link - python-3.x

I'm a very junior developer, tasked with automating the creation, download and transformation of a query from Stripe Sigma.
I've been able to get the bulk of my job done: I have daily scheduled queries that generate a report for the prior 24 hours, which is linked to a dummy account purely for those reports, and I've got the Transformation and reports done on the back half of this problem.
The roadblock I've run into though is getting this code to pull the csv that manually clicking the link generates.
import re
from imbox import Imbox # pip install imbox
import traceback
import requests
from bs4 import BeautifulSoup
mail = Imbox(host, username=username, password=password, ssl=True, ssl_context=None, starttls=False)
messages = mail.messages(unread=True)
message_list = []
for (uid, message) in messages:
body = str(message.body.get('html'))
message_list.append(body)
mail.logout()
def get_download_link(message):
print(message[0])
soup = BeautifulSoup(message, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
urls.append(link.get('href'))
return urls[1]
# return urls
dl_urls = []
for m in message_list:
dl_urls.append(get_download_link(m))
for url in dl_urls:
print(url)
try:
s = requests.Session()
s.auth = (username, password)
response = s.get(url, allow_redirects=True, auth= (username, password))
# print(response.headers)
if (response.status_code == requests.codes.ok):
print('response headers', response.headers['content-type'])
response = requests.get(url, allow_redirects=True, auth= HTTPDigestAuth(username, password))
# print(response.text)
print(response.content)
# open(filename, 'wb').write(response.content)
else:
print("invalid status code",response.status_code)
except:
print('problem with url', url)
I'm working on this in jupyter notebooks, I've tried to just include relevant code detailing how I got into the email, how I extracted URLS from said email, and which one upon being clicked would download the csv.
All the way til the last step, I've had remarkably good luck, but now, the URL that I manually click downloads the csv as expected, however that same URL is being treated as the HTML for a stripe page by python/requests.
I've tried poking around in the headers, the one header that was suggested on another post ('Content-Disposition') wasn't present, and the printing the headers that are present takes up a good 20-25 lines.
Any suggestions on either headers that could contain the csv, or other approaches I would take would be appreciated.
I've included a (intentionally broken) URL to show the rough format of what is working for manual download, not working when kept in python entirely.
https://59.email.stripe.com/CL0/https:%2F%2Fdashboard.stripe.com%2Fscheduled_query_runs%xxxxxxxxxxxxxxxx%2Fdownload/1/xxxxxxxxxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-000000/xxxxxxxxx-xxxxx_xxxxxxxxxxxxxxxx=233

If you're using scheduled queries, you can receive notification about completion/availability as webhooks and then access the files programmatically using the url included in the event payload. You can also list/retrieve scheduled queries via the API and check their status before accessing the data at the file link.
There should be no need to parse these links out of an email.

You can also set up a Stripe webhook for the Stripe Sigma query run, pointing that webhook notification to a Zap webhook trigger. Then:
Catch the hook in Zapier
Filter by the Sigma query name
Use another Zapier webhook custom request to authenticate and grab the data object file URL
Use the utilities formatter by Zapier to transform the results into a CSV
file

Related

Python + Mechanize - Emulate Javascript button click using POST?

I'm trying to automate filling a car insurance quote form on a site:
(following the same format as the site URL lets call it: "https://secure.examplesite.com/css/car/step1#noBack")
I'm stuck on the rego section as once the rego has been added, a button needs to be clicked to perform the search and it seems this is heavy Javascript and I know mechanizer can't handle this. I'm not versed in JavaScript, but I can see that when clicking the button a POST request is made to this URL: ("https://secure.examplesite.com/css/car/step1/searchVehicleByRegNo") Please see image also.
How can I emulate this POST request in Mechanize to run the javascript? So I can see the response / interact with the response? Or is this not possible? Can I consider bs4/requests/robobrowser instead. I'm only ~4 months into learning! Thanks
# Mechanize test
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
res = br.open("https://secure.examplesite.com/css/car/step1#noBack")
br.select_form(id = "quoteCollectForm")
br.set_all_readonly(False) # allow everything to be written to
controlDict = {}
# List all form controls
for control in br.form.controls:
controlDict[control.name] = control.value
print("type = %s, name = %s, value = %s" %(control.type, control.name, control.value))
# Enter Rego etc "example"
br.form["vehicle.searchRegNo"] = "example"
# Now for control name = vehicle.searchRegNo, value = example
# BUT Now how do I click the button?? Simulate POST? The post url is formatted like:
# https://secure.examplesite.com/css/car/step1/searchVehicleByRegNo
Javascript POST
Solved my own problem-
Steps:
open dev tools in browser
Go to network tab and clear
interact with form element (in my case car rego finder)
click on the event that occurs from interaction
copy the exact URL, Request header data, and payload
I used Postman to quickly test the request and responses were correct / the same as the Webform and found the relevant headers
in postman convert to python requests code
Now I can interact completely with the form

Facebook Python SDK - GET events

I am a newbie working with facebook python sdk. I found this file that enabled me to obtain event data from my facebook business page.
However, I am unable to fetch all of the fields I need. The missing field I am unable to obtain is the event videos. I am unable to retrieve the video source or video_url for the video that was posted with the event.
I have tried many different variations of fields =
to no avail. Thank You
import facebook
import requests
def some_action(post):
""" Here you might want to do something with each post. E.g. grab the
post's message (post['message']) or the post's picture (post['picture']).
In this implementation we just print the post's created time.
"""
print(post["id"])
# You'll need an access token here to do anything. You can get a temporary one
# here: https://developers.facebook.com/tools/explorer/
access_token = "My_Access_Token"
# Look at Bill Gates's profile for this example by using his Facebook id.
user = "397894520722082"
graph = facebook.GraphAPI(access_token)
profile = graph.get_object(user)
posts = graph.get_connections(profile['id'], 'posts')
print(graph.get_object(id=397894520722082,
fields='events{videos,id,category,name,description,cover,place,start_time,end_time,interested_count,attending_count,declined_count,event_times,comments}'))
# Wrap this block in a while loop so we can keep paginating requests until
# finished.
while True:
try:
# Perform some action on each post in the collection we receive from
# Facebook.
[some_action(post=post) for post in posts["data"]]
# Attempt to make a request to the next page of data, if it exists.
posts = requests.get(posts["paging"]["next"]).json()
except KeyError:
# When there are no more pages (['paging']['next']), break from the
# loop and end the script.
break

Python - Web scraping using HTML tags

I am trying to scrape a web-page to list out the jobs posted in URL: https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad
Refer to image for details of web-page inspect Web inspect
Following is observed through a web-page inspect:
Each job listed, is in a HTML li with class="jobs-list-item". The Li contains following html tag & data in parent Div within li
data-ph-at-job-title-text="Software Engineer II",
data-ph-at-job-category-text="Engineering",
data-ph-at-job-post-date-text="2018-03-19T16:33:00".
1st Child Div within parent Div with class="information" has HTML with url
href="https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II"
3rd child Div with class="description au-target" within parent Div has short job description
My requirement is to extract below information for each job
Job Title
Job Category
Job Post Date
Job Post Time
Job URL
Job Short Description
I have tried following Python code to scrape the webpage, but unable to extract the required information. (Please ignore the indentation shown in code below)
import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)
if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")
ms_jobs()
If you want to do this via requests you need to reverse engineer the site. Open the dev tools in Chrome, select the networks tab and fill out the form.
This will show you how the site loads the data. If you dig in the site you'll see, that it grabs the data by doing a POST to this endpoint: https://careers.microsoft.com/widgets. It also shows you the payload that the site uses. The site uses cookies so all you have to do is create a session that keeps the cookie, get one and copy/paste the payload.
This way you'll be able to extract the same json-data, that the javascript fetches to populate the site dynamically.
Below is a working example of what that would look like. Left is only to parse out the json as you see fit.
import requests
from pprint import pprint
# create a session to grab a cookie from the site
session = requests.Session()
r = session.get("https://careers.microsoft.com/us/en/")
# these params are the ones that the dev tools show that site sets when using the website form
payload = {
"lang":"en_us",
"deviceType":"desktop",
"country":"us",
"ddoKey":"refineSearch",
"sortBy":"",
"subsearch":"",
"from":0,
"jobs":"true",
"counts":"true",
"all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
"pageName":"search-results",
"size":20,
"keywords":"",
"global":"true",
"selected_fields":{"city":["Hyderabad"],"country":["India"]},
"sort":"null",
"locationData":{}
}
# this is the endpoint the site uses to fetch json
url = "https://careers.microsoft.com/widgets"
r = session.post(url, json=payload)
data = r.json()
job_list = data['refineSearch']['data']['jobs']
# the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
job = job_list[0]
pprint(job)
Cheers.

How to make multiple API calls from multiple pages in single URL

So the title is a little confusing I guess..
I have a script that I've been writing that will display some random data and other non-essentials when I open my shell. I'm using grequests to make my API calls since I'm using more than one URL. For my weather data, I use WeatherUnderground's API since it will offer active alerts. The alerts and conditions data are on separate pages. What I can't figure out is how to insert the appropriate name in the grequests object when it is making requests. Here is the code that I have:
URLS = ['http://api.wunderground.com/api/'+api_id+'/conditions/q/autoip.json',
'http://www.ourmanna.com/verses/api/get/?format=json',
'http://quotes.rest/qod.json',
'http://httpbin.org/ip']
requests = (grequests.get(url) for url in URLS)
responses = grequests.map(requests)
data = [response.json() for response in responses]
#json parsing from here
In the URL 'http://api.wunderground.com/api/'+api_id+'/conditions/q/autoip.json' I need to make an API request to conditions and alerts to retrieve the data I need. How do I do this without rewriting a fourth URLS string?
I've tried
pages = ['conditions', 'alerts']
URL = ['http://api.wunderground.com/api/'+api_id+([p for p in pages])/q/autoip.json']
but, as I'm sure some of you more seasoned programmers know, threw and exception. So how can I iterate through these pages, or will I have to write out both complete URLS?
Thanks!
Ok I was actually able to figure out how to call each individual page within the grequests object by using a simple for loop. Here is the the code that I used to produced the expected results:
import grequests
pages = ['conditions', 'alerts']
api_id = 'myapikeyhere'
for p in pages:
URLS = ['http://api.wunderground.com/api/'+api_id+'/'+p+'/q/autoip.json',
'http://www.ourmanna.com/verses/api/get/?format=json',
'http://quotes.rest/qod.json',
'http://httpbin.org/ip']
#create grequest object and retrieve results
requests = (grequests.get(url) for url in URLS)
responses = grequests.map(requests)
data = [response.json() for response in responses]
#json parsing from here
I'm still not sure why I couldn't figure this out before.
Documentation for the grequests library here

Python Requests Refresh

I'm trying to use python's requests library to log in to a website. It's a pretty simple code, and you can really get the gist of requests just by going on its website. I, however, want to check if I'm successfully logged in via the url. The problem I've encountered is when I initiate the post requests and give it (the variable p) a url, whether the html has changed or not I'm still passed the same url when I type print(p.url). Is there any way for me to refresh the browser or update the url to whatever it's currently set at?
(I can add a line for checking the url against itself later, but for now I just want to get the correct url)
#!usr/bin/env python3
import requests
payload = {'login': 'USERNAME,
'password': 'PASSWORD'}
with requests.Session() as s:
p = s.post('WEBSITE', data=payload)
#print p.text
print(p.url)
The usuage of python-requests may not as complex as you think. It will automatically handle the redirect of your post ( or session.get()).
Here, session.post() method return a response object:
r = s.post('website', data=payload)
which means r.url is current url you are looking for.
If you still want to refresh current page, just use:
s.get(r.url)
To verify whether you has login successfully, one solution is to do the login in your browser.
Based on the title or content of the webpage returned (i.e, use the content in r.text), you can judge whether you have made it.
BTW, python-requests is a great library, enjoy it.

Resources