parallelizing a Python function? - python-3.x

I have a function that submits a search job to a REST API, waits for the API to respond, then downloads 2 sets of JSON data, converts the both JSON's into Pandas dataframes, and returns both dataframes. below is a very simplified version of the function(minus error handling, logging, data scrubbing, etc...)
def getdata(searchstring, url, uname, passwd):
headers = {'content-type': 'application/json'}
json_data = CreateJSONPayload(searchstring)
rPOST = requests.post(url, auth=(uname, passwd), data=json_data, headers=headers)
statusURL = (str(json.loads(rPOST.text)[u'link'][u'href']))
Processing = True
while Processing == True:
rGET = requests.get(statusURL, auth=(uname, passwd))
if rGET.status_code== 200:
url1 = url + "/dataset1"
url2 = url + "/dataset2"
rGET1 = requests.get(url1, auth=(uname, passwd))
rGET2 = requests.get(url2, auth=(uname, passwd))
dfData1 = pd.read_json(rGET1.text)
dfData2 = pd.read_json(rGET2.text)
Processing = False
elif StatusCode == "Other return code handling":
print("handle errors") # Not relevant to question.
else:
sleep(15)
return dfData1, dfData2
The function itself works as expected. However the API being called can take anywhere from a couple of minutes to an hour to return the data depending on the parameters I pass to it and I need to submit multiple searches to it, so I'd rather not submit each search one after the other.
What's the best way to parallelize the calling of a function like this so that I can submit multiple requests to it at the same time, wait for all calls of the function have returned data, and finally continue on with data processing in the script?
I also need to be able to throttle the requests too, as the API rate limits me to no more than 15 concurrent connections at a time.

Related

How to make multiple REST calls asynchronous in python3

I have the following code to make multiple REST calls. Basically I have a dictionary where key is a string and value is a JSON date that I need to use as payload to pass to a REST API POST method.
At the moment, the dictionary contains 10 entries, so I need to make 10 REST calls.
At the moment, I have implemented using requests package in python3 which is synchronous in nature. So after 1 REST call, it waits for its response and similarly for 10 REST calls, it will wait 10 times for the response from API.
def createCategories(BACKEND_URL, token, category):
url = os.path.join(BACKEND_URL, 'api/v1/category-creation')
category_dict = read_payloads(category)
headers = {
"token": f'{token}',
"Content-Type": "application/json",
"accept": "application/json"
}
for name, category_payload in category_dict.items():
json_payload = json.dumps(category_payload)
response = requests.request("POST", url, headers=headers, data=json_payload)
##########################
## Load as string and parsing
response_data = json.loads(response.text)
print(response_data)
category_id = response_data['id']
message = 'The entity with id: ' + str(category_id) + ' is created successfully. '
logging.info(message)
return "categories created successfully."
I read that we need to use asyncio to make these asynchronous. What code changes do I need to make?
You can continue using requests library. You need to use threading or concurrent.futures modules to make several requests simutaneoudly.
Another option is to use some async library like aiohttp or some others.
import requests
from threading import current_thread
from concurrent.futures import ThreadPoolExecutor, Future
from time import sleep, monotonic
URL = "https://api.github.com/events"
def make_request(url: str) -> int:
r = requests.get(url)
sleep(2.0) # wait n seconds
return r.status_code
def done_callback(fut: Future):
if fut.exception():
res = fut.exception()
print(f"{current_thread().name}. Error: {res}")
elif fut.cancelled():
print(f"Task was canceled")
else:
print(f"{current_thread().name}. Result: {fut.result()}")
if __name__ == '__main__':
urls = [URL for i in range(20)] # 20 tasks
start = monotonic()
with ThreadPoolExecutor(5) as pool:
for i in urls:
future_obj = pool.submit(make_request, i)
future_obj.add_done_callback(done_callback)
print(f"Time passed: {monotonic() - start}")

How to run a while loop to run a REST API call until no more results come back in Python

I'm writing a short Python program to request a JSON file using a Rest API call. The API limits me to a relatively small results set (50 or so) and I need to retrieve several thousand result sets. I've implemented a while loop to achieve this and it's working fairly well but I can't figure out the logic for 'continuing the while loop' until there are no more results to retrieve. Right now I've implemented a hard number value but would like to replace it with a conditional that stops the loop if no more results come back. The 'offset' field is the parameter that the API forces you to use to specify which set of results you want in your 50. My logic looks something like...
import requests
import json
from time import sleep
url = "https://someurl"
offsetValue = 0
PARAMS = {'limit':50, 'offset':offsetValue}
headers = {
"Accept": "application/json"
}
while offsetValue <= 1000:
response = requests.request(
"GET",
url,
headers=headers,
params = PARAMS
)
testfile = open("testfile.txt", "a")
testfile.write(json.dumps(json.loads(response.text), sort_keys=True, indent=4, separators=(",", ": ")))
testfile.close()
offsetValue = offsetValue + 1
sleep(1)
So I want to change the conditional the controls the while loop from a fixed number to a check to see if the result set for the getRequest is empty. Hopefully this makes sense.
Your loop can be while true. After each fetch, convert the payload to a dict. If the number of results is 0, then break.
Depending on how the API works, there may be other signals that there’s nothing more to fetch, e.g. some HTTP error, not necessarily the result count — you’ll have to discover the API’s logic for that.

Asyncio, the tasks are not finished properly, because of sentinel issues

I'm trying to do some web-scraping, as learning, using a predefined number of workers.
I'm using None as as sentinel to break out of the while loop and stop the worker.
The speed of each worker varies, and all workers are closed before the last
url is passed to gather_search_links to get the links.
I tried to use asyncio.Queue, but I had less control than with deque.
async def gather_search_links(html_sources, detail_urls):
while True:
if not html_sources:
await asyncio.sleep(0)
continue
data = html_sources.pop()
if data is None:
html_sources.appendleft(None)
break
data = BeautifulSoup(data, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
url = f'{domain_url}{atag.get("href")}'
detail_urls.appendleft(url)
print("apended data", len(detail_urls))
await asyncio.sleep(0)
async def get_page_source(urls, html_sources):
client = httpx.AsyncClient()
while True:
if not urls:
await asyncio.sleep(0)
continue
url = urls.pop()
print("url", url)
if url is None:
urls.appendleft(None)
break
response = await client.get(url)
html_sources.appendleft(response.text)
await asyncio.sleep(8)
html_sources.appendleft(None)
async def navigate(urls):
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
await asyncio.sleep(0)
nav_urls.appendleft(None)
loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()
navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
loop.run_until_complete(workers)
I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.
I think the issue arises at this line:
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.
I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.
EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.
LAST EDIT:
Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.
import asyncio
from collections import deque
import httpx as httpx
from bs4 import BeautifulSoup
# Get or build URLs from config
def navigate():
urls = deque()
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
return urls
# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):
client = httpx.AsyncClient()
response = await client.get(url)
data = BeautifulSoup(response.text, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
#Domain URL was defined elsewhere
url = f'{domain_url}{atag.get("href")}'
products_urls.appendleft(url)
loop = asyncio.get_event_loop()
products_urls = deque()
nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])
loop.run_until_complete(workers)

Ensuring unique timestamps generation in asyncio/aiohttp coroutines

I'm rewriting a web scraper with aiohttp. At some point, it has to make a POST request with a payload notably including a 'CURRENT_TIMESTAMP_ID'. These requests seem to always succeed, but they sometimes are redirected (302 status code) to another location, as additional details need to be fetched to be displayed on the page. Those redirections often fails ("A system error occurred" or "not authorized" error message is displayed on the page), and I don't know why.
I guess it's because they sometimes share the same value for 'CURRENT_TIMESTAMP_ID' (because headers and cookies are the same). Thus, I'd like to generate different timestamps in each request but I had no success doing that. I tried using some randomness with stuffs like asyncio.sleep(1+(randint(500, 2000) / 1000)). Also, note that doing the scraping with task_limit=1 succeeds (see code below).
Here is the relevant part of my code:
async def search(Number, session):
data = None
loop = asyncio.get_running_loop()
while data is None:
t = int(round(time() * 1000)) #often got the same value here
payload = {'Number': Number,
'CURRENT_TIMESTAMP_ID': t}
params = {'CURRENT_TIMESTAMP_ID': t}
try:
async with session.post(SEARCH_URL, data=payload, params=params) as resp:
resp.raise_for_status()
data = await resp.text()
return data
except aiohttp.ClientError as e:
print(f'Error with number{Number}: {e}')
It's called via:
async def main(username, password):
headers = {'User-Agent': UserAgent().random}
async with aiohttp.ClientSession(headers=headers) as session:
await login(session, username, password)
"""Perform the following operations:
1. Fetch a bunch of urls concurrently, with a limit of x tasks
2. Gather the results into chunks of size y
3. Process the chunks in parallel using z different processes
"""
partial_search = async_(partial(search, session=session)) #I'm using Python 3.7
urls = ['B5530'] * 3 #trying to scrape the same URL 3 times
results = await ( #I'm using aiostream cause I got a huge list of urls. Problem also occurs with gather.
stream.iterate(urls)
| pipe.map(partial_search, ordered=False, task_limit=100)
| pipe.chunks(100 // cpu_count())
| pipe.map(process_in_executor, ordered=False, task_limit=cpu_count() + 1)
)
Hope someone will see what I'm missing!

Scrapy does not respect LIFO

I use Scrapy 1.5.1
My Goal is to go through entire chain of requests for each variable before moving to the next variable. For some reason Scrapy takes 2 variables, then sends 2 requests, then takes another 2 variables and so on.
CONCURRENT_REQUESTS = 1
Here is my code sample:
def parsed ( self, response):
# inspect_response(response, self)
search = response.meta['search']
for idx, i in enumerate(response.xpath("//table[#id='ctl00_ContentPlaceHolder1_GridView1']/tr")[1:]):
__EVENTARGUMENT = 'Select${}'.format(idx)
data = {
'__EVENTARGUMENT': __EVENTARGUMENT,
}
yield scrapy.Request(response.url, method = 'POST', headers = self.headers, body = urlencode(data),callback = self.res_before_get,meta = {'search' : search}, dont_filter = True)
def res_before_get ( self, response):
# inspect_response(response, self)
url = 'http://www.moj-yemen.net/Search_detels.aspx'
yield scrapy.Request(url, callback = self.results, dont_filter = True)
My desired behavior is:
1 value from Parse is sent to res_before_get and then i do smth with it.
then another values from Parse is sent to res_before_get and so on.
Post
Get
Post
Get
But currently Scrapy takes 2 values from Parse and adds them to queue , then sends 2 requests from res_before_get. Thus im getting duplicate results.
Post
Post
Get
Get
What do I miss?
P.S.
This is asp.net site. Its logic is as follows:
makes POST request with search payload.
Make GET request to get actual data.
Both request share the same sessionID
Thats why it is important to preserve the order.
At the moment im getting POST1 and POST2. And since the sessionID is associated with POST2, both GET1 and GET2 return the same page.
Scrapy works asynchronously, so you cannot expect it to respect the order of your loops or anything.
If you need it to work sequentially, you'll have to accommodate the callbacks to work like that, for example:
def parse1(self, response):
...
yield Request(..., callback=self.parse2, meta={...(necessary information)...})
def parse2(self, response):
...
if (necessary information):
yield Request(...,
callback=self.parse2,
meta={...(remaining necessary information)...},
)

Resources