asyncio wait - process results as they come - python-3.x

This script should take a list of initial tasks (URLs) and asynchronously make requests with aiohttp. And this part is done correctly. The problem is, since asyncio wait doesn't return actual results but only done/pending task set, I cant figure out where and how to process the results as they come, to make more requests and write data to DB. In this variant I placed the creation for a new task (make more requests...) inside the first one, which doesn't work.
PS. I am using wait because a book I am reading suggests using wait for more control over done and pending tasks and exceptions. Appreciate any help:)
async def fetch_content_2(session, url):
async with session.get(url) as result:
res = await result.text()
try:
new_link = BeautifulSoup(res, 'lxml').select_one('element on website 2')['href'])
# ***PROCESS AND WRITE SOME DATA TO DB***
except:
pass
async def fetch_content_1(session, url):
async with session.get(url) as result:
res = await result.text()
try:
link = BeautifulSoup(res, 'lxml').select_one('element on website 1')['href'])
# ***MAKE ANOTHER ASYNC REQUEST WITH NEW LINK***
asyncio.create_task(fetch_content_1(session,link))
except:
pass
async def main(tasks):
async with ClientSession() as session:
pending = [asyncio.create_task(fetch_content_1(session, url)) for url in tasks]
while pending:
done, pending = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
# print(f'Done count: {len(done)}')
# print(f'Pending count: {len(pending)}')
asyncio.run(main([url1, url2, ...]))

done and pending are sets of asyncio.Task objects. If you want to get the result of the task or its state you must get the values of the sets and call the method you need, check the (docs). Specifically you can get the result invoking the result method.
async def main(tasks):
async with ClientSession() as session:
pending = [asyncio.create_task(fetch_content_1(session, url)) for url in tasks]
while pending:
done, pending = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
res = done.pop().result()
# do some stuff with the result
Check the documentation to see the possible exceptions of call the result method and related methods. A exception may occur if the task had an internal error or the result is not ready (in this case shouldn't happen).

Related

asyncio.wait not returning on first exception

I have an AMQP publisher class with the following methods. on_response is the callback that is called when a consumer sends back a message to the RPC queue I setup. I.e. the self.callback_queue.name you see in the reply_to of the Message. publish publishes out to a direct exchange with a routing key that has multiple consumers (very similar to a fanout), and multiple responses come back. I create a number of futures equal to the number of responses I expect, and asyncio.wait for those futures to complete. As I get responses back on the queue and consume them, I set the result to the futures.
async def on_response(self, message: IncomingMessage):
if message.correlation_id is None:
logger.error(f"Bad message {message!r}")
await message.ack()
return
body = message.body.decode('UTF-8')
future = self.futures[message.correlation_id].pop()
if hasattr(body, 'error'):
future.set_execption(body)
else:
future.set_result(body)
await message.ack()
async def publish(self, routing_key, expected_response_count, msg, timeout=None, return_partial=False):
if not self.connected:
logger.info("Publisher not connected. Waiting to connect first.")
await self.connect()
correlation_id = str(uuid.uuid4())
futures = [self.loop.create_future() for _ in range(expected_response_count)]
self.futures[correlation_id] = futures
await self.exchange.publish(
Message(
str(msg).encode(),
content_type="text/plain",
correlation_id=correlation_id,
reply_to=self.callback_queue.name,
),
routing_key=routing_key,
)
done, pending = await asyncio.wait(futures, timeout=timeout, return_when=asyncio.FIRST_EXCEPTION)
if not return_partial and pending:
raise asyncio.TimeoutError(f'Failed to return all results for publish to {routing_key}')
for f in pending:
f.cancel()
del self.futures[correlation_id]
results = []
for future in done:
try:
results.append(json.loads(future.result()))
except json.decoder.JSONDecodeError as e:
logger.error(f'Client did not return JSON!! {e!r}')
logger.info(future.result())
return results
My goal is to either wait until all futures are finished, or a timeout occurs. This is all working nicely at the moment. What doesn't work, is when I added return_when=asyncio.FIRST_EXCEPTION, the asyncio.wait.. does not finish after the first call of future.set_exception(...) as I thought it would.
What do I need to do with the future so that when I get a response back and see that an error occurred on the consumer side (before the timeout, or even other responses) the await asyncio.wait will no longer be blocking. I was looking at the documentation and it says:
The function will return when any future finishes by raising an exception
when return_when=asyncio.FIRST_EXCEPTION. My first thought is that I'm not raising an exception in my future correctly, only, I'm having trouble finding out exactly how I should do that then. From the API documentation for the Future class, it looks like I'm doing the right thing.
When I created a minimum viable example, I realized I was actually doing things MOSTLY right after all, and I glanced over other errors causing this not to work. Here is my minimum example:
The most important change I had to do was actually pass in an Exception object.. (subclass of BaseException) do the set_exception method.
import asyncio
async def set_after(future, t, body, raise_exception):
await asyncio.sleep(t)
if raise_exception:
future.set_exception(Exception("problem"))
else:
future.set_result(body)
print(body)
async def main():
loop = asyncio.get_event_loop()
futures = [loop.create_future() for _ in range(2)]
asyncio.create_task(set_after(futures[0], 3, 'hello', raise_exception=True))
asyncio.create_task(set_after(futures[1], 7, 'world', raise_exception=False))
print(futures)
done, pending = await asyncio.wait(futures, timeout=10, return_when=asyncio.FIRST_EXCEPTION)
print(done)
print(pending)
asyncio.run(main())
In this line of code if hasattr(body, 'error'):, body was a string. I thought it was JSON at that point already. Should have been using "error" in body as my condition in any case. whoops!

Using asyncio for doing a/b testing in Python

Let's say there's some API that's running in production already and you created another API which you kinda want to A/B test using the incoming requests that's hitting the production-api. Now I was wondering, is it possible to do something like this, (I am aware of people doing traffic splits by keeping two different API versions for A/B testing etc)
As soon as you get the incoming request for your production-api, you make an async request to your new API and then carry on with the rest of the code for the production-api and then, just before returning the final response to the caller back, you check whether you have the results computed for that async task that you had created before. If it's available, then you return that instead of the current API.
I am wondering, what's the best way to do something like this? Do we try to write a decorator for this or something else? i am a bit worried about lot of edge cases that can happen if we use async here. Anyone has any pointers on making the code or the whole approach better?
Thanks for your time!
Some pseudo-code for the approach above,
import asyncio
def call_old_api():
pass
async def call_new_api():
pass
async def main():
task = asyncio.Task(call_new_api())
oldResp = call_old_api()
resp = await task
if task.done():
return resp
else:
task.cancel() # maybe
return oldResp
asyncio.run(main())
You can't just execute call_old_api() inside asyncio's coroutine. There's detailed explanation why here. Please, ensure you understand it, because depending on how your server works you may not be able to do what you want (to run async API on a sync server preserving the point of writing an async code, for example).
In case you understand what you're doing, and you have an async server, you can call the old sync API in thread and use a task to run the new API:
task = asyncio.Task(call_new_api())
oldResp = await in_thread(call_old_api())
if task.done():
return task.result() # here you should keep in mind that task.result() may raise exception if the new api request failed, but that's probably ok for you
else:
task.cancel() # yes, but you should take care of the cancelling, see - https://stackoverflow.com/a/43810272/1113207
return oldResp
I think you can go even further and instead of always waiting for the old API to be completed, you can run both APIs concurrently and return the first that's done (in case new api works faster than the old one). With all checks and suggestions above, it should look something like this:
import asyncio
import random
import time
from contextlib import suppress
def call_old_api():
time.sleep(random.randint(0, 2))
return "OLD"
async def call_new_api():
await asyncio.sleep(random.randint(0, 2))
return "NEW"
async def in_thread(func):
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, func)
async def ensure_cancelled(task):
task.cancel()
with suppress(asyncio.CancelledError):
await task
async def main():
old_api_task = asyncio.Task(in_thread(call_old_api))
new_api_task = asyncio.Task(call_new_api())
done, pending = await asyncio.wait(
[old_api_task, new_api_task], return_when=asyncio.FIRST_COMPLETED
)
if pending:
for task in pending:
await ensure_cancelled(task)
finished_task = done.pop()
res = finished_task.result()
print(res)
asyncio.run(main())

Sequence 2 asyncio function calls in python

Question: when I call generateCSVFromIncidentIdsWithArgs(list) twice with 2 different lists lets say "list1" and "list2", though the first list response appears correctly, the second list response has the results of list1 as well. I am not sure which variable to reset before making the second call so that the second list call appears without mixing the first list results.
function definition: function fetches response from a url with provided IDs in list
async def fetch(self, url, incident, session, csv):
async with session.get(url) as response:
self.format_output(incident, await response.read())
async def bound_fetch(self, sem, url, incident, session, csv):
# Getter function with semaphore.
async with sem:
await self.fetch(url, incident, session, csv)
async def run(self, r, csv):
url = self.conversations_url
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(1000)
sslcontext = ssl.create_default_context(cafile=certifi.where())
sslcontext.load_cert_chain('certificate.pem',
'plainkey.pem')
# Create client session that will ensure we dont open new connection
# per each request.
async with ClientSession(connector=aiohttp.TCPConnector(ssl=sslcontext)) as session:
for i in r:
# pass Semaphore and session to every GET request
task = asyncio.ensure_future(self.bound_fetch(sem, url + str(i), i, session, csv))
tasks.append(task)
responses = await asyncio.gather(*tasks)
return responses
function call:
def generateCSVFromIncidentIdsWithArgs(list):
incident_list = list
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
future = asyncio.ensure_future(run(incident_list, True))
loop.run_until_complete(future)
generateCSVFromIncidentIdsWithArgs(list1)
generateCSVFromIncidentIdsWithArgs(list2)

Handling ensure_future and its missing tasks

I have a streaming application that almost continuously takes the data given as input and sends an HTTP request using that value and does something with the returned value.
Obviously to speed things up I've used asyncio and aiohttp libraries in Python 3.7 to get the best performance, but it becomes hard to debug given how fast the data moves.
This is what my code looks like
'''
Gets the final requests
'''
async def apiRequest(info, url, session, reqType, post_data=''):
if reqType:
async with session.post(url, data = post_data) as response:
info['response'] = await response.text()
else:
async with session.get(url+post_data) as response:
info['response'] = await response.text()
logger.debug(info)
return info
'''
Loops through the batches and sends it for request
'''
async def main(data, listOfData):
tasks = []
async with ClientSession() as session:
for reqData in listOfData:
try:
task = asyncio.ensure_future(apiRequest(**reqData))
tasks.append(task)
except Exception as e:
print(e)
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
print(exc_type, fname, exc_tb.tb_lineno)
responses = await asyncio.gather(*tasks)
return responses #list of APIResponses
'''
Streams data in and prepares batches to send for requests
'''
async def Kconsumer(data, loop, batchsize=100):
consumer = AIOKafkaConsumer(**KafkaConfigs)
await consumer.start()
dataPoints = []
async for msg in consumer:
try:
sys.stdout.flush()
consumedMsg = loads(msg.value.decode('utf-8'))
if consumedMsg['tid']:
dataPoints.append(loads(msg.value.decode('utf-8')))
if len(dataPoints)==batchsize or time.time() - startTime>5:
'''
#1: The task below goes and sends HTTP GET requests in bulk using aiohttp
'''
task = asyncio.ensure_future(getRequests(data, dataPoints))
res = await asyncio.gather(*[task])
if task.done():
outputs = []
'''
#2: Does some ETL on the returned values
'''
ids = await asyncio.gather(*[doSomething(**{'tid':x['tid'],
'cid':x['cid'], 'tn':x['tn'],
'id':x['id'], 'ix':x['ix'],
'ac':x['ac'], 'output':to_dict(xmltodict.parse(x['response'],encoding='utf-8')),
'loop':loop, 'option':1}) for x in res[0]])
simplySaveDataIntoDataBase(id) # This is where I see some missing data in the database
dataPoints = []
except Exception as e:
logger.error(e)
logger.error(traceback.format_exc())
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
logger.error(str(exc_type) +' '+ str(fname) +' '+ str(exc_tb.tb_lineno))
if __name__ == '__main__':
loop = asyncio.get_event_loop()
asyncio.ensure_future(Kconsumer(data, loop, batchsize=100))
loop.run_forever()
Does the ensure_future need to be awaited ?
How does aiohttp handle requests that come a little later than the others? Shouldn't it hold the whole batch back instead of forgetting about it altoghter?
Does the ensure_future need to be awaited ?
Yes, and your code is doing that already. await asyncio.gather(*tasks) awaits the provided tasks and returns their results in the same order.
Note that await asyncio.gather(*[task]) doesn't make sense, because it is equivalent to await asyncio.gather(task), which is again equivalent to await task. In other words, when you need the result of getRequests(data, dataPoints), you can write res = await getRequests(data, dataPoints) without the ceremony of first calling ensure_future() and then calling gather().
In fact, you almost never need to call ensure_future yourself:
if you need to await multiple tasks, you can pass coroutine objects directly to gather, e.g. gather(coroutine1(), coroutine2()).
if you need to spawn a background task, you can call asyncio.create_task(coroutine(...))
How does aiohttp handle requests that come a little later than the others? Shouldn't it hold the whole batch back instead of forgetting about it altoghter?
If you use gather, all requests must finish before any of them return. (That is not aiohttp policy, it's how gather works.) If you need to implement a timeout, you can use asyncio.wait_for or similar.

How to gather task results in Trio?

I wrote a script that uses a nursery and the asks module to loop through and call an API based upon the loop variables. I get responses but don't know how to return the data like you would with asyncio.
I also have a question on limiting the APIs to 5 per second.
from datetime import datetime
import asks
import time
import trio
asks.init("trio")
s = asks.Session(connections=4)
async def main():
start_time = time.time()
api_key = 'API-KEY'
org_id = 'ORG-ID'
networkIds = ['id1','id2','idn']
url = 'https://api.meraki.com/api/v0/networks/{0}/airMarshal?timespan=3600'
headers = {'X-Cisco-Meraki-API-Key': api_key, 'Content-Type': 'application/json'}
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, url.format(i), headers)
print("Total time:", time.time() - start_time)
async def fetch(url, headers):
print("Start: ", url)
response = await s.get(url, headers=headers)
print("Finished: ", url, len(response.content), response.status_code)
if __name__ == "__main__":
trio.run(main)
When I run nursery.start_soon(fetch...) , I am printing data within fetch, but how do I return the data? I didn't see anything similar to asyncio.gather(*tasks) function.
Also, I can limit the number of sessions to 1-4, which helps get down below the 5 API per second limit, but was wondering if there was a built in way to ensure that no more than 5 APIs get called in any given second?
Returning data: pass the networkID and a dict to the fetch tasks:
async def main():
…
results = {}
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, url.format(i), headers, results, i)
## results are available here
async def fetch(url, headers, results, i):
print("Start: ", url)
response = await s.get(url, headers=headers)
print("Finished: ", url, len(response.content), response.status_code)
results[i] = response
Alternately, create a trio.Queue to which you put the results; your main task can then read the results from the queue.
API limit: create a trio.Queue(10) and start a task along these lines:
async def limiter(queue):
while True:
await trio.sleep(0.2)
await queue.put(None)
Pass that queue to fetch, as another argument, and call await limit_queue.get() before each API call.
Based on this answers, you can define the following function:
async def gather(*tasks):
async def collect(index, task, results):
task_func, *task_args = task
results[index] = await task_func(*task_args)
results = {}
async with trio.open_nursery() as nursery:
for index, task in enumerate(tasks):
nursery.start_soon(collect, index, task, results)
return [results[i] for i in range(len(tasks))]
You can then use trio in the exact same way as asyncio by simply patching trio (adding the gather function):
import trio
trio.gather = gather
Here is a practical example:
async def child(x):
print(f"Child sleeping {x}")
await trio.sleep(x)
return 2*x
async def parent():
tasks = [(child, t) for t in range(3)]
return await trio.gather(*tasks)
print("results:", trio.run(parent))
Technically, trio.Queue has been deprecated in trio 0.9. It has been replaced by trio.open_memory_channel.
Short example:
sender, receiver = trio.open_memory_channel(len(networkIds)
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, sender, url.format(i), headers)
async for value in receiver:
# Do your job here
pass
And in your fetch function you should call async sender.send(value) somewhere.
When I run nursery.start_soon(fetch...) , I am printing data within fetch, but how do I return the data? I didn't see anything similar to asyncio.gather(*tasks) function.
You're asking two different questions, so I'll just answer this one. Matthias already answered your other question.
When you call start_soon(), you are asking Trio to run the task in the background, and then keep going. This is why Trio is able to run fetch() several times concurrently. But because Trio keeps going, there is no way to "return" the result the way a Python function normally would. where would it even return to?
You can use a queue to let fetch() tasks send results to another task for additional processing.
To create a queue:
response_queue = trio.Queue()
When you start your fetch tasks, pass the queue as an argument and send a sentintel to the queue when you're done:
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, url.format(i), headers)
await response_queue.put(None)
After you download a URL, put the response into the queue:
async def fetch(url, headers, response_queue):
print("Start: ", url)
response = await s.get(url, headers=headers)
# Add responses to queue
await response_queue.put(response)
print("Finished: ", url, len(response.content), response.status_code)
With the changes above, your fetch tasks will put responses into the queue. Now you need to read responses from the queue so you can process them. You might add a new function to do this:
async def process(response_queue):
async for response in response_queue:
if response is None:
break
# Do whatever processing you want here.
You should start this process function as a background task before you start any fetch tasks so that it will process responses as soon as they are received.
Read more in the Synchronizing and Communicating Between Tasks section of the Trio documentation.

Resources