Handling ensure_future and its missing tasks - python-3.x

I have a streaming application that almost continuously takes the data given as input and sends an HTTP request using that value and does something with the returned value.
Obviously to speed things up I've used asyncio and aiohttp libraries in Python 3.7 to get the best performance, but it becomes hard to debug given how fast the data moves.
This is what my code looks like
'''
Gets the final requests
'''
async def apiRequest(info, url, session, reqType, post_data=''):
if reqType:
async with session.post(url, data = post_data) as response:
info['response'] = await response.text()
else:
async with session.get(url+post_data) as response:
info['response'] = await response.text()
logger.debug(info)
return info
'''
Loops through the batches and sends it for request
'''
async def main(data, listOfData):
tasks = []
async with ClientSession() as session:
for reqData in listOfData:
try:
task = asyncio.ensure_future(apiRequest(**reqData))
tasks.append(task)
except Exception as e:
print(e)
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
print(exc_type, fname, exc_tb.tb_lineno)
responses = await asyncio.gather(*tasks)
return responses #list of APIResponses
'''
Streams data in and prepares batches to send for requests
'''
async def Kconsumer(data, loop, batchsize=100):
consumer = AIOKafkaConsumer(**KafkaConfigs)
await consumer.start()
dataPoints = []
async for msg in consumer:
try:
sys.stdout.flush()
consumedMsg = loads(msg.value.decode('utf-8'))
if consumedMsg['tid']:
dataPoints.append(loads(msg.value.decode('utf-8')))
if len(dataPoints)==batchsize or time.time() - startTime>5:
'''
#1: The task below goes and sends HTTP GET requests in bulk using aiohttp
'''
task = asyncio.ensure_future(getRequests(data, dataPoints))
res = await asyncio.gather(*[task])
if task.done():
outputs = []
'''
#2: Does some ETL on the returned values
'''
ids = await asyncio.gather(*[doSomething(**{'tid':x['tid'],
'cid':x['cid'], 'tn':x['tn'],
'id':x['id'], 'ix':x['ix'],
'ac':x['ac'], 'output':to_dict(xmltodict.parse(x['response'],encoding='utf-8')),
'loop':loop, 'option':1}) for x in res[0]])
simplySaveDataIntoDataBase(id) # This is where I see some missing data in the database
dataPoints = []
except Exception as e:
logger.error(e)
logger.error(traceback.format_exc())
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
logger.error(str(exc_type) +' '+ str(fname) +' '+ str(exc_tb.tb_lineno))
if __name__ == '__main__':
loop = asyncio.get_event_loop()
asyncio.ensure_future(Kconsumer(data, loop, batchsize=100))
loop.run_forever()
Does the ensure_future need to be awaited ?
How does aiohttp handle requests that come a little later than the others? Shouldn't it hold the whole batch back instead of forgetting about it altoghter?

Does the ensure_future need to be awaited ?
Yes, and your code is doing that already. await asyncio.gather(*tasks) awaits the provided tasks and returns their results in the same order.
Note that await asyncio.gather(*[task]) doesn't make sense, because it is equivalent to await asyncio.gather(task), which is again equivalent to await task. In other words, when you need the result of getRequests(data, dataPoints), you can write res = await getRequests(data, dataPoints) without the ceremony of first calling ensure_future() and then calling gather().
In fact, you almost never need to call ensure_future yourself:
if you need to await multiple tasks, you can pass coroutine objects directly to gather, e.g. gather(coroutine1(), coroutine2()).
if you need to spawn a background task, you can call asyncio.create_task(coroutine(...))
How does aiohttp handle requests that come a little later than the others? Shouldn't it hold the whole batch back instead of forgetting about it altoghter?
If you use gather, all requests must finish before any of them return. (That is not aiohttp policy, it's how gather works.) If you need to implement a timeout, you can use asyncio.wait_for or similar.

Related

asyncio wait - process results as they come

This script should take a list of initial tasks (URLs) and asynchronously make requests with aiohttp. And this part is done correctly. The problem is, since asyncio wait doesn't return actual results but only done/pending task set, I cant figure out where and how to process the results as they come, to make more requests and write data to DB. In this variant I placed the creation for a new task (make more requests...) inside the first one, which doesn't work.
PS. I am using wait because a book I am reading suggests using wait for more control over done and pending tasks and exceptions. Appreciate any help:)
async def fetch_content_2(session, url):
async with session.get(url) as result:
res = await result.text()
try:
new_link = BeautifulSoup(res, 'lxml').select_one('element on website 2')['href'])
# ***PROCESS AND WRITE SOME DATA TO DB***
except:
pass
async def fetch_content_1(session, url):
async with session.get(url) as result:
res = await result.text()
try:
link = BeautifulSoup(res, 'lxml').select_one('element on website 1')['href'])
# ***MAKE ANOTHER ASYNC REQUEST WITH NEW LINK***
asyncio.create_task(fetch_content_1(session,link))
except:
pass
async def main(tasks):
async with ClientSession() as session:
pending = [asyncio.create_task(fetch_content_1(session, url)) for url in tasks]
while pending:
done, pending = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
# print(f'Done count: {len(done)}')
# print(f'Pending count: {len(pending)}')
asyncio.run(main([url1, url2, ...]))
done and pending are sets of asyncio.Task objects. If you want to get the result of the task or its state you must get the values of the sets and call the method you need, check the (docs). Specifically you can get the result invoking the result method.
async def main(tasks):
async with ClientSession() as session:
pending = [asyncio.create_task(fetch_content_1(session, url)) for url in tasks]
while pending:
done, pending = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
res = done.pop().result()
# do some stuff with the result
Check the documentation to see the possible exceptions of call the result method and related methods. A exception may occur if the task had an internal error or the result is not ready (in this case shouldn't happen).

python asycio RuntimeWarning: coroutine was never awaited

I am trying to send many requests to a url (~50) concurrently.
from asyncio import Queue
import yaml
import asyncio
from aiohttp import ClientSession, TCPConnector
async def http_get(url, cookie):
cookie = cookie.split('; ')
cookie1 = cookie[0].split('=')
cookie2 = cookie[1].split('=')
cookies = {
cookie1[0]: cookie1[1],
cookie2[0]: cookie2[1]
}
async with ClientSession(cookies=cookies) as session:
async with session.get(url, ssl=False) as response:
return await response.json()
class FetchUtil:
def __init__(self):
self.config = yaml.safe_load(open('../config.yaml'))
def fetch(self):
asyncio.run(self.extract_objects())
async def http_get_objects(self, object_type, limit, offset):
path = '/path' + \
'?query=&filter=%s&limit=%s&offset=%s' % (
object_type,
limit,
offset)
return await self.http_get_domain(path)
async def http_get_objects_limit(self, object_type, offset):
result = await self.http_get_objects(
object_type,
self.config['object_limit'],
offset
)
return result['result']
async def http_get_domain(self, path):
return await http_get(
f'https://{self.config["domain"]}{path}',
self.config['cookie']
)
async def call_objects(self, object_type, offset):
result = await self.http_get_objects_limit(
object_type,
offset
)
return result
async def extract_objects(self):
calls = []
object_count = (await self.http_get_objects(
'PV', '1', '0'))['result']['count']
for i in range(0, object_count, self.config['object_limit']):
calls.append(self.call_objects('PV', str(i)))
queue = Queue()
for i in range(0, len(calls), self.config['call_limit']):
results = await asyncio.gather(*calls[i:self.config['call_limit']])
await queue.put(results)
After running this code using fetch as the entry point i get the following error message:
/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/events.py:88: RuntimeWarning: coroutine 'FetchUtil.call_objects' was never awaited
self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
The program that stops executing after asyncio.gather returns for the first time. I am having trouble understanding this message since I thought that I diligently made sure all functions were async tasks. The only function i didn't await was call_objects since i wanted it to run concurrently.
https://xinhuang.github.io/posts/2017-07-31-common-mistakes-using-python3-asyncio.html#org630d301
in this article gives the following explanation:
This runtime warning can happen in many scenarios, but the cause are
same: A coroutine object is created by the invocation of an async
function, but is never inserted into an EventLoop.
I believed that was what i was doing when i called the async tasks with asyncio.gather.
I should note that when i put a print('url') inside http_get it outputs the first 50 urls like i want, the problem seems to occur when asyncio.gather returns for the first time.
The posted code has a logic error: [i:self.config['call_limit']] should be [i:i + self.config['call_limit']].
It causes the error because the expression evaluates to an entry slice in iterations after the first one, causing some of the calls coroutines to never get passed to gather, and therefore never awaited)
i don't actually understand why it didn't just keep executing the same requests many time instead of stopping with an error.
Because your i would get incremented, making it larger than call_limit in every loop iteration except the first. For example, assuming call_limit is 10, i will be 0 in the first iteration, and you'll await calls[0:10], so far so good. But in the next iteration, i will be 10, and you'll be awaiting calls[10:10], an empty slice. In the iteration after that you'll await calls[20:10] (also an empty slice, although logically it should be an error), then calls[30:10], again empty, and so on. Only the first iteration picks up actual members of the list.

Sequence 2 asyncio function calls in python

Question: when I call generateCSVFromIncidentIdsWithArgs(list) twice with 2 different lists lets say "list1" and "list2", though the first list response appears correctly, the second list response has the results of list1 as well. I am not sure which variable to reset before making the second call so that the second list call appears without mixing the first list results.
function definition: function fetches response from a url with provided IDs in list
async def fetch(self, url, incident, session, csv):
async with session.get(url) as response:
self.format_output(incident, await response.read())
async def bound_fetch(self, sem, url, incident, session, csv):
# Getter function with semaphore.
async with sem:
await self.fetch(url, incident, session, csv)
async def run(self, r, csv):
url = self.conversations_url
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(1000)
sslcontext = ssl.create_default_context(cafile=certifi.where())
sslcontext.load_cert_chain('certificate.pem',
'plainkey.pem')
# Create client session that will ensure we dont open new connection
# per each request.
async with ClientSession(connector=aiohttp.TCPConnector(ssl=sslcontext)) as session:
for i in r:
# pass Semaphore and session to every GET request
task = asyncio.ensure_future(self.bound_fetch(sem, url + str(i), i, session, csv))
tasks.append(task)
responses = await asyncio.gather(*tasks)
return responses
function call:
def generateCSVFromIncidentIdsWithArgs(list):
incident_list = list
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
future = asyncio.ensure_future(run(incident_list, True))
loop.run_until_complete(future)
generateCSVFromIncidentIdsWithArgs(list1)
generateCSVFromIncidentIdsWithArgs(list2)

Python async: Waiting for stdin input while doing other stuff

I'm trying to create a WebSocket command line client that waits for messages from a WebSocket server but waits for user input at the same time.
Regularly polling multiple online sources every second works fine on the server, (the one running at localhost:6789 in this example), but instead of using Python's normal sleep() method, it uses asyncio.sleep(), which makes sense because sleeping and asynchronously sleeping aren't the same thing, at least not under the hood.
Similarly, waiting for user input and asynchronously waiting for user input aren't the same thing, but I can't figure out how to asynchronously wait for user input in the same way that I can asynchronously wait for an arbitrary amount of seconds, so that the client can deal with incoming messages from the WebSocket server while simultaneously waiting for user input.
The comment below in the else-clause of monitor_cmd() hopefully explains what I'm getting at:
import asyncio
import json
import websockets
async def monitor_ws():
uri = 'ws://localhost:6789'
async with websockets.connect(uri) as websocket:
async for message in websocket:
print(json.dumps(json.loads(message), indent=2, sort_keys=True))
async def monitor_cmd():
while True:
sleep_instead = False
if sleep_instead:
await asyncio.sleep(1)
print('Sleeping works fine.')
else:
# Seems like I need the equivalent of:
# line = await asyncio.input('Is this your line? ')
line = input('Is this your line? ')
print(line)
try:
asyncio.get_event_loop().run_until_complete(asyncio.wait([
monitor_ws(),
monitor_cmd()
]))
except KeyboardInterrupt:
quit()
This code just waits for input indefinitely and does nothing else in the meantime, and I understand why. What I don't understand, is how to fix it. :)
Of course, if I'm thinking about this problem in the wrong way, I'd be very happy to learn how to remedy that as well.
You can use the aioconsole third-party package to interact with stdin in an asyncio-friendly manner:
line = await aioconsole.ainput('Is this your line? ')
Borrowing heavily from aioconsole, if you would rather avoid using an external library you could define your own async input function:
async def ainput(string: str) -> str:
await asyncio.get_event_loop().run_in_executor(
None, lambda s=string: sys.stdout.write(s+' '))
return await asyncio.get_event_loop().run_in_executor(
None, sys.stdin.readline)
Borrowing heavily from aioconsole, there are 2 ways to handle.
start a new daemon thread:
import sys
import asyncio
import threading
from concurrent.futures import Future
async def run_as_daemon(func, *args):
future = Future()
future.set_running_or_notify_cancel()
def daemon():
try:
result = func(*args)
except Exception as e:
future.set_exception(e)
else:
future.set_result(result)
threading.Thread(target=daemon, daemon=True).start()
return await asyncio.wrap_future(future)
async def main():
data = await run_as_daemon(sys.stdin.readline)
print(data)
if __name__ == "__main__":
asyncio.run(main())
use stream reader:
import sys
import asyncio
async def get_steam_reader(pipe) -> asyncio.StreamReader:
loop = asyncio.get_event_loop()
reader = asyncio.StreamReader(loop=loop)
protocol = asyncio.StreamReaderProtocol(reader)
await loop.connect_read_pipe(lambda: protocol, pipe)
return reader
async def main():
reader = await get_steam_reader(sys.stdin)
data = await reader.readline()
print(data)
if __name__ == "__main__":
asyncio.run(main())

How to gather task results in Trio?

I wrote a script that uses a nursery and the asks module to loop through and call an API based upon the loop variables. I get responses but don't know how to return the data like you would with asyncio.
I also have a question on limiting the APIs to 5 per second.
from datetime import datetime
import asks
import time
import trio
asks.init("trio")
s = asks.Session(connections=4)
async def main():
start_time = time.time()
api_key = 'API-KEY'
org_id = 'ORG-ID'
networkIds = ['id1','id2','idn']
url = 'https://api.meraki.com/api/v0/networks/{0}/airMarshal?timespan=3600'
headers = {'X-Cisco-Meraki-API-Key': api_key, 'Content-Type': 'application/json'}
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, url.format(i), headers)
print("Total time:", time.time() - start_time)
async def fetch(url, headers):
print("Start: ", url)
response = await s.get(url, headers=headers)
print("Finished: ", url, len(response.content), response.status_code)
if __name__ == "__main__":
trio.run(main)
When I run nursery.start_soon(fetch...) , I am printing data within fetch, but how do I return the data? I didn't see anything similar to asyncio.gather(*tasks) function.
Also, I can limit the number of sessions to 1-4, which helps get down below the 5 API per second limit, but was wondering if there was a built in way to ensure that no more than 5 APIs get called in any given second?
Returning data: pass the networkID and a dict to the fetch tasks:
async def main():
…
results = {}
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, url.format(i), headers, results, i)
## results are available here
async def fetch(url, headers, results, i):
print("Start: ", url)
response = await s.get(url, headers=headers)
print("Finished: ", url, len(response.content), response.status_code)
results[i] = response
Alternately, create a trio.Queue to which you put the results; your main task can then read the results from the queue.
API limit: create a trio.Queue(10) and start a task along these lines:
async def limiter(queue):
while True:
await trio.sleep(0.2)
await queue.put(None)
Pass that queue to fetch, as another argument, and call await limit_queue.get() before each API call.
Based on this answers, you can define the following function:
async def gather(*tasks):
async def collect(index, task, results):
task_func, *task_args = task
results[index] = await task_func(*task_args)
results = {}
async with trio.open_nursery() as nursery:
for index, task in enumerate(tasks):
nursery.start_soon(collect, index, task, results)
return [results[i] for i in range(len(tasks))]
You can then use trio in the exact same way as asyncio by simply patching trio (adding the gather function):
import trio
trio.gather = gather
Here is a practical example:
async def child(x):
print(f"Child sleeping {x}")
await trio.sleep(x)
return 2*x
async def parent():
tasks = [(child, t) for t in range(3)]
return await trio.gather(*tasks)
print("results:", trio.run(parent))
Technically, trio.Queue has been deprecated in trio 0.9. It has been replaced by trio.open_memory_channel.
Short example:
sender, receiver = trio.open_memory_channel(len(networkIds)
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, sender, url.format(i), headers)
async for value in receiver:
# Do your job here
pass
And in your fetch function you should call async sender.send(value) somewhere.
When I run nursery.start_soon(fetch...) , I am printing data within fetch, but how do I return the data? I didn't see anything similar to asyncio.gather(*tasks) function.
You're asking two different questions, so I'll just answer this one. Matthias already answered your other question.
When you call start_soon(), you are asking Trio to run the task in the background, and then keep going. This is why Trio is able to run fetch() several times concurrently. But because Trio keeps going, there is no way to "return" the result the way a Python function normally would. where would it even return to?
You can use a queue to let fetch() tasks send results to another task for additional processing.
To create a queue:
response_queue = trio.Queue()
When you start your fetch tasks, pass the queue as an argument and send a sentintel to the queue when you're done:
async with trio.open_nursery() as nursery:
for i in networkIds:
nursery.start_soon(fetch, url.format(i), headers)
await response_queue.put(None)
After you download a URL, put the response into the queue:
async def fetch(url, headers, response_queue):
print("Start: ", url)
response = await s.get(url, headers=headers)
# Add responses to queue
await response_queue.put(response)
print("Finished: ", url, len(response.content), response.status_code)
With the changes above, your fetch tasks will put responses into the queue. Now you need to read responses from the queue so you can process them. You might add a new function to do this:
async def process(response_queue):
async for response in response_queue:
if response is None:
break
# Do whatever processing you want here.
You should start this process function as a background task before you start any fetch tasks so that it will process responses as soon as they are received.
Read more in the Synchronizing and Communicating Between Tasks section of the Trio documentation.

Resources