Is it more efficient to use create_task(), or gather()? - python-3.x

I'm still at the basics of asynchronous python, and some things confuse me.
import asyncio
loop=asyncio.get_event_loop()
for variation in args:
loop.create_task(coroutine(variation))
loop.run_forever()
Seems very similar to this
import asyncio
loop=asyncio.get_event_loop()
loop.run_forever(
asyncio.gather(
coroutine(variation_1),
coroutine(variation_2),
...))
They might do the same thing, but that doesn't seem useful, so what's the difference?

As mentioned in the comments, your second example should use run_until_complete, not run_forever.
They might do the same thing, but that doesn't seem useful, so what's the difference?
asyncio.gather is a higher-level construct.
create_task submits the coroutine to the event loop, effectively allowing it to run "in the background" (provided the event loop itself is active). As the name implies, it returns a task, a handle over the execution of the coroutine, most importantly providing the ability to cancel it. You can create any number of such tasks in an event loop, and they will all run until their respective completions.
asyncio.gather is for when you are actually interested in the results of the coroutines you have spawned. It spawns them as if with create_task, allowing them to run in parallel, but also waits for all of them to complete, and then returns their respective results (or raises an exception if any of them raised one).
For example, if you have a download coroutine that downloads a URL and returns its contents, and you are downloading a list of URLs, gather allows you to match URLs to their data:
url_list = [...]
data_list = await asyncio.gather(*[download(url) for url in url_list]
# url_list and data_list now have matching elements, so this works:
for url, data in zip(url_list, data_list):
...
Doing this with just create_task would be much more complicated.

Related

Is there any linter that detects blocking calls in an async function?

https://www.aeracode.org/2018/02/19/python-async-simplified/
It's not going to ruin your day if you call a non-blocking synchronous
function, like this:
def get_chat_id(name):
return "chat-%s" % name
async def main():
result = get_chat_id("django")
However, if you call a blocking function, like the Django ORM, the
code inside the async function will look identical, but now it's
dangerous code that might block the entire event loop as it's not
awaiting:
def get_chat_id(name):
return Chat.objects.get(name=name).id
async def main():
result = get_chat_id("django")
You can see how it's easy to have a non-blocking function that
"accidentally" becomes blocking if a programmer is not super-aware of
everything that calls it. This is why I recommend you never call
anything synchronous from an async function without doing it safely,
or without knowing beforehand it's a non-blocking standard library
function, like os.path.join.
So I am looking for a way to automatically catch instances of this mistake. Are there any linters for Python which will report sync function calls from within an async function as a violation?
Can I configure Pylint or Flake8 to do this?
I don't necessarily mind if it catches the first case above too (which is harmless).
Update:
On one level I realise this is a stupid question, as pointed out in Mikhail's answer. What we need is a definition of a "dangerous synchronous function" that the linter should detect.
So for purpose of this question I give the following definition:
A "dangerous synchronous function" is one that performs IO operations. These are the same operations which have to be monkey-patched by gevent, for example, or which have to be wrapped in async functions so that the event loop can context switch.
(I would welcome any refinement of this definition)
So I am looking for a way to automatically catch instances of this
mistake.
Let's make few things clear: mistake discussed in article is when you call any long running sync function inside some asyncio coroutine (it can be I/O blocking call or just pure CPU function with a lot of calculations). It's a mistake because it'll block whole event loop what will lead to significant performance downgrade (more about it here including comments below answer).
Is there any way to catch this situation automatically? Before run time - no, no one except you can predict if particular function will take 10 seconds or 0.01 second to execute. On run time it's already built-in asyncio, all you have to do is to enable debug mode.
If you afraid some sync function can vary between being long running (detectable in run time in debug mode) and short running (not detectable) just execute function in background thread using run_in_executor - it'll guarantee event loop will not be blocked.

Creating non blocking restful service using aiohttp [duplicate]

I have tried the following code in Python 3.6 for asyncio:
Example 1:
import asyncio
import time
async def hello():
print('hello')
await asyncio.sleep(1)
print('hello again')
tasks=[hello(),hello()]
loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
Output is as expected:
hello
hello
hello again
hello again
Then I want to change the asyncio.sleep into another def:
async def sleep():
time.sleep(1)
async def hello():
print('hello')
await sleep()
print('hello again')
tasks=[hello(),hello()]
loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
Output:
hello
hello again
hello
hello again
It seems it is not running in an asynchronous mode, but a normal sync mode.
The question is: Why is it not running in an asynchronous mode and how can I change the old sync module into an 'async' one?
Asyncio uses an event loop, which selects what task (an independent call chain of coroutines) in the queue to activate next. The event loop can make intelligent decisions as to what task is ready to do actual work. This is why the event loop also is responsible for creating connections and watching file descriptors and other I/O primitives; it gives the event loop insight into when there are I/O operations in progress or when results are available to process.
Whenever you use await, there is an opportunity to return control to the loop which can then pass control to another task. Which task then is picked for execution depends on the exact implementation; the asyncio reference implementation offers multiple choices, but there are other implementations, such as the very, very efficient uvloop implementation.
Your sample is still asynchronous. It just so happens that by replacing the await.sleep() with a synchronous time.sleep() call, inside a new coroutine function, you introduced 2 coroutines into the task callchain that don't yield, and thus influenced in what order they are executed. That they are executed in what appears to be synchronous order is a coincidence. If you switched event loops, or introduced more coroutines (especially some that use I/O), the order can easily be different again.
Moreover, your new coroutines use time.sleep(); this makes your coroutines uncooperative. The event loop is not notified that your code is waiting (time.sleep() will not yield!), so no other coroutine can be executed while time.sleep() is running. time.sleep() simply doesn't return or lets any other code run until the requested amount of time has passed. Contrast this with the asyncio.sleep() implementation, which simply yields to the event loop with a call_later() hook; the event loop now knows that that task won't need any attention until a later time.
Also see asyncio: why isn't it non-blocking by default for a more in-depth discussion of how tasks and the event loop interact. And if you must run blocking, synchronous code that can't be made to cooperate, then use an executor pool to have the blocking code executed in a separate tread or child process to free up the event loop for other, better behaved tasks.

Infinite loop or "recursive" in Asyncio

I'm new to Python3 asyncio.
I have a function that constantly retrieves messages from a websocket connection.
I'm wondering whether I should use a while True loop or asyncio.ensure_future in a recursive manner.
Which is preferred or does it not matter?
Example:
async def foo(websocket):
while True:
msg = await websocket.recv()
print(msg)
await asyncio.sleep(0.0001)
or
async def foo(websocket):
msg = await websocket.recv()
print(msg)
await asyncio.sleep(0.0001)
asyncio.ensure_future(foo(websocket))
I would recommend the iterative variant, for two reasons:
It is easier to understand and extend. One of the benefits of coroutines compared to callback-based futures is that they permit the use of familiar control structures like if and while to model the code's execution. If you wanted to change your code to e.g. add an outer loop around the existing one, or to add some code (e.g. another loop) after the loop, that would be considerably easier in the non-recursive version.
It is more efficient. Calling asyncio.ensure_future(foo(websocket)) instantiates both a new coroutine object and a brand new task for each new iteration. While neither are particularly heavy-weight, all else being equal, it is better to avoid unnecessary allocation.

Wrapping synchronous requests into asyncio (async/await)?

I am writing a tool in Python 3.6 that sends requests to several APIs (with various endpoints) and collects their responses to parse and save them in a database.
The API clients that I use have a synchronous version of requesting a URL, for instance they use
urllib.request.Request('...
Or they use Kenneth Reitz' Requests library.
Since my API calls rely on synchronous versions of requesting a URL, the whole process takes several minutes to complete.
Now I'd like to wrap my API calls in async/await (asyncio). I'm using python 3.6.
All the examples / tutorials that I found want me to change the synchronous URL calls / requests to an async version of it (for instance aiohttp). Since my code relies on API clients that I haven't written (and I can't change) I need to leave that code untouched.
So is there a way to wrap my synchronous requests (blocking code) in async/await to make them run in an event loop?
I'm new to asyncio in Python. This would be a no-brainer in NodeJS. But I can't wrap my head around this in Python.
The solution is to wrap your synchronous code in the thread and run it that way. I used that exact system to make my asyncio code run boto3 (note: remove inline type-hints if running < python3.6):
async def get(self, key: str) -> bytes:
s3 = boto3.client("s3")
loop = asyncio.get_event_loop()
try:
response: typing.Mapping = \
await loop.run_in_executor( # type: ignore
None, functools.partial(
s3.get_object,
Bucket=self.bucket_name,
Key=key))
except botocore.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "NoSuchKey":
raise base.KeyNotFoundException(self, key) from e
elif e.response["Error"]["Code"] == "AccessDenied":
raise base.AccessDeniedException(self, key) from e
else:
raise
return response["Body"].read()
Note that this will work because the vast amount of time in the s3.get_object() code is spent in waiting for I/O, and (generally) while waiting for I/O python releases the GIL (the GIL is the reason that generally threads in python is not a good idea).
The first argument None in run_in_executor means that we run in the default executor. This is a threadpool executor, but it may make things more explicit to explicitly assign a threadpool executor there.
Note that, where using pure async I/O you could easily have thousands of connections open concurrently, using a threadpool executor means that each concurrent call to the API needs a separate thread. Once you run out of threads in your pool, the threadpool will not schedule your new call until a thread becomes available. You can obviously raise the number of threads, but this will eat up memory; don't expect to be able to go over a couple of thousand.
Also see the python ThreadPoolExecutor docs for an explanation and some slightly different code on how to wrap your sync call in async code.

Should I use coroutines or another scheduling object here?

I currently have code in the form of a generator which calls an IO-bound task. The generator actually calls sub-generators as well, so a more general solution would be appreciated.
Something like the following:
def processed_values(list_of_io_tasks):
for task in list_of_io_tasks:
value = slow_io_call(task)
yield postprocess(value) # in real version, would iterate over
# processed_values2(value) here
I have complete control over slow_io_call, and I don't care in which order I get the items from processed_values. Is there something like coroutines I can use to get the yielded results in the fastest order by turning slow_io_call into an asynchronous function and using whichever call returns fastest? I expect list_of_io_tasks to be at least thousands of entries long. I've never done any parallel work other than with explicit threading, and in particular I've never used the various forms of lightweight threading which are available.
I need to use the standard CPython implementation, and I'm running on Linux.
Sounds like you are in search of multiprocessing.Pool(), specifically the Pool.imap_unordered() method.
Here is a port of your function to use imap_unordered() to parallelize calls to slow_io_call().
def processed_values(list_of_io_tasks):
pool = multiprocessing.Pool(4) # num workers
results = pool.imap_unordered(slow_io_call, list_of_io_tasks)
while True:
yield results.next(9999999) # large time-out
Note that you could also iterate over results directly (i.e. for item in results: yield item) without a while True loop, however calling results.next() with a time-out value works around this multiprocessing keyboard interrupt bug and allows you to kill the main process and all subprocesses with Ctrl-C. Also note that the StopIteration exceptions are not caught in this function but one will be raised when results.next() has no more items return. This is legal from generator functions, such as this one, which are expected to either raise StopIteration errors when there are no more values to yield or just stop yielding and a StopIteration exception will be raised on it's behalf.
To use threads in place of processes, replace
import multiprocessing
with
import multiprocessing.dummy as multiprocessing

Resources