Wrapping synchronous requests into asyncio (async/await)? - python-3.x

I am writing a tool in Python 3.6 that sends requests to several APIs (with various endpoints) and collects their responses to parse and save them in a database.
The API clients that I use have a synchronous version of requesting a URL, for instance they use
urllib.request.Request('...
Or they use Kenneth Reitz' Requests library.
Since my API calls rely on synchronous versions of requesting a URL, the whole process takes several minutes to complete.
Now I'd like to wrap my API calls in async/await (asyncio). I'm using python 3.6.
All the examples / tutorials that I found want me to change the synchronous URL calls / requests to an async version of it (for instance aiohttp). Since my code relies on API clients that I haven't written (and I can't change) I need to leave that code untouched.
So is there a way to wrap my synchronous requests (blocking code) in async/await to make them run in an event loop?
I'm new to asyncio in Python. This would be a no-brainer in NodeJS. But I can't wrap my head around this in Python.

The solution is to wrap your synchronous code in the thread and run it that way. I used that exact system to make my asyncio code run boto3 (note: remove inline type-hints if running < python3.6):
async def get(self, key: str) -> bytes:
s3 = boto3.client("s3")
loop = asyncio.get_event_loop()
try:
response: typing.Mapping = \
await loop.run_in_executor( # type: ignore
None, functools.partial(
s3.get_object,
Bucket=self.bucket_name,
Key=key))
except botocore.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "NoSuchKey":
raise base.KeyNotFoundException(self, key) from e
elif e.response["Error"]["Code"] == "AccessDenied":
raise base.AccessDeniedException(self, key) from e
else:
raise
return response["Body"].read()
Note that this will work because the vast amount of time in the s3.get_object() code is spent in waiting for I/O, and (generally) while waiting for I/O python releases the GIL (the GIL is the reason that generally threads in python is not a good idea).
The first argument None in run_in_executor means that we run in the default executor. This is a threadpool executor, but it may make things more explicit to explicitly assign a threadpool executor there.
Note that, where using pure async I/O you could easily have thousands of connections open concurrently, using a threadpool executor means that each concurrent call to the API needs a separate thread. Once you run out of threads in your pool, the threadpool will not schedule your new call until a thread becomes available. You can obviously raise the number of threads, but this will eat up memory; don't expect to be able to go over a couple of thousand.
Also see the python ThreadPoolExecutor docs for an explanation and some slightly different code on how to wrap your sync call in async code.

Related

asyncio only processes one file at a time

I'm working a program to upload (large) files to a remote SFTP-server, while also calculating the file's SHA256. The uploads are slow, and the program is supposed to open multiple SFTP-connections.
Here is the main code:
async def bwrite(fd, buf):
log.debug('Writing %d bytes to %s', len(buf), fd)
fd.write(buf)
async def digester(digest, buf):
log.debug('Updating digest %s with %d more bytes', digest, len(buf))
digest.update(buf)
async def upload(fNames, SFTP, rename):
for fName in fNames:
inp = open(fName, "rb", 655360)
log.info('Opened local %s', fName)
digest = hashlib.sha256()
rName = rename % os.path.splitext(fName)[0]
out = SFTP.open(rName, "w", 655360)
...
while True:
buf = inp.read(bsize)
if not buf:
break
await bwrite(out, buf)
await digester(digest, buf)
inp.close()
out.close()
...
for i in range(0, len(clients)):
fNames = args[(i * chunk):((i + 1) * chunk)]
log.debug('Connection %s: %d files: %s',
clients[i], len(fNames), fNames)
uploads.append(upload(fNames, clients[i], Rename))
log.info('%d uploads initiated, awaiting completion', len(uploads))
results = asyncio.gather(*uploads)
loop = asyncio.get_event_loop()
loop.run_until_complete(results)
loop.close()
The idea is for multiple upload coroutines to run "in parallel" -- each using its own separate SFTP-connection -- pushing out one or more files to the server.
It even works -- but only a single upload is running at any time. I expected multiple ones to get control -- while their siblings awaits the bwrite and/or the digester. What am I doing wrong?
(Using Python-3.6 on FreeBSD-11, if that matters... The program is supposed to run on RHEL7 eventually...)
If no awaiting is involved, then no parallelism can occur. bwrite and digester, while declared async, perform no async operations (not launching or creating coroutines that can be awaited); if you removed the async in their name and removed the await where they're called, the code would behave identically.
The only time asyncio can get you benefits is when:
There is a blocking operation involved, and
Said blocking operation is designed for asyncio use (or involves a file descriptor that can be rewrapped for said purposes)
Your bwrite isn't doing that (it's doing normal blocking I/O on SFTP objects, which doesn't appear to be async-friendly; if it was async, your failure to either return the future it produced, or await it yourself, would usually mean it does nothing, barring the off-chance it self-scheduled a task internally), your reads from the input file aren't either (which is fine; normally, changing blocking I/O to asyncio for local file access isn't beneficial; it's all being buffered, at user level and kernel level, so you almost never block on the writes). Nor is your digester (hashing operations are CPU bound; it never makes sense to make them async unless they're being done with actual async stuff).
Since the two awaits in upload are effectively synchronous (they don't return anything that, when awaited, would actually block on actual asynchronous tasks), upload itself is effectively synchronous (it will never, under any circumstances, return control to the event loop before it completes). So even though all the other tasks are in the event loop queue, raring to go, the event loop itself has to wait until the running task blocks in an async-friendly way (with await on something that actually does background work while blocked), which never happens, and the tasks just get run sequentially, one after the other.
If an async-friendly version of your SFTP module exists, that might allow you to gain some benefit. But without it, you're probably better off using concurrent.futures.ThreadPoolExecutor or multiprocessing.pool.ThreadPool to do preemptive multitasking (which will swap out threads whenever they release the GIL, forcibly swapping between bytecodes if they don't release the GIL for awhile). That will get you parallelism on any blocking I/O (async-friendly or not), and, if the data is large enough, on the hashing work as well (hashlib is one of the only Python built-ins I know of that releases the GIL for CPU-bound work, if the data to be hashed is large enough; extension modules releasing the GIL is the only way multithreaded CPython can do more than one core's worth of CPU-bound work in a single process, even on multicore systems).

Creating non blocking restful service using aiohttp [duplicate]

I have tried the following code in Python 3.6 for asyncio:
Example 1:
import asyncio
import time
async def hello():
print('hello')
await asyncio.sleep(1)
print('hello again')
tasks=[hello(),hello()]
loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
Output is as expected:
hello
hello
hello again
hello again
Then I want to change the asyncio.sleep into another def:
async def sleep():
time.sleep(1)
async def hello():
print('hello')
await sleep()
print('hello again')
tasks=[hello(),hello()]
loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
Output:
hello
hello again
hello
hello again
It seems it is not running in an asynchronous mode, but a normal sync mode.
The question is: Why is it not running in an asynchronous mode and how can I change the old sync module into an 'async' one?
Asyncio uses an event loop, which selects what task (an independent call chain of coroutines) in the queue to activate next. The event loop can make intelligent decisions as to what task is ready to do actual work. This is why the event loop also is responsible for creating connections and watching file descriptors and other I/O primitives; it gives the event loop insight into when there are I/O operations in progress or when results are available to process.
Whenever you use await, there is an opportunity to return control to the loop which can then pass control to another task. Which task then is picked for execution depends on the exact implementation; the asyncio reference implementation offers multiple choices, but there are other implementations, such as the very, very efficient uvloop implementation.
Your sample is still asynchronous. It just so happens that by replacing the await.sleep() with a synchronous time.sleep() call, inside a new coroutine function, you introduced 2 coroutines into the task callchain that don't yield, and thus influenced in what order they are executed. That they are executed in what appears to be synchronous order is a coincidence. If you switched event loops, or introduced more coroutines (especially some that use I/O), the order can easily be different again.
Moreover, your new coroutines use time.sleep(); this makes your coroutines uncooperative. The event loop is not notified that your code is waiting (time.sleep() will not yield!), so no other coroutine can be executed while time.sleep() is running. time.sleep() simply doesn't return or lets any other code run until the requested amount of time has passed. Contrast this with the asyncio.sleep() implementation, which simply yields to the event loop with a call_later() hook; the event loop now knows that that task won't need any attention until a later time.
Also see asyncio: why isn't it non-blocking by default for a more in-depth discussion of how tasks and the event loop interact. And if you must run blocking, synchronous code that can't be made to cooperate, then use an executor pool to have the blocking code executed in a separate tread or child process to free up the event loop for other, better behaved tasks.

Is it more efficient to use create_task(), or gather()?

I'm still at the basics of asynchronous python, and some things confuse me.
import asyncio
loop=asyncio.get_event_loop()
for variation in args:
loop.create_task(coroutine(variation))
loop.run_forever()
Seems very similar to this
import asyncio
loop=asyncio.get_event_loop()
loop.run_forever(
asyncio.gather(
coroutine(variation_1),
coroutine(variation_2),
...))
They might do the same thing, but that doesn't seem useful, so what's the difference?
As mentioned in the comments, your second example should use run_until_complete, not run_forever.
They might do the same thing, but that doesn't seem useful, so what's the difference?
asyncio.gather is a higher-level construct.
create_task submits the coroutine to the event loop, effectively allowing it to run "in the background" (provided the event loop itself is active). As the name implies, it returns a task, a handle over the execution of the coroutine, most importantly providing the ability to cancel it. You can create any number of such tasks in an event loop, and they will all run until their respective completions.
asyncio.gather is for when you are actually interested in the results of the coroutines you have spawned. It spawns them as if with create_task, allowing them to run in parallel, but also waits for all of them to complete, and then returns their respective results (or raises an exception if any of them raised one).
For example, if you have a download coroutine that downloads a URL and returns its contents, and you are downloading a list of URLs, gather allows you to match URLs to their data:
url_list = [...]
data_list = await asyncio.gather(*[download(url) for url in url_list]
# url_list and data_list now have matching elements, so this works:
for url, data in zip(url_list, data_list):
...
Doing this with just create_task would be much more complicated.

Threads or asyncio gather?

Which is the best method to do concurrent i/o operations?
thread or
asyncio
There will be list of files.
I open the files and generate a graph using the .txt file and store it on the disk.
I have tried using threads but its time consuming and sometimes it does not generate a graph for some files.
Is there any other method?
I tried with the code below with async on the load_instantel_ascii function but it gives exception
for fl in self.finallist:
k = randint(0, 9)
try:
task2.append( * [load_instantel_ascii(fleName = fl, columns = None,
out = self.outdir,
separator = ',')])
except:
print("Error on Graph Generation")
event_loop.run_until_complete(asyncio.gather(yl1
for kl1 in task2)
)
If I understood everything correct and you want asynchronous file I/O, then asyncio itself doesn't support it out of the box. In the end all asyncio-related stuff that provides async file I/O does it using threads pool.
But it probably doesn't mean you shouldn't use asyncio: this lib is cool as a way to write asynchronous code in a first place, even if it wrapper above threads. I would give a try to something like aiofiles.

Should I use coroutines or another scheduling object here?

I currently have code in the form of a generator which calls an IO-bound task. The generator actually calls sub-generators as well, so a more general solution would be appreciated.
Something like the following:
def processed_values(list_of_io_tasks):
for task in list_of_io_tasks:
value = slow_io_call(task)
yield postprocess(value) # in real version, would iterate over
# processed_values2(value) here
I have complete control over slow_io_call, and I don't care in which order I get the items from processed_values. Is there something like coroutines I can use to get the yielded results in the fastest order by turning slow_io_call into an asynchronous function and using whichever call returns fastest? I expect list_of_io_tasks to be at least thousands of entries long. I've never done any parallel work other than with explicit threading, and in particular I've never used the various forms of lightweight threading which are available.
I need to use the standard CPython implementation, and I'm running on Linux.
Sounds like you are in search of multiprocessing.Pool(), specifically the Pool.imap_unordered() method.
Here is a port of your function to use imap_unordered() to parallelize calls to slow_io_call().
def processed_values(list_of_io_tasks):
pool = multiprocessing.Pool(4) # num workers
results = pool.imap_unordered(slow_io_call, list_of_io_tasks)
while True:
yield results.next(9999999) # large time-out
Note that you could also iterate over results directly (i.e. for item in results: yield item) without a while True loop, however calling results.next() with a time-out value works around this multiprocessing keyboard interrupt bug and allows you to kill the main process and all subprocesses with Ctrl-C. Also note that the StopIteration exceptions are not caught in this function but one will be raised when results.next() has no more items return. This is legal from generator functions, such as this one, which are expected to either raise StopIteration errors when there are no more values to yield or just stop yielding and a StopIteration exception will be raised on it's behalf.
To use threads in place of processes, replace
import multiprocessing
with
import multiprocessing.dummy as multiprocessing

Resources