Threads or asyncio gather? - python-3.x

Which is the best method to do concurrent i/o operations?
thread or
asyncio
There will be list of files.
I open the files and generate a graph using the .txt file and store it on the disk.
I have tried using threads but its time consuming and sometimes it does not generate a graph for some files.
Is there any other method?
I tried with the code below with async on the load_instantel_ascii function but it gives exception
for fl in self.finallist:
k = randint(0, 9)
try:
task2.append( * [load_instantel_ascii(fleName = fl, columns = None,
out = self.outdir,
separator = ',')])
except:
print("Error on Graph Generation")
event_loop.run_until_complete(asyncio.gather(yl1
for kl1 in task2)
)

If I understood everything correct and you want asynchronous file I/O, then asyncio itself doesn't support it out of the box. In the end all asyncio-related stuff that provides async file I/O does it using threads pool.
But it probably doesn't mean you shouldn't use asyncio: this lib is cool as a way to write asynchronous code in a first place, even if it wrapper above threads. I would give a try to something like aiofiles.

Related

asyncio only processes one file at a time

I'm working a program to upload (large) files to a remote SFTP-server, while also calculating the file's SHA256. The uploads are slow, and the program is supposed to open multiple SFTP-connections.
Here is the main code:
async def bwrite(fd, buf):
log.debug('Writing %d bytes to %s', len(buf), fd)
fd.write(buf)
async def digester(digest, buf):
log.debug('Updating digest %s with %d more bytes', digest, len(buf))
digest.update(buf)
async def upload(fNames, SFTP, rename):
for fName in fNames:
inp = open(fName, "rb", 655360)
log.info('Opened local %s', fName)
digest = hashlib.sha256()
rName = rename % os.path.splitext(fName)[0]
out = SFTP.open(rName, "w", 655360)
...
while True:
buf = inp.read(bsize)
if not buf:
break
await bwrite(out, buf)
await digester(digest, buf)
inp.close()
out.close()
...
for i in range(0, len(clients)):
fNames = args[(i * chunk):((i + 1) * chunk)]
log.debug('Connection %s: %d files: %s',
clients[i], len(fNames), fNames)
uploads.append(upload(fNames, clients[i], Rename))
log.info('%d uploads initiated, awaiting completion', len(uploads))
results = asyncio.gather(*uploads)
loop = asyncio.get_event_loop()
loop.run_until_complete(results)
loop.close()
The idea is for multiple upload coroutines to run "in parallel" -- each using its own separate SFTP-connection -- pushing out one or more files to the server.
It even works -- but only a single upload is running at any time. I expected multiple ones to get control -- while their siblings awaits the bwrite and/or the digester. What am I doing wrong?
(Using Python-3.6 on FreeBSD-11, if that matters... The program is supposed to run on RHEL7 eventually...)
If no awaiting is involved, then no parallelism can occur. bwrite and digester, while declared async, perform no async operations (not launching or creating coroutines that can be awaited); if you removed the async in their name and removed the await where they're called, the code would behave identically.
The only time asyncio can get you benefits is when:
There is a blocking operation involved, and
Said blocking operation is designed for asyncio use (or involves a file descriptor that can be rewrapped for said purposes)
Your bwrite isn't doing that (it's doing normal blocking I/O on SFTP objects, which doesn't appear to be async-friendly; if it was async, your failure to either return the future it produced, or await it yourself, would usually mean it does nothing, barring the off-chance it self-scheduled a task internally), your reads from the input file aren't either (which is fine; normally, changing blocking I/O to asyncio for local file access isn't beneficial; it's all being buffered, at user level and kernel level, so you almost never block on the writes). Nor is your digester (hashing operations are CPU bound; it never makes sense to make them async unless they're being done with actual async stuff).
Since the two awaits in upload are effectively synchronous (they don't return anything that, when awaited, would actually block on actual asynchronous tasks), upload itself is effectively synchronous (it will never, under any circumstances, return control to the event loop before it completes). So even though all the other tasks are in the event loop queue, raring to go, the event loop itself has to wait until the running task blocks in an async-friendly way (with await on something that actually does background work while blocked), which never happens, and the tasks just get run sequentially, one after the other.
If an async-friendly version of your SFTP module exists, that might allow you to gain some benefit. But without it, you're probably better off using concurrent.futures.ThreadPoolExecutor or multiprocessing.pool.ThreadPool to do preemptive multitasking (which will swap out threads whenever they release the GIL, forcibly swapping between bytecodes if they don't release the GIL for awhile). That will get you parallelism on any blocking I/O (async-friendly or not), and, if the data is large enough, on the hashing work as well (hashlib is one of the only Python built-ins I know of that releases the GIL for CPU-bound work, if the data to be hashed is large enough; extension modules releasing the GIL is the only way multithreaded CPython can do more than one core's worth of CPU-bound work in a single process, even on multicore systems).

Wrapping synchronous requests into asyncio (async/await)?

I am writing a tool in Python 3.6 that sends requests to several APIs (with various endpoints) and collects their responses to parse and save them in a database.
The API clients that I use have a synchronous version of requesting a URL, for instance they use
urllib.request.Request('...
Or they use Kenneth Reitz' Requests library.
Since my API calls rely on synchronous versions of requesting a URL, the whole process takes several minutes to complete.
Now I'd like to wrap my API calls in async/await (asyncio). I'm using python 3.6.
All the examples / tutorials that I found want me to change the synchronous URL calls / requests to an async version of it (for instance aiohttp). Since my code relies on API clients that I haven't written (and I can't change) I need to leave that code untouched.
So is there a way to wrap my synchronous requests (blocking code) in async/await to make them run in an event loop?
I'm new to asyncio in Python. This would be a no-brainer in NodeJS. But I can't wrap my head around this in Python.
The solution is to wrap your synchronous code in the thread and run it that way. I used that exact system to make my asyncio code run boto3 (note: remove inline type-hints if running < python3.6):
async def get(self, key: str) -> bytes:
s3 = boto3.client("s3")
loop = asyncio.get_event_loop()
try:
response: typing.Mapping = \
await loop.run_in_executor( # type: ignore
None, functools.partial(
s3.get_object,
Bucket=self.bucket_name,
Key=key))
except botocore.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "NoSuchKey":
raise base.KeyNotFoundException(self, key) from e
elif e.response["Error"]["Code"] == "AccessDenied":
raise base.AccessDeniedException(self, key) from e
else:
raise
return response["Body"].read()
Note that this will work because the vast amount of time in the s3.get_object() code is spent in waiting for I/O, and (generally) while waiting for I/O python releases the GIL (the GIL is the reason that generally threads in python is not a good idea).
The first argument None in run_in_executor means that we run in the default executor. This is a threadpool executor, but it may make things more explicit to explicitly assign a threadpool executor there.
Note that, where using pure async I/O you could easily have thousands of connections open concurrently, using a threadpool executor means that each concurrent call to the API needs a separate thread. Once you run out of threads in your pool, the threadpool will not schedule your new call until a thread becomes available. You can obviously raise the number of threads, but this will eat up memory; don't expect to be able to go over a couple of thousand.
Also see the python ThreadPoolExecutor docs for an explanation and some slightly different code on how to wrap your sync call in async code.

How to idiomatically end an Asyncio Operation in Python

I'm working on code where I have a long running shell command whose output is sent to disk. This command will generate hundreds of GBs per file. I have successfully written code that calls this command asynchronously and successfully yields control (awaits) for it to complete.
I also have code that can asynchronously read that file as it is being written to so that I can process the data contained therein. The problem I'm running into is that I can't find a way to stop the file reader once the shell command completes.
I guess I'm looking for some sort of interrupt I can pass into my writer function once the shell command ends that I can use to tell it to close the file and wrap up the event loop.
Here is my writer function. Right now, it runs forever waiting for new data to be written to the file.
import asyncio
PERIOD = 0.5
async def readline(f):
while True:
data = f.readline()
if data:
return data
await asyncio.sleep(PERIOD)
async def read_zmap_file():
with open('/data/largefile.json'
, mode = 'r+t'
, encoding = 'utf-8'
) as f:
i = 0
while True:
line = await readline(f)
print('{:>10}: {!s}'.format(str(i), line.strip()))
i += 1
loop = asyncio.get_event_loop()
loop.run_until_complete(read_zmap_file())
loop.close()
If my approach is off, please let me know. I'm relatively new to asynchronous programming. Any help would be appreciated.
So, I'd do something like
reader = loop.create_task(read_zmap_file)
Then in your code that manages the shell process once the shell process exits, you can do
reader.cancel()
You can do
loop.run_until_complete(reader)
Alternatively, you could simply set a flag somewhere and use that flag in your while statement. You don't need to use asyncio primitives when something simpler works.
That said, I'd look into ways that your reader can avoid the periodic sleep. if your reader will be able to keep up with the shell command, I'd recommend a pipe, because pipes can be used with select (and thus added to an event loop). Then in your reader you can write to to a file if you need a permanent log. I realize the discussion of avoiding the periodic sleep is beyond the scope of this question, and I don't want to go into more detail than I have, but you di ask for hints on how best to approach async programming.

using threading to call the main function again?

I am really trying to wrap my head around the concept of Threading concept in practical applications. I am using the Threading module in python 3.4 and I am not sure if the logic is right for the program functionality.
Here is the gist of my code:
def myMain():
""" my main function actually uses sockets to
send data over a network
"""
# there will be an infinite loop call it loop-1 here
while True:
#perform encoding scheme
#send data out... with all other exception handling
# here is infinite loop 2 which waits for messages
# from other devices
while True:
# Check for incoming messages
callCheckFunction() ----> Should I call this on a thread?
the above mentioned callCheckFunction() will do some comparison on the received data and if the data values don't match I want to run the myMain() function again.
Here is the gist of the callCheckFunction():
def callCheckFunction():
if data == 'same':
# some work done here and then get out
# function and return back to listening on the socket
else data == 'not same':
myMain() ---------> Should I thread this one too??
This might be complicated but I am not sure if Threading is the thing I want. I did a nasty hack by calling the myMain() function the above mentioned fashioned which works great! but I assume there will definitely some limit to calling the function within the function and I want my code to be a bit professional not Hacky!
I have my mind set on Threading since I am listening to the socket in an infinite fashion when some new Data comes in the whole myMain() is called back creating kind of a hectic recursion which I want to control.
EDIT
So I have managed to make the code a bit more modular i.e. I have split the two Infinite Loops in to two different functions
now myMain is divided into
task1()
task2()
and the gist is as follows:
def task1():
while True:
# encoding and sending data
#in the end I call task2() since it the primary
# state which decides things
task2() ---------> still need to decide if I should thread or not
def task2():
while True:
# check for incoming messages
checker = threading.Thread(callCheckFunction, daemon=True)
checker.start()
checker.join()
Now since the callCheckFunction() needs the func1() I decided to Thread func1() in the function Note func1 is actually kinda the main() of the code:
def callCheckFunction():
else data == 'not same':
thready = threading.Thread(func1, daemon= True)
thready.start()
thready.join()
Results
with little understanding I do manage to get the code working. But I am not sure if this is really hacky or a professional way of doing things! I can ofcourse share the code via GitHub and also a Finite State Machine for the system. Also I am not sure if this code is Thread Safe ! But Help/Suggestions really needed

Should I use coroutines or another scheduling object here?

I currently have code in the form of a generator which calls an IO-bound task. The generator actually calls sub-generators as well, so a more general solution would be appreciated.
Something like the following:
def processed_values(list_of_io_tasks):
for task in list_of_io_tasks:
value = slow_io_call(task)
yield postprocess(value) # in real version, would iterate over
# processed_values2(value) here
I have complete control over slow_io_call, and I don't care in which order I get the items from processed_values. Is there something like coroutines I can use to get the yielded results in the fastest order by turning slow_io_call into an asynchronous function and using whichever call returns fastest? I expect list_of_io_tasks to be at least thousands of entries long. I've never done any parallel work other than with explicit threading, and in particular I've never used the various forms of lightweight threading which are available.
I need to use the standard CPython implementation, and I'm running on Linux.
Sounds like you are in search of multiprocessing.Pool(), specifically the Pool.imap_unordered() method.
Here is a port of your function to use imap_unordered() to parallelize calls to slow_io_call().
def processed_values(list_of_io_tasks):
pool = multiprocessing.Pool(4) # num workers
results = pool.imap_unordered(slow_io_call, list_of_io_tasks)
while True:
yield results.next(9999999) # large time-out
Note that you could also iterate over results directly (i.e. for item in results: yield item) without a while True loop, however calling results.next() with a time-out value works around this multiprocessing keyboard interrupt bug and allows you to kill the main process and all subprocesses with Ctrl-C. Also note that the StopIteration exceptions are not caught in this function but one will be raised when results.next() has no more items return. This is legal from generator functions, such as this one, which are expected to either raise StopIteration errors when there are no more values to yield or just stop yielding and a StopIteration exception will be raised on it's behalf.
To use threads in place of processes, replace
import multiprocessing
with
import multiprocessing.dummy as multiprocessing

Resources