Async version of built-in print (stdout) - python-3.x

I have a problem understanding some of the limitations using print inside an async function. Basically this is my code:
#!/usr/bin/env python
import sys
import asyncio
import aiohttp
async amain(loop):
session = aiohttp.ClientSession(loop=loop)
try:
# using session to fetch a large json file which is stored
# in obj
print(obj) # for debugging purposes
finally:
await session.close()
def main():
loop = asyncio.get_event_loop()
res = 1
try:
res = loop.run_until_complete(amain(loop, args))
except KeyboardInterrupt:
# silence traceback when pressing ctrl+c
pass
loop.close()
return res
if __name__ == '__main__':
sys.exit(main())
If I execute this, then the json object is printed on stdout and the suddenly dies with this error
$ dwd-get-sensor-file ; echo $?
Traceback (most recent call last):
File "/home/yanez/anaconda/py3/envs/mondas/bin/dwd-get-sensor-file", line 11, in <module>
load_entry_point('mondassatellite', 'console_scripts', 'dwd-get-sensor-file')()
File "/home/yanez/projects/mondassatellite/mondassatellite/mondassatellite/bin/dwd_get_sensor_file.py", line 75, in main
res = loop.run_until_complete(amain(loop, args))
File "/home/yanez/anaconda/py3/envs/mondas/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "/home/yanez/projects/mondassatellite/mondassatellite/mondassatellite/bin/dwd_get_sensor_file.py", line 57, in amain
print(obj)
BlockingIOError: [Errno 11] write could not complete without blocking
1
The funny thing is that when I execute my code redirecting stdout to a file like this
$ dwd-get-sensor-file > output.txt ; echo $?
0
the exception doesn't happen and the whole output is correctly redirected to output.txt.
For testing purposes I converted the json object to a string and instead of doing print(obj) I do sys.stdout.write(obj_as_str) then I get this
exception:
BlockingIOError: [Errno 11] write could not complete without blocking
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
I've searched for this BlockingIOError exception but all threads I find have something to do with network sockets or CI builds. But I found one
interesting github comment:
The make: write error is almost certainly EAGAIN from stdout. Pretty much every command line tool expects stdout to be in blocking mode, and does not properly retry when in nonblocking mode.
So when I executed this
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); print(flags&os.O_NONBLOCK);'
I get 2048, which means blocking (or is this the other way round? I'm confused). After executing this
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.fcntl(sys.stdout, fcntl.F_SETFL, flags&~os.O_NONBLOCK);'
I don't get the BlockingIOError exceptions anymore, but I don't like this solution though.
So, my question is: how should we deal when writing to stdout inside an async function? If I know that I'm dealing with stdout, should I
set stdout to non-blocking and revert it back when my program exits? Is there a specific strategy for this?

Give aiofiles a try, using stdout FD as the file object.
aiofiles helps with this by introducing asynchronous versions of files that support delegating operations to a separate thread pool.
In terms of actually using aiofiles with an FD directly, you could probably extend the aiofiles.os module, using wrap(os.write).

Related

pyspark streaming: failing to execute rdd.count() on workers

I have a pyspark streaming job doing something along these lines:
def printrddcount(rdd):
c = rdd.count()
print("{1}: Received an RDD of {0} rows".format("CANNOTCOUNT", datetime.now().isoformat()) )
and then:
...
stream.foreachRDD(printrddcount)
From what I get, the printrdd function will be executed within the workers
And, yes, I know it's a bad idea to do a print() within the worker. But that's not the point.
I'm pretty sure this very code was working until very recently.
(and, it looked differently, because the content of 'c' was actually printed in the print statement, rather than just collected, and then thrown away...)
But now, it seems that (all of a sudden?), then rdd.count() has stopped working ans is making my worker process die saying:
UnpicklingError: NEWOBJ class argument has NULL tp_new
full (well, python only) stacktrace:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/serializers.py", line 454, in loads
return pickle.loads(obj)
UnpicklingError: NEWOBJ class argument has NULL tp_new
The line where it fails is, indeed, the one saying rdd.count()
Any idea why rdd.count() would fail?
If something is supposed to be serialized, it should be the rdd, right?
Ok. I investigated a bit further.
There's nothing wrong with rdd.count()
Only thing wrong is that there is another transformation along the pipe that somehow 'corrupts' (Closes? Invalidates? Something along those lines) the rdd.
So, when it gets to the printrddcount function it cannot be serialized any more and gives the error.
The issue is within a code that looks like:
...
log = logging.getLogger(__name__)
...
def parse(parse_function):
def parse_function_wrapper(event):
try:
log.info("parsing")
new_event = parse_function(event)
except ParsingFailedException as e:
pass
return new_event
return parse_function_wrapper
and then:
stream = stream.map(parse(parse_event))
Now, the log.info (tried a lot of variations, in the beginning logging was within an exception handler) is the one creating the issue.
Which leads me to say that, most probably, it is the logger object that cannot be serialized, for some reason.
Closing this thread myself as it has actually nothing to do with rdd serialization; and most probably not even with pyspark even.

Why is this exception immediately raised from an asyncio Task?

My understanding from the documentation is that asyncio.Tasks, as an asyncio.Future subclass, will store exceptions raised in them and they can be retrieved at my leisure.
However, in this sample code, the exception is raised immediately:
import asyncio
async def bad_task():
raise Exception()
async def test():
loop = asyncio.get_event_loop()
task = loop.create_task(bad_task())
await task
# I would expect to get here
exp = task.exception()
# but we never do because the function exits on line 3
loop = asyncio.get_event_loop()
loop.run_until_complete(test())
loop.close()
Example output (Python 3.6.5):
python3 ./test.py
Traceback (most recent call last):
File "./test.py", line 15, in <module>
loop.run_until_complete(test())
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "./test.py", line 9, in test
await task
File "./test.py", line 4, in bad_task
raise Exception()
Exception
Is this a quirk of creating & calling tasks when already within async code?
await will raise any exception thrown by the task, because it's meant to make asynchronous code look almost exactly like synchronous code. If you want to catch them, you can use a normal try...except clause.
As Matti explained, exceptions raised by a coroutine are propagated to the awaiting site. This is intentional, as it ensures that errors do not pass silently by default. However, if one needs to do so, it is definitely possible to await a task's completion without immediately accessing its result/exception.
Here is a simple and efficient way to do so, by using a small intermediate Future:
async def test():
loop = asyncio.get_event_loop()
task = loop.create_task(bad_task())
task_done = loop.create_future() # you could also use asyncio.Event
# Arrange for task_done to complete once task completes.
task.add_done_callback(task_done.set_result)
# Wait for the task to complete. Since we're not obtaining its
# result, this won't raise no matter what bad_task() does...
await task_done
# ...and this will work as expected.
exp = task.exception()

Python - Using timeout while printing line by line in a subprocess with Popen

(in Python 3.5)
I am having difficulties to print stdout line by line (while running the program), and maintain the timeout function (to stop the program after sometime).
I have:
import subprocess as sub
import io
file_name = 'program.exe'
dir_path = r'C:/directory/'
p = sub.Popen(file_name, cwd = dir_path, shell=True, stdout = sub.PIPE, stderr = sub.STDOUT)
And while running "p", do these 2 things:
for line in io.TextIOWrapper(p.stdout, encoding="utf-8"):
print(line)
And do:
try:
outs = p.communicate(timeout=15) # Just to use timeout
except Exception as e:
print(str(e))
p.kill()
The program should print every output line but should not run the simulation for more than 15 seconds.
If I use the "p.communicate" before the "p.stdout", it will wait for the timeout ou the program to finish. If I use it on the other way, the program will not count the timeout.
I would like to do it without threading, and if possible without io too, it seems to be possible, but I don´t know how (need more practice and study). :-(
PS: The program I am running was written in fortran and is used to simulate water flow. If I run the exe from windows, it opens a cmd and prints a line on each timestep. And I am doing a sensitivity analysis changing the inputs on exe file.
That's because your process\child processes are not getting killed correctly
just modify your try,except as below
try:
pid_id=p.pid
outs = p.communicate(timeout=15) # Just to use timeout
except Exception as e:
print(str(e))
import subprocesss
#This will kill all the process and child process associated with p forcefully
subprocess.Popen('taskkill /F /T /PID %i' % pid_id)

Replacement for getstatusoutput in Python 3

Since the commands module is deprecated since Python 2.6, I'm looking into the best way to replace commands.getstatusoutput which returns a tuple of the command's return code and output. The subprocess module is rather obvious, however, it doesn't offer a direct replacement for getstatusoutput. A potential solution is discussed in a related question concerning getstatusoutput - however, I'm not looking into rewriting the original function (which has less then 10 LOC anyway) but would like to know if there is a more straightforward way.
There is no direct replacement, because commands.getstatusoutput was a bad API; it combines stderr and stdout without giving an option to retrieve them separately.
The convenience API that you should be using is subprocess.check_output as it will throw an exception if the command fails.
Otherwise, it does appear somewhat of a deficiency that subprocess doesn't provide a method to retrieve output and status in a single call, but it's easy to work around; here's what the answer to the linked question should have been:
def get_status_output(*args, **kwargs):
p = subprocess.Popen(*args, **kwargs)
stdout, stderr = p.communicate()
return p.returncode, stdout, stderr
If you want stdout and stderr together, use stderr=subprocess.STDOUT.
getstatusoutput is back (from python 3.1) :) See: http://docs.python.org/3.3/library/subprocess.html#legacy-shell-invocation-functions
To answer the question from the title: here's asyncio-based getstatusoutput() implementation -- it is code example from the docs that is modified to follow more closely subprocess.getstatusoutput() interface:
import asyncio
import locale
from asyncio.subprocess import PIPE, STDOUT
#asyncio.coroutine
def getstatusoutput(cmd):
proc = yield from asyncio.create_subprocess_shell(cmd,
stdout=PIPE, stderr=STDOUT)
try:
stdout, _ = yield from proc.communicate()
except:
try:
proc.kill()
except ProcessLookupError: # process is already dead
pass
raise
finally:
exitcode = yield from proc.wait()
# return text
output = stdout.decode(locale.getpreferredencoding(False))
# universal newlines mode
output = output.replace("\r\n", "\n").replace("\r", "\n")
if output[-1:] == "\n": # remove a trailing newline
output = output[:-1]
return (exitcode, output)
It works both Windows and Unix. To run it (from the same example):
import os
from contextlib import closing
if os.name == 'nt': # Windows
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
else:
loop = asyncio.get_event_loop()
with closing(loop):
coro = getstatusoutput('python -m platform')
exitcode, stdout = loop.run_until_complete(coro)
if exitcode == 0:
print("Platform:", stdout)
else:
print("Python failed with exit code %s: %s" % (exitcode, stdout))

Strange "BadZipfile: Bad CRC-32" problem

This code is simplification of code in a Django app that receives an uploaded zip file via HTTP multi-part POST and does read-only processing of the data inside:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
Pretty simple. We open the zip file and one or two CSV files inside the zip file.
What's weird is that if I run this with a large zip file (~13 MB) and have it instantiate the ZipFile from a StringIO.StringIO or a io.BytesIO (Perhaps anything other than a plain filename? I had similar problems in the Django app when trying to create a ZipFile from a TemporaryUploadedFile or even a file object created by calling os.tmpfile() and shutil.copyfileobj()) and have it open TWO csv files rather than just one, then it fails towards the end of processing. Here's the output that I see on a Linux system:
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
Incidentally, the code fails under the same conditions but in a different way on my OS X system. Instead of the BadZipfile exception, it seems to read corrupted data and gets very confused.
This all suggests to me that I am doing something in this code that you are not supposed to do -- e.g.: call zipfile.open on a file while already having another file within the same zip file object open? This doesn't seem to be a problem when using ZipFile(filename), but perhaps it's problematic when passing ZipFile a file-like object, because of some implementation details in the zipfile module?
Perhaps I missed something in the zipfile docs? Or maybe it's not documented yet? Or (least likely), a bug in the zipfile module?
I might have just found the problem and the solution, but unfortunately I had to replace Python's zipfile module with a hacked one of my own (called myzipfile here).
$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py 2011-04-11 11:51:59.000000000 -0700
## -5,6 +5,7 ##
import binascii, cStringIO, stat
import io
import re
+import copy
try:
import zlib # We may need its compression method
## -877,7 +878,7 ##
# Only open a new file for instances where we were not
# given a file object in the constructor
if self._filePassed:
- zef_file = self.fp
+ zef_file = copy.copy(self.fp)
else:
zef_file = open(self.filename, 'rb')
The problem in the standard zipfile module is that when passed a file object (not a filename), it uses that same passed-in file object for every call to the open method. This means that tell and seek are getting called on the same file and so trying to open multiple files within the zip file is causing the file position to be shared and so multiple open calls result in them stepping all over each other. In contrast, when passed a filename, open opens a new file object. My solution is for the case when a file object is passed in, instead of using that file object directly, I create a copy of it.
This change to zipfile fixes the problems I was seeing:
$ ./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
but I don't know if it has other negative impacts on zipfile...
EDIT: I just found a mention of this in the Python docs that I had somehow overlooked before. At http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open, it says:
Note: If the ZipFile was created by passing in a file-like object as the first argument to the
constructor, then the object returned by open() shares the ZipFile’s file pointer. Under these
circumstances, the object returned by open() should not be used after any additional operations
are performed on the ZipFile object. If the ZipFile was created by passing in a string (the
filename) as the first argument to the constructor, then open() will create a new file
object that will be held by the ZipExtFile, allowing it to operate independently of the ZipFile.
what i did was update setup tools then re download and it works now
https://pypi.python.org/pypi/setuptools/35.0.1
In my case, this solved the problem:
pip uninstall pillow
could it be that you had it open in your desktop? It has happened sometimes to me and the solution was just to run the code without having the files open outside of the python session.

Resources