How to stop ThreadPoolExecutor.map and exit on CTRL-C?

How to stop ThreadPoolExecutor.map and exit on CTRL-C? - python-3.x

A python script executes an IO-bound function a lot of times (order of magnitude: anything between 5000 and 75000). This is still pretty performant by using
def _iterator(): ... # yields 5000-75000 different names
def _thread_function(name): ...
with concurrent.futures.ThreadPoolExecutor(max_workers=11) as executor:
executor.map(_thread_function, _iterator(), timeout=44)
If a user presses CTRL-C, it just messes up a single thread. I want it to stop launching new threads; and finish the current ongoing threads or kill them instantly, whatever.
How can I do that?

Exception handling in concurrent.futures.Executor.map might answer your question.
In essence, from the documentation of concurrent.futures.Executor.map
If a func call raises an exception, then that exception will be raised when its value is retrieved from the iterator.
As you are never retrieving the values from map(), the exception is never raised in your main thread.
Furthermore, from PEP 255
If an unhandled exception-- including, but not limited to, StopIteration --is raised by, or passes through, a generator function, then the exception is passed on to the caller in the usual way, and subsequent attempts to resume the generator function raise StopIteration. In other words, an unhandled exception terminates a generator's useful life.
Hence if you change your code to (notice the for loop):
def _iterator(): ... # yields 5000-75000 different names
def _thread_function(name): ...
with concurrent.futures.ThreadPoolExecutor(max_workers=11) as executor:
for _ in executor.map(_thread_function, _iterator(), timeout=44):
pass
The InterruptedError will be raised in the main thread, and by passing through the generator (executor.map(_thread_function, _iterator(), timeout=44)) it will terminate it.

Related

The proper way to use try/except blocks in tasks that can be cancelled

In essence my question is when and where is the asyncio.CancelledError
exception raised in the coroutine being cancelled?
I have an application with a couple of async tasks that run in a loop. At some
point I start those tasks like this:
async def connect(self);
...
t1 = asyncio.create_tasks(task1())
t2 = asyncio.create_task(task2())
...
self._workers = [t1, t2, ...]
When disconnecting, I cancel the tasks like this:
async def disconnect(self):
for task in self._workers:
task.cancel()
This has been working fine. The documentation of Task.cancel says
The coroutine then has a chance to clean up or even deny the request by suppressing the exception with a
try … … except CancelledError … finally block. Therefore, unlike Future.cancel(), Task.cancel() does
not guarantee that the Task will be cancelled, although suppressing cancellation completely is
not common and is actively discouraged.
so in my workers I avoid doing stuff like this:
async def worker():
while True:
...
try:
some work
except:
continue
but that means that now I have to explicitly put asyncio.CancelledError in the
except statement:
async def worker():
while True:
...
try:
some work
except asyncio.CancelledError:
raise
except:
continue
which can be tedious and I also have to make sure that anything that I call from
my worker obliges by this rule.
So now I'm not sure if this is a good practice at all. Now that I'm thinking
about it, I don't even know when exactly the exception is raised. I was
searching for a similar case here in SO and found this question which also
raised the same question "When will this exception be thrown? And where?". The
answer says
This exception is thrown after task.cancel() is called. It is thrown inside the coroutine,
where it is caught in the example, and it is then re-raised to be thrown and caught
in the awaiting routine.
And while it make sense, this got me thinking: this is async scheduling, the
tasks are not interrupted at any arbitrary place like with threads but they only
"give back control" to the event loop when a task does an await. Right?
So that means that checking everywhere whether
asyncio.CancelledError was raised might not be necessary. For example, let's
consider this example:
def worker(interval=1):
while True:
try:
# doing some work and no await is called in this block
sync_call1()
sync_call2()
sync_call3()
except asyncio.CancelledError:
raise
except:
# deal with error
pass
await asyncio.sleep(interval)
So I think here the except asyncio.CancelledError is unnecessary because this
error cannot "physically" be raised in the try/block at all since the thread
in the try block will never be interrupted by the event loop. The only place
where this task gives back the control to the event loop is at the sleep call,
that is not even in a try/block and hence it doesn't suppress the exception. Is
my train of though correct? If so, does that mean that I only have to account
for asyncio.CancelledError when I have an await in the try block? So would
this also be OK, knowing that worker() can be cancelled?
def worker(interval=1):
while True:
try:
# doing some work and no await is called in this block
sync_call1()
sync_call2()
sync_call3()
except:
# deal with error
pass
await asyncio.sleep(interval)
And after reading the answer of the other SO question, I think I should also
wait for the cancelled tasks in my disconnect() function, do I? Like this?
async def disconnect(self):
for task in self._workers:
task.cancel()
await asyncio.gather(*self._workers)
Is this correct?

Your reasoning is correct: if the code doesn't contain an awaiting construct, you can't get a CancelledError (at least not from task.cancel; someone could still raise it manually, but then you probably want to treat is as any other exception). Note that awaiting constructs include await, async for and async with.
Having said that, I would add that try: ... except: continue is an anti-pattern. You should always catch a more specific exception. If you do catch all exceptions, that should be only to perform some cleanup/logging before re-raising it. If you do so, you won't have a problem with CancelledError. If you absolutely must catch all exceptions, consider at least logging the fact that an exception was raised, so that it doesn't pass silently.
Python 3.8 made it much easier to catch exceptions other than CancelledError because it switched to deriving CancelledError from BaseException. In 3.8 except Exception won't catch it, resolving your issue.
To sum it up:
If you run Python 3.8 and later, use except Exception: traceback.print_exc(); continue.
In Python 3.7 and earlier you need to use the pattern put forth in the question. If it's a lot of typing, you can abstract it into a function, but that will still require some refactoring.
For example, you could define a utility function like this:
def run_safe(thunk):
try:
thunk()
return True
except asyncio.CancelledError:
raise
except:
traceback.print_exc()
return False

Does using a context manager in a generator may lead to resources leak?

I have a function that yields from a context manager:
def producer(pathname):
with open(pathname) as f:
while True:
chunk = f.read(4)
if not chunk:
break
yield chunk
It is not a problem when the generator is entirely consumed since, during the last iteration, the generator resumes execution after the yield statement, and the loop breaks and we nicely exit the context manager.
However, if the generator is only partially consumed, and there are no more consumers to consume it entirely, will the generator remain suspended forever? In that case, we will never exit from the context manager. Would that mean the file will remain open for the rest of the program execution? Or at least until the generator is garbage collected? Is this a corner case I should take care of by myself, or can I rely on the Python runtime to close dangling context manager in time?
FWIW, I've seen Generator and context manager at the same time and How to use a python context manager inside a generator but I don't think they really answer the same question. Unless I missed something?

If you fail to consume the whole generator, the context manager won't be cleaned until the generator is garbage collected, which may take quite a while if reference cycles are involved, or you're running on a non-CPython interpreter.
You can work around this by close-ing the generator-iterator; all generator functions provide a close method on the resulting generator-iterator that raises GeneratorExit inside it; the exception bubbles out of with statements and the like to make sure they're properly cleaned deterministically.
To make it occur at a guaranteed point in time, you can use contextlib.closing to get guaranteed closing of the generator itself:
from contextlib import closing
with closing(producer(mypath)) as produced_items:
for item in produced_items:
# Do stuff, maybe break loop early
Even if you break, return, or raise an exception, the with controlling produced_items will close it, which will in turn invoke cleanup for the with statements within it.

Process finishes but cannot be joined?

To accelerate a certain task, I'm subclassing Process to create a worker that will process data coming in samples. Some managing class will feed it data and read the outputs (using two Queue instances). For asynchronous operation I'm using put_nowait and get_nowait. At the end I'm sending a special exit code to my process, upon which it breaks its internal loop. However... it never happens. Here's a minimal reproducible example:
import multiprocessing as mp
class Worker(mp.Process):
def __init__(self, in_queue, out_queue):
super(Worker, self).__init__()
self.input_queue = in_queue
self.output_queue = out_queue
def run(self):
while True:
received = self.input_queue.get(block=True)
if received is None:
break
self.output_queue.put_nowait(received)
print("\tWORKER DEAD")
class Processor():
def __init__(self):
# prepare
in_queue = mp.Queue()
out_queue = mp.Queue()
worker = Worker(in_queue, out_queue)
# get to work
worker.start()
in_queue.put_nowait(list(range(10**5))) # XXX
# clean up
print("NOTIFYING")
in_queue.put_nowait(None)
#out_queue.get() # XXX
print("JOINING")
worker.join()
Processor()
This code never completes, hanging permanently like this:
NOTIFYING
JOINING
WORKER DEAD
Why?
I've marked two lines with XXX. In the first one, if I send less data (say, 10**4), everything will finish normally (processes join as expected). Similarly in the second, if I get() after notifying the workers to finish. I know I'm missing something but nothing in the documentation seems relevant.

Documentation mentions that
When an object is put on a queue, the object is pickled and a background thread later flushes the pickled data to an underlying pipe. This has some consequences [...] After putting an object on an empty queue there may be an infinitesimal delay before the queue’s empty() method returns False and get_nowait() can return without raising queue.Empty.
https://docs.python.org/3.7/library/multiprocessing.html#pipes-and-queues
and additionally that
whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. Otherwise you cannot be sure that processes which have put items on the queue will terminate.
https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing-programming
This means that the behaviour you describe is caused probably by a racing condition between self.output_queue.put_nowait(received) in the worker and joining the worker with worker.join() in the Processers __init__. If joining was faster than feeding it into the queue, everything finishes fine. If it was too slow, there is an item in the queue, and the worker would not join.
Uncommenting the out_queue.get() in the main process would empty the queue, which allows joining. But as it is important for the queue to return if the queue would already be empty, using a time-out might be an option to try to wait out the racing condition, e.g out_qeue.get(timeout=10).
Possibly important might also be to protect the main routine, especially for Windows (python multiprocessing on windows, if __name__ == "__main__")

What should I do with "not awaited coroutine" warning?

First I do apologize for posting code using a proprietary library similar to a subset of asyncio. Anyway the question is not related to that library and I'm sure it will be readable for everybody familiar with asynchronous Python.
I have a standard coroutine, e.g.:
async def tcoro():
pass
that is used as a timeout handler. For that purpose it is wrapped into a Task object. Basically the Task keeps the coroutine alive by repeating send calls. The task is scheduled to be run when the operation times out:
timer = loop.call_later(Task(tcoro()).run, delay=timeout)
But when the operation succeeds, the pending timeout gets cancelled with:
timer.cancel()
which just sets a cancelled flag. Later, when the delay is over, the event loop will pop the timer object out of the queue and because it got cancelled in the meantime, it will be ignored. And that produces:
RuntimeWarning: coroutine 'tcoro' was never awaited
My question is how am I supposed to react to this warning when there is no problem with program's functionality?
Should I ignore it? Should I suppress it somehow? Should I rewrite the code because it is badly designed?
The fix is very simple:
def timeout():
Task(tcoro()).run()
timer = loop.call_later(timeout, delay=timeout)
But I'm not sure if the original code is broken.

Should I use coroutines or another scheduling object here?

I currently have code in the form of a generator which calls an IO-bound task. The generator actually calls sub-generators as well, so a more general solution would be appreciated.
Something like the following:
def processed_values(list_of_io_tasks):
for task in list_of_io_tasks:
value = slow_io_call(task)
yield postprocess(value) # in real version, would iterate over
# processed_values2(value) here
I have complete control over slow_io_call, and I don't care in which order I get the items from processed_values. Is there something like coroutines I can use to get the yielded results in the fastest order by turning slow_io_call into an asynchronous function and using whichever call returns fastest? I expect list_of_io_tasks to be at least thousands of entries long. I've never done any parallel work other than with explicit threading, and in particular I've never used the various forms of lightweight threading which are available.
I need to use the standard CPython implementation, and I'm running on Linux.

Sounds like you are in search of multiprocessing.Pool(), specifically the Pool.imap_unordered() method.
Here is a port of your function to use imap_unordered() to parallelize calls to slow_io_call().
def processed_values(list_of_io_tasks):
pool = multiprocessing.Pool(4) # num workers
results = pool.imap_unordered(slow_io_call, list_of_io_tasks)
while True:
yield results.next(9999999) # large time-out
Note that you could also iterate over results directly (i.e. for item in results: yield item) without a while True loop, however calling results.next() with a time-out value works around this multiprocessing keyboard interrupt bug and allows you to kill the main process and all subprocesses with Ctrl-C. Also note that the StopIteration exceptions are not caught in this function but one will be raised when results.next() has no more items return. This is legal from generator functions, such as this one, which are expected to either raise StopIteration errors when there are no more values to yield or just stop yielding and a StopIteration exception will be raised on it's behalf.
To use threads in place of processes, replace
import multiprocessing
with
import multiprocessing.dummy as multiprocessing

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to stop ThreadPoolExecutor.map and exit on CTRL-C? - python-3.x

Related

The proper way to use try/except blocks in tasks that can be cancelled

Does using a context manager in a generator may lead to resources leak?

Process finishes but cannot be joined?

What should I do with "not awaited coroutine" warning?

Should I use coroutines or another scheduling object here?

Categories

Resources