Generator function causing exceptions to be caught after all processes complete

Generator function causing exceptions to be caught after all processes complete - python-3.x

I wrote this short POC to help understand the issue I am having with the hope that someone can explain to me what is going on and how I can fix it and/or make it more efficient.
My goal of using iterators, itertools and generators is because I didn't want to store a huge list in memory, as I scale up the list will become unmanageable and I didn't want to have to loop over the entire list to do something every single time. Note, I am fairly new to the idea of generators, iterators and multiprocessing and wrote this code today, so, if you can clearly tell I am miss understanding the workflow on how these things are suppose to work, please educate me and help make my code better.
You should be able to run the code as is and see the problem I am facing. I am expecting as soon as the exception is caught, it gets raised and the script dies, but what I see is happening, the exception get caught but the other processes continue.
If I comment out the generateRange generator and create a dummy list and pass it into futures = (map(executor.submit, itertools.repeat(execute), mylist)), the exception does get caught and exits the script as intended.
My guess is, the generator/iterator has to complete generating the range before the script can die, which, to my understanding was not suppose to be the case.
The reason I opted in using a generator function/iterators was because you can access them the objects only when they are needed.
Is there a way for me to stop the generator from continuing and let the exception be raised appropriately.
Here is my POC:
import concurrent.futures
PRIMES = [0]*80
import time
def is_prime(n):
print("Enter")
time.sleep(5)
print("End")
1/0
child = []
def main():
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
for i in PRIMES:
child.append(executor.submit(is_prime, i))
for future in concurrent.futures.as_completed(child):
if future.exception() is not None:
print("Throw an exception")
raise future.exception()
if __name__ == '__main__':
main()
EDIT: I updated the POC with something simpler.

It is not possible to cancel running futures immediately, but this at least makes it so only a few processes are run after the exception is raised:
import concurrent.futures
PRIMES = [0]*80
import time
def is_prime(n):
print("Enter")
time.sleep(5)
print("End")
1/0
child = []
def main():
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
for i in PRIMES:
child.append(executor.submit(is_prime, i))
for future in concurrent.futures.as_completed(child):
if future.exception() is not None:
for fut in child:
fut.cancel()
print("Throw an exception")
raise future.exception()
if __name__ == '__main__':
main()

Related

Python script with Multithreading was killed due to out of memory

I have used threading module to process the data faster. The python program is killed as the memory usage keeps on increasing as the time goes. Here is the simple example to reproduce the issue. What is wrong with this code? Where the memory leakage is happening. Thanks for your help.
import threading
import time
def f1():return
def f2():
for i in (1,300):
t = threading.Thread(target=f1)
t.start()
return
def main():
while True:
for i in range(1,200):
t = threading.Thread(target=f2)
t.start()
time.sleep(0.5)
if __name__ == '__main__':
main()

You've got threads creating threads??? and 200 threads each creating 300 threads, for a total of 60,000 threads?
I think any machine will likely run out of memory trying to do this.
Your code has no memory leak, and there is nothing 'wrong' with it, except what you are trying to do is just, well, completely wrong.
So perhaps you should explain a bit of background about what you're trying to achieve and why.

In Python asyncio, why does line-per-file processing blocks method?

I'm totally new to python's asyncio. I understand the idea, but even the most simple task won't work due to a lack of understanding on my side.
Here's my code which tries to read a file (and ultimately process each line of it) reguarily:
#!/usr/bin/env python3
import asyncio
import aiofiles
async def main():
async def work():
while True:
async with aiofiles.open('../v2.rst', 'r') as f:
async for line in f:
# real work will happen here
pass
print('loop')
await asyncio.sleep(2)
tasks = asyncio.gather(
work(),
)
await asyncio.sleep(10)
# Cancel tasks
tasks.add_done_callback(lambda r: r.exception())
tasks.cancel()
if __name__ == '__main__':
asyncio.run(main())
The work-function should read a file, do some line-per-line processing and then wait 2 seconds.
What happens is, that the function does "nothing". It blocks, I never see loop printed.
Where is my error in understanding asyncio?

The code hides the exception because the callback installed with add_done_callback retrieves the exception, only to immediately discard it. This prevents the (effectively unhandled) exception from getting logged by asyncio, which happens if you comment out the line with add_done_callback.
Also:
the code calls gather without awaiting it, either immediately after the call or later.
it unnecessarily invokes gather with a single coroutine. If the idea is to run the coroutine in the background, the idiomatic way to do so is with asyncio.create_task(work()).

How to properly memoize when using a ProcessPoolExecutor?

I suspect that something like:
#memoize
def foo():
return something_expensive
def main():
with ProcessPoolExecutor(10) as pool:
futures = {pool.submit(foo, arg): arg for arg in args}
for future in concurrent.futures.as_completed(futures):
arg = futures[future]
try:
result = future.result()
except Exception as e:
sys.stderr.write("Failed to run foo() on {}\nGot {}\n".format(arg, e))
else:
print(result)
Won't work (assuming #memoize is a typical dict-based cache) due to the fact that I am using a multi-processing pool and the processes don't share much. At least it doesn't seem to work.
What is the correct way to memoize in this scenario? Ultimately I'd also like to pickle the cache to disk and load it on subsequent runs.

You can use a Manager.dict from multiprocessing which uses a Manager to proxy between processes and store in a shared dict, which can be pickled. I decided to use Multithreading though because it's an IO bound app and thread shared memory space means I dont need all that manager stuff, I can just use a dict.

Asyncio Queue waits until it is full before get returns something

I'm having a weird issue with asyncio.Queue - instead of returning an item as soon as it available, the queue waits until it is full before returning anything. I realized that while using a queue to store frames collected from cv2.VideoCapture, the larger the maxsize of the queue was, the longer it took to show anything on screen, and then, it looked like a sequence of all the frames collected into the queue.
Is that a feature, a bug, or am i just using this wrong?
Anyway, here is my code
import asyncio
import cv2
import numpy as np
async def collecting_loop(queue):
print("cl")
cap = cv2.VideoCapture(0)
while True:
_, img = cap.read()
await queue.put(img)
async def processing_loop(queue):
print("pl")
await asyncio.sleep(0.1)
while True:
img = await queue.get()
cv2.imshow('img', img)
cv2.waitKey(5)
async def main(e_loop):
print("running main")
queue = asyncio.Queue(loop=e_loop, maxsize=10)
await asyncio.gather(collecting_loop(queue), processing_loop(queue))
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main(e_loop=loop))
except KeyboardInterrupt:
pass
finally:
loop.close()

Is [the queue getter not waking up until the queue fills up] a feature, a bug, or am i just using this wrong?
You're using it wrong, but in a subtle way. As Andrew explained, queue.put doesn't guarantee a task switch, and the collector coroutine only runs blocking code and queue.put. Although the blockade is short, asyncio doesn't know that and thinks you are invoking queue.put in a really tight loop. The queue getters simply don't get a chance to run until the queue fills up.
The correct way to integrate asyncio and cv is to run the cv code in a separate thread and have the asyncio event loop wait for it to finish. The run_in_executor method makes that really simple:
async def collecting_loop(queue):
print("cl")
loop = asyncio.get_event_loop()
cap = cv2.VideoCapture(0)
while True:
_, img = await loop.run_in_executor(None, cap.read)
await queue.put(img)
run_in_executor will automatically suspend the collector coroutine while waiting for a new frame, allowing for the queued frame(s) to be processed in time.

The problem is that await q.put() doesn't switch to another task every call. Actually it does only when inserting a new value is suspended by queue-full state transition.
Inserting await asyncio.sleep(0) forces task switch.
Like in multithreaded code file.read() doesn't enforce OS thread switching but time.sleep(0) does.
Misunderstandings like this are pretty common for newbies, I've discussed very similar problem yesterday, see github issue.
P.S.
Your code has much worse problem actually: you call blocking synchronous code from async function, it just is not how asyncio works.
If no asynchronous OpenCV API exists (yet) you should run OpenCV functions in a separate thread.
Already mentioned janus can help with passing data between sync and async code.

Getting a thread to leave an (intentional) infinite loop?

I have a background thread that main calls, the background thread can open a number of different scripts but occasionally it will get an infinite print loop like this.
In thing.py
import foo
def main():
thr = Thread(target=background)
thr.start()
thread_list.append(thr)
def background():
getattr(foo, 'bar')()
return
And then in foo.py
def bar():
while True:
print("stuff")
This is what it's supposed to do but I want to be able to kill it when I need to. Is there a way for me to kill the background thread and all the functions it has called? I've tried putting flags in background to return when the flag goes high, but background is never able to check the flags since its waiting for bar to return.
EDIT: foo.py is not my code so I'm hesitant to edit it, ideally I could do this without modifying foo.py but if its impossible to avoid its okay

First of all it is very difficult (if possible) to control threads from other threads, no matter what language you are using. This is due to potential security issues. So what you do is you create a shared object which both threads can freely access. You can set a flag on it.
But luckily in Python each thread has its own Thread object which we can use:
import foo
def main():
thr = Thread(target=background)
thr.exit_requested = False
thr.start()
thread_list.append(thr)
def background():
getattr(foo, 'bar')()
return
And in foo:
import threading
def bar():
th = threading.current_thread()
# What happens when bar() is called from the main thread?
# The commented code is not thread safe.
# if not hasattr(th, 'exit_requested'):
# th.exit_requested = False
while not th.exit_requested:
print("stuff")
Although this will probably be hard to maintain/debug. Treat it more like a hack. Cleaner way would be to create a shared object and pass it around to all calls.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Generator function causing exceptions to be caught after all processes complete - python-3.x

Related

Python script with Multithreading was killed due to out of memory

In Python asyncio, why does line-per-file processing blocks method?

How to properly memoize when using a ProcessPoolExecutor?

Asyncio Queue waits until it is full before get returns something

Getting a thread to leave an (intentional) infinite loop?

Categories

Resources