python concurrent.futures skip timeout processes - python-3.x

I am dealing with thousands of image urls and want to use concurrent.futures.ProcessPoolExecutor to speed up.
Since some of the urls are broken or images are large, the process function may hang or unexpectedly consume a lot of time during processing. I want to add a timeout on the process function like 10 seconds to get rid of these invalid images.
I tried to set the timeout param in futures .as_completed, the TimeoutException could be successfully raised. However, it seems that the main process will still wait until the timeout child process is completed. Is there any approach to immediately kill the timeout child process and put next url into the pool?
from concurrent import futures
def process(url):
### Some time consuming operation
return result
def main():
urls = ['url1','url2','url3',...,'url100']
with futures.ProcessPoolExecutor(max_workers=10) as executor:
future_list = {executor.submit(process, url):url for url in urls}
results = []
try:
for future in futures.as_completed(future_list, timeout=10):
results.append(future.result())
except futures._base.TimeoutException:
print("timeout")
print(results)
if __name__ == '__main__':
main()
In above example, suppose that I have 100 urls, 10 of them are invalid and may cost a lot of time ,how to get the rest 90 urls' processed result list?

Not with the concurrent.futures library.
The pebble module has been developed to overcome such limitation.
from pebble import ProcessPool
from concurrent.futures import TimeoutError
with process.ProcessPool() as pool:
future = pool.schedule(function, args=(1,2), timeout=5)
try:
result = future.result() # blocks until results are ready
except TimeoutError as error:
print("Function took longer than %d seconds" % error.args[1])

Related

Best way to keep creating threads on variable list argument

I have an event that I am listening to every minute that returns a list ; it could be empty, 1 element, or more. And with those elements in that list, I'd like to run a function that would monitor an event on that element every minute for 10 minute.
For that I wrote that script
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
import Client
client = Client()
def handle_event(event):
for i in range(10):
client.get_info(event)
sleep(60)
async def main():
while True:
entires = client.get_new_entry()
if len(entires) > 0:
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
executor.map(handle_event, entires)
await asyncio.sleep(60)
if __name__ == "__main__":
loop = asyncio.new_event_loop()
loop.run_until_complete(main())
However, instead of keep monitoring the entries, it blocks while the previous entries are still being monitors.
Any idea how I could do that please?
First let me explain why your program doesn't work the way you want it to: It's because you use the ThreadPoolExecutor as a context manager, which will not close until all the threads started by the call to map are finished. So main() waits there, and the next iteration of the loop can't happen until all the work is finished.
There are ways around this. Since you are using asyncio already, one approach is to move the creation of the Executor to a separate task. Each iteration of the main loop starts one copy of this task, which runs as long as it takes to finish. It's a async def function so many copies of this task can run concurrently.
I changed a few things in your code. Instead of Client I just used some simple print statements. I pass a list of integers, of random length, to handle_event. I increment a counter each time through the while True: loop, and add 10 times the counter to every integer in the list. This makes it easy to see how old calls continue for a time, mixing with new calls. I also shortened your time delays. All of these changes were for convenience and are not important.
The important change is to move ThreadPoolExecutor creation into a task. To make it cooperate with other tasks, it must contain an await expression, and for that reason I use executor.submit rather than executor.map. submit returns a concurrent.futures.Future, which provides a convenient way to await the completion of all the calls. executor.map, on the other hand, returns an iterator; I couldn't think of any good way to convert it to an awaitable object.
To convert a concurrent.futures.Future to an asyncio.Future, an awaitable, there is a function asyncio.wrap_future. When all the futures are complete, I exit from the ThreadPoolExecutor context manager. That will be very fast since all of the Executor's work is finished, so it does not block other tasks.
import random
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
def handle_event(event):
for i in range(10):
print("Still here", event)
sleep(2)
async def process_entires(counter, entires):
print("Counter", counter, "Entires", entires)
x = [counter * 10 + a for a in entires]
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
futs = []
for z in x:
futs.append(executor.submit(handle_event, z))
await asyncio.gather(*(asyncio.wrap_future(f) for f in futs))
async def main():
counter = 0
while True:
entires = [0, 1, 2, 3, 4][:random.randrange(5)]
if len(entires) > 0:
counter += 1
asyncio.create_task(process_entires(counter, entires))
await asyncio.sleep(3)
if __name__ == "__main__":
asyncio.run(main())

Multithreading in Python within a for loop

Let's say I have a program in Python which looks like this:
import time
def send_message_realtime(s):
print("Real Time: ", s)
def send_message_delay(s):
time.sleep(5)
print("Delayed Message ", s)
for i in range(10):
send_message_realtime(str(i))
time.sleep(1)
send_message_delay(str(i))
What I am trying to do here is some sort of multithreading, so that the contents of my main for loop continues to execute without having to wait for the delay caused by time.sleep(5) in the delayed function.
Ideally, the piece of code that I am working upon looks something like below. I get a message from some API endpoint which I want to send to a particular telegram channel in real-time(paid subscribers), but I also want to send it to another channel by delaying it exactly 10 minutes or 600 seconds since they're free members. The problem I am facing is, I want to keep sending the message in real-time to my paid subscribers and kind of create a new thread/process for the delayed message which runs independently of the main while loop.
def send_message_realtime(my_realtime_message):
telegram.send(my_realtime_message)
def send_message_delayed(my_realtime_message):
time.sleep(600)
telegram.send(my_realtime_message)
while True:
my_realtime_message = api.get()
send_message_realtime(my_realtime_message)
send_message_delayed(my_realtime_message)
I think something like ThreadPoolExecutor does what you are look for:
import time
from concurrent.futures.thread import ThreadPoolExecutor
def send_message_realtime(s):
print("Real Time: ", s)
def send_message_delay(s):
time.sleep(5)
print("Delayed Message ", s)
def work_to_do(i):
send_message_realtime(str(i))
time.sleep(1)
send_message_delay(str(i))
with ThreadPoolExecutor(max_workers=4) as executor:
for i in range(10):
executor.submit(work_to_do, i)
The max_workers would be the number of parallel messages that could have at a given moment in time.
Instead of a multithread solution, you can also use a multiprocessing solution, for instance
from multiprocessing import Pool
...
with Pool(4) as p:
print(p.map(work_to_do, range(10)))

How to set timeout for a block of code which is not a function python3

After spending a lot of hours looking for a solution in stackoverflow, I did not find a good solution to set a timeout for a block of code. There are approximations to set a timeout for a function. Nevertheless, I would like to know how to set a timeout without having a function. Let's take the following code as an example:
print("Doing different things")
for i in range(0,10)
# Doing some heavy stuff
print("Done. Continue with the following code")
So, How would you break the for loop if it has not finished after x seconds? Just continue with the code (maybe saving some bool variables to know that timeout was reached), despite the fact that the for loop did not finish properly.
i think implement this efficiently without using functions not possible , look this code ..
import datetime as dt
print("Doing different things")
# store
time_out_after = dt.timedelta(seconds=60)
start_time = dt.datetime.now()
for i in range(10):
if dt.datetime.now() > time_started + time_out:
break
else:
# Doing some heavy stuff
print("Done. Continue with the following code")
the problem : the timeout will checked in the beginning of every loop cycle, so it may be take more than the specified timeout period to break of the loop, or in worst case it maybe not interrupt the loop ever becouse it can't interrupt the code that never finish un iteration.
update :
as op replayed, that he want more efficient way, this is a proper way to do it, but using functions.
import asyncio
async def test_func():
print('doing thing here , it will take long time')
await asyncio.sleep(3600) # this will emulate heaven task with actual Sleep for one hour
return 'yay!' # this will not executed as the timeout will occur early
async def main():
# Wait for at most 1 second
try:
result = await asyncio.wait_for(test_func(), timeout=1.0) # call your function with specific timeout
# do something with the result
except asyncio.TimeoutError:
# when time out happen program will break from the test function and execute code here
print('timeout!')
print('lets continue to do other things')
asyncio.run(main())
Expected output:
doing thing here , it will take long time
timeout!
lets continue to do other things
note:
now timeout will happen after exactly the time you specify. in this example code, after one second.
you would replace this line:
await asyncio.sleep(3600)
with your actual task code.
try it and let me know what do you think. thank you.
read asyncio docs:
link
update 24/2/2019
as op noted that asyncio.run introduced in python 3.7 and asked for altrnative on python 3.6
asyncio.run alternative for python older than 3.7:
replace
asyncio.run(main())
with this code for older version (i think 3.4 to 3.6)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
You may try the following way:
import time
start = time.time()
for val in range(10):
# some heavy stuff
time.sleep(.5)
if time.time() - start > 3: # 3 is timeout in seconds
print('loop stopped at', val)
break # stop the loop, or sys.exit() to stop the script
else:
print('successfully completed')
I guess it is kinda viable approach. Actual timeout is greater than 3 seconds and depends on the single step execution time.

Python Fire dynamic urls using multithreading

I'm new to to Python-Threading, and I've gone through multiple posts but I really did not understand how to use it. However I tried to complete my task, and I want to check if I'm doing it with right approach.
Task is :
Read big CSV containing around 20K records, fetch id from each record and fire an HTTP API call for each record of the CSV.
t1 = time.time()
file_data_obj = csv.DictReader(open(file_path, 'rU'))
threads = []
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
thread = threading.Thread(target=requests.get, args=(apiurl,))
thread.start()
threads.append(thread)
t2 = time.time()
for thread in threads:
thread.join()
print("Total time required to process a file - {} Secs".format(t2-t1))
As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
How can I collect the response returned by requests.get?
Would t2 - t1 really give mw the time required to process whole file?
As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
Yes - it will start a thread for each iteration. The maximum amount of threads is dependent on your OS.
How can I grab the response returned by requests.get?
If you want to use the threading module only, you'll have to make use of a Queue. Threads return None by design, hence you'll have to implement a line of communication between the Thread and you main loop yourself.
from queue import Queue
from threading import Thread
import time
# A thread that produces data
q = Queue()
def return_get(q, apiurl):
q.put(requests.get(apiurl)
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = threading.Thread(target=return_get, args=(q, apiurl))
t.start()
threads.append(t)
for thread in threads:
thread.join()
while not q.empty:
r = q.get() # Fetches the first item on the queue
print(r.text)
An alternative is to use a worker pool.
from concurrent.futures import ThreadPoolExecutor
from queue import Queue
import urllib.request
threads = []
pool = ThreadPoolExecutor(10)
# Submit work to the pool
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = pool.submit(fetch_url, 'http://www.python.org')
threads.append(t)
for t in threads:
print(t.result())
You can use ThreadPoolExecutor
Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
Create pool executor with N workers
with concurrent.futures.ThreadPoolExecutor(max_workers=N_workers) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))

Monitoring the asyncio event loop

I am writing an application using python3 and am trying out asyncio for the first time. One issue I have encountered is that some of my coroutines block the event loop for longer than I like. I am trying to find something along the lines of top for the event loop that will show how much wall/cpu time is being spent running each of my coroutines. If there isn't anything already existing does anyone know of a way to add hooks to the event loop so that I can take measurements?
I have tried using cProfile which gives some helpful output, but I am more interested in time spent blocking the event loop, rather than total execution time.
Event loop can already track if coroutines take much CPU time to execute. To see it you should enable debug mode with set_debug method:
import asyncio
import time
async def main():
time.sleep(1) # Block event loop
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.set_debug(True) # Enable debug
loop.run_until_complete(main())
In output you'll see:
Executing <Task finished coro=<main() [...]> took 1.016 seconds
By default it shows warnings for coroutines that blocks for more than 0.1 sec. It's not documented, but based on asyncio source code, looks like you can change slow_callback_duration attribute to modify this value.
You can use call_later. Periodically run callback that will log/notify the difference of loop's time and period interval time.
class EventLoopDelayMonitor:
def __init__(self, loop=None, start=True, interval=1, logger=None):
self._interval = interval
self._log = logger or logging.getLogger(__name__)
self._loop = loop or asyncio.get_event_loop()
if start:
self.start()
def run(self):
self._loop.call_later(self._interval, self._handler, self._loop.time())
def _handler(self, start_time):
latency = (self._loop.time() - start_time) - self._interval
self._log.error('EventLoop delay %.4f', latency)
if not self.is_stopped():
self.run()
def is_stopped(self):
return self._stopped
def start(self):
self._stopped = False
self.run()
def stop(self):
self._stopped = True
example
import time
async def main():
EventLoopDelayMonitor(interval=1)
await asyncio.sleep(1)
time.sleep(2)
await asyncio.sleep(1)
await asyncio.sleep(1)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
output
EventLoop delay 0.0013
EventLoop delay 1.0026
EventLoop delay 0.0014
EventLoop delay 0.0015
For anyone reading this in 2019, this might be a better answer: yappi. With Yappi version 1.2.1>=, you can natively profile coroutines and see exactly how much wall or cpu time is spent inside a coroutine.
See here for details on this coroutine profiling.
To expand a bit on one of the answers, if you want to monitor your loop and detect hangs, here's a snippet to do just that. It launches a separate thread that checks whether the loop's tasks yielded execution recently enough.
def monitor_loop(loop, delay_handler):
loop = loop
last_call = loop.time()
INTERVAL = .5 # How often to poll the loop and check the current delay.
def run_last_call_updater():
loop.call_later(INTERVAL, last_call_updater)
def last_call_updater():
nonlocal last_call
last_call = loop.time()
run_last_call_updater()
run_last_call_updater()
def last_call_checker():
threading.Timer(INTERVAL / 2, last_call_checker).start()
if loop.time() - last_call > INTERVAL:
delay_handler(loop.time() - last_call)
threading.Thread(target=last_call_checker).start()

Resources