Why i need cancel tasks in this queue example? - python-3.x

Well I am looking into python documentation for study for my work. I am new to python and also programming, I also do not understand concepts of programming like async operations very well.
I usign Fedora 29 with Python 3.7.3 for try examples of queue and the lib asyncio.
Follow the example of queue and async operations below:
import asyncio
import random
import time
async def worker(name, queue):
while True:
# Get a "work item" out of the queue.
sleep_for = await queue.get()
# Sleep for the "sleep_for" seconds.
await asyncio.sleep(sleep_for)
# Notify the queue that the "work item" has been processed.
queue.task_done()
print(f'{name} has slept for {sleep_for:.2f} seconds')
async def main():
# Create a queue that we will use to store our "workload".
queue = asyncio.Queue()
# Generate random timings and put them into the queue.
total_sleep_time = 0
for _ in range(20):
sleep_for = random.uniform(0.05, 1.0)
total_sleep_time += sleep_for
queue.put_nowait(sleep_for)
# Create three worker tasks to process the queue concurrently.
tasks = []
for i in range(3):
task = asyncio.create_task(worker(f'worker-{i}', queue))
tasks.append(task)
# Wait until the queue is fully processed.
started_at = time.monotonic()
await queue.join()
total_slept_for = time.monotonic() - started_at
# Cancel our worker tasks.
for task in tasks:
task.cancel()
# Wait until all worker tasks are cancelled.
await asyncio.gather(*tasks, return_exceptions=True)
print('====')
print(f'3 workers slept in parallel for {total_slept_for:.2f} seconds')
print(f'total expected sleep time: {total_sleep_time:.2f} seconds')
asyncio.run(main())
Why in this example I need cancel the tasks? Why I can exclude this part of code
# Cancel our worker tasks.
for task in tasks:
task.cancel()
# Wait until all worker tasks are cancelled.
await asyncio.gather(*tasks, return_exceptions=True)
and the example work fine?

Why in this example i need cancel the tasks?
Because they will otherwise remain hanging indefinitely, waiting for a new item in the queue that will never arrive. In that particular example you are exiting the event loop anyway, so there's no harm from them "hanging", but if you did that as part of a utility function, you would create a coroutine leak.
In other words, canceling the workers tells them to exit because their services are no longer necessary, and is needed to ensure that resources associated with them get freed.

Related

Is it possible for two coroutines running in different threads can communicate with each other by asyncio.Queue?

Two coroutintes in code below, running in different threads, cannot communicate with each other by asyncio.Queue. After the producer inserts a new item in asyncio.Queue, the consumer cannot get this item from that asyncio.Queue, it gets blocked in method await self.n_queue.get().
I try to print the ids of asyncio.Queue in both consumer and producer, and I find that they are same.
import asyncio
import threading
import time
class Consumer:
def __init__(self):
self.n_queue = None
self._event = None
def run(self, loop):
loop.run_until_complete(asyncio.run(self.main()))
async def consume(self):
while True:
print("id of n_queue in consumer:", id(self.n_queue))
data = await self.n_queue.get()
print("get data ", data)
self.n_queue.task_done()
async def main(self):
loop = asyncio.get_running_loop()
self.n_queue = asyncio.Queue(loop=loop)
task = asyncio.create_task(self.consume())
await asyncio.gather(task)
async def produce(self):
print("id of queue in producer ", id(self.n_queue))
await self.n_queue.put("This is a notification from server")
class Producer:
def __init__(self, consumer, loop):
self._consumer = consumer
self._loop = loop
def start(self):
while True:
time.sleep(2)
self._loop.run_until_complete(self._consumer.produce())
if __name__ == '__main__':
loop = asyncio.get_event_loop()
print(id(loop))
consumer = Consumer()
threading.Thread(target=consumer.run, args=(loop,)).start()
producer = Producer(consumer, loop)
producer.start()
id of n_queue in consumer: 2255377743176
id of queue in producer 2255377743176
id of queue in producer 2255377743176
id of queue in producer 2255377743176
I try to debug step by step in asyncio.Queue, and I find after the method self._getters.append(getter) is invoked in asyncio.Queue, the item is inserted in queue self._getters. The following snippets are all from asyncio.Queue.
async def get(self):
"""Remove and return an item from the queue.
If queue is empty, wait until an item is available.
"""
while self.empty():
getter = self._loop.create_future()
self._getters.append(getter)
try:
await getter
except:
# ...
raise
return self.get_nowait()
When a new item is inserted into asycio.Queue in producer, the methods below would be invoked. The variable self._getters has no items although it has same id in methods put() and set().
def put_nowait(self, item):
"""Put an item into the queue without blocking.
If no free slot is immediately available, raise QueueFull.
"""
if self.full():
raise QueueFull
self._put(item)
self._unfinished_tasks += 1
self._finished.clear()
self._wakeup_next(self._getters)
def _wakeup_next(self, waiters):
# Wake up the next waiter (if any) that isn't cancelled.
while waiters:
waiter = waiters.popleft()
if not waiter.done():
waiter.set_result(None)
break
Does anyone know what's wrong with the demo code above? If the two coroutines are running in different threads, how could they communicate with each other by asyncio.Queue?
Short answer: no!
Because the asyncio.Queue needs to share the same event loop, but
An event loop runs in a thread (typically the main thread) and executes all callbacks and Tasks in its thread. While a Task is running in the event loop, no other Tasks can run in the same thread. When a Task executes an await expression, the running Task gets suspended, and the event loop executes the next Task.
see
https://docs.python.org/3/library/asyncio-dev.html#asyncio-multithreading
Even though you can pass the event loop to threads, it might be dangerous to mix the different concurrency concepts. Still note, that passing the loop just means that you can add tasks to the loop from different threads, but they will still be executed in the main thread. However, adding tasks from threads can lead to race conditions in the event loop, because
Almost all asyncio objects are not thread safe, which is typically not a problem unless there is code that works with them from outside of a Task or a callback. If there’s a need for such code to call a low-level asyncio API, the loop.call_soon_threadsafe() method should be used
see
https://docs.python.org/3/library/asyncio-dev.html#asyncio-multithreading
Typically, you should not need to run async functions in different threads, because they should be IO bound and therefore a single thread should be sufficient to handle the work load. If you still have some CPU bound tasks, you are able to dispatch them to different threads and make the result awaitable using asyncio.to_thread, see https://docs.python.org/3/library/asyncio-task.html#running-in-threads.
There are many questions already about this topic, see e.g. Send asyncio tasks to loop running in other thread or How to combine python asyncio with threads?
If you want to learn more about the concurrency concepts, I recommend to read https://medium.com/analytics-vidhya/asyncio-threading-and-multiprocessing-in-python-4f5ff6ca75e8

Asyncio Tasks running sequentially

I've recently started working on python and its related concurrency aspects and I'm banging my head around asyncio.
Structure of Data: List of companies with users-list in them.
Goal: I want to execute gRPC calls in parallel, with a task to always run for a particular company. Also, the API call is on users-list, and is a batch call [not a single call for one company]
Ref I followed: https://docs.python.org/3/library/asyncio-queue.html [Modified a bit according to my use-case]
What I've done: Below 3 small functions with process_cname_vs_users having input of company_id vs users-list
async def update_data(req_id, user_ids, company_id): # <-- THIS IS THE ASYNC CALL ON CHUNK OF SOME USERS
# A gRPC call to server here.
async def worker(worker_id, queue, company_vs_user_ids):
while True:
company_id = await queue.get()
user_ids = cname_vs_user_ids.get(company_id)
user_ids_chunks = get_data_in_chunks(user_ids, 20)
for user_id_chunk in user_ids_chunks:
try:
await update_data(user_id_chunk, company_id)
except Exception as e:
print("error: {}".format(e))
# Notify the queue that the "work item" has been processed.
queue.task_done()
async def process_cname_vs_users(cname_vs_user_ids):
queue = asyncio.Queue()
for company_id in cname_vs_user_ids:
queue.put_nowait(company_id)
tasks = []
for i in range(5): # <- number of workers
task = asyncio.create_task(
worker(i, queue, cname_vs_user_ids))
tasks.append(task)
# Wait until the queue is fully processed.
await queue.join()
# Cancel our worker tasks.
for task in tasks:
task.cancel()
try:
# Wait until all worker tasks are cancelled.
responses = await asyncio.gather(*tasks, return_exceptions=True)
print('Results:', responses)
except Exception as e:
print('Got an exception:', e)
Expectation: The tasks should be executed concurrently for 5 (no. of workers) companies.
Reality: Only first task is doing work for all companies sequentially.
Any help/suggestions will be helpful. Thanks in advance :)
So, finally, I figured out after reading more about concurrency in python.
I have used futures.ThreadPoolExecutor as of now to achieve the desired output.
Solution:
async def update_data(req_id, user_ids, company_id): # <-- THIS IS THE ASYNC CALL ON CHUNK OF SOME USERS
# A gRPC call to server here.
async def worker(cname_vs_user_ids):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_summary = {executor.submit(update_data, req_id, user_items, company_id) for
company_id, user_items in
cname_vs_user_ids.items()}
for future in concurrent.futures.as_completed(future_summary):
try:
response = future.result()
except Exception as error:
print("ReqId: {}, Error occurred: {}".format(req_id, str(error)))
executor.shutdown()
raise error
async def process_cname_vs_users(cname_vs_user_ids):
loop = asyncio.get_event_loop()
loop.run_until_complete(worker(cname_vs_user_ids))
The above solution worked wonders for me.

Asyncio with two loops, best practice

I have two infinite loops. Their processing is lightweight. I don't want them to block each other. Is using await asyncio.sleep(0) a good practice?
This is my code
import asyncio
async def loop1():
while True:
print("loop1")
# pull data from kafka
await asyncio.sleep(0)
async def loop2():
while True:
print("loop2")
# send data to all clients using asyncio stream api
await asyncio.sleep(0)
async def main():
await asyncio.gather(loop1(), loop2())
asyncio.run(main())
Two (many more) asyncio tasks will not block each other until one of tasks have some long sync operation inside.
Both of your tasks have only network operations inside (Kafka and API requests), so none of them will block another task.
When should you use asyncio.sleep(0)?
Imagine you have some long sync operation - calculations. Calculations is not I/O operation.
This example is more like good to know, if you have such operations in real app, you have to move them in loop.run_in_executor and use concurrent.futures.ProcessPoolExecutor as executor. The example:
import asyncio
async def long_calc():
"""
Some Heavy CPU bound task.
Better make it sync function and move to ProcessPoolExecutor
"""
s = 0
for _ in range(100):
for i in range(1_000_000):
s += i**2
# comment the line and watch result
# you'll get no working messages
# that's why I use sleep(0.0) here
await asyncio.sleep(0.0)
return s
async def pinger():
"""Task which shows that app is alive"""
n = 0
while True:
await asyncio.sleep(1)
print(f"Working {n}")
n += 1
async def amain():
"""Main async function in this app"""
# run in asyncio.create_task since we want the task
# to run in parallel with long_calc +
# we do not want to wait till it will be finished
# If it were thread it would be called daemon thread
asyncio.create_task(pinger())
# await results of long task
s = await long_calc()
print(f"Done: {s}")
if __name__ == '__main__':
asyncio.run(amain())
If you need me to provide you with run_in_executor example - let me know.

what happens when consumers have no tasks in the queue when using asyncio.Queue()

I have this script
import asyncio
import random
q = asyncio.Queue()
async def producer(num):
while True:
await q.put(num + random.random())
await asyncio.sleep(random.random())
async def consumer(num):
while True:
value = await q.get()
print('Consumed', num, value)
loop = asyncio.get_event_loop()
for i in range(6):
loop.create_task(producer(i))
for i in range(3):
loop.create_task(consumer(i))
loop.run_forever()
that uses asyncio.Queue()
I am running my script forever and produces tasks randomly and adding them to the queue. In the event that there are no tasks to consume in the queue, will this be using cpu for nothing or is it harmless and produces no error?
In the event that there are no tasks to consume in the queue, will this produce an error?
Since the consumer calls get() to consume the next queued item, if the queue is empty, it will simply wait for the next item to arrive. No CPU will be wasted on the wait, and no error reported - the consumer coroutine will just be suspended until an item is produced, after which it will be immediately woken up. During the wait the event loop is free to run other coroutines, if any.

Sharing a dynamically started worker among several consumers

I am building a worker class that connects to an external event stream using asyncio. It is a single stream, but several consumer may enable it. The goal is to only maintain the connection while one or more consumer requires it.
My requirements are as follow:
The worker instance is created dynamically first time a consumer requires it.
When other consumers then require it, they re-use the same worker instance.
When the last consumer closes the stream, it cleans up its resources.
This sounds easy enough. However, the startup sequence is causing me issues, because it is itself asynchronous. Thus, assuming this interface:
class Stream:
async def start(self, *, timeout=DEFAULT_TIMEOUT):
pass
async def stop(self):
pass
I have the following scenarios:
Scenario 1 - exception at startup
Consumer 1 requests worker to start.
Worker startup sequence begins
Consumer 2 requests worker to start.
Worker startup sequence raises an exception.
Both consumer should see the exception as the result of their call to start().
Scenario 2 - partial asynchronous cancellation
Consumer 1 requests worker to start.
Worker startup sequence begins
Consumer 2 requests worker to start.
Consumer 1 gets cancelled.
Worker startup sequence completes.
Consumer 2 should see a successful start.
Scenario 3 - complete asynchronous cancellation
Consumer 1 requests worker to start.
Worker startup sequence begins
Consumer 2 requests worker to start.
Consumer 1 gets cancelled.
Consumer 2 gets cancelled.
Worker startup sequence must be cancelled as a result.
I struggle to cover all scenarios without getting any race condition and a spaghetti mess of either bare Future or Event objects.
Here is an attempt at writing start(). It relies on _worker() setting an asyncio.Event named self._worker_ready when it completes the startup sequence:
async def start(self, timeout=None):
assert not self.closing
if not self._task:
self._task = asyncio.ensure_future(self._worker())
# Wait until worker is ready, has failed, or timeout triggers
try:
self._waiting_start += 1
wait_ready = asyncio.ensure_future(self._worker_ready.wait())
done, pending = await asyncio.wait(
[self._task, wait_ready],
return_when=asyncio.FIRST_COMPLETED, timeout=timeout
)
except asyncio.CancelledError:
wait_ready.cancel()
if self._waiting_start == 1:
self.closing = True
self._task.cancel()
with suppress(asyncio.CancelledError):
await self._task # let worker shutdown
raise
finally:
self._waiting_start -= 1
# worker failed to start - either throwing or timeout triggering
if not self._worker_ready.is_set():
self.closing = True
self._task.cancel()
wait_ready.cancel()
try:
await self._task # let worker shutdown
except asyncio.CancelledError:
raise FeedTimeoutError('stream failed to start within %ss' % timeout)
else:
assert False, 'worker must propagate the exception'
That seems to work, but it seems too complex, and is really hard to test: the worker has many await points, leading to combinatoric explosion if I am to try all possible cancellation points and execution orders.
I need a better way. I am thus wondering:
Are my requirements reasonable?
Is there a common pattern to do this?
Does my question raise some code smell?
Your requirements sound reasonable. I would try to simplify start by replacing Event with a future (in this case a task), using it to both wait for the startup to finish and to propagate exceptions that occur during its course, if any. Something like:
class Stream:
async def start(self, *, timeout=DEFAULT_TIMEOUT):
loop = asyncio.get_event_loop()
if self._worker_startup_task is None:
self._worker_startup_task = \
loop.create_task(self._worker_startup())
self._add_user()
try:
await asyncio.shield(asyncio.wait_for(
self._worker_startup_task, timeout))
except:
self._rm_user()
raise
async def _worker_startup(self):
loop = asyncio.get_event_loop()
await asyncio.sleep(1) # ...
self._worker_task = loop.create_task(self._worker())
In this code the worker startup is separated from the worker coroutine, and is also moved to a separate task. This separate task can be awaited and removes the need for a dedicated Event, but more importantly, it allows scenarios 1 and 2 to be handled by the same code. Even if someone cancels the first consumer, the worker startup task will not be canceled - the cancellation just means that there is one less consumer waiting for it.
Thus in case of consumer cancellation, await self._worker_startup_task will work just fine for other consumers, whereas in case of an actual exception in worker startup, all other waiters will see the same exception because the task will have completed.
Scenario 3 should work automatically because we always cancel the startup that can no longer be observed by a consumer, regardless of the reason. If the consumers are gone because the startup itself has failed, then self._worker_startup_task will have completed (with an exception) and its cancellation will be a no-op. If it is because all consumers have been themselves canceled while awaiting the startup, then self._worker_startup_task.cancel() will cancel the startup sequence, as required by scenario 3.
The rest of the code would look like this (untested):
def __init__(self):
self._users = 0
self._worker_startup = None
def _add_user(self):
self._users += 1
def _rm_user(self):
self._users -= 1
if self._users:
return
self._worker_startup_task.cancel()
self._worker_startup_task = None
if self._worker_task is not None:
self._worker_task.cancel()
self._worker_task = None
async def stop(self):
self._rm_user()
async def _worker(self):
# actual worker...
while True:
await asyncio.sleep(1)
With my previous tests and integrating suggestions from #user4815162342 I came up with a re-usable solution:
st = SharedTask(test())
task1 = asyncio.ensure_future(st.wait())
task2 = asyncio.ensure_future(st.wait(timeout=15))
task3 = asyncio.ensure_future(st.wait())
This does the right thing: task2 cancels itself after 15s. Cancelling tasks has no effect on test() unless they all get cancelled. In that case, the last task to get cancelled will manually cancel test() and wait for cancellation handling to complete.
If passed a coroutine, it is only scheduled when first task starts waiting.
Lastly, awaiting the shared task after it has completed simply yields its result immediately (seems obvious, but initial version did not).
import asyncio
from contextlib import suppress
class SharedTask:
__slots__ = ('_clients', '_task')
def __init__(self, task):
if not (asyncio.isfuture(task) or asyncio.iscoroutine(task)):
raise TypeError('task must be either a Future or a coroutine object')
self._clients = 0
self._task = task
#property
def started(self):
return asyncio.isfuture(self._task)
async def wait(self, *, timeout=None):
self._task = asyncio.ensure_future(self._task)
self._clients += 1
try:
return await asyncio.wait_for(asyncio.shield(self._task), timeout=timeout)
except:
self._clients -= 1
if self._clients == 0 and not self._task.done():
self._task.cancel()
with suppress(asyncio.CancelledError):
await self._task
raise
def cancel(self):
if asyncio.iscoroutine(self._task):
self._task.close()
elif asyncio.isfuture(self._task):
self._task.cancel()
The re-raising of task exception cancellation (mentioned in comments) is intentional. It allows this pattern:
async def my_task():
try:
await do_stuff()
except asyncio.CancelledError as exc:
await flush_some_stuff() # might raise an exception
raise exc
The clients can cancel the shared task and handle an exception that might arise as a result, it will work the same whether my_task is wrapped in a SharedTask or not.

Resources