I have some code that looks like this:
sem = asyncio.Semaphore(max_concurrency)
async def run_concurrent(invocation):
async with sem:
return await _run_invocation_with_timeout(invocation, timeout_seconds)
return await asyncio.gather(
*(run_concurrent(invocation) for invocation in invocations)
)
Behind the scenes, this gives me max_concurrency workers running in parallel. How can I get some unique identifier which distinguishes which "thread" the invocation is actually running on? The reason i want this is so that I can emit some timing information in json that can be loaded into chrome://tracing so that I can visualize the parallelism of my application.
Is it sufficient to start a counter at 0, increment it every time a task starts, and decrement it when a task finishes? Will this accurately model the way the work is scheduled by the runtime?
As user481516342 pointed out in an earlier comment, this probably isn't possible to do 100% accurately. Through a roundabout method and sacrificing 100% correctness, I came up with something that is close enough for my purposes though, and it's surprisingly simple.
My original code now looks like this:
sem = asyncio.Semaphore(max_concurrency)
tasks = list(range(1, max_concurrency+1))
async def run_concurrent(invocation):
async with sem:
invocation.task_id = tasks.pop()
result = await _run_invocation_with_timeout(invocation, timeout_seconds)
tasks.insert(0, invocation.task_id)
return result
return await asyncio.gather(
*(run_concurrent(invocation) for invocation in invocations)
)
This isn't 100% accurate because if max_concurrency is say 4 my algorithm might decide that tasks A, B, and C get ids 1, 2, and 3 whereas internally they run in 2, 3, and 4. For the purposes of visualizing the parallelism though, I think this is sufficiently close to the real thing.
Related
This is probably the second time I've asked for help here so my apologies for little or extra detail / wording in my question *
The following code, though very basic is something I've written as an imitation (in its simplest form) of another piece of code written by an ex-employee at my firm. I am currently working on a project she was working on, and I do not understand how the following is executing without it being awaited or gathered.
In the original code, the 'wait_and_print' function is of course is an async function that that does a single RESTful web API call using aiohttp.ClientSession (using an async context manager) which returns nothing yet appends/extends a list with the response it gets.
So far, it has been 2 weeks since I've been using (and trying to understand) asyncio or asynchronous programming thus I am not very savvy with it. I've used Python, however for 3 years - on and off. I understand what task creation does and how asyncio.gather can run multiple API calls concurrently. BUT, this is something I do not get;
import asyncio
import time
L = []
async def wait_and_print(wait_time):
print(f"starting function {wait_time}")
for x in range(1, wait_time + 1):
print("Sleeping for {} time.".format(x))
await asyncio.sleep(1)
print(f"ending function {wait_time}")
L.append(wait_time)
async def main_loop():
tasks = [ asyncio.create_task(wait_and_print(x)) for x in [3,1,2]]
while len(tasks) != 0:
tasks = [t for t in tasks if not t.done()]
await asyncio.sleep(0) # HOW IS THIS MAKING IT WORK WITHOUT ACTUALLY AWAITING tasks?
print("Main loop ended!")
def final(func):
a = time.time()
asyncio.run(func())
b = time.time()
print(b-a, "seconds taken to run all!")
print(L)
final(main_loop)
I'm noticing that when I spawn an asyncio task using create_task, it's first completing the rest of the logic rather than starting that task. I'm forced to add an await asyncio.sleep(0) to get the task started, which seems a bit hacky and unclean to me.
Here is some example code:
async def make_rpc_calls(...some args...)
val_1, val_2 = await asyncio.gather(rpc_call_1(...), rpc_call_2(...))
return process(val_1, val_2)
def some_very_cpu_intensive_function(...some args...):
// Does a lot of computation, can take 20 seconds to run
task_1 = asyncio.get_running_loop().create_task(make_rpc_calls(...))
intensive_result = some_very_cpu_intensive_function(...)
await task_1
process(intensive_result, task_1.result())
Anytime I run the above, it runs the some_very_cpu_intensive_function function before the kicking off the expensive RPCs. The only way I've gotten this to work is to do:
async def make_rpc_calls(...some args...)
val_1, val_2 = await asyncio.gather(rpc_call_1(...), rpc_call_2(...))
return process(val_1, val_2)
def some_very_cpu_intensive_function(...some args...):
// Does a lot of computation, can take 20 seconds to run
task_1 = asyncio.get_running_loop().create_task(make_rpc_calls(...))
await asyncio.sleep(0)
intensive_result = some_very_cpu_intensive_function(...)
await task_1
process(intensive_result, task_1.result())
This feels like a hack to me - I'm forcing the event loop to context switch, and doesn't feel like I'm using the asyncio framework correctly. Is there another way I should be approaching this?
sleep() always suspends the current task, allowing other tasks to run.
Setting the delay to 0 provides an optimized path to allow other tasks to run. This can be used by long-running functions to avoid blocking the event loop for the full duration of the function call.
Source: https://docs.python.org/3/library/asyncio-task.html
I have approximately the following code
import asyncio
.
.
.
async def query_loop()
while connected:
result = await asyncio.gather(get_value1, get_value2, get_value3)
if True in result:
connected = False
async def main():
await query_loop()
asyncio.run(main())
The get_value - functions query a device, receive values, and publish them to a server. If no problems occur they return False, else True.
Now I need to implement, that the get_value2-function checks if it received the value 7. In this case I need the program to wait for 3 min before sending a special command to the device. But in the mean time, and also afterwards the query_loop should continue.
Has anybody an idea how to do that ?
thanks in advance!
If I understand you correctly, you want to modify get_value2 so that it reacts to a value received from device by spawning additional work in the background, i.e. do something without the loop in query_loop having to wait for that new work to finish.
You can use asyncio.create_task() to spawn a background task. In fact, you can always combine create_task() and await to runs things in the background; asyncio.gather is just a utility function that does it for you. In this case query_loop remains unchanged, and get_value2 gets modified like this:
async def get_value2():
...
value = await receive_value_from_device()
if value == 7:
# schedule send_command() to run, but don't wait for it
asyncio.create_task(special_command())
...
return False
async def special_command():
await asyncio.sleep(180)
await send_command_to_device(...)
Note that if get_value1 and others are async functions, the correct invocation of gather must call them, so it should be await asyncio.gather(get_value1(), get_value2(), get_value3()) (note the extra parentheses).
I'm trying to fetch some data from OpenSubtitles using asyncio and then download a file who's information is contained in that data. I want to fetch that data and download the file at the same time using asyncio.
The problem is that I want to wait for 1 task from the list tasks to finish before commencing with the rest of the tasks in the list or the download_tasks. The reason for this is that in self._perform_query() I am writing information to a file and in self._download_and_save_file() I am reading that same information from that file. So in other words, the download_tasks need to wait for at least one task in tasks to finish before starting.
I found out I can do that with asyncio.wait(return_when=FIRST_COMPLETED) but for some reason it is not working properly:
payloads = [create_payloads(entry) for entry in retreive(table_in_database)]
tasks = [asyncio.create_task(self._perform_query(payload, proxy)) for payload in payloads]
download_tasks = [asyncio.create_task(self._download_and_save_file(url, proxy) for url in url_list]
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
print(done)
print(len(done))
print(pending)
print(len(pending))
await asyncio.wait(download_tasks)
The output is completely different than expected. It seems that out of 3 tasks in the list tasks all 3 of them are being completed despite me passing asyncio.FIRST_COMPLETED. Why is this happening?
{<Task finished coro=<SubtitleDownloader._perform_query() done, defined at C:\Users\...\subtitles.py:71> result=None>, <Task finished coro=<SubtitleDownloader._perform_query() done, defined at C:\Users\...\subtitles.py:71> result=None>, <Task finished coro=<SubtitleDownloader._perform_query() done, defined at C:\Users\...\subtitles.py:71> result=None>}
3
set()
0
Exiting
As far as I can tell, the code in self._perform_query() shouldn't affect this problem. Here it is anyway just to make sure:
async def _perform_query(self, payload, proxy):
try:
query_result = proxy.SearchSubtitles(self.opensubs_token, [payload], {"limit": 25})
except Fault as e:
raise "A fault has occurred:\n{}".format(e)
except ProtocolError as e:
raise "A ProtocolError has occurred:\n{}".format(e)
else:
if query_result["status"] == "200 OK":
with open("dl_links.json", "w") as dl_links_json:
result = query_result["data"][0]
subtitle_name = result["SubFileName"]
download_link = result["SubDownloadLink"]
download_data = {"download link": download_link,
"file name": subtitle_name}
json.dump(download_data, dl_links_json)
else:
print("Wrong status code: {}".format(query_result["status"]))
For now, I've been testing this without running download_tasks but I have included it here for context. Maybe I am going about this problem in a completely wrong manner. If so, I would much appreciate your input!
Edit:
The problem was very simple as answered below. _perform_query wasn't an awaitable function, instead it ran synchronously. I changed that by editing the file writing part of _perform_query to be asynchronous with aiofiles:
def _perform_query(self, payload, proxy):
query_result = proxy.SearchSubtitles(self.opensubs_token, [payload], {"limit": 25})
if query_result["status"] == "200 OK":
async with aiofiles.open("dl_links.json", mode="w") as dl_links_json:
result = query_result["data"][0]
download_link = result["SubDownloadLink"]
await dl_links_json.write(download_link)
return_when=FIRST_COMPLETED doesn't guarantee that only a single task will complete. It guarantees that the wait will complete as soon as a task completes, but it is perfectly possible that other tasks complete "at the same time", which for asyncio means in the same iteration of the event loop. Consider, for example, the following code:
async def noop():
pass
async def main():
done, pending = await asyncio.wait(
[noop(), noop(), noop()], return_when=asyncio.FIRST_COMPLETED)
print(len(done), len(pending))
asyncio.run(main())
This prints 3 0, just like your code. Why?
asyncio.wait does two things: it submits the coroutines to the event loop, and it sets up callbacks to notify it when any of them is complete. However, the noop coroutine doesn't contain an await, so none of the calls to noop() suspends, each just does its thing and immediately returns. As a result, all three coroutine instances finish within the same pass of the event loop. wait is then informed that all three coroutines have finished, a fact it dutifully reports.
If you change noop to await a random sleep, e.g. change pass to await asyncio.sleep(0.1 * random.random()), you get the expected behavior. With an await the coroutines no longer complete at the same time, and wait will report the first one as soon as it detects it.
This reveals the true underlying issue with your code: _perform_query doesn't await. This indicates that you are not using an async underlying library, or that you are using it incorrectly. The call to SearchSubtitles likely simply blocks the event loop, which appears to work in trivial tests, but breaks essential asyncio features such as concurrent execution of tasks.
When using trio and nursery objects, how do you capture any value that was returned from a method?
Take this example from the trio website:
async def append_fruits():
fruits = []
fruits.append("Apple")
fruits.append("Orange")
return fruits
async def numbers():
numbers = []
numbers.append(1)
numbers.append(2)
return numbers
async def parent():
async with trio.open_nursery() as nursery:
nursery.start_soon(append_fruits)
nursery.start_soon(numbers)
I modified it so that each method returns a list. How would you capture the return value so that I could print them?
Currently, there is no built-in mechanism for this. Mostly because we haven't figured out how we would even want it to work, so if you have some suggestions that would be helpful :-).
The thing is, with regular functions, there's exactly one obvious place to access the return value – the caller is waiting, so you hand them the return value, done. With concurrent functions, the caller isn't waiting, so you also need some way to specify where to return it to, when to return it, if there are multiple functions you have to keep track of which one is returning a value, and so on. It's not as simple a concept.
What do you want to do with the return values? Do you want to, say, print them immediately when each function returns? In that case the simplest thing is to do it directly from the tasks:
async def print_fruits():
print(await fruits())
async def print_numbers():
print(await numbers())
async with trio.open_nursery() as nursery:
nursery.start_soon(print_fruits)
nursery.start_soon(print_numbers)
You could even factor this into a helper function:
async def call_then_print(fn):
print(await fn())
async with trio.open_nursery() as nursery:
nursery.start_soon(call_then_print, fruits)
nursery.start_soon(call_then_print, numbers)
Or maybe you want to put them in a data structure to look at later?
results = {}
async def store_fruits_in_results_dict():
results["fruits"] = await fruits()
async def store_numbers_in_results_dict():
results["numbers"] = await numbers()
async with trio.open_nursery() as nursery:
nursery.start_soon(store_fruits_in_results_dict)
nursery.start_soon(store_numbers_in_results_dict)
# This is after the nursery block, so we know that the dict is fully filled in:
print(results["fruits"])
print(results["numbers"])
You can imagine fancier versions of those too – for example, sometimes when you run a lot of tasks in parallel you want to capture exceptions, not just return values, so that some tasks can still succeed even if some of them fail. For that you can use a try/except around each individual function, or the outcome library. Or when each operation finishes you could put its return value into a trio.Queue, so that another task can process the results as they're finished. But hopefully this gives you a good starting point :-)
In this case, simply create the arrays in the parent and pass each to the child that needs it.
More generally, pass an object to the tasks; they can set an attribute on it. You might also add an Event so that the parent can wait for the results to be available.