with access to a trio.Nursery instance nursery, how may I print state of all nursery.child_tasks, specifically which have not yet exited?
I'm not understanding, reading docs & the trio NurseryManager code:
how "nested child" tasks might be relevant. I see [direct] children removed when a task completes with _child_finished(), but am not understanding use of _nested_child_finished().
the window of time between one task failing (raiseing), and all tasks completing. Being cooperative, I would expect to be able to find "active" tasks, in the window ~soon after one failure, with both states
"failed, exception captured"
and "running, has not handled Canceled yet"

"Nested child" is our internal name for "the block of code that's actually part of the parent task, that you typed inside the async with open_nursery():. This code runs a bit differently than a real child task, but it has similar semantics (the nursery won't exit until it exits, if it raises an exception it cancels the real child tasks and vice-versa, etc.), so that's why we call it that.
You're correct that there's a window of time between one task raiseing and other tasks completing. When a task raises then the other tasks get cancelled, but this means injecting trio.Cancelled exceptions, waiting for those exceptions to unwind, etc., so it might take some time. (You can tell check whether the nursery has been cancelled with nursery.cancel_scope.cancel_called.)
During this period, nursery.child_tasks will have only the tasks that are still running (i.e., still processing their cancellation). Currently Trio doesn't keep track of "failed tasks" – the nursery keeps a list of the exception objects themselves, so it can re-raise them, but it doesn't track which tasks those came from or anything, and there's currently no API to introspect the list of pending exceptions.
Zooming out: Trio's general philosophy is that when thinking about code organization, functions are more useful than tasks. So it really de-emphasizes tasks: outside of debugging/introspection/low-level-plumbing, you never encounter a "task object" or give a task a name. (See also Go's take on this.) Depending on what you're doing, you might find it helpful to step back and think if there's a better way to keep track of what operations you're doing and how they're progressing.


How can I monitor stalled tasks?

I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.
What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.
My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.
So my question is if there's a way to identify and alert in such cases?
If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.
Using tracing might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:
that a task is blocked (was resumed then never suspended again)
if you add your own spans, that a "userland" span was never exited / closed, which might mean it's stuck in a non-blocking loop (which is also an issue though somewhat less so)
I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.

Camunda Engine behaviour with massive multi-instances processes and ready state

I wonder how Camunda manage multiple instances of a sub-process.
For example this BPMN:
Let's say multi-instances process would iterate on a big collection, 500 instances.
I have a function in a web app that call the endpoint to complete the user common task, and perform another call to camunda engine to get all tasks (on first API call callback). I am supposed to get a list of 500 sub-process user tasks (the ones generated by the multi-instances process).
What if the get tasks call is performed before Camunda Engine successfully instantiated all sub-processes?
Do i get a partial list of task ?
How to detect that main and sub processes are ready?
I don't really know if Camunda is able to manage this problematic by itself so I though of the following solution, knowing I only can use Modeler environment with Groovy to add code (Javascript as well, but the entire code parts already added are groovy):
Use of a sub process throw event to catch in main process, then count and compare tasks ready with awaited tasks number for each signal emitted.
I would maybe likely spawn the tasks as parallel process (or 500 of them) and then got to a next step in which I signal or otherwise set a state that indicates the spawning is completed. I would further join the parallel processes all together again and have here a task signaling or otherwise setting a state that indicates all the parallel processes are done. See This way you can know exactly at what point (after spawning is done and before the join) you have a chance of getting your 500 spawned sub processes

exception handling for Web page exist or not in blueprism

I am launching page . After that I am also waiting for 5 sec to load the page. After this I want to check whether the page exist or not, or gets loaded or not then throw the exception. So how and when to use exception handling in this scenario. see the image attached. I tried putting recover, resume, exception stage on launch stage as well as on wait stage. But I dont know where to put the exception.
1st of all, don't use arbitrary (fixed) wait stages until it's completely necessary. Use intelligent wait stages instead, which means wait for something to happen and then proceed or throw an exception if it times out. In your case, you can use intelligent wait stage for example to check if the website has been loaded.
When it comes to throwing an exception, in your case I would just simply launch, then wait for the document to be loaded and throw an exception if it times out. See below diagram.
Also, I would leave retry logic (recover - resume) for the process layer. Object should ideally contain small reusable actions and no business logic, so decisions if and how many times to retry should be taken in the Process.

"Automatically set exception at clean up" exception showing in Control Room

When reviewing in cases in a work queue the message:
Automatically set exception at clean up
appears as the exception reason.
Why has Blue Prism set the case as an exception?
The "Automatically set exception at clean up" happens when you the process finishes or gets terminated without unlocking the item queue that is being processed.
I imagine that you are getting data from the Work Queue using and action like "Get next item". Every time that you get an item from the queue BP locks it to prevent other bot from processing it at the same time.
To solve your problem, use the "Mark Completed" if you finished processing that item, or the "Unlock Item" if you want to keep working with it later.
"Automatically set exception at cleanup" appears when you have picked up a case and not declared it as completed or exception WHILE THE PROCESS FINISHES WITHOUT ANY FURTHER ACTION ON THE QUEUE ITEM. In other words, if you leave the queue item to be in the locked state and your process execution finishes, it will still go in the said reason.
Well, the clean up phase is the phase that happens after the process is done. There are two important things that are done then - cleaning of objects and cleaning of the queue.
For every object used in the process, the action "finalize" is executed. It's rarely used option - i've never scen it used.
During the cleaning of the queue, all locked tems are marked with exception that you're asking about.
So, my advice is to investigate how it was possible that an item has been left behind.

Is is OK to use a non-zero return code for a process that executed successfully?

I'm implementing a simple job scheduler, which spans a new process for every job to run. When a job exits, I'd like it to report the number of actions executed to the scheduler.
The simplest way I could find, is to exit with the number of actions as a return code. The process would for example exit with return code 3 for "3 actions executed".
But the standard (AFAIK) being to use the return code 0 when a process exited successfully, and any other value when there was en error, would this approach risk to create any problem?
Note: the child process is not an executable script, but a fork of the parent, so not accessible from the outside world.
What you are looking for is inter process communication - and there are plenty ways to do it:
Shared memory
Exclusive file descriptors (to some extend, rather go for something else if you can)
Return convention changes are not something a regular programmer should dare to violate.
The only risk is confusing a calling script. What you describe makes sense, since what you want really is the count. As Joe said, use negative values for failures, and you should consider including a --help option that explains the return values ... so you can figure out what this code is doing when you try to use it next month.
I would use logs for it: log the number of actions executed to the scheduler. This way you can also log datetimes and other extra info.
I would not change the return convention...
If the scheduler spans a child and you are writing that you could also open a pipe per child, or a named pipes or maybe unix domain sockets, and use that for inter process communication and writing the processed jobs there.
I would stick with conventions, namely returning 0 for success, expecially if your program is visible/usable around by other people, or anyway document well those decisions.
Anyway apart from conventions there are also standards.
