How to kill parallel execution of Databricks notebooks? - multithreading

I am currently using Python's Threading to parallelize the execution of multiple Databricks notebooks. These are long-running notebooks, and I need to add some logic for killing the threads in the case where I want to restart the execution with new changes. When re-executing the master notebook without killing the threads, the cluster is quickly filled with computational heavy, long-lived threads, leaving little space for the actually required computations.
I have tried these suggestions without luck. Furthermore, I have tried getting the runId from dbutils.notebook.run() and killing the thread with dbutils.notebook.exit(runId), but since the call to dbutils.notebook.run() is synchronous, I am unable to obtain the runId before the notebook has executed.
I would appreciate any suggestion on how to solve this issue!

Hello #sondrelv and thank you for your question. I would like to clarify that dbutils.notebook.exit(value) is not used for killing other threads by runId. dbutils.notebook.exit(value) is used to cause the current (this) thread to exit and return a value.
I see the difficulty of management without an available interrupt inside notebook code. Given this limitation, I have tried to look for other ways to cancel the threads.
One way is to use other utilities to kill the thread/run.
Part of the difficulty in solving this, is that threads/runs created through dbutils.notebook.run() are ephemeral runs. The Databricks CLI databricks runs get --run-id <ephemeral_run_id> can fetch details of an ephemeral run. If details can be fetched, then the cancel should also work (databricks runs cancel ...).
The remaining difficulty is getting the run ids. Ephemeral runs are excluded from the CLI runs list operation databricks runs list.
As you noted, the dbutils.notebook.run() is synchronous, and does not return a value to code until it finishes.
However, in the notbook UI, the run ID and link is printed when it starts. There must be a way to capture these. I have not yet found how.
Another possible solution, would be to create some endpoint or resources for the child notebooks check whether they should continue execution or exit early using dbutils.notebooks.exit(). Doing this check between every few cells would be similar to the approaches in the article you linked, just applied to a notebook instead of a thread.

Related

In AzureML, start_logging will start asynchronous execution or synchronous execution?

It was written in the Microsoft AzureML documentation, "A run represents a single trial of an experiment. Runs are used to monitor the asynchronous execution of a trial" and A Run object is also created when you submit or start_logging with the Experiment class."
Related to start_logging, as far as I know, when we have simply started the run by executing this start logging method. We have to stop, or complete by complete method when the run is completed. This is because start_logging is a synchronized way of creating an experiment. However, Run object created from start_logging is to monitor the asynchronous execution of a trial.
Can anyone clarify whether start_logging will start asynchronous execution or synchronous execution?
start_logging will be considered as asynchronous execution as this generates the multiple interactive run sessions. In a specific experiment, there is a chance of multiple interactive sessions, that work parallelly and there will be no scenario to be followed in sequential.
The individual operation can be performed and recognized based on the parameters like args and kwargs.
When the start_logging is called, then an interactive run like jupyter notebook was created. The complete metrics and components which are created when the start_logging was called will be utilized. When the output directory was mentioned for each interactive run, based on the args value, the output folder will be called seamlessly.
The following code block will help to define the operation of start_logging
experiment = Experiment(your_workspace, "your_experiment_name")
run = experiment.start_logging(outputs=None, snapshot_directory=".", display_name="test")
...
run.log_metric("Accuracy_Value", accuracy)
run.complete()
the below code block will be defining the basic syntax of start_logging
start_logging(*args, **kwargs)

How to set mode=reschedule in SubdagOperator to get rid of deadlock

I see in the Airflow 2 SubdagOperator documentation link that using mode = reschedule we can get rid of potential deadlock.
To my understanding it is not a param which can be be passed with list of other params. If anyone has used this let me know how to incorporate it in SubdagOperator.
Technically, a SubDagOperator is a sensor, which can take an argument mode="reschedule". The default mode poke keeps a slot open, which can potentially lead to a deadlock situation in case you're using lots of sensors. Mode reschedule instead stops the process and creates a new process on every check, not causing a situation where all slots are occupied by sensors waiting on each other.
SubDagOperator(task_id="foobar", ..., mode="reschedule")
With that said, the SubDagOperator is deprecated since Airflow 2.0 and it is advised to use TaskGroups. TaskGroups are a visual way to group together tasks within a DAG (tutorial here: https://www.astronomer.io/guides/task-groups).
Alternatively, you can use the TriggerDagRunOperator to trigger another DAG (tutorial: https://www.astronomer.io/guides/cross-dag-dependencies).

How can I monitor stalled tasks?

I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.
What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.
My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.
So my question is if there's a way to identify and alert in such cases?
If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.
Using tracing might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:
that a task is blocked (was resumed then never suspended again)
if you add your own spans, that a "userland" span was never exited / closed, which might mean it's stuck in a non-blocking loop (which is also an issue though somewhat less so)
I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.

Camunda Engine behaviour with massive multi-instances processes and ready state

I wonder how Camunda manage multiple instances of a sub-process.
For example this BPMN:
Let's say multi-instances process would iterate on a big collection, 500 instances.
I have a function in a web app that call the endpoint to complete the user common task, and perform another call to camunda engine to get all tasks (on first API call callback). I am supposed to get a list of 500 sub-process user tasks (the ones generated by the multi-instances process).
What if the get tasks call is performed before Camunda Engine successfully instantiated all sub-processes?
Do i get a partial list of task ?
How to detect that main and sub processes are ready?
I don't really know if Camunda is able to manage this problematic by itself so I though of the following solution, knowing I only can use Modeler environment with Groovy to add code (Javascript as well, but the entire code parts already added are groovy):
Use of a sub process throw event to catch in main process, then count and compare tasks ready with awaited tasks number for each signal emitted.
Thanks
I would maybe likely spawn the tasks as parallel process (or 500 of them) and then got to a next step in which I signal or otherwise set a state that indicates the spawning is completed. I would further join the parallel processes all together again and have here a task signaling or otherwise setting a state that indicates all the parallel processes are done. See https://docs.camunda.org/manual/7.12/reference/bpmn20/gateways/parallel-gateway/. This way you can know exactly at what point (after spawning is done and before the join) you have a chance of getting your 500 spawned sub processes

Timeout a pyspark job

TL;DR
Is there a way to timeout a pyspark job? I want a spark job running in cluster mode to be killed automatically if it runs longer than a pre-specified time.
Longer Version:
The cryptic timeouts listed in the documentation are at most 120s, except one which is infinity, but this one is only used if spark.dynamicAllocation.enabled is set to true, but by default (I havent touch any config parameters on this cluster) it is false.
I want to know because I have a code that for a particular pathological input will run extremely slow. For expected input the job will terminate in under an hour.Detecting the pathological input is as hard as trying to solve the problem, so I don't have the option of doing clever preprocessing. The details of the code are boring and irrelevant, so I'm going to spare you having to read them =)
Im using pyspark so I was going to decorate the function causing the hang up like this but it seems that this solution doesnt work in cluster mode. I call my spark code via spark-submit from a bash script, but so far as I know bash "goes to sleep" while the spark job is running and only gets control back once the spark job terminates, so I don't think this is an option.
Actually, the bash thing might be a solution if I did something clever but I'd have to get the driver id for the job like this, and by now I'm thinking "this is too much thought and typing for something so simple as a timeout which ought to be built in."
You can set a classic python alarm. Then in handler function you can raise exception or use sys.exit() function to finish driver code. As driver finishes, YARN kills whole application.
You can find example usage in documentation: https://docs.python.org/3/library/signal.html#example

Resources