Timeout a pyspark job - apache-spark

TL;DR
Is there a way to timeout a pyspark job? I want a spark job running in cluster mode to be killed automatically if it runs longer than a pre-specified time.
Longer Version:
The cryptic timeouts listed in the documentation are at most 120s, except one which is infinity, but this one is only used if spark.dynamicAllocation.enabled is set to true, but by default (I havent touch any config parameters on this cluster) it is false.
I want to know because I have a code that for a particular pathological input will run extremely slow. For expected input the job will terminate in under an hour.Detecting the pathological input is as hard as trying to solve the problem, so I don't have the option of doing clever preprocessing. The details of the code are boring and irrelevant, so I'm going to spare you having to read them =)
Im using pyspark so I was going to decorate the function causing the hang up like this but it seems that this solution doesnt work in cluster mode. I call my spark code via spark-submit from a bash script, but so far as I know bash "goes to sleep" while the spark job is running and only gets control back once the spark job terminates, so I don't think this is an option.
Actually, the bash thing might be a solution if I did something clever but I'd have to get the driver id for the job like this, and by now I'm thinking "this is too much thought and typing for something so simple as a timeout which ought to be built in."

You can set a classic python alarm. Then in handler function you can raise exception or use sys.exit() function to finish driver code. As driver finishes, YARN kills whole application.
You can find example usage in documentation: https://docs.python.org/3/library/signal.html#example

Related

Do submitted jobs take a copy the source? Queued jobs?

When submitting jobs with sbatch, is a copy of my executable taken to the compute node? Or does it just execute the file from /home/user/? Sometimes when I am unorganised I will submit a job, then change the source and re-compile to submit another job. This does not seem like a good idea, especially if the job is still in the queue. At the same time it seems like it should be allowed, and it would be much safer if at the moment of calling sbatch a copy of the source was made.
I ran some tests which confirmed (unsurprisingly) that once a job is running, recompiling the source code has no effect. But when the job is in the queue, I am not sure. It is difficult to test.
edit: man sbatch does not seem to give much insight, other than to say that the job is submitted to the Slurm controller "immediately".
The sbatch command creates a copy of the submission script and a snapshot of the environment and saves it in the directory listed as the StateSaveLocation configuration parameter. It can therefore be changed after submission without effect.
But that is not the case for the files used in the submission script. If your submission script starts an executable, if will see the "version" of the executable at the time it starts.
Modifying the program before it starts will lead to the new version being run, modifying it during the run (i.e. while it has already been read from disk and saved into memory) will lead to the old version being run.

HPC SLURM and batch calls to MPI-enabled application in Master-Worker system

I am trying to implement some sort of Master-Worker system in a HPC with the resource manager SLURM, and I am looking for advices on how to implement such a system.
I have to use some python code that plays the role of the Master, in the sense that between batches of calculations the Master will run 2 seconds of its own calculations, before sending a new batch of work to the Workers. Each Worker must run an external executable over a single node of the HPC. The external executable (Gromacs) is itself MPI-enabled. There will be ~25 Workers and many batches of calculations.
What I have in mind atm (also see EDIT further below):
What I'm currently trying:
Allocate via SLURM as many MPI tasks as I want to use nodes, within a bash script that I'm calling via sbatch run.sh
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
module load required_env_module_for_external_executable
srun python my_python_code.py
Catch within python my_python_code.py the current MPI rank, and use rank/node 0 to run the Master python code
from mpi4py import MPI
name = MPI.Get_processor_name()
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()
if rank == 0: # Master
run_initialization_and_distribute_work_to_Workers()
else: # Workers
start_Worker_waiting_for_work()
Within the python code of the Workers, start the external (MPI-enabled) application using MPI.COMM_SELF.Spawn()
def start_Worker_waiting_for_work():
# here we are on a single node
executable = 'gmx_mpi'
exec_args = 'mdrun -deffnm calculation_n'
# create some relationship between current MPI rank
# and the one the executable should use ?
mpi_info = MPI.Info.Create()
mpi_info.Set('host', MPI.Get_processor_name())
commspawn = MPI.COMM_SELF.Spawn(executable, args=exec_args,
maxprocs=1, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()
res_analysis = do_some_analysis() # check what the executable produced
return res_analysis
What I would like some explanations on:
Can someone confirm that this approach seems valid for implementing the desired system ? Or is it obvious this has no chance to work ? If so, please, why ?
I am not sure that MPI.COMM_SELF.Spawn() will make the executable inherit from the SLURM resource allocation. If not, how to fix this ? I think that MPI.COMM_SELF.Spawn() is what I am looking for, but I'm not sure.
The external executable requires some environment modules to be loaded. If they are loaded at sbatch run.sh, are they still loaded when I invoke from MPI.COMM_SELF.Spawn() from my_python_code.py ?
As a slightly different approach, is it possible to have something like pre-allocations/reservations to book resources for the Workers, then use MPI.COMM_WORLD.Spawn() together with the pre-allocations/reservations ? The goal is also to avoid entering the SLURM queue at each new batch, as this may waste a lot of clock time (hence the will to book all required resources at the very beginning).
Since the python Master has to always stay alive anyways, SLURM job dependencies cannot be useful here, can they ?
Thank you so much for any help you may provide !
EDIT: Simplification of the workflow
In an attempt to keep my question simple, I first omited the fact that I actually had the Workers doing some analysis. But this work can be done on the Master using OpenMP multiprocessing, as Gilles Gouillardet suggested. It executes fast enough.
Then the Workers are necessary indeed, because each task takes about 20-25 min on a single Worker/Node.
I also added some bits about maintaining my own queue of tasks to be sent to the SLURM queue and ultimately to the Workers, just in case the number of tasks t would exceed a few tens/hundreds jobs. This should provide some flexibility also in the future, when re-using this code for different applications.
Probably this is fine like this. I will try to go this way and update these lines. EDIT: It works fine.
At first glance, this looks over convoluted to me:
there is no communication between a slave and GROMACS
there is some master/slave communications, but is MPI really necessary?
are the slaves really necessary? (e.g. can the master process simply serialize the computation and then directly start GROMACS?)
A much simpler architecture would be to have one process on your frontend, that will do:
prepare the GROMACS inputs
sbatch gromacs (start several jobs in a row)
wait for the GROMACS jobs to complete
analyze the GROMACS outputs
re-iterate or exit
If the slave is doing some work you do not want to serialize on the master, can you replace the MPI communications by using files on a shared filesystem? in that case, you can do the computation on the compute nodes within a GROMACS job, before and after executing GROMACS. If not, maybe TCP/IP based communications can do the trick.

How to kill parallel execution of Databricks notebooks?

I am currently using Python's Threading to parallelize the execution of multiple Databricks notebooks. These are long-running notebooks, and I need to add some logic for killing the threads in the case where I want to restart the execution with new changes. When re-executing the master notebook without killing the threads, the cluster is quickly filled with computational heavy, long-lived threads, leaving little space for the actually required computations.
I have tried these suggestions without luck. Furthermore, I have tried getting the runId from dbutils.notebook.run() and killing the thread with dbutils.notebook.exit(runId), but since the call to dbutils.notebook.run() is synchronous, I am unable to obtain the runId before the notebook has executed.
I would appreciate any suggestion on how to solve this issue!
Hello #sondrelv and thank you for your question. I would like to clarify that dbutils.notebook.exit(value) is not used for killing other threads by runId. dbutils.notebook.exit(value) is used to cause the current (this) thread to exit and return a value.
I see the difficulty of management without an available interrupt inside notebook code. Given this limitation, I have tried to look for other ways to cancel the threads.
One way is to use other utilities to kill the thread/run.
Part of the difficulty in solving this, is that threads/runs created through dbutils.notebook.run() are ephemeral runs. The Databricks CLI databricks runs get --run-id <ephemeral_run_id> can fetch details of an ephemeral run. If details can be fetched, then the cancel should also work (databricks runs cancel ...).
The remaining difficulty is getting the run ids. Ephemeral runs are excluded from the CLI runs list operation databricks runs list.
As you noted, the dbutils.notebook.run() is synchronous, and does not return a value to code until it finishes.
However, in the notbook UI, the run ID and link is printed when it starts. There must be a way to capture these. I have not yet found how.
Another possible solution, would be to create some endpoint or resources for the child notebooks check whether they should continue execution or exit early using dbutils.notebooks.exit(). Doing this check between every few cells would be similar to the approaches in the article you linked, just applied to a notebook instead of a thread.

Automatically rerun jobs submitted with sbatch --array upon error

I am submitting jobs in an array. Occasionally one job will error because of a difficult to diagnose gpu memory issue. Simply rerunning the job results in success.
What I would like to do is catch this error, log it, and put the job back into slurm's queue to be rerun. If this is not possible to do with an array job, that's fine, it's not essential to use arrays (though it is preferred).
I've tried playing around with sbatch --rerun, but this doesn't seem to do what I want (I think this option is for rerunning after a hardware error detected by slurm, or a node is restarted when a job is running - this isn't the case for my jobs).
Any advice well received.
If you can detect the GPU memory issue, you can end your submission job with a construct like this:
if <gpu memory issue>; then
scontrol requeue $SLURM_JOBID
fi
This will put the job back in the scheduling queue and it will be restarted as is. Interestingly, the SLURM_RESTART_COUNT environment variable holds the number of times the job was re-queued.

qsub: What is the standard when to get occasional updates on a submitted job?

I have just begun using an HPC, and I'm having trouble adjusting my workflow.
I submit a job using qsub myjob.sh. Then I can view the status of the job by typing qstat -u myusername. This gives me some details about my job, such as how long it has been running for.
My job is a python script that occasionally prints out an update to indicate how things are going in the program. I know that this will instead be found in outputfile once the job is done, but how can I go about monitoring this program as it runs? One way it to print the output to a file instead of printing to screen, but this seems like a bit of a hack.
Any other tips on imporving this process would be great.

Resources