I have a simulation that consists of N steps, run sequentially. Each of these steps modifies a global state in memory, until the final step which is the result. It is possible, after a step has run, to write to disk the intermediate state that this step just computed, and to load such an intermediate state instead of starting from scratch. Writing and loading intermediate states has a non-negligible cost.
I want to run many variations of a simulation on a Slurm cluster. Each variation will change the parameter of some of the steps.
Example
Simulation steps
S1 --> S2 --> S3 --> S4
Variations
run1: S2.speed=2, S3.height=12
run2: S2.speed=2, S3.height=20
run3: S2.speed=2, S3.height=40
run4: S2.speed=5, S3.height=12
run5: S2.speed=5, S3.height=80
What I want to do is for the various runs to share common computations, by dumping the intermediate state of the shared steps. This will form a tree of step runs:
S1
├─ S2 (speed=2)
│ ├─ S3 (height=12)
│ │ └─ S4
│ ├─ S3 (height=20)
│ │ └─ S4
│ └─ S3 (height=40)
│ └─ S4
└─ S2 (speed=5)
├─ S3 (height=12)
│ └─ S4
└─ S3 (height=80)
└─ S4
I know I can get the result of the 5 runs by running 5 processes:
run1: S1 --> S2 (speed=2) --> S3 (height=12) --> S4
run2: (dump of run1.S2) --> S3 (height=20) --> S4
run3: (dump of run1.S2) --> S3 (height=40) --> S4
run4: (dump of run1.S1) --> S2 (speed=5) --> S3 (height=12) --> S4
run5: (dump of run4.S2) --> S3 (height=80) --> S4
This reduces the computation from 20 steps using a naive approach, to 13 steps with 3 dumps and 4 loads.
Now, my question is how to model this with Slurm, to make the best use of the scheduler?
One solution I can think of, is that each run is responsible to submit the jobs of the runs that depend on it, after the dump of the intermediate state. Run1 will submit run4 after dumping S1, and then it will submit run2 and run3 after dumping S2, and run4 will submit run5 after dumping S2. With this solution, is there any point in declaring the dependency when submitting the job to Slurm?
Another solution I can see is to break the long chains of computation in multiple, dependent jobs. The list of jobs to submit and their dependencies would be basically the tree I drew above (except the pairs S3/S4 would be merged in the same job). This is 8 jobs to submit instead of 5, but I can submit them all at once from the beginning, with the right dependencies. However, I am not sure what the advantages of this approach would be. Will Slurm do a better job as a scheduler, if he knows the full list of jobs and their dependencies right from the start? Are there some advantages from a user point of view, to have all the jobs submitted and linked with dependencies (eg, to cancel all the jobs that depend on the root job)? I know I can submit many jobs at once with a job array, but I don't see a way to declare dependencies between jobs of the same array. Is is possible, or even advisable?
Finally, are there other approaches I did not think about?
Edit
The example I gave is of course simplified a lot. The real simulations will contain hundreds of steps, with about a thousand variations to try. The scalability of the chosen solution is important.
One solution I can think of, is that each run is responsible to submit the jobs of the runs that depend on it, after the dump of the intermediate state. With this solution, is there any point in declaring the dependency when submitting the job to Slurm?
This is an approach often followed with simple workflows that involve long-running jobs that must checkpoint and restart.
Another solution I can see is to break the long chains of computation in multiple, dependent jobs. Will Slurm do a better job as a scheduler, if he knows the full list of jobs and their dependencies right from the start?
No. Slurm will just ignore the jobs that are not eligible to start because their dependent jobs are not finished.
Are there some advantages from a user point of view, to have all the jobs submitted and linked with dependencies (eg, to cancel all the jobs that depend on the root job)?
Yes, but that is marginally useful.
I know I can submit many jobs at once with a job array, but I don't see a way to declare dependencies between jobs of the same array. Is is possible, or even advisable?
No you cannot set dependencies between jobs of the same array.
Finally, are there other approaches I did not think about?
You could use a workflow management system.
One of the simplest solution is Makeflow. It uses files that look like classical Makefiles that describe the dependencies between jobs. Then, simply running something like makeflow –T slurm makefile.mf
Another option is Bosco. Bosco offers a bit more possibilities, and is good for personal use. It is easy to setup and can submit jobs to multiple clusters.
Finally, Fireworks is a very powerful solution. It requires a MongoDB, and is more suited for lab-wise use, but it can implement very complex logic for job submission/resubmission based on the outputs of jobs, and can handle errors in a clever way. You can for instance implement a workflow where a job is submitted with a given value for a given parameter, and have Fireworks monitor the convergence based on the output file, and cancel and re-submit with another value in case the convergence is not satisfactory.
Another possible solution is to make use of pipeline tools. In the field of bioinformatics SnakeMake is becoming really popular. SnakeMake is based on GNU Make, but made in Python, hence the name SnakeMake. For SnakeMake to work you specify which output you want, and SnakeMake will deduce which rules it has to run for this output. One of the nice things about SnakeMake is that it scales really easily from personal laptops, to bigger computers, and even clusters (for instance slurm clusters). Your example would look something like this:
rule all:
input:
['S4_speed_2_height_12.out',
'S4_speed_2_height_20.out',
'S4_speed_2_height_40.out',
'S4_speed_5_height_12.out',
'S4_speed_5_height_80.out']
rule S1:
output:
"S1.out"
shell:
"touch {output}" # do your heavy computations here
rule S2:
input:
"S1.out"
output:
"S2_speed_{speed}.out"
shell:
"touch {output}"
rule S3:
input:
"S2_speed_{speed}.out"
output:
"S3_speed_{speed}_height_{height}.out"
shell:
"touch {output}"
rule S4:
input:
"S3_speed_{speed}_height_{height}.out"
output:
"S4_speed_{speed}_height_{height}.out"
shell:
"touch {output}"
We can then ask snakemake to make a pretty image of how it would perform these computations:
Snakemake automatically figures out which output can be used by different rules.
Running this on your local machine is as simple executing snakemake, and to submit the actions to slurm is just snakemake --cluster "sbatch". The example I gave is obviously an oversimplification, but SnakeMake is highly customizable (nr of threads per rule, memory usage, etc.), and has the advantage that it is based on Python. It takes a bit of figuring out how everything works in SnakeMake, but I can definitely recommend it.
Related
I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs
I have a question for a very specific use case. I'll start by giving a bit of background:
I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create snakemake workflows and execute them on a slurm cluster. Due to limited GPU nodes, I would like to checkpoint my model, stop the job and resubmit once in a while to not block the GPUs. The goal of this would be to train the model iteratively with short running jobs.
Now to my questions:
Is there a way to resubmit a job a certain number of times/until a condition is met?
Is there another clever way to train a model iteratively without having to manually submit the job?
For this, you need to submit job with command
llsubmit job.sh
The shell script or batch job file should be executed as manytimes. Once the job finishes, resources are available. it restarts the same script(you already submitted and waiting in queue) automatically.
Here are a few suggestions:
Just train your network. It's up to the scheduler to try not to block the GPUs and running 10 short jobs vs 1 long job will probably lead to the same priority.
You can specify --restart-times to run a job which has failed multiple times. The trick is that snakemake will also remove outputs from failed jobs. The workaround is to checkpoint your model to a temp file (not in the output directive of the rule) and exit your training with an error to signal to snakemake that it needs to run again. The inelegant part is that you have to set your restart to a large value, or make sure your training code knows that it is running the final attempt and needs to save the actual output. You can acquire the attempt as a resource. I'm not sure the parameter is available in other directives. Also any job that fails will be resubmitted; not a great option for development.
You can make your checkpoint files outputs. This again assumes you want to run a set number of times. Your rule all will look for a file like final.checkpoint, which depends on 10.checkpoint, which depends on 9.checkpoint and so on. With a fancy enough input function this can be implemented in one rule where 1.checkpoint depends on nothing (or your training data perhaps).
There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?
You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)
Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*
I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?
Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.
When I run a U SQL script from portal/visual studio it follows stages like preparing,queued,running,finalizing. What exactly happens behind the scenes in all these stages?Will there be any execution time difference when the job is run from visual studio/portal in dev and production environment? We need to clock the speeds and record the time the script would take in production.Ultimately, the goal is to run these scripts as Data Factory activities in production.
I assume that there would be differences since I assume your dev environment would probably run at lower resource usage (lower degree of parallelism both between jobs and inside a job) than your production environment. Otherwise there should be no difference.
Note that we are still working on performance so if you are running into particular issues, please let us know.
The phases roughly do the following (I am probably missing some parts):
preparing: includes compilation, optimization, Codegen, preparing the execution graph and required resources and putting the job into the queue.
queueing: The job sits in the queue to get executed once the job is at the top of the queue and resources are available to start the job. This can be impacted by setting the maximal number of jobs that can run in parallel (a setting you can set by "calling" support/us).
running: Actual job execution. This will be affected by resources: Maximal number of parallelism that is specified on the job, network bandwidth, store access (throttling, bandwidth).
finalizing: Cleanup and stitching results into files, "sealing" table files. This can be more expensive depending on where you write the data (ADL is faster than WASB for example).