SLURM job taking up entire node when using just one GPU - slurm

I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each?
My script looks like this:
#SLURM --gres=gpu:1
#SLURM --ntasks-per-node 1
#SLURM -p ghp-queue
myprog.exe

I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the partition configuration in slurm.conf, like this:
PartitionName=compute Nodes=ALL ... OverSubscribe=FORCE
After that, I was able to run four jobs with --gres=gpu:1, and each one took a different GPU (a fifth job is queued, as expected).

Related

Dataflow exceeds number_of_worker_harness_threads

I deployed Dataflow job with param --number_of_worker_harness_threads=5 (streaming mode).
Next I send 20x PubSub messages triggering 20x loading big CSV files from GCS and start processing.
In the logs I see that job took 10 messages and process it in parallel on 6-8 threads (I checked several times, sometimes it was 6, sometimes 8).
Nevertheless all the time it was more than 5.
Any idea how it works? It does not seem to be expected behavior.
Judging from the flag name, you are using Beam Python SDK.
For Python streaming, the total number of threads running DoFns on 1 worker VM in current implementation may be up to the value provided in --number_of_worker_harness_threads times the number of SDK processes running on the worker, which by default is the number of vCPU cores. There is a way to limit number of processes to 1 regardless of # of vCPUs. To do so, set --experiments=no_use_multiple_sdk_containers.
For example, if you are using --machine_type=n1-standard-2 and --number_of_worker_harness_threads=5, you may have up to 10 DoFn instances in different threads running concurrently on the same machine.
If --number_of_worker_harness_threads is not specified, up to 12 threads per process are used. See also: https://cloud.google.com/dataflow/docs/resources/faq#how_many_instances_of_dofn_should_i_expect_dataflow_to_spin_up_

Slurm: can i create e sub-queue using a subset of resources in a single node?

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

How to submit jobs across multiple partitions at the same time (Slurm)

After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.
Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.

Spark tasks stuck at RUNNING

I'm trying to run a Spark ML pipeline (load some data from JDBC, run some transformers, train a model) on my Yarn cluster but each time I run it, a couple - sometimes one, sometimes 3 or 4 - of my executors get stuck running their first task set (that'd be 3 tasks for each of their 3 cores), while the rest run normally, checking off 3 at a time.
In the UI, you'd see something like this:
Some things I have observed so far:
When I set up my executors to use 1 core each with spark.executor.cores (i.e. run 1 task at a time), the issue does not occur;
The stuck executors always seem to be them ones that had to get some partitions shuffled to them in order to run the task;
The stuck tasks would ultimately get successfully speculatively executed by another instance;
Occasionally, a single task would get stuck in an executor that is otherwise normal, the other 2 cores would keep working fine, however;
The stuck executor instances look like everything is normal: CPU is at ~100%, plenty of memory to spare, the JVM processes are alive, neither Spark or Yarn log anything out of the ordinary and they can still receive instructions from the driver, such as "drop this task, someone else speculatively executed it already" -- though, for some reason, they don't drop it;
Those executors never get killed off by the driver, so I imagine they keep sending their heartbeats just fine;
Any ideas as to what may be causing this or what I should try?
TLDR: Make sure your code is threadsafe and race condition-free before you blame Spark.
Figured it out. For posterity: was using an thread-unsafe data structure (a mutable HashMap). Since executors on the same machine share a JVM, this was resulting in data races that were locking up the separate threads/tasks.
The upshot: when you have spark.executor.cores > 1 (and you probably should), make sure your code is threadsafe.

Why does web UI show different durations in Jobs and Stages pages?

I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?
Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.

Resources