Slurm: can i create e sub-queue using a subset of resources in a single node? - linux

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?

You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

Related

Dataflow exceeds number_of_worker_harness_threads

I deployed Dataflow job with param --number_of_worker_harness_threads=5 (streaming mode).
Next I send 20x PubSub messages triggering 20x loading big CSV files from GCS and start processing.
In the logs I see that job took 10 messages and process it in parallel on 6-8 threads (I checked several times, sometimes it was 6, sometimes 8).
Nevertheless all the time it was more than 5.
Any idea how it works? It does not seem to be expected behavior.
Judging from the flag name, you are using Beam Python SDK.
For Python streaming, the total number of threads running DoFns on 1 worker VM in current implementation may be up to the value provided in --number_of_worker_harness_threads times the number of SDK processes running on the worker, which by default is the number of vCPU cores. There is a way to limit number of processes to 1 regardless of # of vCPUs. To do so, set --experiments=no_use_multiple_sdk_containers.
For example, if you are using --machine_type=n1-standard-2 and --number_of_worker_harness_threads=5, you may have up to 10 DoFn instances in different threads running concurrently on the same machine.
If --number_of_worker_harness_threads is not specified, up to 12 threads per process are used. See also: https://cloud.google.com/dataflow/docs/resources/faq#how_many_instances_of_dofn_should_i_expect_dataflow_to_spin_up_

Is there any python wrapper to submit my single job to High Performance Cluster managed by SGE?

I have already called multiprocessing package and used up all the CPUs in one node. I would like to use another 10 nodes to complete my job. Thus, I need 10 * 10 task threads to calculate it. Is there some example code? I found this post "How to use multiple nodes/cores on a cluster with parellelized Python code"
. But I am still in confusion. For instance, the implemented interface,
job_task = perform_job(job_params, nodes, cups)
Any suggestion is greatly appreciated.

SLURM job taking up entire node when using just one GPU

I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each?
My script looks like this:
#SLURM --gres=gpu:1
#SLURM --ntasks-per-node 1
#SLURM -p ghp-queue
myprog.exe
I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the partition configuration in slurm.conf, like this:
PartitionName=compute Nodes=ALL ... OverSubscribe=FORCE
After that, I was able to run four jobs with --gres=gpu:1, and each one took a different GPU (a fifth job is queued, as expected).

Apache Spark DAGScheduler Flow of Data

I am trying to understand how exactly Apache Spark scheduler works. To do so, i've set a local cluster with one master and two workers. I only submit one application, which simply reads 4 files (2 small (~10MB) and 2 big(~1,1GB)),joins them and collects the result. In addition, i cache in memory the two small files.
I am running the standalone cluster mode with FIFO.I've understood how the stages are formed but i cannot figure out how the flow of data is determined(the arrows). When i look at SparkUI, i notice that each time,even though the stages are formed in the same way, the arrows( flow of data and control i guess) are different. It's like the scheduler works non-deterministically.
I've read the relative chapters (about DAG and Task Scheduler) from Jacek Laskowski's book, but it isn't still clear in my head how the flow of control is determined . Thanks in advance for the help.
Cheers,
Jim
It's like the scheduler works non-deterministically.
Yes, there's some randomness in scheduling tasks to make it more "fair". In that sense Spark scheduler does work "non-deterministically", but within acceptable limits of execution placement (i.e. assigning tasks with lesser location preferences to executors).
The component in Apache Spark that does the work of selecting a task for a task set (that corresponds to a stage) is TaskSetManager:
Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of each task, retries tasks if they fail (up to a limited number of times), and handles locality-aware scheduling for this TaskSet via delay scheduling. The main interfaces to it are resourceOffer, which asks the TaskSet whether it wants to run a task on one node, and statusUpdate, which tells it that one of its tasks changed state (e.g. finished).

Requesting integer multiple of "M" cores per node on SGE

I want to submit a multi-threaded MPI job to SGE, and the cluster I am running in has different nodes that each has different number of cores. Let's say the number of threads per process is M (M == OMP_NUM_THREADS for OpenMP) How can I request that a job submitted to a SGE queue would be run in a such a way that in every node, an integer multiple of M is allocated for my job?
Let's say M=8, and the number of MPI tasks is 5 (so a total of 40 cores requested). And in this cluster, there are nodes with 4, 8, 12, and 16 cores. Then this combination is OK:
2*(8-core nodes) + 1*(16-core nodes) + 0.5*(16-core nodes)
but of course not any of these ones:
2*(4-core nodes) + 2*(8-core nodes) + 1*(16-core node)
2*(12-core nodes) + 1*(16-core node)
(3/8)*(8-core nodes) + (5/8)*(8-core nodes) + 2*(16-core node)
PS: There was another similar question, like this one: ( MPI & pthreads: nodes with different numbers of cores ), but mine is different since I have to run exactly M threads per MPI process (think hybrid MPI+OpenMP).
The best scenario is to run this job exclusively on the same kind of nodes. But to speed up the start time, I want to allow this job to run on different kind of nodes, provided that each node has integer*M cores allocated to the job.
The allocation policy in SGE is specified on per parallel environment (PE) basis. Each PE could be configured to fill the slots available on the cluster nodes in a specific way. One requests a specific PE with the -pe pe_name num_slots parameter and then SGE tries to find num_slots slots following the allocation policy of the pe_name PE. Unfortunately, there is no easy way to request slots in integer multiples per node.
In order to be able to request exactly M slots per host (and not a multiple of M), your SGE administrator (or you, in case you are the SGE administrator) must first create a new PE, let's call it mpi8ppn, set its allocation_rule to 8, and then assign the PE to each cluster queue. Then you have to submit the job to that PE with -pe mpi8ppn 40 and instruct the MPI runtime to start only one process per host, e.g. with -npernode 1 for Open MPI.
If the above is unlikely to happen, your other (unreliable) solution would be to request a very high amount of memory per slot, close to what each node has, e.g. -l h_vmem=23.5G. Assuming that the nodes are configured with h_vmem of 24 GiB, this request will ensure that SGE won't be able to fit more than one slot on each host. So, if you would like to start a hybrid job on 5 nodes, you will simply ask SGE for 5 slots and 23.5G vmem for each slot with:
qsub -pe whatever 5 -l h_vmem=23.5G <other args> jobscript
or
#$ -pe whatever 5
#$ -l h_vmem=23.5G
This method is unreliable since it does not allow you to select cluster nodes that have a specific number of cores and only works if all nodes are configured with h_vmem of less than 47 GB. h_vmem serves just as an example here - any other per-slot consumable attribute should do. The following command should give you an idea of what host complexes are defined and what their values are across the cluster nodes:
qhost -F | egrep '(^[^ ])|(hc:)'
The method works best for clusters where node_mem = k * #cores with k being constant across all nodes. If a node provides twice the number of cores but also has twice the memory, e.g. 48 GiB, then the above request will give you two slots on such nodes.
I don't claim to fully understand SGE and my knowledge dates back from the SGE 6.2u5 era, so simpler solutions might exist nowadays.

Resources