How to run binary executables in multi-thread HPC cluster? - multithreading

I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory?

The scheduler just runs the binary provided by you on the first node allocated. The onus of splitting the job and running it in parallel is on the binary. Hence, you see that you are using one core out of the fifty allocated.
Parallelising at the code level
You will need to make sure that the binary that you are submitting as a job to the cluster has some mechanism to understand the nodes that are allocated (interaction with the Job Scheduler) and a mechanism to utilize the allocated resources (MPI, PGAS etc.).
If it is parallelized, submitting the binary through a job submission script (through a wrapper like mpirun/mpiexec) should utilize all the allocated resources.
Running black box serial binaries in parallel
If not, the only other possible workload distribution mechanism across the resources is the data parallel mode, wherein, you use the cluster to supply multiple inputs to the same binary and run the processes in parallel to effectively reduce the time taken to solve the problem.
You can set the granularity based on the memory required for each run. For example, if each process needs 1GB of memory, you can run 16 processes per node (with assumed 16 cores and 16GB memory etc.)
The parallel submission of multiple inputs on a single node can be done through the tool Parallel. You can then submit multiple jobs to the cluster, with each job requesting 1 node (exclusive access and the parallel tool) and working on different input elements respectively.
If you do not want to launch 'n' separate jobs, you can use the mechanisms provided by the scheduler like blaunch to specify the machine on which the job is supposed to be run dynamically. You can parse the names of the machines allocated by the scheduler and further use blaunch like script to emulate the submission of n jobs from the first node.
Note: These class of applications are better off being run on a cloud like setup instead of typical HPC systems [effective utilization of the cluster at all the levels of available parallelism (cluster, thread and SIMD) is a key part of HPC.]

Related

SLURM nodes, tasks, cores, and cpus

Would someone be able to clarify what each of these things actually are? From what I gathered, nodes are computing points within the cluster, essentially a single computer. Tasks are processes that can be executed either on a single node or on multiple nodes. And cores are basically how much of a CPU on a single node do you want to be allocated to executing the task assigned to that CPU. Is this correct? Am I confusing something?
The terms can have different meanings in different context, but if we stick to a Slurm context:
A (compute) node is a computer part of a larger set of nodes (a cluster). Besides compute nodes, a cluster comprises one or more login nodes, file server nodes, management nodes, etc. A compute node offers resources such as processors, volatile memory (RAM), permanent disk space (e.g. SSD), accelerators (e.g. GPU) etc.
A core is the part of a processor that does the computations. A processor comprises multiple cores, as well as a memory controller, a bus controller, and possibly many other components. A processor in the Slurm context is referred to as a socket, which actually is the name of the slot on the motherboard that hosts the processor. A single core can have one or two hardware threads. This is a technology that allows virtually doubling the number of cores the operating systems perceives while only doubling part of the core components -- typically the components related to memory and I/O and not the computation components. Hardware multi-threading is very often disabled in HPC.
a CPU in a general context refers to a processor, but in the Slurm context, a CPU is a consumable resource offered by a node. It can refer to a socket, a core, or a hardware thread, based on the Slurm configuration.
The role of Slurm is to match those resources to jobs. A job comprises one or more (sequential) steps, and each step has one or more (parallel) tasks. A task is an instance of a running program, i.e. at a process, possibly along with subprocesses or software threads.
Multiple tasks are dispatched on possibly multiple nodes depending on how many core each task needs. The number of cores a task needs depends on the number of subprocesses or software threads in the instance of the running program. The idea is to map each hardware thread to one core, and make sure that each task has all assigned cores assigned on the same node.

SLURM Schedule Tasks Without Node Constraints

I have to schedule jobs on a very busy GPU cluster. I don't really care about nodes, more about GPUs. The way my code is structured, each job can only use a single GPU at a time and then they communicate to use multiple GPUs. The way we generally schedule something like this is by doing gpus_per_task=1, ntasks_per_node=8, nodes=<number of GPUs you want / 8> since each node has 8 GPUs.
Since not everyone needs 8 GPUs, there are often nodes that have a few (<8) GPUs lying around, which using my parameters wouldn't be schedulable. Since I don't care about nodes, is there a way to tell slurm I want 32 tasks and I dont care how many nodes you use to do it?
For example if it wants to give me 2 tasks on one machine with 2 GPUs left and the remaining 30 split up between completely free nodes or anything else feasible to make better use of the cluster.
I know there's an ntasks parameter which may do this but the documentation is kind of confusing about it. It states
The default is one task per node, but note that the --cpus-per-task option will change this default.
What does cpus_per_task have to do with this?
I also saw
If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node
but I'm also confused about this interaction. Does this mean if I ask for --ntasks=32 --ntasks-per-node=8 it will put at most 8 tasks on a single machine but it could put less if it decides to (basically this is what I want)
Try --gpus-per-task 1 and --ntasks 32. No tasks per node or number of nodes specified. This allows slurm to distribute the tasks across the nodes however it wants and to use leftover GPUs on nodes that are not fully utilized.
And it won't place more then 8 tasks on a single node, as there are no more then 8 GPUs available.
Regarding ntasks vs cpus-per-task: This should not matter in your case. Per default a task gets one CPU. If you use --cpus-per-tasks x it is guaranteed that the x CPUs are on one node. This is not the case if you just say --ntasks, where the tasks are spread however slurm decides. There is an example for this in the documentation.
Caveat: This requires a version of slurm >= 19.05, as all the --gpu options have been added there.

Reserve memory per task in SLURM

We're using SLURM to manage job scheduling on our computing cluster, and we experiencing a problem with memory management. Specifically, we can't find out how we can allocate memory for a specific task.
Consider the following setup:
Each node has 32GB memory
We have a SLURM job that sets --mem=24GB
Now, assume we want to run that SLURM job twice, concurrently. Then what I expect (or want) to happen is that when I queue it twice by calling sbatch runscript.sh twice, one of the two jobs will run on one node, and the other will run on another node. However, as it currently is, SLURM schedules both tasks on the same node.
One of the possible causes we've identified is that it appears to check only whether the 24GB of memory is available (i.e., not actively used by other node), instead of checking whether it is requested/allocated.
The question here is: is it possible to allocate/reserve memory per task in SLURM?
Thanks for your help!
In order to be able to manage memory slurm needs the parameter in SchedTypeParameters to include MEMORY. So just changing that parameter to CR_Core_Memory should be enough for Slurm to start to manage the memory.
If that is not set --mem will not reserve memory and only ensure that the node has enough memory configured.
More information here
#CarlesFenoy's answer is good, but to answer
The question here is: is it possible to allocate/reserve memory per
task in SLURM?
the parameter you are looking for is --mem-per-cpu, to use in combination with --cpus-per-task

Number of CPUs per Task in Spark

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?
To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

How to control the number of threads/cores used?

I am running Spark on a local machine, with 8 cores, and I understand that I can use "local[num_threads]" as the master, and use "num_threads" in the bracket to specify the number of threads used by Spark.
However, it seems that Spark often uses more threads than I required. For example, if I only specify 1 thread for Spark, by using the top command on Linux, I can still observe that the cpu usage is often more than 100% and even 200%, implying that more than 1 threads are actually used by Spark.
This may be a problem if I need to run multiple programs concurrently. How can I control the number of threads/cores used strictly by Spark?
Spark uses one thread for its scheduler, which explains the usage pattern you see it. If you launch n threads in parallel, you'll get n+1 cores used.
For details, see the scheduling doc.

Resources