Reserve memory per task in SLURM - slurm

We're using SLURM to manage job scheduling on our computing cluster, and we experiencing a problem with memory management. Specifically, we can't find out how we can allocate memory for a specific task.
Consider the following setup:
Each node has 32GB memory
We have a SLURM job that sets --mem=24GB
Now, assume we want to run that SLURM job twice, concurrently. Then what I expect (or want) to happen is that when I queue it twice by calling sbatch runscript.sh twice, one of the two jobs will run on one node, and the other will run on another node. However, as it currently is, SLURM schedules both tasks on the same node.
One of the possible causes we've identified is that it appears to check only whether the 24GB of memory is available (i.e., not actively used by other node), instead of checking whether it is requested/allocated.
The question here is: is it possible to allocate/reserve memory per task in SLURM?
Thanks for your help!

In order to be able to manage memory slurm needs the parameter in SchedTypeParameters to include MEMORY. So just changing that parameter to CR_Core_Memory should be enough for Slurm to start to manage the memory.
If that is not set --mem will not reserve memory and only ensure that the node has enough memory configured.
More information here

#CarlesFenoy's answer is good, but to answer
The question here is: is it possible to allocate/reserve memory per
task in SLURM?
the parameter you are looking for is --mem-per-cpu, to use in combination with --cpus-per-task

Related

SLURM Schedule Tasks Without Node Constraints

I have to schedule jobs on a very busy GPU cluster. I don't really care about nodes, more about GPUs. The way my code is structured, each job can only use a single GPU at a time and then they communicate to use multiple GPUs. The way we generally schedule something like this is by doing gpus_per_task=1, ntasks_per_node=8, nodes=<number of GPUs you want / 8> since each node has 8 GPUs.
Since not everyone needs 8 GPUs, there are often nodes that have a few (<8) GPUs lying around, which using my parameters wouldn't be schedulable. Since I don't care about nodes, is there a way to tell slurm I want 32 tasks and I dont care how many nodes you use to do it?
For example if it wants to give me 2 tasks on one machine with 2 GPUs left and the remaining 30 split up between completely free nodes or anything else feasible to make better use of the cluster.
I know there's an ntasks parameter which may do this but the documentation is kind of confusing about it. It states
The default is one task per node, but note that the --cpus-per-task option will change this default.
What does cpus_per_task have to do with this?
I also saw
If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node
but I'm also confused about this interaction. Does this mean if I ask for --ntasks=32 --ntasks-per-node=8 it will put at most 8 tasks on a single machine but it could put less if it decides to (basically this is what I want)
Try --gpus-per-task 1 and --ntasks 32. No tasks per node or number of nodes specified. This allows slurm to distribute the tasks across the nodes however it wants and to use leftover GPUs on nodes that are not fully utilized.
And it won't place more then 8 tasks on a single node, as there are no more then 8 GPUs available.
Regarding ntasks vs cpus-per-task: This should not matter in your case. Per default a task gets one CPU. If you use --cpus-per-tasks x it is guaranteed that the x CPUs are on one node. This is not the case if you just say --ntasks, where the tasks are spread however slurm decides. There is an example for this in the documentation.
Caveat: This requires a version of slurm >= 19.05, as all the --gpu options have been added there.

How can i get memory usage by a process in freeRTOS

As we all know, we can get RAM currently used by a process in Linux using commands like ps, top and vmstat or reading the pseudo-filesystem /proc. But how can i get the same information in freeRTOS where we could not use commands and there exist no file system.
First there's no process context in RTOS. In FreeRTOS there're tasks(which are analogous to threads in Linux) and the main context which again is lost once the Scheduler is started. The stack memory occupied by each task is configured by the client at task creation.
However once the system is running you can query if the stack reaches its maximum value by using the following API.
uxTaskGetStackHighWaterMardk(TaskHandle_t task)
Please refer https://www.freertos.org/uxTaskGetStackHighWaterMark.html
Remember that INCLUDE_uxTaskGetStackHighWaterMark should be defined to 1 to use this feature.
For heap memory I assume you're using one of FreeRTOS heap allocation strategies(heap_1,heap_2 etc). In that case if you've globally overridden your malloc/free/new/new[]/delete/delete[] to use FreeRTOS pvPortMalloc, there's a way to register a hook function that gets called when the system runs out of heap.
Refer https://www.freertos.org/a00016.html
At the same time it is possible to retrieve run time status from the scheduler by using the following API.
void vTaskGetRunTimeStats( char *pcWriteBuffer );
Of course, this will suspend/unsuspend the scheduler frequently, so will not be a real solution for your production code, but is still a good debugging aid.
Refer https://www.freertos.org/rtos-run-time-stats.html.

Spark tuning issues

strong text
Why this stage has been running with 1 thread at end ? Due to this it is taking much time to finish, I guess here it is not achieving parallel process.
So can any one explain it ?
As you haven't put any more specific information about what exactly are you trying to do there can be only broad answer.
Most common cause if you have one (or just a few) tasks hanging in larger pool of tasks is skewed data.
Another option is that the data triggered task that might be taking longer to compute the data (CPU heavy)
Or your task is hanging on IO which might indicate network/IO channel saturation.
The question is pretty generic. Spark documentation says that it is not really easy to find bottlenecks directly or indirectly even for smallest of the programs (such as WordCount). The bottleneck can be in IO, memory to CPU, CPU where Garbage collection is going on, network and other factors internal to spark (such as scheduler delays, buffer memory overflows etc).
So, you might need to dig deeper keeping the below in mind:
a. Do you have many cores freely available to share the load of the stage.
b. How many executors are configured for this job to finish
c. is the 200GB data read/write justified for the job that you are doing.
d. free RAM on server before job trigger.
e. Go to YARN resource manager to see the resources around memory and CPU cores (in case you are using YARN).

How to run binary executables in multi-thread HPC cluster?

I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory?
The scheduler just runs the binary provided by you on the first node allocated. The onus of splitting the job and running it in parallel is on the binary. Hence, you see that you are using one core out of the fifty allocated.
Parallelising at the code level
You will need to make sure that the binary that you are submitting as a job to the cluster has some mechanism to understand the nodes that are allocated (interaction with the Job Scheduler) and a mechanism to utilize the allocated resources (MPI, PGAS etc.).
If it is parallelized, submitting the binary through a job submission script (through a wrapper like mpirun/mpiexec) should utilize all the allocated resources.
Running black box serial binaries in parallel
If not, the only other possible workload distribution mechanism across the resources is the data parallel mode, wherein, you use the cluster to supply multiple inputs to the same binary and run the processes in parallel to effectively reduce the time taken to solve the problem.
You can set the granularity based on the memory required for each run. For example, if each process needs 1GB of memory, you can run 16 processes per node (with assumed 16 cores and 16GB memory etc.)
The parallel submission of multiple inputs on a single node can be done through the tool Parallel. You can then submit multiple jobs to the cluster, with each job requesting 1 node (exclusive access and the parallel tool) and working on different input elements respectively.
If you do not want to launch 'n' separate jobs, you can use the mechanisms provided by the scheduler like blaunch to specify the machine on which the job is supposed to be run dynamically. You can parse the names of the machines allocated by the scheduler and further use blaunch like script to emulate the submission of n jobs from the first node.
Note: These class of applications are better off being run on a cloud like setup instead of typical HPC systems [effective utilization of the cluster at all the levels of available parallelism (cluster, thread and SIMD) is a key part of HPC.]

How to control the number of threads/cores used?

I am running Spark on a local machine, with 8 cores, and I understand that I can use "local[num_threads]" as the master, and use "num_threads" in the bracket to specify the number of threads used by Spark.
However, it seems that Spark often uses more threads than I required. For example, if I only specify 1 thread for Spark, by using the top command on Linux, I can still observe that the cpu usage is often more than 100% and even 200%, implying that more than 1 threads are actually used by Spark.
This may be a problem if I need to run multiple programs concurrently. How can I control the number of threads/cores used strictly by Spark?
Spark uses one thread for its scheduler, which explains the usage pattern you see it. If you launch n threads in parallel, you'll get n+1 cores used.
For details, see the scheduling doc.

Resources