I have to schedule jobs on a very busy GPU cluster. I don't really care about nodes, more about GPUs. The way my code is structured, each job can only use a single GPU at a time and then they communicate to use multiple GPUs. The way we generally schedule something like this is by doing gpus_per_task=1, ntasks_per_node=8, nodes=<number of GPUs you want / 8> since each node has 8 GPUs.
Since not everyone needs 8 GPUs, there are often nodes that have a few (<8) GPUs lying around, which using my parameters wouldn't be schedulable. Since I don't care about nodes, is there a way to tell slurm I want 32 tasks and I dont care how many nodes you use to do it?
For example if it wants to give me 2 tasks on one machine with 2 GPUs left and the remaining 30 split up between completely free nodes or anything else feasible to make better use of the cluster.
I know there's an ntasks parameter which may do this but the documentation is kind of confusing about it. It states
The default is one task per node, but note that the --cpus-per-task option will change this default.
What does cpus_per_task have to do with this?
I also saw
If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node
but I'm also confused about this interaction. Does this mean if I ask for --ntasks=32 --ntasks-per-node=8 it will put at most 8 tasks on a single machine but it could put less if it decides to (basically this is what I want)
Try --gpus-per-task 1 and --ntasks 32. No tasks per node or number of nodes specified. This allows slurm to distribute the tasks across the nodes however it wants and to use leftover GPUs on nodes that are not fully utilized.
And it won't place more then 8 tasks on a single node, as there are no more then 8 GPUs available.
Regarding ntasks vs cpus-per-task: This should not matter in your case. Per default a task gets one CPU. If you use --cpus-per-tasks x it is guaranteed that the x CPUs are on one node. This is not the case if you just say --ntasks, where the tasks are spread however slurm decides. There is an example for this in the documentation.
Caveat: This requires a version of slurm >= 19.05, as all the --gpu options have been added there.
Related
I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?
To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).
In my sparkconf, i can set the number of cores to use, i have 4 physical, 8 logical on my laptop, what does spark do if I specify a number that was not possible on the machine, like say 100 cores?
Number of cores doesn't describe physical cores but a number of running threads. It means that nothing really strange happens if the number is higher than a number of available cores.
Depending on your setup it can be actually a preferred configuration with value around twice a number of available cores being a commonly recommended setting. Obviously if number is to high your application will spend more time on switching between threads than actual processing.
It heavily depends on your cluster manager. I assume that you're asking about local[n] run mode.
If so, the driver and the one and only one executor are the same JVM with n number of threads.
DAGScheduler - the Spark execution planner will use n threads to schedule as many tasks as you've told it should.
If you have more tasks, i.e. threads, than cores, your OS will have to deal with more threads than cores and schedule them appropriately.
I have this tool called cgatools from complete genomics (http://cgatools.sourceforge.net/docs/1.8.0/). I need to run some genome analyses in High-Performance Computing Cluster. I tried to run the job allocating more than 50 cores and 250gb memory, but it only uses one core and limits the memory to less than 2GB. What would be my best option in this case? Is there a way to run binary executables in HPC cluster making it use all the allocated memory?
The scheduler just runs the binary provided by you on the first node allocated. The onus of splitting the job and running it in parallel is on the binary. Hence, you see that you are using one core out of the fifty allocated.
Parallelising at the code level
You will need to make sure that the binary that you are submitting as a job to the cluster has some mechanism to understand the nodes that are allocated (interaction with the Job Scheduler) and a mechanism to utilize the allocated resources (MPI, PGAS etc.).
If it is parallelized, submitting the binary through a job submission script (through a wrapper like mpirun/mpiexec) should utilize all the allocated resources.
Running black box serial binaries in parallel
If not, the only other possible workload distribution mechanism across the resources is the data parallel mode, wherein, you use the cluster to supply multiple inputs to the same binary and run the processes in parallel to effectively reduce the time taken to solve the problem.
You can set the granularity based on the memory required for each run. For example, if each process needs 1GB of memory, you can run 16 processes per node (with assumed 16 cores and 16GB memory etc.)
The parallel submission of multiple inputs on a single node can be done through the tool Parallel. You can then submit multiple jobs to the cluster, with each job requesting 1 node (exclusive access and the parallel tool) and working on different input elements respectively.
If you do not want to launch 'n' separate jobs, you can use the mechanisms provided by the scheduler like blaunch to specify the machine on which the job is supposed to be run dynamically. You can parse the names of the machines allocated by the scheduler and further use blaunch like script to emulate the submission of n jobs from the first node.
Note: These class of applications are better off being run on a cloud like setup instead of typical HPC systems [effective utilization of the cluster at all the levels of available parallelism (cluster, thread and SIMD) is a key part of HPC.]
I am running Spark on a local machine, with 8 cores, and I understand that I can use "local[num_threads]" as the master, and use "num_threads" in the bracket to specify the number of threads used by Spark.
However, it seems that Spark often uses more threads than I required. For example, if I only specify 1 thread for Spark, by using the top command on Linux, I can still observe that the cpu usage is often more than 100% and even 200%, implying that more than 1 threads are actually used by Spark.
This may be a problem if I need to run multiple programs concurrently. How can I control the number of threads/cores used strictly by Spark?
Spark uses one thread for its scheduler, which explains the usage pattern you see it. If you launch n threads in parallel, you'll get n+1 cores used.
For details, see the scheduling doc.
I'm running SGE (6.2u5p2) on our beowulf cluster. I've got a couple of users who submit 10s of thousands of short (<15minute) jobs that are a low priority (i.e. they've set the jobs low priority so anyone can jump ahead of them). This works really well for other users running single core jobs, however anyone wishing to run something with multiple threads aren't able to. The single core jobs keep skipping ahead, never allowing (say 6 cores) to be available.
I don't really want to separate the users into two queues (i.e. single and multicore) since those using the multicore jobs use it briefly and then there are multiple cores left unused.
Is there a way in SGE to allow multi core jobs to reserve slots?
Many thanks,
Rudiga
As "High Performance Mark" eludes to, using the -R option may help. See:
http://www.ace-net.ca/wiki/Scheduling_Policies_and_Mechanics#Reservation