List of resources available for a given configuration on Slurm - slurm

Is there a command or way to get the available resources for a given configuration on Slurm?
For example, if I need the following configuration --time=4-0 --mem 128G --node 1, how can I get the list of available nodes on a cluster on Slurm?

Related

Per-node default partition in SLURM

I'm configuring a small cluster, controlled by SLURM.
This cluster has one master node and two partitions.
Users submit their jobs from worker nodes, I've restricted their access to the master node.
Each partition in the cluster is dedicated to a team in our company.
I'd like that members of different teams submit their jobs to different partitions without bothering with additional command line switches.
That is, I'd like default partition for srun or sbatch to be different depending on the node, running these commands.
For example: all jobs, submitted from the host worker1 should go to the partition1,
and all jobs, submitted from the hosts worker[2-4] should go to the partition2.
And all invocations of sbatch or srun should not contain -p (or --partition) switch.
I've tried setting default=YES on different lines in slurm.conf files on different computers, but this did not help.
This can be solved using SLURM_PARTITION and SBATCH_PARTITION environment variables, put in the /etc/environment file.
Details on environment variables are in manual pages for sbatch and srun

Is it possible to assign hosts for Spark tasks with regex

Spark on yarn mode, the cluster has nodes like nodeA-xx and nodeB-xx, is there any configurations to launch tasks run on hosts named nodeA-*
If you're using Capacity Scheduler you need to enable the feature called Node labels. YARN Node Labels won't allow you to specify a regex. Instead you would have to label the nodes then specify this label as a resource for a queue and then finally run the job against a specific queue.

How does spark choose nodes to run executors?(spark on yarn)

How does spark choose nodes to run executors?(spark on yarn)
We use spark on yarn mode, with a cluster of 120 nodes.
Yesterday one spark job create 200 executors, while 11 executors on node1,
10 executors on node2, and other executors distributed equally on the other nodes.
Since there are so many executors on node1 and node2, the job run slowly.
How does spark select the node to run executors?
according to yarn resourceManager?
As you mentioned Spark on Yarn:
Yarn Services choose executor nodes for spark job based on the availability of the cluster resource. Please check queue system and dynamic allocation of Yarn. the best documentation https://blog.cloudera.com/blog/2016/01/untangling-apache-hadoop-yarn-part-3/
Cluster Manager allocates resources across the other applications.
I think the issue is with bad optimized configuration. You need to configure Spark on the Dynamic Allocation. In this case Spark will analyze cluster resources and add changes to optimize work.
You can find all information about Spark resource allocation and how to configure it here: http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
Are all 120 nodes having identical capacity?
Moreover the jobs will be submitted to a suitable node manager based on the health and resource availability of the node manager.
To optimise spark job, You can use dynamic resource allocation, where you do not need to define the number of executors required for running a job. By default it runs the application with the configured minimum cpu and memory. Later it acquires resource from the cluster for executing tasks. It will release the resources to the cluster manager once the job has completed and if the job is idle up to the configured idle timeout value. It reclaims the resources from the cluster once it starts again.

AWS EMR- Yarn Container

I was running an application on AWS EMR-Spark. Here, is the spark-submit job;-
Arguments : spark-submit --deploy-mode cluster --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08
AWS uses YARN for resource management. I was looking at the metrics (screenshot below), and have a doubt regarding the YARN 'container' metrics.
Here, the container allocated is shown as 2. However, I was using 4 nodes (3 slave + 1 master),all 8 cores CPU. So, how are only 2 container allocated?
A couple of thing you need to do. First of all, you need to set the following configuration in capacity-scheduler.xml
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
otherwise YARN will not use all the cores you specify. Secondly, you need to actually specify the number of executors you need, and also the number of cores you need and the amount of memory you want allocated on executors (and possibly on the driver as well, if you either have many shuffle partitions or if you collect data to the driver).
YARN is designed to manage clusters running many different jobs at the time, so it will not per default assign all ressources to a single job, unless you force it to by setting the above mentioned setting. Furthermore, the default setting for Spark are also not sufficient for most jobs and you need to set them explicitly. Please have a read through this blog post to get a better understanding of how to tune spark settings for optimal performance.

Resource allocation with Apache Spark and Mesos

So I've deployed a cluster with Apache Mesos and Apache Spark and I've several jobs that I need to execute on the cluster. I would like to be sure that a job has enough resources to be successfully executed, and if it is not the case, it must return an error.
Spark provides several settings like spark.cores.max and spark.executor.memory in order to limit the resources used by the job, but there is no settings for the lower bound (e.g. set the minimal number of core to 8 for the job).
I'm looking for a way to be sure that a job has enough resources before it is executed (during the resource allocation for instance), do you know if it is possible to get this information with Apache Spark on Mesos?

Resources