Monitor memory usage of each node in a slurm job - slurm

My slurm job uses several nodes, and I want to know the maximum memory usage of each node for a running job. What can I do?
Right now, I can ssh into each node and do free -h -s 30 > memory_usage, but I think there must be a better way to do this.

The Slurm accounting will give you the maximum memory usage over time over all tasks directly. If that information is not sufficient, you can setup profiling following this documentaiton and you will receive from Slurm the full memory usage of each process as a time series for the duration of the job. You can then aggregate per node, find the maximum, etc.

Related

How to find max number of tasks that I can run using slurm?

I have access to supercomputer that uses slurm, but I need one information, that I cannot find. How many parallel tasks can I run? I know I can use --ntasks to set the number, and e.g. if I have parallel prblen and I want to check it running 1000 processes I can run it with --ntasks 1000 but what sets max number? Nuber of nodes or number of CPUs or something else?
There is a physical limitation which is the total number of cores available in the cluster. You can check that with sinfo -o%C; the last number in the output will be the total number of CPUs.
There can also be limits defined in the "Quality of Services". You can see them with sacctmgr show qos. Look for the MaxTRES column.
But there can be also administrative limits specific to your user or your account. You can see them with sacctmgr show user $USER withassoc. Look for the MaxCPUMins column.

How To Get the Aggregated Usage of CPU in GRAFANA when hyperthreading is enabled

We are running GRAFANA/PROMETHEUS to monitor our CPU metrics and find aggregated CPU Usage of all cpus. the problem is we have enabled hyperthreading and when we stress CPU the percentage exceeds from 100%. my question is how to limit that cpu usage to show only usage in 100% not more even if cpu is highly utilized.
P.S i have tried setting the max and min limit in grafana but still the graph spikes goes above that limit.
Kindly give me the right query for this problem.
The queries I have tried are given below.
sum(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
100 - avg(irate(node_cpu_seconds_total{instance="localhost",job="node", mode!="idle"}[5m]))*100
and other similar queries we have tried.
If all you want is to "cap" a variable or expression result to a maximum value (that is, 100) you could simply use the Prometheus function clamp_max.
Thus, you could do:
clamp_max(<expr>, 100)
This is probably the most helpful query.
(1 - avg(irate(node_cpu_seconds_total{instance="$instance",job="$job",mode!="idle"}[5m])))*100
Replace your instance IP and your node exporter job name.

Cores assigned to SLURM job

Let's say I want to submit a slurm job just assigning the total amount of tasks (--ntasks=someNumber), without specifying the number of nodes and the tasks per node. Is there a way to know within the launched slurm script how many cores are assigned by slurm for each of the reserved nodes? I need to know this info to properly create a machinefile for the program I'm launching, that must be structured like this:
node02:7
node06:14
node09:3
Once the job is launched, the only way I figured out to see what cores have been allocated on the nodes is using the command:
scontrol show jobid -dd
In its output the abovementioned info is stored (together with plenty of other details).
Is there a better way to get this info?
The way the srun documentation illustrates creating a machine file is by running srun hostname. To get the output you want you could run
srun hostname -s | sort | uniq -c | awk '{print $2":"$1}' > $MACHINEFILE
You should check the documentation of your program to see if it accepts a machine file with repetitions rather than a suffix count. If so you can simplify the command as
srun hostname -s > $MACHINEFILE
And of course the first step is actually to make sure you indeed need a machine file in the first place as many parallel programs/libraries have Slurm support and can gather the needed information from the environment variables setup by Slurm upon job start.

Getting the load for each job

Where can I find the load (used/claimed CPUs) per job? I know to get it per host using sinfo, but that does not directly give information on which job causes a possible 'incorrect' load of anything unequal to 1.
(I want to get this for all jobs, i.e. logging in to the node and running top is not my objective.)
You can use
sacct --format='jobid,ReqCPUS,elapsed,AveCPU'
and compare Elapsed with AveCPU. The latter will only be available for job steps, not for the whole job.

Why does Spark job fail with "too many open files"?

I get "too many open files" during the shuffle phase of my Spark job. Why is my job opening so many files? What steps can I take to try to make my job succeed.
This has been answered on the spark user list:
The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.
You might be able to hack around this by decreasing the number of
reducers [or cores used by each node] but this could have some performance implications for your
job.
In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.
-Patrick Wendell
the default ulimit is 1024 which is ridiculously low for large scale applications. HBase recommends up to 64K; modern linux systems don't seem to have trouble with this many open files.
use
ulimit -a
to see your current maximum number of open files
ulimit -n
can temporarily change the number of open files; you need to update the system configuration files and per-user limits to make this permanent. On CentOS and RedHat systems, that can be found in
/etc/sysctl.conf
/etc/security/limits.conf
Another solution for this error is reducing your partitions.
check to see if you've got a lot of partitions with:
someBigSDF.rdd.getNumPartitions()
Out[]: 200
#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)
#if you just need it for one transformation/action,
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

Resources