Is there any reason in slurm not running more than a certain number of nodes? - slurm

I want to run about 400 jobs in GCP-slurm from about 2,000 arrays.
The slurm settings and slurm.config settings in my bash file are as follows.
run.sh
#SBATCH -o ./out/vs.%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -W
slurm.config
MaxArraySize=50000
MaxJobCount=50000
#COMPUTE NODE
NodeName=DEFAULT CPUs=16 RealMemory=63216 State=UNKNOWN
NodeName=node-0-[0-599] State=CLOUD
Currently, 100 nodes are being used for work other than that task.
If you proceed with this task, only about 130-150 node tasks in total are executed and the rest are not executed.
Are there any additional parameters that need to be set?
-- additional error log
[2022-06-20T01:18:41.294] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2022-06-20T01:18:41.294] error: slurm_set_addr: Unable to resolve “node-333"
[2022-06-20T01:18:41.294] error: fwd_tree_thread: can't find address for host node-333, check slurm.conf

I found a workaround for the additional error.
https://groups.google.com/g/slurm-users/c/y-QZKDbYfIk
I referred to the article, and you can edit slurcmtld.service / slurmd.service / slurmdbd.service.
network.target -> network-online.target
However, the maximum number of execution nodes is still maintained.

Related

Running parallel jobs in slurm

I was wondering if I could ask something about running slurm jobs in parallel.(Please note that I am new to slurm and linux and have only started using it 2 days ago...)
As per the insturctions on the picture below (source : https://hpc.nmsu.edu/discovery/slurm/serial-parallel-jobs/),
I have designed the following bash script
#!/bin/bash
#SBATCH --job-name fmriGLM #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH -t 16:00:00 # Time for running job
#SBATCH -o /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/output_fmri_glm.o%j #%j : job id 가 [>
#SBATCH -e /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/error_fmri_glm.e%j
pwd; hostname; date
#SBATCH --ntasks=30
#SBATCH --mem-per-cpu=3000MB
#SBATCH --cpus-per-task=1
for num in {0..29}
do
srun --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
done
wait
The, I ran sbatch as follows: sbatch test_bash
However, when I view the outputs, it is apparent that only one of the sruns in the bash script are being executed... Could anyone tell me where I went wrong and how I can fix it?
**update : when I look at the error file I get the following : srun: Job 43969 step creation temporarily disabled, retrying. I searched the internet and it says that this could be caused by not specifying the memory and hence not having enough memory for the second job.. but I thought that I already specifeid the memory when I did --mem_per_cpu=300MB?
**update : I have tried changing the code as said as in here : Why are my slurm job steps not launching in parallel?, but.. still it didn't work
**potentially pertinent information: our node has about 96cores, which seems odd when compared to tutorials that say one node has like 4cores or something
Thank you!!
Try adding --exclusive to the srun command line:
srun --exclusive --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
This will instruct srun to use a sub-allocation and work as you intended.
Note that the --exclusive option has a different meaning in this context than if used with sbatch.
Note also that different versions of Slurm have a distinct canonical way of doing this, but using --exclusive should work across most versions.
Even though you have solved your problem which turned out to be something else, and that you have already specified --mem_per_cpu=300MB in your sbatch script, I would like to add that in my case, my Slurm setup doesn't allow --mem_per_cpu in sbatch, only --mem. So the srun command will still allocate all the memory and block the subsequent steps. The key for me, is to specify --mem_per_cpu (or --mem) in the srun command.

How do I dynamically assign the number of cores per node?

#!/bin/bash
#SBATCH --job-name=Parallel# Job name
#SBATCH --output=slurmdiv.out# Output file name
#SBATCH --error=slurmdiv.err # Error file name
#SBATCH --partition=hadoop# Queue
#SBATCH --nodes = 1
#SBATCH --time=01:00:00# Time limit
The above script does not work without specifying --ntasks-per-node directive. The number of cores per node depends on the queue being used. I would like to assign the maximum number of cores per node without having to specify it ahead of time in the slurm script. I'm using this to run an R script that uses detectCores() and mclapply.
You can try adding the #SBATCH --exclusive parameter to your submission script so that Slurm will allocate a full node for your job, without the need to specify explicitly a specific number of tasks. Then you can use detectCores() in your script.

Slurm can't run more than one sbatch task

I've installed Slurm on a 2-node cluster. Both nodes are compute nodes, one is the controller also. I am able to successfully run srun with multiple jobs at once. I am running GPU jobs and have confirmed I can get multiple jobs running on multiple GPUs with srun, up to the number of GPUs in the systems.
However, when I try running sbatch with the same test file, it will only run one batch job, and it only runs on the compute node which is also the controller. The others fail, with an ExitCode of 1:0 in the sacct summary. If I try forcing it to run on the compute node that's not the controller, it won't run and shows the 1:0 exit code. However, just using srun will run on any compute node.
I've made sure the /etc/slurm/slurm.conf files are correct with the specs of the machines. Here is the sbatch .job file I am using:
#!/bin/bash
#SBATCH --job-name=tf_test1
#SBATCH --output=/storage/test.out
#SBATCH --error=/storage/test.err
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2000
##SBATCH --mem=10gb
#SBATCH --gres=gpu:1
~/anaconda3/bin/python /storage/tf_test.py
Maybe there is some limitation with sbatch I don't know about?
sbatch creates a job allocation and launches what is called the 'batch step'.
If you aren't familiar with what a job step is, I recommend this page: https://slurm.schedmd.com/quickstart.html
The batch step runs the script passed to it from sbatch. The only way to launch additional job steps is to invoke srun inside the batch step. In your case, it would be
srun ~/anaconda3/bin/python /storage/tf_test.py
This will create a job step running tf_test.py on each task in the allocation. Note that while the command is the same as when you run srun directly, it detects that is inside an allocation via environment variables from sbatch. You can split up the allocation into multiple job steps by running srun with flags like -n[num tasks] instead. ie
#!/bin/bash
#SBATCH --ntasks=2
srun --ntasks=1 something.py
srun --ntasks=1 somethingelse.py
I don't know if you're having any other problems because you didn't post any other error messages or logs.
If using srun on the second node works and using sbatch with the submission script you mention fails without any output written, the most probable reason would be that /storage does not exist, or is not writable by the user, on the second node.
The slurmd logs on the second node should be explicit about this. The default location is /var/log/slurm/slurmd.log, but check the output of scontrol show config| grep Log for definitive information.
Another probable cause that lead to the same behaviour would be that the user is not defined or has a different UID on the second node (but then srun would fail too)
#damienfrancois answer was closest and maybe even correct. After making sure the /storage location was available on all nodes, things run with sbatch. The biggest issue was the /storage location is shared via NFS, but it was read-only for the compute nodes. This had to be changed in /etc/exports to look more like:
/storage *(rw,sync,no_root_squash)
Before it was ro...
The job file I have that works is also a bit different. Here is the current .job file:
#!/bin/bash
#SBATCH -N 1 # nodes requested
#SBATCH --job-name=test
#SBATCH --output=/storage/test.out
#SBATCH --error=/storage/test.err
#SBATCH --time=2-00:00
#SBATCH --mem=36000
#SBATCH --qos=normal
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER#nothing.com
#SBATCH --gres=gpu
srun ~/anaconda3/bin/python /storage/tf_test.py

Slurm: select nodes with specified number of CPUs

I'm using slurm on a cluster where single partitions have dissimilar nodes. Specifically, the nodes have varying # CPUs. My code is a single-core application being used for a parameter sweep and thus I want to fully use an (eg.) 32 CPU node by sending it 32 jobs.
How can I select nodes (within a named partition) that have a specified number of CPUs?
I know my Partition configuration via
sinfo -e -p <partition_name> -o "%9P %3c %.5D %6t " -t idle,mix
PARTITION CPU NODES STATE
<partition_name> 16 63 mix
<partition_name> 32 164 mix
But if I use a submissions script like
[snip preamble]
#SBATCH --partition <partition_name> # resource to be used
#SBATCH --nodes 1 # Num nodes
#SBATCH -N 1 # Num cores per job
#SBATCH --cores-per-socket=32 # Cores per node
the slurm scheduler says
sbatch: error: Socket, core and/or thread specification can not be satisfied
PS. A minor correction: my code to get partition info isn't the best. Just in case anyone looks up this question later, here is a better query (using X,Y for socket, core counts) that helps identify the problem that damien's excellent answer solved
sinfo -e -p <partition_name> -o "%9P %3c %.3D %6t %2X %2Y %N" -t idle,mix
To strictly answer your question: With
#SBATCH --cores-per-socket=32
you request 32 core per socket, which is per physical CPU. I guess those machines have two CPUs so you should request something like
#SBATCH --sockets-per-node=2
#SBATCH --cores-per-socket=16
Another way of requesting the same is to ask for
#SBATCH --nodes 1
#SBATCH --tasks-per-node 32
But please note that, if your cluster allows node sharing, what you do seems more suited for job arrays :
#SBATCH --ntasks 1
#SBATCH --arrays 1-32
IDS=($(seq RUN_ID_FIRST RUN_ID_LAST))
RUN_ID=${IDS[$SLURM_ARRAY_TASK_ID]}
matlab -nojvm -singleCompThread -r "try myscript(${RUN_ID}); catch me; disp(' *** error'); end; exit" > ./result_${RUN_ID}
This will launch 32 independent jobs, each taking care of running the Matlab script for one value of the parameter sweep.
To answer your additional question; if a 32-process job is scheduled on a 16-CPU node, the node will be overloaded, and depending on the containment solution set up by the administrators, your processes might impact others' jobs and slow them down.

How to gather processed information from nodes SLURM/PBS

I am new to Parallel computing, I cannot understand the use of PBS systems. I have successfully install SLURM and set up processing nodes. But cannot get the idea how I can distribute a task between multiple nodes.
There are a lot of simple examples, but they just run simple "Hello World" programs and that's all.
Consider the following example, I've found on the internet.
#!/bin/bash
#SBATCH -N 4
#SBATCH -c 1
#SBATCH --time=0-00:15:00 # 30 minutes
#SBATCH --job-name="just_a_test"
module load python
python --version
Simple script that run gets the Python version.
When I run it using sbatch python.slurm, the result is saved only on the first node even if I set the number to 4. But srun -N4 /bin/hostname works fine on the other hand.
But this is not the main question.
I cannot understand what I have to write my parallel algorithm.
Any example of parallel algorithm like array sorting, matrix multiplication or whatever.
The steps that are used for example in Hadoop or just multithreaded environment.
Get input from a source.
Divide the input into chunks,the number of chunks should be related to the node count.
Send these chunk to each processing node/thread
Wait for all threads to complete
Gather processed information and show it user after merging
How can I do the same using SLURM or any PBS.
#!/bin/bash
#SBATCH -N 4
#SBATCH -c 1
#SBATCH --time=0-00:15:00 # 30 minutes
#SBATCH --job-name="just_a_test"
what I have to write here ?
Please explain this or give a good article to read about, because I haven't found any.
Thanks
The most basic way to do this is to use pbsdsh:
pbsdsh hostname
will make the hostname command execute once for each execution slot (core or thread) in your job. I will also point out that you'll need to translate your #SBATCH to their #PBS equivalents.
The more universal way to do this is through MPI implementations.

Resources