Change salloc behavior to run all commands on remote node - slurm

The default behaviour for salloc will run any shell related commands on the node where salloc was called from, while any srun commands called from that salloc job shell will run on the node that was allocated. Does anyone know of a way to get salloc to interactively run all commands on the remote node job shell?
Below is an example of the current default behaviour I'm seeing. Ideally, the first hostname command would run on slurm-node02 and return that hostname. Thanks!
[testyboi#slurm-node01 ~]$ salloc --nodelist=slurm-node02
salloc: Granted job allocation 890
[testyboi#slurm-node01 ~]$ hostname
slurm-node01
[testyboi#slurm-node01 ~]$ cat salloctest.sh
#!/bin/bash
echo "I am running on "; hostname;
[testyboi#slurm-node01 ~]$ srun -N1 salloctest.sh
I am running on
slurm-node02

The former FAQ, relevant for versions prior to 20.11, suggested to set
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL"
in slurm.conf. For the current version, 20.11, there is a new option. You can set
LaunchParameters=use_interactive_step
in slurm.conf.

Related

Run the same bash script in multiple processes

I have a bash script that basically translates an argument to a kubectl command. For example:
$ ./file.sh service1 (will run command kubectl logs service1- -n namespace -f)
or
$ ./file.sh service2 (will run command kubectl logs service2- -n anothernamespace -f)
My problem is that when I run it to see the logs live with option -f (follow) and I want to open another terminal tab to see the logs from another service with the same script, the first process is killed. So how can I run the same script from multiple terminals without stopping eachother and seeing the output for all.
Run the program in the background and output the log to out.file
command >> out.file 2>&1 &
I did manage to get it working with the help of ChatGPT (it does wonders). The solution is this:
osascript -e 'tell application "Terminal" to do script "clear; ./your_script.sh arg1 arg2" in the front window'

How to run slurm in the background?

To use allocated resources by slurm interactively and in the background, I use salloc -n 12 -t 20:00:00&. The problem is that this command does not redirect me to the compute node and if I run a program it uses resources of the login node. Could you please help me to find the right command?
salloc -n 12 -t 20:00:00 a.out </dev/null&
but it fails :
salloc: error: _fork_command: Unable to find command "a.out"
Any help is highly appreciated.
Is a.out in your path? e.g. what does which a.out return?
you only need to execute salloc -n 12 -t 20:00:00&. Then use ssh to connect to the allocated node (for example, ssh node013).

"rejoin" a bash SLURM job

Currently, I can use srun [variety of settings] bash to create a shell on a compute note. However, if my ssh disconnects for whatever reason and I want to re-access the shell, how can I do that?
Get your job ID
squeue -u $USERNAME
Get your NodeID
scontrol show job $JOBID | grep NodeList
Attach (you can assign step = 0)
sattach $JOBID.0
This did not work for me, but in theory this should work:
srun --pty --jobid $JOBID -w $NODEID /bin/bash
Source
Assuming the SSH connection from your laptop to the login node of the cluster is unstable, you can use a terminal multiplexer such as screen or tmux, depending on what is already installed on the login node.
Typically, a session would look like this
[you#yourlaptop ~]$ ssh cluster-frontend
[you#cluster ~]$ tmux # to enter a persistent tmux session
[you#cluster ~]$ srun [...] bash # to get a shell on a compute node
[you#computenode ~]$ # some work, then...
some SSH error (e.g. Write failed: Broken pipe)
[you#yourlaptop ~]$ ssh cluster-frontend
[you#cluster ~]$ tmux a # to re-attach to the persistent tmux session
[you#computenode ~]$ # resume work
With screen, you would use screen -r rather than tmux a. Otherwise the process is the same.
If you want to join a job from another terminal instance (on the right below), you can use the Slurm's sattach command.
[you#yourlaptop ~]$ ssh cluster-frontend |
[you#cluster ~]$ srun [...] bash |
srun: job ******* queued and waiting for resources |
srun: job ******* has been allocated resources | [you#yourlaptop ~]$ ssh cluster-frontend
[you#computenode ~]$ | [you#cluster ~]$ sattach --pty ********
[you#computenode ~]$ echo OK | [you#computenode ~]$ echo OK
[you#computenode ~]$ OK | [you#computenode ~]$ OK
The original terminal and the one in which sattach was run are now entirely synchronised.
Note that the above does not protect from an accidental termination of srun; whenever srun terminates, the job is terminated as well.

Get PID of the jobs submitted by nohup in Linux

I'm using nohup to submit jobs in background in the machines I got through BSUB and ssh.
My primary machine is on RHEL, from there I am picking up other AIX machine by BSUB(Submits a job to LSF) and also doing a SSH login to another server.
After getting these two machines, executing a script(inner.sh) there through nohup.
I'm capturing the respective PIDs through echo $$ in the script which I am executing(inner.sh).
After submitting the nohup execution in background, I am exiting both of the machines and landing back to primary RHEL machine.
Now, from this RHEL machine, I'm trying to get the status of the nohup execution by ps -p PID with the help of the previously captured two PIDs, but not getting any process listed.
Top level wrapper script wrapper.sh:
#!/bin/bash
#login to a remote server
ssh -k xyz#abc < env_setup.sh
#picking up a AIX machine form LSF
bsub -q night -Is ksh -i env_setup.sh
ps -p process-<AIX_machine>.pid
#got no output
ps -p process-<server_machine>.pid
#got no output
Script passed to Machines picked up by BSUB/SSH to execute nohup env_setup.sh:
#!/bin/bash
nohup sh /path/to/dir/inner.sh > /path/to/dir/log-<hostname>.out &
exit
The actual script which I am trying to execute in machines picked up by BSUB/SSH inner.sh:
#!/bin/bash
echo $$ > /path/to/dir/process-<hostname>.pid
#hope this would give correct us the PID of the remote machine
#execute some other commands
Now, I am getting two process-<hostname>.pid files updated with two PIDs respectively each for both of the machines.
But ps -p at wrapper script is giving us no output.
I am picking up the process IDs from remote machines and doing ps -p at my local RHEL machine.
Is it the reason I am not getting any status update of those two processes?
Can I do anything else to get the status?
ps give the status of local processes. bsub can be used to get the status of processes on each remote machine.

Run a script in the same shell(bash)

My problem is specific to the running of SPECCPU2006(a benchmark suite).
After I installed the benchmark, I can invoke a command called "specinvoke" in terminal to run a specific benchmark. I have another script, where part of the codes are like following:
cd (specific benchmark directory)
specinvoke &
pid=$!
My goal is to get the PID of the running task. However, by doing what is shown above, what I got is the PID for the "specinvoke" shell command and the real running task will have another PID.
However, by running specinvoke -n ,the real code running in the specinvoke shell will be output to the stdout. For example, for one benchmark,it's like this:
# specinvoke r6392
# Invoked as: specinvoke -n
# timer ticks over every 1000 ns
# Use another -n on the command line to see chdir commands and env dump
# Starting run for copy #0
../run_base_ref_gcc43-64bit.0000/milc_base.gcc43-64bit < su3imp.in > su3imp.out 2>> su3imp.err
Inside it it's running a binary.The code will be different from benchmark to benchmark(by invoking under different benchmark directory). And because "specinvoke" is installed and not just a script, I can not use "source specinvoke".
So is there any clue? Is there any way to directly invoke the shell command in the same shell(have same PID) or maybe I should dump the specinvoke -n and run the dumped materials?
You can still do something like:
cd (specific benchmark directory)
specinvoke &
pid=$(pgrep milc_base.gcc43-64bit)
If there are several invocation of the milc_base.gcc43-64bit binary, you can still use
pid=$(pgrep -n milc_base.gcc43-64bit)
Which according to the man page:
-n
Select only the newest (most recently started) of the matching
processes
when the process is a direct child of the subshell:
ps -o pid= -C=milc_base.gcc43-64bit --ppid $!
when not a direct child, you could get the info from pstree:
pstree -p $! | grep -o 'milc_base.gcc43-64bit(.*)'
output from above (PID is in brackets): milc_base.gcc43-64bit(9837)

Resources