How to run slurm in the background? - slurm

To use allocated resources by slurm interactively and in the background, I use salloc -n 12 -t 20:00:00&. The problem is that this command does not redirect me to the compute node and if I run a program it uses resources of the login node. Could you please help me to find the right command?
salloc -n 12 -t 20:00:00 a.out </dev/null&
but it fails :
salloc: error: _fork_command: Unable to find command "a.out"
Any help is highly appreciated.

Is a.out in your path? e.g. what does which a.out return?

you only need to execute salloc -n 12 -t 20:00:00&. Then use ssh to connect to the allocated node (for example, ssh node013).

Related

Bash script results in different output when running from a cron job

I'm puzzled by this problem I'm having on Ubuntu 20.04 where cron is able to run a bash script but the overall outcome is different then when using the shell command.
I've look through all questions I could in here and on Google but couldn't find anyone that had the same problem.
Background:
I'm using Pushgateway to store metrics I'm generating through a bash script, and afterwards it's being imported automatically to Prometheus.
The end goal is to export a list of running processes, their CPU%, Mem% etc, similar to top command.
This is the bash script:
#!/bin/bash
z=$(top -n 1 -bi)
while read -r z
do
var=$var$(awk 'FNR>7{print "cpu_usage{process=\""$12"\", pid=\""$1"\"}", $9z} FNR>7{print "memory_usage{process=\""$12"\", pid=\""$1"\"}", $10z}')
done <<< "$z"
curl -X POST -H "Content-Type: text/plain" --data "$var
" http://localhost:9091/metrics/job/top/instance/machine
I used to have a version that used ps aux but then I found out that it only shows the average CPU% per process.
As you can see, the command I'm running is top -n 1 -bi which gives me a snapshot of active processes and their metrcis.
I'm using awk to format the data, and FNR>7 because I need to ignore the first 7 lines which is the summery presented by top.
The bash scrip is registered on /bin, /usr/bin and /usr/local/bin.
When checking http://localhost:9091/metrics, which is supposed to show me the information gathered, I'm getting this some of information when running the scrip using shell:
cpu_usage{instance="machine",job="top",pid="114468",process="php-fpm74"} 17.6
cpu_usage{instance="machine",job="top",pid="114483",process="php-fpm74"} 11.8
cpu_usage{instance="machine",job="top",pid="126305",process="ffmpeg"} 64.7
And this is the same information when cron is running the same script:
cpu_usage{instance="machine",job="top",pid="114483",process="php-fpm+"} 5
cpu_usage{instance="machine",job="top",pid="126305",process="ffmpeg"} 60
cpu_usage{instance="machine",job="top",pid="128777",process="php"} 15
So, for some reason, when I run it from cron it cuts the process name after 7 places.
I initially though it was related to the FNR>7 but even after changing it to 8 or 9 (and using exec bash to re-register the command) it gives the same results, also when I run it manually it works just fine.
Any help would be appreciated!!

Change salloc behavior to run all commands on remote node

The default behaviour for salloc will run any shell related commands on the node where salloc was called from, while any srun commands called from that salloc job shell will run on the node that was allocated. Does anyone know of a way to get salloc to interactively run all commands on the remote node job shell?
Below is an example of the current default behaviour I'm seeing. Ideally, the first hostname command would run on slurm-node02 and return that hostname. Thanks!
[testyboi#slurm-node01 ~]$ salloc --nodelist=slurm-node02
salloc: Granted job allocation 890
[testyboi#slurm-node01 ~]$ hostname
slurm-node01
[testyboi#slurm-node01 ~]$ cat salloctest.sh
#!/bin/bash
echo "I am running on "; hostname;
[testyboi#slurm-node01 ~]$ srun -N1 salloctest.sh
I am running on
slurm-node02
The former FAQ, relevant for versions prior to 20.11, suggested to set
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --cpu-bind=no --mpi=none $SHELL"
in slurm.conf. For the current version, 20.11, there is a new option. You can set
LaunchParameters=use_interactive_step
in slurm.conf.

Why am I not able to use -o or --format with ps command to control the output format?

I want to print certain columns only from ps output that is PID, PPID, command, memory utilization and CPU utilization columns.
when I run ps command I get the following output.
Now I only want some columns from this output so I use -o flag as mentioned in this tutorial.
But I am getting this error.
I don't understand where is the problem. I have also tried usin --help and it is not showing -o flag. So I am confused here.
I am using the windows operating system. And using Git Bash terminal to run all these Linux commands.
Git Bash is a terminal for Windows that emulates the Linux bash (shell) functionality. It is not 100% compatible to a "real" bash shell. As you've empirically seen, its ps executable doesn't support all the flags you're used to from Linux. The --help option will show you what flags are supported.
Hello
Maybe put 2 things together, ps and grep? Then try this...
ps | grep -o -E "^[ 0-9]{1,9}"
...and is this working on your system?
( The Space in [ ] is important )

Run Hydra (mpiexec) locally gives strange SSH error

I am trying to run example code from this question: MPI basic example doesn't work but when I do:
$ mpirun -np 2 mpi_test
I get this:
ssh: Could not resolve hostname wvxvw-laptop: Name or service not known
And then the program hangs until interrupted.
wvxvw-laptop is the "host name" of my laptop, which is just that, really, a laptopt...
All I want is to try to run the example code, not to set up a network cluster or anything like that.
What did I miss? I'm reading the wiki page http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager but I can't understand what is the reason.
Sorry, I'm very new to this.
Some more verbose output:
/usr/bin/ssh -x wvxvw-laptop "/usr/lib64/mpich/bin/hydra_pmi_proxy" \
--control-port wvxvw-laptop:54320 --debug --rmk user --launcher ssh \
--demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
Formatted for readability. I'm not quite sure why is this even supposed to work (I've never used ssh -x not sure what it is supposed to do :/
mpirun execute your program on all node registered on your mpi cluster.
MPI use the computer name so you can edit your /etc/hosts to add an entry for wvxvw-laptop

Invocation command using SSH getting failed?

As per project requirement, i need to check the content of zip file generated which been generated on remote machine.This entire activity is done using automation framework suites. which has been written in shell scripts. I am performing above activity using ssh command abd execute unzip command with -l and -q switches. But this command is getting failed. and shows below error messages.
[SOMEUSER#MACHINE IP Function]$ ./TESTS.sh
ssh SOMEUSER#MACHINE IP unzip -l -q SOME_PATH/20130409060734*.zip | grep -i XML |wc -l
unzip: cannot find or open SOME_PATH/20130409060734*.zip, SOME_PATH/20130409060734*.zip.zip or SOME_PATH/20130409060734*.zip.ZIP.
No zipfiles found.
0
the same command i had written manually but that works properly. I really have no idea.Why this is getting failed whenever i executed via shell scripts.
[SOMEUSER#MACHINE IP Function]$ ssh SOMEUSER#MACHINE IP unzip -l -q SOME_PATH/20130409060734*.zip | grep -i XML |wc -l
2
Kindly help me to resolve that issue.
Thanks in Advance,
Priyank Shah
when you run the command from your local machine, the asterisk character is being expanded on your local machine before it is passed on to your remote ssh command. So your command is expecting to find SOME_PATH/20130409060734*.zip files on your machine and insert them into your ssh command to be passed to the other machine, whereas you (I'm assuming) mean, SOME_PATH/20130409060734*.zip files on the remote machine.
for that, precede the * character by a backslash ( \ ) and see if it helps you. In some shells escape character might be defined differently and if yours is one of them you need to find the escape character and use that one instead. Also, use quotes around the commands being passed to other server. Your command line should look something like this in my opinion:
ssh SOMEUSER#MACHINE_IP "/usr/bin/unzip -l -q SOME_PATH/20130409060734\*.zip | grep -i XML |wc -l"
Hope this helps

Resources