Open RTE was unable to open the hostfile - openmpi

After running job.sh script i receive the following error:
Open RTE was unable to open the hostfile:
-np
Check to make sure the path and filename are correct.
Here is job.sh
1 #!/bin/bash
2
3 #PBS -l nodes=${NODES}:ppn=${PPN},walltime=00:01:00
4 #PBS -N lab
5 #PBS -q batch
6
7 cd $PBS_O_WORKDIR
8 mpirun --hostfile $PBS_NODEFILE -np $((NODES * PPN)) main ${ARG1} ${ARG2}

Related

Is there any way to run more than one parallel job simultaneously using a single job script?

Is there any way to run more than one parallel job simultaneously using a single job script? I have written a script like this. However, it is not processing four jobs simultaneously. Only 12 cores out of 48 are running a single job. Only one by one the four codes (from four different directories) are running.
#!/bin/sh
#SBATCH --job-name=my_job_name # Job name
#SBATCH --ntasks-per-node=48
#SBATCH --nodes=1
#SBATCH --time=24:00:00 # Time limit hrs:min:sec
#SBATCH -o cpu_srun_new.out
#SBATCH --partition=medium
module load compiler/intel/2019.5.281
cd a1
mpirun -np 12 ./a.out > output.txt
cd ../a2
mpirun -np 12 ./a.out > output.txt
cd ../a3
mpirun -np 12 ./a.out > output.txt
cd ../a4
mpirun -np 12 ./a.out > output.txt
Commands in sh (like in any other shell) are blocking, meaning that once you run them, the shell waits for its completion before looking at the next comment, unless you append an ampersand & at the end of the command.
Your script should look like this:
#!/bin/sh
#SBATCH --job-name=my_job_name # Job name
#SBATCH --ntasks-per-node=48
#SBATCH --nodes=1
#SBATCH --time=24:00:00 # Time limit hrs:min:sec
#SBATCH -o cpu_srun_new.out
#SBATCH --partition=medium
module load compiler/intel/2019.5.281
cd a1
mpirun -np 12 ./a.out > output1.txt &
cd ../a2
mpirun -np 12 ./a.out > output2.txt &
cd ../a3
mpirun -np 12 ./a.out > output3.txt &
cd ../a4
mpirun -np 12 ./a.out > output4.txt &
wait
Note the & at the end of the mpirun lines, and the addition of the wait command at the end of the script. That command is necessary to make sure the script does not end before the mpirun commands are completed.

how to submit multiple jobs using a single .sh file

I am using WinSCP SSH Client to connect my Windows PC to a HPC system. I created a file called RUN1.sh including following code:
#PBS -S /bin/bash
#PBS -q batch
#PBS -N Rjobname
#PBS -l nodes=1:ppn=1:AMD
#PBS -l walltime=400:00:00
#PBS -l mem=20gb
cd $PBS_O_WORKDIR
module load mplus/7.4
mplus -i mplus1.inp -o mplus1.out > output_${PBS_JOBID}.log
It simply calls mplus1.inp file and save results of the analysis in the mplus1.out file. I can do it on Linux one by one. I can do the same for mplus2.inp file running RUN2.sh file below:
#PBS -S /bin/bash
#PBS -q batch
#PBS -N Rjobname
#PBS -l nodes=1:ppn=1:AMD
#PBS -l walltime=400:00:00
#PBS -l mem=20gb
cd $PBS_O_WORKDIR
module load mplus/7.4
mplus -i mplus2.inp -o mplus2.out > output_${PBS_JOBID}.log
However, I have 400 files like this (RUN1.sh, RUN2.sh, ......RUN400.sh). I was wondering if there is a way to create a single file in order to run all 400 jobs in linux.

Shared Library Error in SGE qsub script

I am trying to run a parallel job on a cluster, but every time I do it gives me the error "error while loading shared libraries: liblammps.so: cannot open shared object file: No such file or directory". I know that I need to export the library path aka "export LD_LIBRARY_PATH=/path/to/library", and when I do that locally and then run the program everything is fine. It's only when I submit the job to the cluster that I run into any issues. My script looks like this
#!/bin/bash
LD_LIBRARY_PATH=/path/to/library
for i in 1 2 3
do
qsub -l h_rt=43200 -l mem=1G -pe single 1 -cwd ./Project-Serial.sh
qsub -l h_rt=43200 -l mem=1G -pe mpi-spread 2 -cwd ./Project-MPI.sh 2
qsub -l h_rt=28800 -l mem=1G -pe mpi-spread 4 -cwd ./Project-MPI.sh 4
qsub -l h_rt=19200 -l mem=1G -pe mpi-spread 8 -cwd ./Project-MPI.sh 8
qsub -l h_rt=12800 -l mem=1G -pe mpi-spread 16 -cwd ./Project-MPI.sh 16
qsub -l h_rt=8540 -l mem=1G -pe mpi-spread 32 -cwd ./Project-MPI.sh 32
done
I'm not sure whether I simply am setting the path in the wrong place? Or maybe there is some other way to use the library? Any help is appreciated.
You can use qsub -v Variable_List to pass specific environment variables, or qsub -V to pass the full environment. From there, you may need to use mpiexec -x if the sub-tasks rely on the library file (or -env for mpich).

Job summited with qsub does not write output and enters E status

I have a job called test.sh:
#!/bin/sh -e
#PBS -S /bin/sh
#PBS -V
#PBS -o /my/many/directories/file.log
#PBS -e /my/many/directories/fileerror.log
#PBS -r n
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:00:00
#PBS -V
#############################################################################
echo 'hello'
date
sleep 10
date
I submit it with qsub test.sh
It counts to 10 seconds, but it doesn't write hello to file.log or anywhere else. If I include a call to another script I need that I programmed (and runs outside the cluster), it just goes to Exiting status after said 10 seconds and plainly ignores the call.
Help, please?
Thanks Ott Toomet for your suggestion! I found the problem elsewhere. The .tschrc file had "bash" written in it. Don't ask me why. I deleted it and now the jobs happily run.

PBS (torque) fails to consider quad core processors as 4 processors

I have a Debian cluster with 2 nodes and two quad-core processors each. I use Torque and Maui as scheduler. When I try to run an MPI job with 16 processes, the scheduler is not able to run the job: either it puts it to the queue (although there is not any job runing at that moment) or runs and the resulting output file says that you was trying to run a 16 processes job with only 4 processors.
my .../pbs/server_priv/nodes file looks as follows:
node1 np=8
node2 np=8
and an example of the script I'm using to run the program is the following:
#!/bin/sh
#PBS -d /home/bellman/
#PBS -N output
#PBS -k oe
#PBS -j oe
#PBS -l nodes=2:ppn=8,walltime=10000:00:00
#PBS -V
ulimit -s 536870912
# How many procs do I have?
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo Number of processors is $NP
mpiexec -np 16 /home/bellman/AAA
I tried lots of combinations of nodes and ppn, but one of the two errors happen. Any ideas on what is going on?
Did you try :
#PBS -l nodes=2:ncpus=8,walltime=10000:00:00

Resources