Shared Library Error in SGE qsub script - shared-libraries

I am trying to run a parallel job on a cluster, but every time I do it gives me the error "error while loading shared libraries: liblammps.so: cannot open shared object file: No such file or directory". I know that I need to export the library path aka "export LD_LIBRARY_PATH=/path/to/library", and when I do that locally and then run the program everything is fine. It's only when I submit the job to the cluster that I run into any issues. My script looks like this
#!/bin/bash
LD_LIBRARY_PATH=/path/to/library
for i in 1 2 3
do
qsub -l h_rt=43200 -l mem=1G -pe single 1 -cwd ./Project-Serial.sh
qsub -l h_rt=43200 -l mem=1G -pe mpi-spread 2 -cwd ./Project-MPI.sh 2
qsub -l h_rt=28800 -l mem=1G -pe mpi-spread 4 -cwd ./Project-MPI.sh 4
qsub -l h_rt=19200 -l mem=1G -pe mpi-spread 8 -cwd ./Project-MPI.sh 8
qsub -l h_rt=12800 -l mem=1G -pe mpi-spread 16 -cwd ./Project-MPI.sh 16
qsub -l h_rt=8540 -l mem=1G -pe mpi-spread 32 -cwd ./Project-MPI.sh 32
done
I'm not sure whether I simply am setting the path in the wrong place? Or maybe there is some other way to use the library? Any help is appreciated.

You can use qsub -v Variable_List to pass specific environment variables, or qsub -V to pass the full environment. From there, you may need to use mpiexec -x if the sub-tasks rely on the library file (or -env for mpich).

Related

qsub Job using GNU parallel not running

I am trying execute qsub job in a multinode(2) and PPN of 20 using GNU parallel, However it shows some error.
#!/bin/bash
#PBS -l nodes=2:ppn=20
#PBS -l walltime=02:00:00
#PBS -N down
cd $PBS_O_WORKDIR
module load gnu-parallel
for cdr in /scratch/data/v/mt/Downscale/*;do
(cp /scratch/data/v/mt/DWN_FILE_NEW/* $cdr/)
(cd $cdr && parallel -j20 --sshloginfile $PBS_NODEFILE 'echo {} | ./vari_1st_imge' ::: *.DS0 )
done
When I run the above code I got the following error(Please note all the path are properly checked, and the same code without qsub is running properly in a normal computer)
$ ./down
parallel: Error: Cannot open echo {} | ./vari_1st_imge.
& for $qsub down -- no output is creating
I am using parallel --version
GNU parallel 20140622
Please help to solve the problem
First try adding --dryrun to parallel.
But my feeling is that $PBS_NODEFILE is not set for some reason, and that GNU Parallel tries to read the command as the --sshloginfile.
To test this:
echo $PBS_NODEFILE
(cd $cdr && parallel --sshloginfile $PBS_NODEFILE -j20 'echo {} | ./vari_1st_imge' ::: *.DS0 )
If GNU Parallel now tries to open -j20 then it is clear that it is empty.

Job summited with qsub does not write output and enters E status

I have a job called test.sh:
#!/bin/sh -e
#PBS -S /bin/sh
#PBS -V
#PBS -o /my/many/directories/file.log
#PBS -e /my/many/directories/fileerror.log
#PBS -r n
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:00:00
#PBS -V
#############################################################################
echo 'hello'
date
sleep 10
date
I submit it with qsub test.sh
It counts to 10 seconds, but it doesn't write hello to file.log or anywhere else. If I include a call to another script I need that I programmed (and runs outside the cluster), it just goes to Exiting status after said 10 seconds and plainly ignores the call.
Help, please?
Thanks Ott Toomet for your suggestion! I found the problem elsewhere. The .tschrc file had "bash" written in it. Don't ask me why. I deleted it and now the jobs happily run.

Does awk run parallelly?

TASK - SSH to 650 Servers and fetch few details from them and then write the completed server name in different file. How can do it in faster way? If I do normal ssh it takes 7 Minutes. So, I read about awk and wrote following 2 codes.
Could you please explain me the difference in the following codes?
Code 1 -
awk 'BEGIN{done_file="/home/sarafa/AWK_FASTER/done_status.txt"}
{
print "blah"|"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=1 -o ConnectionAttempts=1 "$0" uname >/dev/null 2>&1";
print "$0" >> done_file
}' /tmp/linux
Code 2 -
awk 'BEGIN{done_file="/home/sarafa/AWK_FASTER/done_status.txt"}
{
"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=1 -o ConnectionAttempts=1 "$0" uname 2>/dev/null"|getline output;
print output >> done_file
}' /tmp/linux
When I run these codes for 650 Servers, Code 1 takes - 30 seconds and Code 2 takes 7 Minutes ?
Why is there so much time difference ?
File - /tmp/linux is a list of 650 servers
Updated Answer - with thanks to #OleTange
This form is preferable to my suggestion:
parallel -j 0 --tag --slf /tmp/linux --nonall 'hostname;ls'
--tag Tag lines with arguments. Each output line will be prepended
with the arguments and TAB (\t). When combined with --onall or
--nonall the lines will be prepended with the sshlogin
instead.
--nonall --onall with no arguments. Run the command on all computers
given with --sshlogin but take no arguments. GNU parallel will
log into --jobs number of computers in parallel and run the
job on the computer. -j adjusts how many computers to log into
in parallel.
This is useful for running the same command (e.g. uptime) on a
list of servers.
Original Answer
I would recommend using GNU Parallel for this task, like this:
parallel -j 64 -k -a /tmp/linux 'echo ssh user#{} "hostname; ls"'
which will ssh into 64 hosts in parallel (you can change the number), run hostname and ls on each and then give you all the results in order (-k switch).
Obviously remove the echo when you see how it works.

SGE qsub define variable using bach?

I am trying to automatically set up several variables for SGE system but get no luck.
#!/bin/bash
myname="test"
totaltask=10
#$ -N $myname
#$ -cwd
#$ -t 1-$totaltask
apparently $myname will not be recognized. any solution?
thanks a lot
Consider making a wrapper function
qsub_file.sh
#!/bin/bash
#$ -V
#$ -cwd
wrapper_script.sh
#!/bin/bash
myname="test"
totaltask=10
qsub qsub_script.sh -N ${myname} -t 1-${totaltask}

PBS (torque) fails to consider quad core processors as 4 processors

I have a Debian cluster with 2 nodes and two quad-core processors each. I use Torque and Maui as scheduler. When I try to run an MPI job with 16 processes, the scheduler is not able to run the job: either it puts it to the queue (although there is not any job runing at that moment) or runs and the resulting output file says that you was trying to run a 16 processes job with only 4 processors.
my .../pbs/server_priv/nodes file looks as follows:
node1 np=8
node2 np=8
and an example of the script I'm using to run the program is the following:
#!/bin/sh
#PBS -d /home/bellman/
#PBS -N output
#PBS -k oe
#PBS -j oe
#PBS -l nodes=2:ppn=8,walltime=10000:00:00
#PBS -V
ulimit -s 536870912
# How many procs do I have?
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo Number of processors is $NP
mpiexec -np 16 /home/bellman/AAA
I tried lots of combinations of nodes and ppn, but one of the two errors happen. Any ideas on what is going on?
Did you try :
#PBS -l nodes=2:ncpus=8,walltime=10000:00:00

Resources