Qsub job runs but doesn't write to file - qsub

I am running a parallelised code on an SGE cluster, via the qsub command.
The code (which compiled successfully on the system on which it is supposed to run) is meant to take a file of input values, minimise some function of those values, and then output the new values to the same input file.
The job executes succesfully (code 0), and runs for about 40 minutes of walltime: but nothing is written to the input file.
This is my script to submit the jobs:
#!/bin/bash
#PBS -V
#PBS -l select=1:ncpus=20:mpiprocs=20,walltime=02:00:00
#PBS -o some/path
#PBS -e some/path
#PBS -q smp
#PBS -m ae
#PBS -M user#username.com
#PBS -P Name
#PBS -I
#PBS -N minMg-1
module load gcc/5.1.0
module load chpc/openmpi/1.10.2/gcc-5.1.0
mpirun -np 20 $SRCDIR/myexecutable args < inputfile.inp
I can't see why the thing executes successfully, but doesn't write to inputfile.inp. Strangely, I also don't get the standard ".o" and ".e" files, either. I am sure my mistake may be obvious to someone in the know! Any help would be deeply appreciated.

Related

how to submit multiple jobs using a single .sh file

I am using WinSCP SSH Client to connect my Windows PC to a HPC system. I created a file called RUN1.sh including following code:
#PBS -S /bin/bash
#PBS -q batch
#PBS -N Rjobname
#PBS -l nodes=1:ppn=1:AMD
#PBS -l walltime=400:00:00
#PBS -l mem=20gb
cd $PBS_O_WORKDIR
module load mplus/7.4
mplus -i mplus1.inp -o mplus1.out > output_${PBS_JOBID}.log
It simply calls mplus1.inp file and save results of the analysis in the mplus1.out file. I can do it on Linux one by one. I can do the same for mplus2.inp file running RUN2.sh file below:
#PBS -S /bin/bash
#PBS -q batch
#PBS -N Rjobname
#PBS -l nodes=1:ppn=1:AMD
#PBS -l walltime=400:00:00
#PBS -l mem=20gb
cd $PBS_O_WORKDIR
module load mplus/7.4
mplus -i mplus2.inp -o mplus2.out > output_${PBS_JOBID}.log
However, I have 400 files like this (RUN1.sh, RUN2.sh, ......RUN400.sh). I was wondering if there is a way to create a single file in order to run all 400 jobs in linux.

Job summited with qsub does not write output and enters E status

I have a job called test.sh:
#!/bin/sh -e
#PBS -S /bin/sh
#PBS -V
#PBS -o /my/many/directories/file.log
#PBS -e /my/many/directories/fileerror.log
#PBS -r n
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:00:00
#PBS -V
#############################################################################
echo 'hello'
date
sleep 10
date
I submit it with qsub test.sh
It counts to 10 seconds, but it doesn't write hello to file.log or anywhere else. If I include a call to another script I need that I programmed (and runs outside the cluster), it just goes to Exiting status after said 10 seconds and plainly ignores the call.
Help, please?
Thanks Ott Toomet for your suggestion! I found the problem elsewhere. The .tschrc file had "bash" written in it. Don't ask me why. I deleted it and now the jobs happily run.

qsub array job delay

#!/bin/bash
#PBS -S /bin/bash
#PBS -N garunsmodel
#PBS -l mem=2g
#PBS -l walltime=1:00:00
#PBS -t 1-2
#PBS -e error/error.txt
#PBS -o error/output.txt
#PBS -A improveherds_my
#PBS -m ae
set -x
c=$PBS_ARRAYID
nodeDir=`mktemp -d /tmp/phuong.XXXXX`
cp -r /group/dairy/phuongho/garuns $nodeDir
cp /group/dairy/phuongho/jo/parity1/my/simplex.bin $nodeDir/garuns/simplex.bin
cp /group/dairy/phuongho/jo/parity1/nttp.txt $nodeDir/garuns/my.txt
cp /group/dairy/phuongho/jo/parity1/delay_input.txt $nodeDir/garuns/delay_input.txt
cd $nodeDir/garuns
module load gcc vle
XXX=`pwd`
sed -i "s|/group/dairy/phuongho/garuns/out|$XXX/out/|" exp/garuns.vpz
awk -v i="$c" 'NR == 1 || $8==i' my.txt > simplex-observed.txt
awk -v i="$c" 'NR == 1 || $7==i {print $6}' delay_input.txt > afm_param.txt
cp "/group/dairy/phuongho/garuns_param.txt" "$nodeDir/garuns/garuns_param.txt"
while true
do
./simplex.bin &
sleep 5m
done
awk 'NR >1' < simplex-optimum-output.csv>> /group/dairy/phuongho/jo/parity1/my/finalresuls${c}.csv
cp simplex-all-output.csv "/group/dairy/phuongho/jo/parity1/my/simplex-all-output${c}.csv"
#awk '$28==1{print $1, $12,$26,$28,c}' c=$c out/exp_tempfile.csv > /group/dairy/phuongho/jo/parity1/my/simulated_my${c}.csv
cp /out/exp_tempfile.csv /group/dairy/phuongho/jo/parity1/my/exp_tempfile${c}.csv
rm simplex-observed.txt
rm garuns_param.txt
I have above bash script that allows submitting multiple jobs at the same time via PBS_ARRAYID. My issue is that my model (simplex.bin) when it executes it writes something to my home directory. Thus, if one jobs runs at a time or wait until next jobs finished writing stuff to home then it is fine. However, as I want to have >1000 jobs running at a time, 1000 of them try to write the same stuff to home, then leading to crash.
Is there any a smart way to just submit the second job after the first one has already started for a certain amount of time (let's say 5 minutes)?
I already checked and found two options: starts 2nd job when 1st finished, or start at a specific date/time.
Thanks
You can try something like the following:
while [ yes ]
do
./simplex.bin &
sleep 2
done
It endlessly starts ./simplex.bin process in the background, waits for 2 seconds, starts a new ./simplex.bin, etc.
Please note that you may also need nohup and add standard input/output redirection for your ./simplex.bin. Depending on your exact requirements
If you are using Torque, you can set a limit on the number of jobs that can run concurrently:
# Only allow 100 jobs to concurrently execute from this job array
qsub myscript.sh -t 0-10000%100
I know this isn't exactly what you're looking for, but I'm guessing you can find a slot limit that'll make it run without crashing.

Running samtools from a qsub

I'm trying to run some samtools commands from a qsub call (to run on a cluster). For some reason, the commands do not seem to be recognized. However, if I copy-paste the command and run it directly from the terminal cluster, it works fine. Has anybody experienced such issues or have an idea what I'm doing wrong?
Thanks,
Patrick
My qsub (this doesn't work):
#!/bin/bash
#./etc/sysconfig/pssc
#PBS -S /bin/bash
#PBS JOB_NAME="QSH_$(whoami)"
#PBS NODE_NUM="1"
#PBS NODE_PPN="${NODE_NCPUS}"
#PBS HOURS="24"
#PBS MINUTES="00"
#PBS SECONDS="00"
#PBS WALLTIME=${HOURS}:${MINUTES}:${SECONDS}
#PBS RES_LIST="nodes=${NODE_NUM}:ppn=${NODE_PPN}"
#PBS DIR_WORK="${PBS_O_WORKDIR}"
#PBS QUEUE="high"
#PBS cd ${DIR_WORK}
samtools index /data/test.bam /data/test.bai
If I run the command directly from the terminal, it works:
samtools index /data/test.bam /data/test.bai
Did you remember to cd into your working dir? I do not believe that qsub expands the $ variables in e.g. PBS cd ${DIR_WORK}.
Try with this script:
#!/bin/bash
#./etc/sysconfig/pssc
#PBS JOB_NAME=test
#PBS WALLTIME=24:00:00
cd ${PBS_O_WORKDIR}
echo `pwd`
dir

PBS (torque) fails to consider quad core processors as 4 processors

I have a Debian cluster with 2 nodes and two quad-core processors each. I use Torque and Maui as scheduler. When I try to run an MPI job with 16 processes, the scheduler is not able to run the job: either it puts it to the queue (although there is not any job runing at that moment) or runs and the resulting output file says that you was trying to run a 16 processes job with only 4 processors.
my .../pbs/server_priv/nodes file looks as follows:
node1 np=8
node2 np=8
and an example of the script I'm using to run the program is the following:
#!/bin/sh
#PBS -d /home/bellman/
#PBS -N output
#PBS -k oe
#PBS -j oe
#PBS -l nodes=2:ppn=8,walltime=10000:00:00
#PBS -V
ulimit -s 536870912
# How many procs do I have?
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo Number of processors is $NP
mpiexec -np 16 /home/bellman/AAA
I tried lots of combinations of nodes and ppn, but one of the two errors happen. Any ideas on what is going on?
Did you try :
#PBS -l nodes=2:ncpus=8,walltime=10000:00:00

Resources