qsub array job delay - linux

#!/bin/bash
#PBS -S /bin/bash
#PBS -N garunsmodel
#PBS -l mem=2g
#PBS -l walltime=1:00:00
#PBS -t 1-2
#PBS -e error/error.txt
#PBS -o error/output.txt
#PBS -A improveherds_my
#PBS -m ae
set -x
c=$PBS_ARRAYID
nodeDir=`mktemp -d /tmp/phuong.XXXXX`
cp -r /group/dairy/phuongho/garuns $nodeDir
cp /group/dairy/phuongho/jo/parity1/my/simplex.bin $nodeDir/garuns/simplex.bin
cp /group/dairy/phuongho/jo/parity1/nttp.txt $nodeDir/garuns/my.txt
cp /group/dairy/phuongho/jo/parity1/delay_input.txt $nodeDir/garuns/delay_input.txt
cd $nodeDir/garuns
module load gcc vle
XXX=`pwd`
sed -i "s|/group/dairy/phuongho/garuns/out|$XXX/out/|" exp/garuns.vpz
awk -v i="$c" 'NR == 1 || $8==i' my.txt > simplex-observed.txt
awk -v i="$c" 'NR == 1 || $7==i {print $6}' delay_input.txt > afm_param.txt
cp "/group/dairy/phuongho/garuns_param.txt" "$nodeDir/garuns/garuns_param.txt"
while true
do
./simplex.bin &
sleep 5m
done
awk 'NR >1' < simplex-optimum-output.csv>> /group/dairy/phuongho/jo/parity1/my/finalresuls${c}.csv
cp simplex-all-output.csv "/group/dairy/phuongho/jo/parity1/my/simplex-all-output${c}.csv"
#awk '$28==1{print $1, $12,$26,$28,c}' c=$c out/exp_tempfile.csv > /group/dairy/phuongho/jo/parity1/my/simulated_my${c}.csv
cp /out/exp_tempfile.csv /group/dairy/phuongho/jo/parity1/my/exp_tempfile${c}.csv
rm simplex-observed.txt
rm garuns_param.txt
I have above bash script that allows submitting multiple jobs at the same time via PBS_ARRAYID. My issue is that my model (simplex.bin) when it executes it writes something to my home directory. Thus, if one jobs runs at a time or wait until next jobs finished writing stuff to home then it is fine. However, as I want to have >1000 jobs running at a time, 1000 of them try to write the same stuff to home, then leading to crash.
Is there any a smart way to just submit the second job after the first one has already started for a certain amount of time (let's say 5 minutes)?
I already checked and found two options: starts 2nd job when 1st finished, or start at a specific date/time.
Thanks

You can try something like the following:
while [ yes ]
do
./simplex.bin &
sleep 2
done
It endlessly starts ./simplex.bin process in the background, waits for 2 seconds, starts a new ./simplex.bin, etc.
Please note that you may also need nohup and add standard input/output redirection for your ./simplex.bin. Depending on your exact requirements

If you are using Torque, you can set a limit on the number of jobs that can run concurrently:
# Only allow 100 jobs to concurrently execute from this job array
qsub myscript.sh -t 0-10000%100
I know this isn't exactly what you're looking for, but I'm guessing you can find a slot limit that'll make it run without crashing.

Related

How to run shell script commands in an sh file in parallel?

I'm trying to take backup of tables in my database server.
I have around 200 tables. I have a shell script that contains commands to take backups of each table like:
backup.sh
psql -u username ..... table1 ... file1;
psql -u username ..... table2 ... file2;
psql -u username ..... table3 ... file3;
I can run the script and create backups in my machine. But as there are 200 tables, it's gonna run the commands sequentially and takes lot of time.
I want to run the backup commands in parallel. I have seen articles where in they suggested to use && after each command or use nohup command or wait command.
But I don't want to edit the script and include around 200 such commands.
Is there any way to run these list of shell script commands parallelly? something like nodejs does? Is it possible to do it? Or am I looking at it wrong?
Sample command in the script:
psql --host=somehost --port=5490 --username=user --dbname=db -c '\copy dbo.tablename TO "/home/username/Desktop/PostgresFiles/tablename.csv" with DELIMITER ","';
You can leverage xargs to run command in parallel, AND control the number of concurrent jobs. Running 200 backup jobs might overwhelm your database, and result in less than optimal performance.
Assuming you have backup.sh with one backup command per line
xargs -P5 -I{} bash -c "{}" < backup.sh
The commands in backup.sh should be modified to allow quoting (using single quote when possible, escaping double quote):
psql --host=somehost --port=5490 --username=user --dbname=db -c '\copy dbo.tablename TO \"/home/username/Desktop/PostgresFiles/tablename.csv\" with DELIMITER \",\"';
Where -P5 control the number of concurrent jobs. This will be able to process command lines WITHOUT double quotes. For the above script, you change "\copy ..." to '\copy ...'
Simpler alternative will be to use a helper backup-table.sh, which will take two parameters (table, file), and use
xargs -P5 -I{} backup-table.sh "{}" < tables.txt
And put all the complex quoting into the backup-table.sh
doit() {
table=$1
psql --host=somehost --port=5490 --username=user --dbname=db -c '\copy dbo.'$table' TO "/home/username/Desktop/PostgresFiles/'$table'.csv" with DELIMITER ","';
}
export -f doit
sql --listtables -n postgresql://user:pass#host:5490/db | parallel -j0 doit
Is there any logic in the script other than individual commands? (EG: and if's or processing of output?).
If it's just a file with a list of scripts, you could write a wrapper for the script (or a loop from the CLI) EG:
$ cat help.txt
echo 1
echo 2
echo 3
$ while read -r i;do bash -c "$i" &done < help.txt
[1] 18772
[2] 18773
[3] 18774
1
2
3
[1] Done bash -c "$i"
[2]- Done bash -c "$i"
[3]+ Done bash -c "$i"
$ while read -r i;do bash -c "$i" &done < help.txt
[1] 18820
[2] 18821
[3] 18822
2
3
1
[1] Done bash -c "$i"
[2]- Done bash -c "$i"
[3]+ Done bash -c "$i"
Each line of help.txt contains a command and I run a loop where I take each command and run it in subshell. (this is a simple example where I just background each job. You could get more complex using something like xargs -p or parallel but this is a starting point)

qsub Job using GNU parallel not running

I am trying execute qsub job in a multinode(2) and PPN of 20 using GNU parallel, However it shows some error.
#!/bin/bash
#PBS -l nodes=2:ppn=20
#PBS -l walltime=02:00:00
#PBS -N down
cd $PBS_O_WORKDIR
module load gnu-parallel
for cdr in /scratch/data/v/mt/Downscale/*;do
(cp /scratch/data/v/mt/DWN_FILE_NEW/* $cdr/)
(cd $cdr && parallel -j20 --sshloginfile $PBS_NODEFILE 'echo {} | ./vari_1st_imge' ::: *.DS0 )
done
When I run the above code I got the following error(Please note all the path are properly checked, and the same code without qsub is running properly in a normal computer)
$ ./down
parallel: Error: Cannot open echo {} | ./vari_1st_imge.
& for $qsub down -- no output is creating
I am using parallel --version
GNU parallel 20140622
Please help to solve the problem
First try adding --dryrun to parallel.
But my feeling is that $PBS_NODEFILE is not set for some reason, and that GNU Parallel tries to read the command as the --sshloginfile.
To test this:
echo $PBS_NODEFILE
(cd $cdr && parallel --sshloginfile $PBS_NODEFILE -j20 'echo {} | ./vari_1st_imge' ::: *.DS0 )
If GNU Parallel now tries to open -j20 then it is clear that it is empty.

Job summited with qsub does not write output and enters E status

I have a job called test.sh:
#!/bin/sh -e
#PBS -S /bin/sh
#PBS -V
#PBS -o /my/many/directories/file.log
#PBS -e /my/many/directories/fileerror.log
#PBS -r n
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:00:00
#PBS -V
#############################################################################
echo 'hello'
date
sleep 10
date
I submit it with qsub test.sh
It counts to 10 seconds, but it doesn't write hello to file.log or anywhere else. If I include a call to another script I need that I programmed (and runs outside the cluster), it just goes to Exiting status after said 10 seconds and plainly ignores the call.
Help, please?
Thanks Ott Toomet for your suggestion! I found the problem elsewhere. The .tschrc file had "bash" written in it. Don't ask me why. I deleted it and now the jobs happily run.

Running bash scripts parallel in Linux

I am trying to run a script (1.sh)
spin -a /home/files/1/1.pml;
gcc -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c >log1.txt;
./pan -m100000 >log2.txt;
spin -p -s -r -X -v -n123 -l -g -k /home/files/1/1.pml.trail \
-u10000 /home/files/1/1.pml >log3.txt;
The command spin -a ...; generates temporary files (pan.c, pan.h) which is used by the next gcc -O2.. command. If I run the script in terminal it creates the temporary files in the same location.
I want to run multiple scripts parallelly. I tried two things, first to write a script to run then in a loop in background (parallel.sh)
for((i=1;i<1800;i++))
do
/home/files/$i/$i.sh &
done
and secondly use parallel gnu parallel -j0 sh /home/files/{}/{}.sh ::: {1..1800}.
Both method created temp file in the location from where they were called from instead of the script location.
For example if I run the script 'parallel.sh' from home/files the temp file are created in "home/files" instead of the location "home/files/1","home/files/2", etc.
Please suggest a method so that the temporary file generated by the script 1.sh,2.sh,.. are created in the directory /home/file/1/, /home/files/2/,.. respectively while I run the parallel script parallel.sh or parallel GNU in terminal from location /home.
The trick is to change the working directory for each command.
When your computer can really run up to 1800 such processes at the same time without heating up the climate:
for i in {1..1800}; do (cd $i && ./$i.sh) & done
When running in parallel, and your processes are cpu-bound, it usually does not gain throughput when running more than the number of processors:
seq 1 1800 | xargs -n1 -P8 -I% sh -c 'cd % && ./%.sh'
Try:
parallel 'cd /home/files/{}; sh {}.sh' ::: {1..1800}
It will run one process per core, and may be faster than '-j0' (only testing can tell with certainty).
If your scripts only vary by the number, consider rewriting it as a general script or bash function that takes the number as an argument:
spinit() {
num=$1
spin -a /home/files/$num/$num.pml;
gcc -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c >log1.txt;
./pan -m100000 >log2.txt;
spin -p -s -r -X -v -n123 -l -g -k /home/files/$num/$num.pml.trail \
-u10000 /home/files/$num/$num.pml >log3.txt;
}
export -f spinit
parallel 'cd /home/files/{}; spinit {}' ::: {1..1800}

How to pass an argument to a job and keep it unchanged in parallel fashion

I am trying to execute a series of job in different directories. I want to pass the directory as an input argument to the job. So far I understood that I can use environmental variable as a way to send argument to the jobs. But the problem is since jobs run in parallel fashion, the last value of this variable will be used for all jobs. let look at my code :
for i in "${arr[#]}"
do
export dir=$i
qsub myBashFile.sh
done
and in my job I used the variable dir to do some operation. I want each job execute with its own input parameter.
Edit: this is my job
#!/bin/sh
#
#
#PBS -N Brownie
#PBS -o test.output.txt
#PBS -e test.error.txt
#PBS -l walltime=2:00:00
#PBS -m n
#PBS -V dir
cd $dir
./run_mycode.sh
I know this is not correct, but i am looking for an alternative way to keep the value of dir unchanged and unique for all jobs independently.
I also tried to modify a variable in job file with sed command like below:
sed "s/dir/"'$i'"/g" my_job.sh > alljobs/my_jobNew.sh
but instead of putting the actual value of $i, dir changes exactly to $i which is meaningless in my_job.sh.
Have you tried passing the directory as command_args as explained in the manpage qsub(1)? That would be:
for i in "${arr[#]}"
do
qsub myBashFile.sh -- "$i"
done
You should be able to access it as $1 inside myBashFile.sh.
I would use $PBS_O_WORKDIR for this. Change your submission script to this:
for i in "${arr[#]}"
do
cd /path/to/$i
qsub /path/to/myBashFile.sh
done
In your job you would then change 'cd $dir' to 'cd $PBS_O_WORKDIR'.

Resources