I have a jobscript compile.pbs which runs on a single CPU and compiles source code to create an executable. I then have a 2nd job script jobscript.pbs which I call using 32 CPU's to run that newly created executable with MPI. They both work perfectly when I manually call them in succession, but I would like to automate the process by having the first script call the 2nd jobscript just before it ends. Is there a way to properly nest qsub calls or have them be called in succession?
Currently my attempt is to have the first script call the 2nd script right before it ends, but when I try that I get a strange error message from the 2nd (nested) qsub:
qsub: Bad UID for job execution MSG=ruserok failed validating masterhd/masterhd from s59-16.local
I think the 2nd script is being called properly, but maybe the permissions are not the same as when I called the original one. Obviously my user name masterhd is allowed to run the jobscripts because it works fine when I call the jobscript manually. Is there a way to accomplish what I am trying to do?
Here is a more detailed example of the procedure. First I call the first jobscript and specify a variable with -v:
qsub -v outpath='/home/dest_folder/' compile.pbs
That outpath variable just specifies where to copy the new executable, and then the 2nd jobscript changes to that output directory and attempts to run jobscript.pbs.
compile.pbs:
#!/bin/bash
#PBS -N compile
#PBS -l walltime=0:05:00
#PBS -j oe
#PBS -o ocompile.txt
#Perform compiling stuff:
module load gcc-openmpi-1.2.7
rm *.o
make -f Makefile
#Copy the executable to the destination:
cp visct ${outpath}/visct
#Change to the output path before calling the next jobscript:
cd ${outpath}
qsub jobscript
jobscript.pbs:
#!/bin/bash
#PBS -N run_exe
#PBS -l nodes=32
#PBS -l walltime=96:00:00
#PBS -j oe
#PBS -o results.txt
cd $PBS_O_WORKDIR
module load gcc-openmpi-1.2.7
time mpiexec visct
You could make a submitting script that qsubs both jobs, but makes the second execute only if, and after, the first was completed without errors:
JOB1CMD="qsub -v outpath='/home/dest_folder/' compile.pbs -t" # -t for terse output
JOB1OUT=$(eval $JOB1CMD)
JOB1ID=${JOB1OUT%%.*} # parse to get job id, change accordingly
JOB2CMD="qsub jobscript.pbs -W depend=afterok:$JOB1ID"
eval $JOB2CMD
It's possible that there are restrictions on your system to run scripts inside scripts. Your first job only runs for 5 minutes and then the second job needs 96 hours. If the second job is requested inside the first job, that would violate the time limit of the first job.
Why can't you just put the compile part at the beginning of the second script?
Related
I am running my program on a high-performance computer, usually with different parameters as input. Those parameters are given to the program via a parameter file, i.e. the qsub-file looks like
#!/bin/bash
#PBS -N <job-name>
#PBS -A <name>
#PBS -l select=1:ncpus=20:mpiprocs=20
#PBS -l walltime=80:00:00
#PBS -M <mail-address>
#PBS -m bea
module load foss
cd $PBS_O_WORKDIR
mpirun main parameters.prm
# Append the job statistics to the std out file
qstat -f $PBS_JOBID
Now usually I run the same program multiple times more or less at the same time, with different parameter.prm-files. Nevertheless they all show up in the job-list with the same name, making the correlation between the job in the list and the used parameters difficult (not impossible).
Is there a way to change the name of the program in the job list dynamically, depending on the used input parameters (ideally from within main)? Or is there another way to change the job name without having to edit the job-file every time I run
qsub job_script.pbs
?
Would be a solution to create a shell script which reads data from the parameter file, and then in turn creates the job-script and runs it? Or are there easier ways?
Simply use the -N option on the command line:
qsub -N job1 job_script.pbs
You can then use a for loop to iterate over the *.prm files:
for prm in *.prm
do
prmbase=$(basename $prm .prm)
qsub -N $prmbase main $prm
done
This will name each job by the parameter file name, sans the .prm suffix.
i'm running a bash script that submits some pbs Jobs on a Linux based Cluster multiple times. each Submission calls Matlab, reads some data, performs calculations, and writes the results back to my Directory.
This process works fine without one exception. For some calculations the m-file starts, loads everything, than performs the calculation, but while printing the results to the stdout the Job terminates.
the log file of pbs Shows no error Messages, matlab Shows no error Messages.
the code runs perfectly on my Computer. I am out of ideas.
if anyone would have an idea what i could do, i would appreciate it.
thanks in advance
jbug
edit:
is there a possibility to force matlab to reach the end of file? may that help?
edit #18:00:
as requested in the comment below by HBHB here is the comment that Shows how matlab is called by an external *.sh file
#PBS -l nodes=1:ppn=2
#PBS -l pmem=1gb
#PBS -l mem=1gb
#PBS -l walltime=00:05:00
module load MATLAB/R2015b
cd $PBS_O_WORKDIR
matlab -nosplash -nodisplay -nojvm -r "addpath('./data/calc');myFunc("$a","$b"),quit()"
Where $a and $b Comes from a Loop within the caller bash file and ./data/calc Points to the Directory where myFunction is located
edit #18:34: if i perform the calculation manually than everything runs fine. so the given data is fine and seems to narrow down to pbs?
edit #21:27 i put an until Loop around the matlab call that checks if matlab Returns the desired data. if not, it should restart matlab again after some delay. but still. matlab stops after finished calulation while printing the result(some matrices) and even the Job finishes. the checking part of the restart will never be reached.
what i don't understand. the Job stays in the Queue, like i planned it with the small delay. so the sleep$w will be executed? but if I check the error files, it just shows me the frozen matlab in its first round, recognizable by i. here is that part of code. maybe you can help me
#w=w wait
i=1
until [[ -e ./temp/$b/As$a && -e ./temp/$b/Bs$a && -e ./temp/$b/Cs$a && -e ./temp/$b/lamb$a ]]
do
echo $i
matlab -nosplash -nodisplay -nojvm -r "addpath('./data/calc');myFunc("$a","$b"),quit()"
sleep $w
((i=i+1))
done
You are most likely choking your matlab process with limited memory. Your PBS file:
#PBS -l nodes=1:ppn=2
#PBS -l pmem=1gb
#PBS -l mem=1gb
#PBS -l walltime=00:05:00
You are setting your physical memory to 1gb. Matlab without any files runs around 900MB of virtual memory. Try:
#PBS -l nodes=1:ppn=1
#PBS -l pvmem=5gb
#PBS -l walltime=00:05:00
Additionally, this is something you should contact your local system administrator for. Without system logs, I can't tell you for sure why your job is cutting short (but my guess is resource limits). As an SA of an HPC center, I can tell you that they would be able to tell you in about 5 minutes why your job is not working correctly. Additionally, different HPC centers utilize different PBS configurations. So mem might not even be recognized; this is something your local adminstrators can help you with much better then StackOverflow.
I am currently running multiple Java executable program using qsub.
I wrote two scripts: 1) qsub.sh, 2) run.sh
qsub.sh
#! /bin/bash
echo cd `pwd` \; "$#" | qsub
run.sh
#! /bin/bash
for param in 1 2 3
do
./qsub.sh java -jar myProgram.jar -param ${param}
done
Given the two scripts above, I submit jobs by
sh run.sh
I want to redirect the messages generated by myProgram.jar -param ${param}
So in run.sh, I replaced the 4th line with the following
./qsub.sh java -jar myProgram.jar -param ${param} > output-${param}.txt
but the messages stored in output.txt is "Your job 730 ("STDIN") has been submitted", which is not what I intended.
I know that qsub has an option -o for specifying the location of output, but I cannot figure out how to use this option for my case.
Can anyone help me?
Thanks in advance.
The issue is that qsub doesn't return the output of your job, it returns the output of the qsub command itself, which is simply informing your resource manager / scheduler that you want that job to run.
You want to use the qsub -o option, but you need to remember that the output won't appear there until the job has run to completion. For Torque, you'd use qstat to check the status of your job, and all other resource managers / schedulers have similar commands.
I have the following lines at the beginning of my script:
#!/bin/bash
#PBS -j oe
#PBS -o ~/output/a
After submitting this script with qsub, and after the job is completed, there is no file a under ~/output/. What am I missing here?
I just tried a couple test jobs and it seems that torque doesn't like the use of '~'. I received this in my email:
Aborted by PBS Server
Job cannot be executed
See Administrator for help
I would replace '~' with the path to your home directory and try it again.
My problem is specific to the running of SPECCPU2006(a benchmark suite).
After I installed the benchmark, I can invoke a command called "specinvoke" in terminal to run a specific benchmark. I have another script, where part of the codes are like following:
cd (specific benchmark directory)
specinvoke &
pid=$!
My goal is to get the PID of the running task. However, by doing what is shown above, what I got is the PID for the "specinvoke" shell command and the real running task will have another PID.
However, by running specinvoke -n ,the real code running in the specinvoke shell will be output to the stdout. For example, for one benchmark,it's like this:
# specinvoke r6392
# Invoked as: specinvoke -n
# timer ticks over every 1000 ns
# Use another -n on the command line to see chdir commands and env dump
# Starting run for copy #0
../run_base_ref_gcc43-64bit.0000/milc_base.gcc43-64bit < su3imp.in > su3imp.out 2>> su3imp.err
Inside it it's running a binary.The code will be different from benchmark to benchmark(by invoking under different benchmark directory). And because "specinvoke" is installed and not just a script, I can not use "source specinvoke".
So is there any clue? Is there any way to directly invoke the shell command in the same shell(have same PID) or maybe I should dump the specinvoke -n and run the dumped materials?
You can still do something like:
cd (specific benchmark directory)
specinvoke &
pid=$(pgrep milc_base.gcc43-64bit)
If there are several invocation of the milc_base.gcc43-64bit binary, you can still use
pid=$(pgrep -n milc_base.gcc43-64bit)
Which according to the man page:
-n
Select only the newest (most recently started) of the matching
processes
when the process is a direct child of the subshell:
ps -o pid= -C=milc_base.gcc43-64bit --ppid $!
when not a direct child, you could get the info from pstree:
pstree -p $! | grep -o 'milc_base.gcc43-64bit(.*)'
output from above (PID is in brackets): milc_base.gcc43-64bit(9837)