No output from submitted job using qsub? - pbs

I have the following lines at the beginning of my script:
#!/bin/bash
#PBS -j oe
#PBS -o ~/output/a
After submitting this script with qsub, and after the job is completed, there is no file a under ~/output/. What am I missing here?

I just tried a couple test jobs and it seems that torque doesn't like the use of '~'. I received this in my email:
Aborted by PBS Server
Job cannot be executed
See Administrator for help
I would replace '~' with the path to your home directory and try it again.

Related

Why Torque qsub don't create output file?

I trying start task on cluster via Torque PBS with command
qsub -o a.txt a.sh
File a.sh contain single string:
hostname
After command qsub I make qstat command, that give next output:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
302937.voms a.sh user 00:00:00 E long
After 5 seconds command qstat return empty output (no jobs in queue).
Command
qsub --version
give output: version: 2.5.13
Command
which qsub
Output: /usr/bin/qsub
The problem is that the file a.txt (from command qsub -o a.txt a.sh) is not created! In the terminal returned only job id, there is not any errors. Command
qsub a.sh
has the same behavior. How I can fix it? Where is qsub log files with errors?
If I use command
qsub -l nodes=node36:ppn=1 -o a.txt a.sh
then output files I can find in folder
/var/spool/pbs/undelivered
on node36 (after ssh login on it).
Output file contain string "node36", error file is empty.
Why my files is "undelivered"?
The output log and error log files are kept on the execution node in a spool directory and copied back to the head node after the job has completed. The location of the spool directory may vary. But you should look for it
under
/var/torque/spool on the first node from the list of nodes the job has been allocated.
There are multiple reasons that might cause torque to fail to deliver the output files.
The user submitting the job might not exist on the node or their home directory might not be accessible, or there is a user ID mismatch between the nodes of the cluster.
Torque is using ssh to copy files to the head node, but passwordless public key authentication for the user to ssh across the cluster has not been set up consistently on all the nodes.
A node failed during the execution of the job.
This list is by no means complete. Already here on Stack Overflow one can find a number of questions dealing with such a failure. Try to check if any of the above applies to your case.
You(or anyone else finding this thread) should also check out the solution given here:
PBS, refresh stdout
If you have admin access, you can set
$spool_as_final_name true
which causes the output to be written directly to the final destination.

Redirect output of my java program under qsub

I am currently running multiple Java executable program using qsub.
I wrote two scripts: 1) qsub.sh, 2) run.sh
qsub.sh
#! /bin/bash
echo cd `pwd` \; "$#" | qsub
run.sh
#! /bin/bash
for param in 1 2 3
do
./qsub.sh java -jar myProgram.jar -param ${param}
done
Given the two scripts above, I submit jobs by
sh run.sh
I want to redirect the messages generated by myProgram.jar -param ${param}
So in run.sh, I replaced the 4th line with the following
./qsub.sh java -jar myProgram.jar -param ${param} > output-${param}.txt
but the messages stored in output.txt is "Your job 730 ("STDIN") has been submitted", which is not what I intended.
I know that qsub has an option -o for specifying the location of output, but I cannot figure out how to use this option for my case.
Can anyone help me?
Thanks in advance.
The issue is that qsub doesn't return the output of your job, it returns the output of the qsub command itself, which is simply informing your resource manager / scheduler that you want that job to run.
You want to use the qsub -o option, but you need to remember that the output won't appear there until the job has run to completion. For Torque, you'd use qstat to check the status of your job, and all other resource managers / schedulers have similar commands.

Torque+MAUI PBS submitted job strange startup

I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is -- in run.sh -- 1 32, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/* get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?
It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name> and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.
For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub

Torque nested/successive qsub call

I have a jobscript compile.pbs which runs on a single CPU and compiles source code to create an executable. I then have a 2nd job script jobscript.pbs which I call using 32 CPU's to run that newly created executable with MPI. They both work perfectly when I manually call them in succession, but I would like to automate the process by having the first script call the 2nd jobscript just before it ends. Is there a way to properly nest qsub calls or have them be called in succession?
Currently my attempt is to have the first script call the 2nd script right before it ends, but when I try that I get a strange error message from the 2nd (nested) qsub:
qsub: Bad UID for job execution MSG=ruserok failed validating masterhd/masterhd from s59-16.local
I think the 2nd script is being called properly, but maybe the permissions are not the same as when I called the original one. Obviously my user name masterhd is allowed to run the jobscripts because it works fine when I call the jobscript manually. Is there a way to accomplish what I am trying to do?
Here is a more detailed example of the procedure. First I call the first jobscript and specify a variable with -v:
qsub -v outpath='/home/dest_folder/' compile.pbs
That outpath variable just specifies where to copy the new executable, and then the 2nd jobscript changes to that output directory and attempts to run jobscript.pbs.
compile.pbs:
#!/bin/bash
#PBS -N compile
#PBS -l walltime=0:05:00
#PBS -j oe
#PBS -o ocompile.txt
#Perform compiling stuff:
module load gcc-openmpi-1.2.7
rm *.o
make -f Makefile
#Copy the executable to the destination:
cp visct ${outpath}/visct
#Change to the output path before calling the next jobscript:
cd ${outpath}
qsub jobscript
jobscript.pbs:
#!/bin/bash
#PBS -N run_exe
#PBS -l nodes=32
#PBS -l walltime=96:00:00
#PBS -j oe
#PBS -o results.txt
cd $PBS_O_WORKDIR
module load gcc-openmpi-1.2.7
time mpiexec visct
You could make a submitting script that qsubs both jobs, but makes the second execute only if, and after, the first was completed without errors:
JOB1CMD="qsub -v outpath='/home/dest_folder/' compile.pbs -t" # -t for terse output
JOB1OUT=$(eval $JOB1CMD)
JOB1ID=${JOB1OUT%%.*} # parse to get job id, change accordingly
JOB2CMD="qsub jobscript.pbs -W depend=afterok:$JOB1ID"
eval $JOB2CMD
It's possible that there are restrictions on your system to run scripts inside scripts. Your first job only runs for 5 minutes and then the second job needs 96 hours. If the second job is requested inside the first job, that would violate the time limit of the first job.
Why can't you just put the compile part at the beginning of the second script?

can i delete a shell script after it has been submitted using qsub without affecting the job?

I want to submit a a bunch of jobs using qsub - the jobs are all very similar. I have a script that has a loop, and in each instance it rewrites over a file tmpjob.sh and then does qsub tmpjob.sh . Before the job has had a chance to run, the tmpjob.sh may have been overwritten by the next instance of the loop. Is another copy of tmpjob.sh stored while the job is waiting to run? Or do I need to be careful not to change tmpjob.sh before the job has begun?
Assuming you're talking about torque, then yes; torque reads in the script at submission time. In fact the submission script need never exist as a file at all; as given as an example in the documentation for torque, you can pipe in commands to qsub (from the docs: cat pbs.cmd | qsub.)
But several other batch systems (SGE/OGE, PBS PRO) use qsub as a queue submission command, so you'll have to tell us what queuing system you're using to be sure.
Yes. You can even create jobs and sub-jobs with HERE Documents. Below is an example of a test I was doing with a script initiated by a cron job:
#!/bin/env bash
printenv
qsub -N testCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUB
cd \$PBS_O_WORKDIR
printenv
qsub -N testsubCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUBNEST
cd \$PBS_O_WORKDIR
pwd
date -Isec
QSUBNEST
QSUB

Resources