"qsub script.sh" yielding "Unknown queue" error - pbs

Let say I have two bash scripts. (small.sh & super.sh)
small.sh
#!/bin/bash
cd /current_path/
chmod a+x *.sh
bash super.sh
super.sh
#!/bin/bash
qsub test.sh
When I submit my job to PBS system.
qsub small.sh
The super.sh could not be executed.
That means it will not
qsub test.sh
Am I doing something wrong? How can I achieve this?

If your script has no #PBS directives, and you don't submit with something like qsub -q batch ..., then it seems like you either a) have no default queue defined, or b) the queue name being submitted to does not exist (or has a typo). Run this (as an admin) to see the default queue:
qmgr -c 'print server default_queue'
Run this to see the queue settings:
qmgr -c 'print queue <queue_name>'
If you have no default queue, then either set one, or make sure to always submit directly to a queue with qsub -q <queue_name>... (and of course make sure the queue actually exists, which you can still do with print queue as mentioned.

This is what i found out from here :
Queue is Unknown
Be sure to use the correct queue. For Pleiades jobs, use the common queue names normal, long, vlong, and debug. For Endeavour jobs, use the queue names e_normal, e_long, e_vlong, and e_debug. The PBS server pbspl1 recognizes the queue names for both Pleiades and Endeavour, and will route them appropriately. However, the pbspl3 server only recognizes the queue names for Endeavour jobs, as shown below:
pfe20% qsub - q normal#pbspl3 job_script
qsub: unknown queue

Related

How can I find out the "command" (batch script filename) of a finished SLURM job?

I often have lots of SLURM jobs running from different directories. Therefore, it is useful to query the workdir of the jobs. I can do this for jobs in the queue (e.g. pending, running, etc.) something like this:
squeue -u $USER -o "%i %Z"
and I can do this for finished jobs (e.g. completed, timeout, cancelled, etc.) something like this:
sacct -u $USER -o JobID,WorkDir
The problem is, sometimes I have a directory with two (or more) SLURM batch scripts in it, e.g. submit.sh and restart.sh. Therefore, it is also useful to query the "command" of the jobs, i.e. the filename of the batch script. I can do this for jobs in the queue something like this:
squeue -u $USER -o "%i %o"
However, from checking the documentation of sacct and playing around with sacct, there appears to be no equivalent option for sacct so I cannot currently get the command for finished jobs. I also cannot use the squeue method for finished jobs - it just says slurm_load_jobs error: Invalid job id specified because finished jobs are not included in the squeue list. So, how can I find out the command of a finished SLURM job (using sacct or otherwise)?
Slurm does not indeed store the command in the accounting database. Two workarounds:
For a single user: use the JobName or Comment to store the script name upon submission. These are stored in the database, but this approach is error-prone;
Cluster-wise: enable job completion plugin to ElastiSearch as this stores not only the script name but the whole contents as well.

Why Torque qsub don't create output file?

I trying start task on cluster via Torque PBS with command
qsub -o a.txt a.sh
File a.sh contain single string:
hostname
After command qsub I make qstat command, that give next output:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
302937.voms a.sh user 00:00:00 E long
After 5 seconds command qstat return empty output (no jobs in queue).
Command
qsub --version
give output: version: 2.5.13
Command
which qsub
Output: /usr/bin/qsub
The problem is that the file a.txt (from command qsub -o a.txt a.sh) is not created! In the terminal returned only job id, there is not any errors. Command
qsub a.sh
has the same behavior. How I can fix it? Where is qsub log files with errors?
If I use command
qsub -l nodes=node36:ppn=1 -o a.txt a.sh
then output files I can find in folder
/var/spool/pbs/undelivered
on node36 (after ssh login on it).
Output file contain string "node36", error file is empty.
Why my files is "undelivered"?
The output log and error log files are kept on the execution node in a spool directory and copied back to the head node after the job has completed. The location of the spool directory may vary. But you should look for it
under
/var/torque/spool on the first node from the list of nodes the job has been allocated.
There are multiple reasons that might cause torque to fail to deliver the output files.
The user submitting the job might not exist on the node or their home directory might not be accessible, or there is a user ID mismatch between the nodes of the cluster.
Torque is using ssh to copy files to the head node, but passwordless public key authentication for the user to ssh across the cluster has not been set up consistently on all the nodes.
A node failed during the execution of the job.
This list is by no means complete. Already here on Stack Overflow one can find a number of questions dealing with such a failure. Try to check if any of the above applies to your case.
You(or anyone else finding this thread) should also check out the solution given here:
PBS, refresh stdout
If you have admin access, you can set
$spool_as_final_name true
which causes the output to be written directly to the final destination.

Redirect output of my java program under qsub

I am currently running multiple Java executable program using qsub.
I wrote two scripts: 1) qsub.sh, 2) run.sh
qsub.sh
#! /bin/bash
echo cd `pwd` \; "$#" | qsub
run.sh
#! /bin/bash
for param in 1 2 3
do
./qsub.sh java -jar myProgram.jar -param ${param}
done
Given the two scripts above, I submit jobs by
sh run.sh
I want to redirect the messages generated by myProgram.jar -param ${param}
So in run.sh, I replaced the 4th line with the following
./qsub.sh java -jar myProgram.jar -param ${param} > output-${param}.txt
but the messages stored in output.txt is "Your job 730 ("STDIN") has been submitted", which is not what I intended.
I know that qsub has an option -o for specifying the location of output, but I cannot figure out how to use this option for my case.
Can anyone help me?
Thanks in advance.
The issue is that qsub doesn't return the output of your job, it returns the output of the qsub command itself, which is simply informing your resource manager / scheduler that you want that job to run.
You want to use the qsub -o option, but you need to remember that the output won't appear there until the job has run to completion. For Torque, you'd use qstat to check the status of your job, and all other resource managers / schedulers have similar commands.

Torque+MAUI PBS submitted job strange startup

I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is -- in run.sh -- 1 32, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/* get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?
It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name> and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.
For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub

can i delete a shell script after it has been submitted using qsub without affecting the job?

I want to submit a a bunch of jobs using qsub - the jobs are all very similar. I have a script that has a loop, and in each instance it rewrites over a file tmpjob.sh and then does qsub tmpjob.sh . Before the job has had a chance to run, the tmpjob.sh may have been overwritten by the next instance of the loop. Is another copy of tmpjob.sh stored while the job is waiting to run? Or do I need to be careful not to change tmpjob.sh before the job has begun?
Assuming you're talking about torque, then yes; torque reads in the script at submission time. In fact the submission script need never exist as a file at all; as given as an example in the documentation for torque, you can pipe in commands to qsub (from the docs: cat pbs.cmd | qsub.)
But several other batch systems (SGE/OGE, PBS PRO) use qsub as a queue submission command, so you'll have to tell us what queuing system you're using to be sure.
Yes. You can even create jobs and sub-jobs with HERE Documents. Below is an example of a test I was doing with a script initiated by a cron job:
#!/bin/env bash
printenv
qsub -N testCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUB
cd \$PBS_O_WORKDIR
printenv
qsub -N testsubCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUBNEST
cd \$PBS_O_WORKDIR
pwd
date -Isec
QSUBNEST
QSUB

Resources