Torque+MAUI PBS submitted job strange startup - pbs

I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is -- in run.sh -- 1 32, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/* get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?

It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name> and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.

For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub

Related

How can I find out the "command" (batch script filename) of a finished SLURM job?

I often have lots of SLURM jobs running from different directories. Therefore, it is useful to query the workdir of the jobs. I can do this for jobs in the queue (e.g. pending, running, etc.) something like this:
squeue -u $USER -o "%i %Z"
and I can do this for finished jobs (e.g. completed, timeout, cancelled, etc.) something like this:
sacct -u $USER -o JobID,WorkDir
The problem is, sometimes I have a directory with two (or more) SLURM batch scripts in it, e.g. submit.sh and restart.sh. Therefore, it is also useful to query the "command" of the jobs, i.e. the filename of the batch script. I can do this for jobs in the queue something like this:
squeue -u $USER -o "%i %o"
However, from checking the documentation of sacct and playing around with sacct, there appears to be no equivalent option for sacct so I cannot currently get the command for finished jobs. I also cannot use the squeue method for finished jobs - it just says slurm_load_jobs error: Invalid job id specified because finished jobs are not included in the squeue list. So, how can I find out the command of a finished SLURM job (using sacct or otherwise)?
Slurm does not indeed store the command in the accounting database. Two workarounds:
For a single user: use the JobName or Comment to store the script name upon submission. These are stored in the database, but this approach is error-prone;
Cluster-wise: enable job completion plugin to ElastiSearch as this stores not only the script name but the whole contents as well.

"qsub script.sh" yielding "Unknown queue" error

Let say I have two bash scripts. (small.sh & super.sh)
small.sh
#!/bin/bash
cd /current_path/
chmod a+x *.sh
bash super.sh
super.sh
#!/bin/bash
qsub test.sh
When I submit my job to PBS system.
qsub small.sh
The super.sh could not be executed.
That means it will not
qsub test.sh
Am I doing something wrong? How can I achieve this?
If your script has no #PBS directives, and you don't submit with something like qsub -q batch ..., then it seems like you either a) have no default queue defined, or b) the queue name being submitted to does not exist (or has a typo). Run this (as an admin) to see the default queue:
qmgr -c 'print server default_queue'
Run this to see the queue settings:
qmgr -c 'print queue <queue_name>'
If you have no default queue, then either set one, or make sure to always submit directly to a queue with qsub -q <queue_name>... (and of course make sure the queue actually exists, which you can still do with print queue as mentioned.
This is what i found out from here :
Queue is Unknown
Be sure to use the correct queue. For Pleiades jobs, use the common queue names normal, long, vlong, and debug. For Endeavour jobs, use the queue names e_normal, e_long, e_vlong, and e_debug. The PBS server pbspl1 recognizes the queue names for both Pleiades and Endeavour, and will route them appropriately. However, the pbspl3 server only recognizes the queue names for Endeavour jobs, as shown below:
pfe20% qsub - q normal#pbspl3 job_script
qsub: unknown queue

Calling Matlab on a linux based Cluster: matlab sessions stops before m file is completly executed

i'm running a bash script that submits some pbs Jobs on a Linux based Cluster multiple times. each Submission calls Matlab, reads some data, performs calculations, and writes the results back to my Directory.
This process works fine without one exception. For some calculations the m-file starts, loads everything, than performs the calculation, but while printing the results to the stdout the Job terminates.
the log file of pbs Shows no error Messages, matlab Shows no error Messages.
the code runs perfectly on my Computer. I am out of ideas.
if anyone would have an idea what i could do, i would appreciate it.
thanks in advance
jbug
edit:
is there a possibility to force matlab to reach the end of file? may that help?
edit #18:00:
as requested in the comment below by HBHB here is the comment that Shows how matlab is called by an external *.sh file
#PBS -l nodes=1:ppn=2
#PBS -l pmem=1gb
#PBS -l mem=1gb
#PBS -l walltime=00:05:00
module load MATLAB/R2015b
cd $PBS_O_WORKDIR
matlab -nosplash -nodisplay -nojvm -r "addpath('./data/calc');myFunc("$a","$b"),quit()"
Where $a and $b Comes from a Loop within the caller bash file and ./data/calc Points to the Directory where myFunction is located
edit #18:34: if i perform the calculation manually than everything runs fine. so the given data is fine and seems to narrow down to pbs?
edit #21:27 i put an until Loop around the matlab call that checks if matlab Returns the desired data. if not, it should restart matlab again after some delay. but still. matlab stops after finished calulation while printing the result(some matrices) and even the Job finishes. the checking part of the restart will never be reached.
what i don't understand. the Job stays in the Queue, like i planned it with the small delay. so the sleep$w will be executed? but if I check the error files, it just shows me the frozen matlab in its first round, recognizable by i. here is that part of code. maybe you can help me
#w=w wait
i=1
until [[ -e ./temp/$b/As$a && -e ./temp/$b/Bs$a && -e ./temp/$b/Cs$a && -e ./temp/$b/lamb$a ]]
do
echo $i
matlab -nosplash -nodisplay -nojvm -r "addpath('./data/calc');myFunc("$a","$b"),quit()"
sleep $w
((i=i+1))
done
You are most likely choking your matlab process with limited memory. Your PBS file:
#PBS -l nodes=1:ppn=2
#PBS -l pmem=1gb
#PBS -l mem=1gb
#PBS -l walltime=00:05:00
You are setting your physical memory to 1gb. Matlab without any files runs around 900MB of virtual memory. Try:
#PBS -l nodes=1:ppn=1
#PBS -l pvmem=5gb
#PBS -l walltime=00:05:00
Additionally, this is something you should contact your local system administrator for. Without system logs, I can't tell you for sure why your job is cutting short (but my guess is resource limits). As an SA of an HPC center, I can tell you that they would be able to tell you in about 5 minutes why your job is not working correctly. Additionally, different HPC centers utilize different PBS configurations. So mem might not even be recognized; this is something your local adminstrators can help you with much better then StackOverflow.

No output from submitted job using qsub?

I have the following lines at the beginning of my script:
#!/bin/bash
#PBS -j oe
#PBS -o ~/output/a
After submitting this script with qsub, and after the job is completed, there is no file a under ~/output/. What am I missing here?
I just tried a couple test jobs and it seems that torque doesn't like the use of '~'. I received this in my email:
Aborted by PBS Server
Job cannot be executed
See Administrator for help
I would replace '~' with the path to your home directory and try it again.

can i delete a shell script after it has been submitted using qsub without affecting the job?

I want to submit a a bunch of jobs using qsub - the jobs are all very similar. I have a script that has a loop, and in each instance it rewrites over a file tmpjob.sh and then does qsub tmpjob.sh . Before the job has had a chance to run, the tmpjob.sh may have been overwritten by the next instance of the loop. Is another copy of tmpjob.sh stored while the job is waiting to run? Or do I need to be careful not to change tmpjob.sh before the job has begun?
Assuming you're talking about torque, then yes; torque reads in the script at submission time. In fact the submission script need never exist as a file at all; as given as an example in the documentation for torque, you can pipe in commands to qsub (from the docs: cat pbs.cmd | qsub.)
But several other batch systems (SGE/OGE, PBS PRO) use qsub as a queue submission command, so you'll have to tell us what queuing system you're using to be sure.
Yes. You can even create jobs and sub-jobs with HERE Documents. Below is an example of a test I was doing with a script initiated by a cron job:
#!/bin/env bash
printenv
qsub -N testCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUB
cd \$PBS_O_WORKDIR
printenv
qsub -N testsubCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUBNEST
cd \$PBS_O_WORKDIR
pwd
date -Isec
QSUBNEST
QSUB

Resources