Slurm - Accessing stdout/stderr location of a completed job - slurm

I am trying to get the location of the stdout and stderr file of an already completed job.
Indeed, while the job is running, I could do
scontrol show job $JobId
However, this does not work after a job is finished
I am able to get information about previous completed jobs with sacct,
However, there is no option to display the location of stderr and stdout with this command.
The only information, I found about this issue is this https://groups.google.com/g/slurm-users/c/e4cZMbtrMM0 . However, this suggests changing slurm.conf so that scontrol show job $JobId retains information. This is not possible in my case because I do not have access to slurm.conf
So I was wondering if there was a way with slurm to get the location of the stdout and stderr of a completed job?
Thanks for your help
---- edit ----
The jobs are submitted with a bash file
#SBATCH --output=...
#SBATCH --error=...
By running the command sbatch $submission_file
This means retrieving the command to submitted the file does not help. Indeed this will retrieve only sbatch $jobfile and not give any further information on the output and error directory.

Although Slurm does not seem to save that information in the accounting database explicitly, it does save the information of the working directory, which you can obtain with
sacct -j <JOBID> -o workdir%-100
Most of the time, chances are the output and error files will be relative to that directory.
Slurm also saves the submission command, which you can retrieve with
sacct -j <JOBID> -o SubmitLine%-100
which will reveal output and error files in the case they were provided in the command line.
Finally, note that Slurm will also save the full submission script if configured to do so (which could not be the case on your cluster). If so, you can retrieve it with
sacct -j <JOBID> --batch-script

Related

How can I find out the "command" (batch script filename) of a finished SLURM job?

I often have lots of SLURM jobs running from different directories. Therefore, it is useful to query the workdir of the jobs. I can do this for jobs in the queue (e.g. pending, running, etc.) something like this:
squeue -u $USER -o "%i %Z"
and I can do this for finished jobs (e.g. completed, timeout, cancelled, etc.) something like this:
sacct -u $USER -o JobID,WorkDir
The problem is, sometimes I have a directory with two (or more) SLURM batch scripts in it, e.g. submit.sh and restart.sh. Therefore, it is also useful to query the "command" of the jobs, i.e. the filename of the batch script. I can do this for jobs in the queue something like this:
squeue -u $USER -o "%i %o"
However, from checking the documentation of sacct and playing around with sacct, there appears to be no equivalent option for sacct so I cannot currently get the command for finished jobs. I also cannot use the squeue method for finished jobs - it just says slurm_load_jobs error: Invalid job id specified because finished jobs are not included in the squeue list. So, how can I find out the command of a finished SLURM job (using sacct or otherwise)?
Slurm does not indeed store the command in the accounting database. Two workarounds:
For a single user: use the JobName or Comment to store the script name upon submission. These are stored in the database, but this approach is error-prone;
Cluster-wise: enable job completion plugin to ElastiSearch as this stores not only the script name but the whole contents as well.

Check the path of a running script on PBS job scheduler

I'd like to check where is the script which is currently running on my HPC platform. I know I can get the names of the jobs with the command qsub, which prompts:
Req'd Req'd
Job id Username Queue Name SessID NDS TSK Memory Time Us
-------------------- -------- -------- -------------------- ------ ----- ----- ------ ----- -
123456 xxxxx xxxxxx script_name -- 5 120 -- 72:00 R
Update: I managed to do it with UNIX commands: ´find´, ´cat´ and ´grep´, but I'd like to know if there is a way to get it directly from PBS commands.
Old question: (for your information)
Also, I know that the name is set inside the file with the variable
#SBATCH -J script_name
So I tried to look for it using ´find´, ´cat´ and ´grep´:
$ cat $(find . -name '*.sh') | grep script_name
#SBATCH -J script_name
But, of course, I'm getting just that line with that command
Is there a way to get the name/path of the file that owns the "grep-ed" file portion?
Is there a command to do it with the PBS commands (I'd love to know how to get it with the ´find´, ´cat´ and ´grep´ commands too)
I'm not too sure which workload manager are you running. Commands you mentioned and the tag you added points to PBS but "SBATCH" in your script means that you are using SLURM.
I can tell you that in case of PBS job script is almost always in some directory under PBS_HOME directory on the execution host. You can find path of PBS_HOME in /etc/pbs.conf. The job directory is root protected directory so unless you are root, you can not see the contents.
The script name is not the name of the script you provide in the qsub command line utility. Rather, it is named after the job-id you get out of qsub and suffixed with ".SC".
Example, if qsub job.sh gives you a job id "7557.host1" then job script is named as "7557.host1.SC".

Why Torque qsub don't create output file?

I trying start task on cluster via Torque PBS with command
qsub -o a.txt a.sh
File a.sh contain single string:
hostname
After command qsub I make qstat command, that give next output:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
302937.voms a.sh user 00:00:00 E long
After 5 seconds command qstat return empty output (no jobs in queue).
Command
qsub --version
give output: version: 2.5.13
Command
which qsub
Output: /usr/bin/qsub
The problem is that the file a.txt (from command qsub -o a.txt a.sh) is not created! In the terminal returned only job id, there is not any errors. Command
qsub a.sh
has the same behavior. How I can fix it? Where is qsub log files with errors?
If I use command
qsub -l nodes=node36:ppn=1 -o a.txt a.sh
then output files I can find in folder
/var/spool/pbs/undelivered
on node36 (after ssh login on it).
Output file contain string "node36", error file is empty.
Why my files is "undelivered"?
The output log and error log files are kept on the execution node in a spool directory and copied back to the head node after the job has completed. The location of the spool directory may vary. But you should look for it
under
/var/torque/spool on the first node from the list of nodes the job has been allocated.
There are multiple reasons that might cause torque to fail to deliver the output files.
The user submitting the job might not exist on the node or their home directory might not be accessible, or there is a user ID mismatch between the nodes of the cluster.
Torque is using ssh to copy files to the head node, but passwordless public key authentication for the user to ssh across the cluster has not been set up consistently on all the nodes.
A node failed during the execution of the job.
This list is by no means complete. Already here on Stack Overflow one can find a number of questions dealing with such a failure. Try to check if any of the above applies to your case.
You(or anyone else finding this thread) should also check out the solution given here:
PBS, refresh stdout
If you have admin access, you can set
$spool_as_final_name true
which causes the output to be written directly to the final destination.

Torque+MAUI PBS submitted job strange startup

I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is -- in run.sh -- 1 32, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/* get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?
It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name> and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.
For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub

can i delete a shell script after it has been submitted using qsub without affecting the job?

I want to submit a a bunch of jobs using qsub - the jobs are all very similar. I have a script that has a loop, and in each instance it rewrites over a file tmpjob.sh and then does qsub tmpjob.sh . Before the job has had a chance to run, the tmpjob.sh may have been overwritten by the next instance of the loop. Is another copy of tmpjob.sh stored while the job is waiting to run? Or do I need to be careful not to change tmpjob.sh before the job has begun?
Assuming you're talking about torque, then yes; torque reads in the script at submission time. In fact the submission script need never exist as a file at all; as given as an example in the documentation for torque, you can pipe in commands to qsub (from the docs: cat pbs.cmd | qsub.)
But several other batch systems (SGE/OGE, PBS PRO) use qsub as a queue submission command, so you'll have to tell us what queuing system you're using to be sure.
Yes. You can even create jobs and sub-jobs with HERE Documents. Below is an example of a test I was doing with a script initiated by a cron job:
#!/bin/env bash
printenv
qsub -N testCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUB
cd \$PBS_O_WORKDIR
printenv
qsub -N testsubCron -l nodes=1:vortex:compute -l walltime=1:00:00 <<QSUBNEST
cd \$PBS_O_WORKDIR
pwd
date -Isec
QSUBNEST
QSUB

Resources