Checking output from qsub : Sungrid cluster

Checking output from qsub : Sungrid cluster - qsub

is there way to check the output of the submitted job. The output files are written with quite big delay, so I want to be able to see if they are anything wrong going on.
I saw for PBS there was option -k oe to directly write the qsub output to file in home directory but could not find similar solution for my case.

Based on what I find on Torque it would seem like this is not doable without configuring server:
Check real time output after qsub a job on cluster
PBS, refresh stdout
You may want to consider writing output to separate file from within the program and flush stdout accordingly.

Related

How to set number of threads as a downstream variable in PBS job queue

Is there any way to determine how many threads are available to a program when I run it from a PBS job script?
In the header of my PBS job script I set
#PBS -l nodes=1:ppn=8
Is there a command I can use that will detect the number of threads - so that I can set a variable to equal this number (for downstream processes).
This way I can set threads as $k, for downstream processes, instead of going through the code line by line every time I change #PBS -l nodes=1:ppn=_.
Thanks all!
I found a workaround -
So if using a single node the variable I am looking for is $PBS_NUM_PPN

By default, PBS doesn't expose the ppn setting in the running job. And there is no way that a shell script can read its comments ... without knowing and parsing its source (and that's probably not going to work here for a couple of reasons.)
But here are a couple of ideas:
You could pass an arbitrary variable from the qsub command line using the -v option. (You might be able to do the same thing using #PBS -v ... but would be equivalent to setting a variable in your script in the normal way.)
You should be able to specify the resources (using -l) on the qsub command line instead of in the job script.
Put them together like this:
qsub ... -l nodes=1:ppn=8 - v NOSTHREADS=8 myscript.pbs
where myscript.pbs is:
#!/bin/bash
#PBS directives ... without the "-l" !!!
# ordinary shell commands.
somecommand --someoption $NOSTHREADS ...
Note: I recommend that you don't mix specifying resources on the command line and in the script. Put the "-l" options in one place only. If you put them in both places AND your Torque / PBS installation uses job submission filters, things can get rather confused.
Alternatively, you could write a shell (or python or whatever) launcher that generates the PBS script with matching values of the ppn (etc) resource(s) and the corresponding variable(s) embedded in the generated script.
This approach can have the advantage of being more reproducible ... if you do a few other things as well. (Ask a local eResearch analyst about reproducibility in your scientific computing.)
If neither of the above can be made to work, you might be able to check the ulimit settings within the job script. However, my understanding is that the PBS mon will typically not use ulimit restrictions as the means of enforcing thread / process limits. Instead, it will monitor the number of cores that are active. (The ppn resource limits the number of processors, not the number of threads or processes.)

How can I know the real-time memory usage of a running job on slurm?

I know little about how cpu communicates with memories, so I’m not sure whether this is a ‘correct’ question to ask.
In a job script I submit to a slurm cluster, the script needs to read data from a database stored in the working dictionary. I want to monitor the memory used by running this script.
How can I write a bash script to do this? I have tried #CoffeeNerd's script. However, while the job is running, there is only one line of output in the file
AveCPU|AveRSS|MaxRSS
How can I modify this script to output the real-time memory usage?
I know sstat command, but I'm not sure whether something like sstat -j $JOBID.batch --format=MaxVMSize is the solution to my problem.

Slurm has a plugin that records a 'profile' of a job (PCU usage, memory usage, etc) into a HDF5 file. It holds a time series for each item measured.
Use
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
to activate it.
See the documentation here.

How can I configure SLURM at the user level (e.g. with something like a ".slurmrc")?

Is there something like .slurmrc for SLURM that would allow each user to set their own defaults for parameters that they would normally specify on the command line.
For example, I run 95% of my jobs on what I'll call our HighMem partition. Since my routine jobs can easily go over the default of 1GB, I almost always request 10GB of RAM. To make the best use of my time, I would like to put the partition and RAM requests in a configuration file so that I don't have to type them in all the time. So, instead of typing the following:
sbatch --partition=HighMem --mem=10G script.sh
I could just type this:
sbatch script.sh
I tried searching for multiple variations on "SLURM user-level configuration" and it seemed that all SLURM-related hits dealt with slurm.conf (a global-level configuration file).
I even tried creating slurm.conf and .slurmrc in my home directory, just in case that worked, but they didn't have any effect on the partition used.
update 1
Yes, I thought about scontrol, but the only configuration file it deals with is global and most parameters in it aren't even relevant for a normal user.
update 2
My supervisor pointed out the SLURM Perl API to me. The last time I looked at it, it seemed too complicated to me, but this time upon looking at the code for https://github.com/SchedMD/slurm/blob/master/contribs/perlapi/libslurm/perl/t/06-complete.t, it would seem that it wouldn't too be hard to create a script that behaves similar to sbatch that reads in a default configuration file and sets the desired parameters. However, I haven't had any success in setting the 'std_out' to a file name that gets written to.

If your example is representative, defining an alias
alias sbatch='sbatch --partition=HighMem --mem=10G'
could be the easiest way. Alternatively, a Bash function could also be used
sbatch() {
command sbatch --partition=HighMem --mem=10G "$#"
}
Put any of these in your .bash_profile for persistence.

SLURM / Sbatch creates many small output files

I am running a pipeline on a SLURM-cluster, and for some reason a lot of smaller files (between 500 and 2000 bytes in size) named along the lines of slurm-XXXXXX.out (where XXXXXX is a number). I've tried to find out what these files are on the SLURM website, but I can't find any mention of them. I assume they are some sort of in-progress files that the system uses while parsing my pipeline?
If it matters, the pipeline I'm running is using snakemake. I know I've seen these types of files before though, without snakemake, but I they weren't a big problem back then. I'm afraid that clearing the working directory of these files after each step of the workflow will interrupt in-progress steps, so I'm not doing anything with them at the moment.
What are these files, and how can I suppress their output or, alternatively, delete them after their corresponding job is finished? Did I mess up my workflow somehow, and that's why they are created?

You might want to take a look at the sbatch documentation. The files that you are referring to are essentially SLURM logs as explained there:
By default both standard output and standard error are directed to a
file of the name "slurm-%j.out", where the "%j" is replaced with the
job allocation number.
You can change the filename with the --error=<filename pattern> and --output=<filename pattern> command line options. The filename_pattern can have one or more symbols that will be replaced as explained in the documentation. According to the FAQs, you should be able to suppress standard output and standard error by using the following command line options:
sbatch --output=/dev/null --error=/dev/null [...]

Homework: How can I log processes for auditing using the bash shell?

I am very new to linux and am sorry for the newbie questions.
I had a homework extra credit question that I was trying to do but failed to get it.
Q. Write a security shell script that logs the following information
for every process: User ID, time started, time ended (0 if process is
still running), whether the process has tried to access a secure file
(stored as either yes or no) The log created is called
process_security_log where each of the above pieces of information is
stored on a separate line and each entry follows immediately (that is,
there are no blank lines). Write a shell script that will examine
this log and output the User ID of any process that is still running
that has tried to access a secure file.
I started by trying to just capturing the User and echo it but failed.
output=`ps -ef | grep [*]`
set -- $output
User=$1
echo $User

The output of ps is both insufficient and incapable of producing data required by this question.
You need something like auditd, SELinux, or straight up kernel hacks (ie. fork.c) to do anything remotely in the realm of security logging.
Update
Others have made suggestions to use shell command logging, ps and friends (proc or sysfs). They can be useful, and do have their place (obviously). I would argue that they shouldn't be relied on for this purpose, especially in an educational context.
... whether the process has tried to access a secure file (stored as either yes or no)
Seems to be the one that the other answers are ignoring. I stand by my original answer, but as Daniel points out there are other interesting ways to garnish this data.
systemtap
pref
LTTng
For an educational exercise these tools will help provide a more complete answer.

Since this is homework, I'm assuming that the scenario isn't a real-world scenario, and is merely a learning exercise. The shell is not really the right place to do security auditing or process accounting. However, here are some pointers that may help you discover what you can do at the shell prompt.
You might set the bash PROMPT_COMMAND to do your process logging.
You can tail or grep your command history for use in logging.
You can use /usr/bin/script (usually found in the bsdutils package) to create a typescript of your session.
You can run ps in a loop, using subshells or the watch utility, to see what processes are currently running.
You can use pidof or pgrep to find processes more easily.
You can modify your .bashrc or other shell startup file to set up your environment or start your logging tools.
As a starting point, you might begin with something trivial like this:
$ export PROMPT_COMMAND='history | tail -n1'
56 export PROMPT_COMMAND='history | tail -n1'
$ ls /etc/passwd
/etc/passwd
57 ls /etc/passwd
and build in any additional logging data or process information that you think necessary. Hope that gets you pointed in the right direction!

Take a look at the /proc pseudo-filesystem.
Inside of this, there is a subdirectory for every process that is currently running - process [pid] has its information available in /proc/[pid]/. Inside of that directory, you might make use of /prod/[pid]/stat/ or /proc/[pid]/status to get information about which user started the process and when.
I'm not sure what the assignment means by a "secure file," but if you have some way of determining which files are secure, you get get information about open files (including their names) through /prod/[pid]/fd/ and /prod/[pid]/fdinfo.
Is /proc enough for true security logging? No, but /proc is enough to get information about which processes are currently running on the system, which is probably what you need for a homework assignment about shell scripting. Also, outside of this class you'll probably find /proc useful later for other purposes, such as seeing the mapped pages for a process. This can come in handy if you're writing a stack trace utility or want to know how they work, or if you're debugging code that uses memory-mapped files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string