How to get count of failed and completed jobs in an array job of SLURM

How to get count of failed and completed jobs in an array job of SLURM - slurm

I am running multiple array jobs using slurm. For a given array job id, let's say 885881, I want to list the count of failed and completed number of jobs. Something like this:
Input:
<some-command> -j 885881
Output: Let's say we have 200 jobs in the array.
count | status
120 | failed
80 | completed
Secondly, it'd be great if I can get the unique list of reasons due to which tasks failed.
Input:
`<some-command> -j 885881`
Output:
count | reason
80 | OUT_OF_MEMORY
40 | TIMED_OUT
I believe sacct command can be utilized to somehow get these results, but not sure how.

With a one-liner like this one, you can get both information at the same time
$ sacct -n -X -j 885881 -o state%20 | sort | uniq -c
16 COMPLETED
99 FAILED
32 OUT_OF_MEMORY
1 PENDING
The sacct command digs into the accounting information. The -n -X parameters are used to simplify the output and reduce the number of unnecessary lines, and the -o parameter requests only the STATE column to be displayed. Then the output is fed into the sort and uniq commands which do the counting.
If you really need two separate commands, you can adapt the above one-liner easily. And you can make it a script or a Bash function for ease of use.
If you would like a more elaborate solution, you can have a look at smanage and at atools

Related

Make sacct to not truncate the SLURM_ARRAY_TASK_ID

I need to find the taskID of my job array that got a timeout. So I use sacct as follows.
sacct -u <UserID> -j <jobID> -s TIMEOUT
and I get this as output.
User JobID start
---- ----- -----
<UserID> <JobID>_90+ ....
My taskID is a four-digit number that has been truncated by sacct and displayed as 90+ instead. How can I get the full taskID?

The sacct command has a --format parameter that allows customising the columns shown, along with their size.
The following will show the same three columns as your example, with a 30-character wide column for jobid:
sacct -u <UserID> -j <jobID> -s TIMEOUT --format user,jobid%-30,start

Failed Autosys jobs names within a list of jobs

I am looking for all the autosys jobs that are in failed state by using the command
autorep -J %<<name>>% | grep "FA"
However I am looking to get all such failed jobs whose names match within a list of ids.
For example I have 4 failed jobs - job_1, job_2, job_3, job_4
I only want to return the jobs with names which have 1 and 2 in them. So how would I do that?
Please note, the starting part of the jobname is not the same in the actual scenario.

I tried to match all failed jobs with characters ABC or DEF in them:
autorep -J ALL | grep -e '\(ABC\|DEF\).*FA'

Get job number from non-interactive queuing of "at" job

Background
I'm writing a script that occasionally queues jobs via the at command. In order to accomplish this in an automated, non-interactive way, I echo the commands to be executed to a file, i.e.:
echo "ls -la" > cmd.txt
I then schedule the command to run 2 minutes later via:
at -f cmd.txt now + 2 min
Problem
I would like to, in an automated, non-interactive, deterministic way; determine the job number associated with the task my script just queued up. Unfortunately, there doesn't appear to be anything supplied in the return code (i.e. echo $?), nor a CLI command I can issue, that provides me with this. I can always scrape the stdout data, i.e.:
$> A=$(at -f cmd.txt now + 1 min)
warning: commands will be executed using /bin/sh
job 6 at Fri Mar 8 07:18:00 2019
However, I would like to, if possible, use something more canonical/direct than parsing the stdout data, as I want to avoid cases where the stdout varies from one platform to the next (i.e. Linux, BSD, OSX).
Question
How can I directly acquire the job number (in a script) for an at job my script just queued up?
Edit
I have to account for other processes also using the at command concurrently.

Record the at queue state before scheduling, schedule your job, then find the additions:
$ date
Fri Mar 8 10:33:34 EST 2019
$ atq
3 2019-03-08 10:34 a bishop
$ atq > atq.1
$ echo "ls -l" > cmd.txt
$ at -f cmd.txt now + 2 min
job 4 at 2019-03-08 10:36
You have new mail in /var/spool/mail/bishop
$ atq > atq.2
$ comm atq.1 atq.2
3 2019-03-08 10:34 a bishop
4 2019-03-08 10:36 a bishop
$ comm -23 atq.1 atq.2 | awk '{print $1}' # completed jobs
3
$ comm -13 atq.1 atq.2 | awk '{print $1}' # added jobs
4
As demonstrated, this is impervious to jobs finishing under you. Of course, if jobs are added in separate processes simultaneously, and you want to exclude those, a different solution would be called for (perhaps by grepping for the user submitting the job, or having different processes submit into separate at -q queues).

you could use atq to see the jobs queue after each submission, and get the job ID from the first column of the last line for any newly submitted job.

Sun Grid Engine: submitted jobs by qsub command

I am using Sun Grid Engine queuing system.
Assume I submitted multiple jobs using a script that looks like:
#! /bin/bash
for i in 1 2 3 4 5
do
sh qsub.sh python run.py ${i}
done
qsub.sh looks like:
#! /bin/bash
echo cd `pwd` \; "$#" | qsub
Assuming that 5 jobs are running, I want to find out which command each job is executing.
By using qstat -f, I can see which node is running which jobID, but not what specific command each jobID is related to.
So for example, I want to check which jobID=xxxx is running python run.py 3 and so on.
How can I do this?

I think you'll see it if you use qstat -j *. See https://linux.die.net/man/1/qstat-ge .

You could try running array jobs. Array jobs are useful when you have multiple inputs to process in the same way. Qstat will identify each instance of the array of jobs. See the docs for more information.
http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm#-t
http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto

WGET - Simultaneous connections are SLOW

I use the following command to append the browser's response from list of URLs into an according output file:
wget -i /Applications/MAMP/htdocs/data/urls.txt -O - \
>> /Applications/MAMP/htdocs/data/export.txt
This works fine and when finished it says:
Total wall clock time: 1h 49m 32s
Downloaded: 9999 files, 3.5M in 0.3s (28.5 MB/s)
In order to speed this up I used:
cat /Applications/MAMP/htdocs/data/urls.txt | \
tr -d '\r' | \
xargs -P 10 $(which wget) -i - -O - \
>> /Applications/MAMP/htdocs/data/export.txt
Which opens simultaneous connections making it a little faster:
Total wall clock time: 1h 40m 10s
Downloaded: 3943 files, 8.5M in 0.3s (28.5 MB/s)
As you can see, it somehow omits more than half of the files and takes approx. the same time to finish. I cannot guess why. What I want to do here is download 10 files at once (parallel processing) using xargs and jump to the next URL when the ‘STDOUT’ is finished. Am I missing something or can this be done elsewise?
On the other hand, can someone tell me what the limit that can be set is regarding the connections? It would really help to know how many connections my processor can handle without slowing down my system too much and even avoid some type of SYSTEM FAILURE.
My API Rate-Limiting is as follows:
Number of requests per minute 100
Number of mapping jobs in a single request 100
Total number of mapping jobs per minute 10,000

Have you tried GNU Parallel? It will be something like this:
parallel -a /Applications/MAMP/htdocs/data/urls.txt wget -O - > result.txt
You can use this to see what it will do without actually doing anything:
parallel --dry-run ...
And either of these to see progress:
parallel --progress ...
parallel --bar ...
As your input file seems to be a bit of a mess, you can strip carriage returns like this:
tr -d '\r' < /Applications/MAMP/htdocs/data/urls.txt | parallel wget {} -O - > result.txt

A few things:
I don't think you need the tr, unless there's something weird about your input file. xargs expects one item per line.
man xargs advises you to "Use the -n option with -P; otherwise
chances are that only one exec will be done."
You are using wget -i - telling wget to read URLs from stdin. But xargs will be supplying the URLs as parameters to wget.
To debug, substitute echo for wget and check how it's batching the parameters
So this should work:
cat urls.txt | \
xargs --max-procs=10 --max-args=100 wget --output-document=-
(I've preferred long params - --max-procs is -P. --max-args is -n)
See wget download with multiple simultaneous connections for alternative ways of doing the same thing, including GNU parallel and some dedicated multi-threading HTTP clients.
However, in most circumstances I would not expect parallelising to significantly increase your download rate.
In a typical use case, the bottleneck is likely to be your network link to the server. During a single-threaded download, you would expect to saturate the slowest link in that route. You may get very slight gains with two threads, because one thread can be downloading while the other is sending requests. But this will be a marginal gain.
So this approach is only likely to be worthwhile if you're fetching from multiple servers, and the slowest link in the route to some servers is not at the client end.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to get count of failed and completed jobs in an array job of SLURM - slurm

Related

Make sacct to not truncate the SLURM_ARRAY_TASK_ID

Failed Autosys jobs names within a list of jobs

Get job number from non-interactive queuing of "at" job

Sun Grid Engine: submitted jobs by qsub command

WGET - Simultaneous connections are SLOW

Categories

Resources