What is the difference between Run all at once and Run in sequence,when create a Cognos JOB? - cognos

What is the difference between
Run all at once and Run in sequence modes?
If jobs fail both in those two modes, what will happen? (the scheduled report email will be sent or not?)enter image description here

Let's say your job has 10 reports in it.
In sequence mode, the first report will run. When it's done, the second will run. Then the 3rd, etc.
In "all at once" mode, the job doesn't wait for anything to finish. All reports are submitted at once and begin running.
Provided the server has the appropriate resources to do so, running the job all at once will complete everything sooner, but it does put a much larger strain on the system. I have jobs where running in sequence ends up being faster because running them all at once is such a bottleneck that it slows down everything to a crawl.
Failure depends on what you select: In either case you can have the job stop if any component fails, or continue running on the other reports if one fails. However, if any component fails then when you view the job status in you schedule list will still show the overall status of the job as "Failed", and "run history details" will show you exactly where the failure was.

Related

Timeout including time in queue JCL Z os IBM

I need to set a Timeout, in a JCL step that calls a Unix script through bpxbtach. I did it with
//STEPX EXEC PGM=BPXBATCH, PARM='sh /x.sh',TIME=(,10)
However, After some time I realized that does not include the time in the queue. they say " This run time refers to actual execution time only, and does not include the time that the job spends in the INPUT or INPUT HOLD queues" https://supportline.microfocus.com/documentation/books/rd60/cbwjto.htm
That is microfocus JCL, but I verified the behavior is that on IBM Z too.
So even if I set the timeout to 10 seconds, the step can take several minutes if the queue is attending other things. I need a timeout that kills the step no matter the reason it took so long. I haven't been able to find what I need. Please help.
z/OS batch really isn't the best choice for time-critical work. As you figured out, the JCL "TIME" parameter is about CPU time consumption, not an elapsed time control. If this is a business-critical need, then by all means talk to your z/OS administrators - they can certainly configure your system such that your job is very likely to run without delay, but this isn't usually default behavior.
You don't provide a lot of detail as to what else your job might be doing and how it gets submitted. If you have the ability to control how your job is submitted, one option might be to spawn your shell script directly rather than submitting a batch process to run your script.
For example, what you've described is submitting JCL that spawns BPXBATCH, then BPXBATCH spawns your shell script. Instead, you might write a small C program that simply calls "spawn()" to run the shell as a distinct UNIX process - that's not difficult, depending on how you're submitting the JCL you shared. You cut out the need for the batch job - just run your script directly.
If you're running in a TSO environment, the OSHELL command lets you interactively run your script. You can even automate the whole process with a simple REXX script, and none of this requires a pass through a batch initiator.
If your site runs SSH or similar, you might consider launching your script through an SSH command - this even works across a network. SSH lets you launch a shell session and pass a command for execution...again, there's no JCL or input queue here.
If your administrators would allow it, another alternative would be to run your JCL via a "START" command. Unlike batch JCL, when a START command is encountered, the work you're starting runs immediately - there's no input queue for started tasks. Start commands can be issued from JCL too, and since they're issued as the JCL is scanned and not when the job starts, these are fairly immediate too.
Inside your shell script, it's pretty easy to setup an elapsed time limit - there are examples here.
I see a couple of problems in your code...
//STEPX EXEC PGM=BPXBATCH, PARM='sh /x.sh',TIME=(,10)
First, you have a space between BPXBATCH, and PARM= which will not execute your shell script and may result in a JCL error.
Second, you are using the TIME parameter of the EXEC statement, which limits CPU time, yet you reference a desire to cancel the job step if it waits more than some amount of time in the input queue, which is a clock time limitation.
There is no way to cancel the job from the job itself via JCL parameters based on clock time, either including or excluding time spent in the input queue.
If you really need to do this, I suggest you look into capabilities of your shop's job scheduler package. You might want to reexamine why you need to cancel a job if it doesn't run to completion within 10 clock seconds after you submit it.

Run job repeatedly, but with no overlap and not at precise scheduled times

I have a background task that needs to be run repeatedly, every hour or so, sending me an email whenever the task emitted non-trivial output.
I'm currently using cron for that, but it's somewhat ill-suited: it forces me to choose exact times at which the command is run, and it doesn't prevent overlap.
An alternative would be to run the script in a loop with sleep 3600 at the end of each iteration but this then needs extra work to make sure the script is always restarted after boot and such.
Ideally, I'd like a cron-like tool where I can give a set of commands to run repeatedly with approximate execution rates and the tool will run them "when convenient" and without overlapping execution of different iterations of a command (or even without overlapping execution of any command).
Short of writing such a tool myself, what would be the recommended approach?

Threshold for allowed amount of failed Hyperdrive runs

Because "reasons", we know that when we use azureml-sdk's HyperDriveStep we expect a number of HyperDrive runs to fail -- normally around 20%. How can we handle this without failing the entire HyperDriveStep (and then all downstream steps)? Below is an example of the pipeline.
I thought there would be an HyperDriveRunConfig param to allow for this, but it doesn't seem to exist. Perhaps this is controlled on the Pipeline itself with the continue_on_step_failure param?
The workaround we're considering is to catch the failed run within our train.py script and manually log the primary_metric as zero.
thanks for your question.
I'm assuming that HyperDriveStep is one of the steps in your Pipeline and that you want the remaining Pipeline steps to continue, when HyperDriveStep fails, is that correct?
Enabling continue_on_step_failure, should allow the rest of the pipeline steps to continue, when any single steps fails.
Additionally, the HyperDrive run consists of multiple child runs, controlled by the HyperDriveConfig. If the first 3 child runs explored by HyperDrive fail (e.g. with user script errors), the system automatically cancels the entire HyperDrive run, in order to avoid further wasting resources.
Are you looking to continue other Pipeline steps when the HyperDriveStep fails? or are you looking to continue other child runs within the HyperDrive run, when the first 3 child runs fail?
Thanks!

Automatically rerun jobs submitted with sbatch --array upon error

I am submitting jobs in an array. Occasionally one job will error because of a difficult to diagnose gpu memory issue. Simply rerunning the job results in success.
What I would like to do is catch this error, log it, and put the job back into slurm's queue to be rerun. If this is not possible to do with an array job, that's fine, it's not essential to use arrays (though it is preferred).
I've tried playing around with sbatch --rerun, but this doesn't seem to do what I want (I think this option is for rerunning after a hardware error detected by slurm, or a node is restarted when a job is running - this isn't the case for my jobs).
Any advice well received.
If you can detect the GPU memory issue, you can end your submission job with a construct like this:
if <gpu memory issue>; then
scontrol requeue $SLURM_JOBID
fi
This will put the job back in the scheduling queue and it will be restarted as is. Interestingly, the SLURM_RESTART_COUNT environment variable holds the number of times the job was re-queued.

How to define frequency of a job in application by users?

I have an application that has to launch jobs repeatingly. But (yes, that would have been to easy without a but...) I would like users to define their backup frequency in application.
In worst case, they would have to choose between :
weekly,
daily,
every 12 hours,
every 6 hours,
hourly
In best case, they should be able to use crontab expressions (see documentation for example)
How to do this? Do I launch a job every minutes that check for last execution time, frequency and then launches another job if needed? Do I create a sort of queue that will be executed by a masterjob?
Any clues, ideas, opinions, best pratices, experiences are welcome!
EDIT : Solved this problem using Akka scheduler. Ok, this is a technical solution not a design answer but still everything works great.
Each user defined repetition is an actor that send messages every period to a new actor to execute the actual job.
There may be two ways to do this depending on your requirements/architecture:
If you can only use Play:
The user creates the job and the frequency it will run (crontab, whatever).
On saving the job, you calculate the first time it will have to be run. You then add an entry to a table JOBS with the execution time, job id, and any other information required. This is required as Play is stateless and information must be stored in the DB for later retrieval.
You have a job that queries the table for entries whose execution date is less than now. Retrieves the first, runs it, removes it from the table and adds a new entry for next execution. You should keep some execution counter so if a task fails (which means the entry is not removed from DB) it won't block execution of the other tasks by the job trying again and again.
The frequency of this job is set to run every second. That way while there is information in the table, you should execute the request around as often as they are required. As Play won't spawn a new job while the current one is working if you have enough tasks this one job will serve all. If not, it will be killed at some point and restored when required.
Of course, the crons of the users will not be too precise, as you have to account for you own cron delays plus execution delays on all the tasks in queue, which will be run sequentially. Not the best approach, unless you somehow disallow crons which run every second or more often than every minute (to be safe). Doing a check on execution time of the crons to kill them if they are over a certain amount of time would be a good idea.
If you can use more than Play:
The better alternative I believe is to use Quartz (see this) to create a future execution when the user creates the job, and reproram it once the execution is over.
There was a discussion on google-groups about it. As far as I remember you must define a job which start every 6 hours and check which backups must be done. So you must remember when the last backup job was finished and make the control yourself. I'm unsure if Quartz can handle such a requirement.
I looked in the source-code (always a good source ;-)) and found a method every, where I think this should be do what you want. How ever I'm unsure if this is a clever design, because if you have 1000 user you will have then 1000 Jobs. I'm unsure if Play was build to handle such a large number of jobs.
[Update] For cron-expressions you should have a look into JobPlugin.scheduleForCRON()
There are several ways to solve this.
If you don't have a really huge load of jobs, I'd just persist them to a table using the required flexibility. Then check all of them every hour (or the lowest interval you support) and run those eligible. Simple.
Or, if you prefer to use cron syntax anyway, just write (export) jobs to a user crontab using a wrapper which calls back to your running app, or starts the job in a standalone process if that's possible.

Resources