JCL job dependency without scheduler - mainframe

I'm trying to implement a JCL, in a JES2 environment, that launches a set of jobs with dependencies in it, for example:
JOB_A -> JOB_B )
JOB_C -> JOB_D ) -> JOB_E
In other words, JOB_E is only launched when JOB_B and JOB_D are finished.
I can launch JOB_B and JOB_D through job internal reader in JOB_A and JOB_C but I can't not create the dependencies for JOB_E.
I tried to explore JCL resource lock so that I could lock a data set in JOB_B and JOB_D that JOB_E needed so that JOB_E would only start when all data set's are available but the JCL only requests data set in STEP level and release them afterwards. If JCL could request all data set before start I could implement some sort of mutex in the JOBs, for example:
JOB_A locks data set DSN_A
JOB_B waits to get data set DSN_A
JOB_C locks data set DSN_C
JOB_D waits to get data set DSN_C
JOB_E waits to get data set DSN_A and DSN_C
How to do this?
I need this to test set of JCL's in a development environment without access to a scheduler.

Your comment that you need this to test in a development environment without access to a scheduler makes me wonder if your shop has a scheduler for the production environment. If it does, then your testing will not actually test what will be used in your production environment. Just something to think about if you haven't already.
In answer to your question, One technique is to use a utility such as IEBGENER in the last step of one job to submit a subsequent job.
For example, The last step of JOB_A would execute IEBGENER with SYSUT1 containing the execution JCL for JOB_B and SYSUT2 pointing at INTRDR. This is one technique you could use, though getting JOB_E to run so that it doesn't interfere with any of the other jobs might be tricky, as JOB_E needs to run after both JOB_B and JOB_D complete.
Another technique would be to use Rexx in batch mode to submit your jobs using the internal reader and then use the SDSF Rexx interface to watch for when they complete. Essentially you will be writing a special-purpose job scheduler, specific to your set of jobs.
Update, ten years later...
As of z/OS 2.2 IBM has added JES2 Execution Control Statements which "define the execution sequencing of a group of jobs and the jobs themselves". Prior to use of this feature, some configuration must be done my your z/OS Systems Programmer.

I'm wondering why to invest precious time to test a set of jobs, where the PROD set is entirely different and will be handled by some xyz scheduler. Don't mind, if I sound crazy but lemme propose mine too:
Assumption: Your jobs take manageable CPU and NO NEED to be run parallely.
A triggers B triggers C triggers D
triggers E (I know its not worthy but
your testing goes fine) I just put it
here by thinking what I would do if I
were you. I mainly need my testing to
go quick and fine. Lemme know your cliche.
Now, lemme appreciate you both for such resolution that we can manage submission of jobs by means of REXX that too creating a virtual and subjective scheduling.

Related

Timeout including time in queue JCL Z os IBM

I need to set a Timeout, in a JCL step that calls a Unix script through bpxbtach. I did it with
//STEPX EXEC PGM=BPXBATCH, PARM='sh /x.sh',TIME=(,10)
However, After some time I realized that does not include the time in the queue. they say " This run time refers to actual execution time only, and does not include the time that the job spends in the INPUT or INPUT HOLD queues" https://supportline.microfocus.com/documentation/books/rd60/cbwjto.htm
That is microfocus JCL, but I verified the behavior is that on IBM Z too.
So even if I set the timeout to 10 seconds, the step can take several minutes if the queue is attending other things. I need a timeout that kills the step no matter the reason it took so long. I haven't been able to find what I need. Please help.
z/OS batch really isn't the best choice for time-critical work. As you figured out, the JCL "TIME" parameter is about CPU time consumption, not an elapsed time control. If this is a business-critical need, then by all means talk to your z/OS administrators - they can certainly configure your system such that your job is very likely to run without delay, but this isn't usually default behavior.
You don't provide a lot of detail as to what else your job might be doing and how it gets submitted. If you have the ability to control how your job is submitted, one option might be to spawn your shell script directly rather than submitting a batch process to run your script.
For example, what you've described is submitting JCL that spawns BPXBATCH, then BPXBATCH spawns your shell script. Instead, you might write a small C program that simply calls "spawn()" to run the shell as a distinct UNIX process - that's not difficult, depending on how you're submitting the JCL you shared. You cut out the need for the batch job - just run your script directly.
If you're running in a TSO environment, the OSHELL command lets you interactively run your script. You can even automate the whole process with a simple REXX script, and none of this requires a pass through a batch initiator.
If your site runs SSH or similar, you might consider launching your script through an SSH command - this even works across a network. SSH lets you launch a shell session and pass a command for execution...again, there's no JCL or input queue here.
If your administrators would allow it, another alternative would be to run your JCL via a "START" command. Unlike batch JCL, when a START command is encountered, the work you're starting runs immediately - there's no input queue for started tasks. Start commands can be issued from JCL too, and since they're issued as the JCL is scanned and not when the job starts, these are fairly immediate too.
Inside your shell script, it's pretty easy to setup an elapsed time limit - there are examples here.
I see a couple of problems in your code...
//STEPX EXEC PGM=BPXBATCH, PARM='sh /x.sh',TIME=(,10)
First, you have a space between BPXBATCH, and PARM= which will not execute your shell script and may result in a JCL error.
Second, you are using the TIME parameter of the EXEC statement, which limits CPU time, yet you reference a desire to cancel the job step if it waits more than some amount of time in the input queue, which is a clock time limitation.
There is no way to cancel the job from the job itself via JCL parameters based on clock time, either including or excluding time spent in the input queue.
If you really need to do this, I suggest you look into capabilities of your shop's job scheduler package. You might want to reexamine why you need to cancel a job if it doesn't run to completion within 10 clock seconds after you submit it.

Which reader, writer, processor should I use for restartable multithreaded spring batch?

I want to create multi threaded step in my restartable job. As I saw many of Spring Batch readers are not thread-safe.
I have some questions related with that;
Is there any readers/writer/processor that can i use for restartable multi threaded step ?
If not, how can I do this without adding processors or status column to table ? Because our all tables are stable and can not be change for this.
I'm doing some researches but I want to ask here also.
Restartability is not possible with a multi-threaded step, because threads can (and will) override each others state in the execution context. That's why the javadoc of stateful readers mentions to disable state management in multi-threaded steps. Here is an excerpt from the JdbcPagingItemReader:
The implementation is thread-safe in between calls to open(ExecutionContext),
but remember to use saveState=false if used in a multi-threaded client
(no restart available)
Same mention in the Javadoc of JpaPagingItemReader and others. The process indicator pattern is the way to go, but you said you can't add the column in your table..
What you can probably try is create a temporary table with the required data + column (in a first step) and remove it (in a final step) once the job is successfully executed. The lifecycle of the temporary table will be the same as the job instance (ie it can survive a job execution failure and be available for the next run). The successful job execution would remove it (this can be done automatically in a final step of the job to avoid manual house-keeping of temporary tables in your db).

Airflow - Locking between tasks so that only one parallel task runs at a time?

I have one DAG that has three task streams (licappts, agents, agentpolicy):
For simplicity I'm calling these three distinct streams. The streams are independent in the sense that just because agentpolicy failed doesn't mean the other two (liceappts and agents) should be affected by the other streams failure.
But for the sourceType_emr_task_1 tasks (i.e., licappts_emr_task_1, agents_emr_task_1, and agentpolicy_emr_task_1) I can only run one of these tasks at a time. For example I can't run agents_emr_task_1 and agentpolicy_emr_task_1 at the same time even though they are two independent tasks that don't necessarily care about each other.
How can I achieve this functionality in Airflow? For now the only thing I can think of is to wrap that task in a script that somehow locks a global variable, then if the variable is locked I'll have the script do a Thread.sleep(60 seconds) or something, and then retry. But that seems very hacky and I'm curious if Airflow offers a solution for this.
I'm open to restructuring the ordering of my DAG if needed to achieve this. One thing I thought about doing was to make a hard coded ordering of
Dag Starts -> ... -> licappts_emr_task_1 -> agents_emr_task_1 -> agentpolicy_emr_task_1 -> DAG Finished
But I don't think combining the streams this way because then for example agentpolicy_emr_task_1 has to wait for the other two to finish before it can start and there could be times when agentpolicy_emr_task_1 is ready to go before the other two have finished their other tasks.
So ideally I want whatever sourceType_emr_task_1 task to start that's ready first and then block the other tasks from running their sourceType_emr_task_1 task until it's finished.
Update:
Another solution I just thought of is if there is a way for me to check on the status of another task I could create a script for sourceType_emr_task_1 that checks to see if any of the other two sourceType_emr_task_1 tasks have a status of running, and if they do it'll sleep and periodically check to see if none of the other's are running, in which case it'll start it's process. I'm not a big fan of this way though because I feel like it could cause a race condition where both read (at the same time) that none are running and both start running.
You could use a pool to ensure the parallelism for those tasks is 1.
For each of the *_emr_task_1 tasks, set a pool kwarg to to be something like pool=emr_task.
Then just go into the webserver -> admin -> pools -> create:
Set the name Pool to match the pool used in your operator, and the Slots to be 1.
This will ensure the scheduler will only allow tasks to be queued for that pool up to the number of slots configured, regardless of the parallelism of the rest of Airflow.

Why in kubernetes cron job two jobs might be created, or no job might be created?

In k8s Cron Job Limitations mentioned that there is no guarantee that a job will executed exactly once:
A cron job creates a job object about once per execution time of its
schedule. We say “about” because there are certain circumstances where
two jobs might be created, or no job might be created. We attempt to
make these rare, but do not completely prevent them. Therefore, jobs
should be idempotent
Could anyone explain:
why this could happen?
what are the probabilities/statistic this could happen?
will it be fixed in some reasonable future in k8s?
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
The controller:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go
starts with a comment that lays the groundwork for an explanation:
I did not use watch or expectations. Those add a lot of corner cases, and we aren't expecting a large volume of jobs or scheduledJobs. (We are favoring correctness over scalability.)
If we find a single controller thread is too slow because there are a lot of Jobs or CronJobs, we we can parallelize by Namespace. If we find the load on the API server is too high, we can use a watch and UndeltaStore.)
Just periodically list jobs and SJs, and then reconcile them.
Periodically means every 10 seconds:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go#L105
The documentation following the quoted limitations also has some useful color on some of the circumstances under which 2 jobs or no jobs may be launched on a particular schedule:
If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrentPolicy is set to AllowConcurrent, the jobs will always run at least once.
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency. For example, suppose a cron job is set to start at exactly 08:30:00 and its startingDeadlineSeconds is set to 10, if the CronJob controller happens to be down from 08:29:00 to 08:42:00, the job will not start. Set a longer startingDeadlineSeconds if starting later is better than not starting at all.
Higher level, solving for only-once in a distributed system is hard:
https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
Clocks and time synchronization in a distributed system is also hard:
https://8thlight.com/blog/rylan-dirksen/2013/10/04/synchronization-in-a-distributed-system.html
To the questions:
why this could happen?
For instance- the node hosting the CronJobController fails at the time a job is supposed to run.
what are the probabilities/statistic this could happen?
Very unlikely for any given run. For a large enough number of runs, very unlikely to escape having to face this issue.
will it be fixed in some reasonable future in k8s?
There are no idemopotency-related issues under the area/batch label in the k8s repo, so one would guess not.
https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Aarea%2Fbatch
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
Think more about the specific definition of idempotent, and the particular points in the job where there are commits. For instance, jobs can be made to support more-than-once execution if they save state to staging areas, and then there is an election process to determine whose work wins.
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
Yes, it's a core distributed systems problem.
For most users, the k8s documentation gives perhaps a more precise and nuanced answer than is necessary. If your scheduled job is controlling some critical medical procedure, it's really important to plan for failure cases. If it's just doing some system cleanup, missing a scheduled run doesn't much matter. By definition, nearly all users of k8s CronJobs fall into the latter category.

How to define frequency of a job in application by users?

I have an application that has to launch jobs repeatingly. But (yes, that would have been to easy without a but...) I would like users to define their backup frequency in application.
In worst case, they would have to choose between :
weekly,
daily,
every 12 hours,
every 6 hours,
hourly
In best case, they should be able to use crontab expressions (see documentation for example)
How to do this? Do I launch a job every minutes that check for last execution time, frequency and then launches another job if needed? Do I create a sort of queue that will be executed by a masterjob?
Any clues, ideas, opinions, best pratices, experiences are welcome!
EDIT : Solved this problem using Akka scheduler. Ok, this is a technical solution not a design answer but still everything works great.
Each user defined repetition is an actor that send messages every period to a new actor to execute the actual job.
There may be two ways to do this depending on your requirements/architecture:
If you can only use Play:
The user creates the job and the frequency it will run (crontab, whatever).
On saving the job, you calculate the first time it will have to be run. You then add an entry to a table JOBS with the execution time, job id, and any other information required. This is required as Play is stateless and information must be stored in the DB for later retrieval.
You have a job that queries the table for entries whose execution date is less than now. Retrieves the first, runs it, removes it from the table and adds a new entry for next execution. You should keep some execution counter so if a task fails (which means the entry is not removed from DB) it won't block execution of the other tasks by the job trying again and again.
The frequency of this job is set to run every second. That way while there is information in the table, you should execute the request around as often as they are required. As Play won't spawn a new job while the current one is working if you have enough tasks this one job will serve all. If not, it will be killed at some point and restored when required.
Of course, the crons of the users will not be too precise, as you have to account for you own cron delays plus execution delays on all the tasks in queue, which will be run sequentially. Not the best approach, unless you somehow disallow crons which run every second or more often than every minute (to be safe). Doing a check on execution time of the crons to kill them if they are over a certain amount of time would be a good idea.
If you can use more than Play:
The better alternative I believe is to use Quartz (see this) to create a future execution when the user creates the job, and reproram it once the execution is over.
There was a discussion on google-groups about it. As far as I remember you must define a job which start every 6 hours and check which backups must be done. So you must remember when the last backup job was finished and make the control yourself. I'm unsure if Quartz can handle such a requirement.
I looked in the source-code (always a good source ;-)) and found a method every, where I think this should be do what you want. How ever I'm unsure if this is a clever design, because if you have 1000 user you will have then 1000 Jobs. I'm unsure if Play was build to handle such a large number of jobs.
[Update] For cron-expressions you should have a look into JobPlugin.scheduleForCRON()
There are several ways to solve this.
If you don't have a really huge load of jobs, I'd just persist them to a table using the required flexibility. Then check all of them every hour (or the lowest interval you support) and run those eligible. Simple.
Or, if you prefer to use cron syntax anyway, just write (export) jobs to a user crontab using a wrapper which calls back to your running app, or starts the job in a standalone process if that's possible.

Resources