Scheduling for Spark jobs on Bluemix - apache-spark

I'm trying to run my Spark application on Bluemix by schedule. For now I'm using scheduling of spark-submit.sh script locally on my machine. But I'd like to use Bluemix for this purpose. Is there any way to set scheduling directly inside Bluemix infrastructure for running Spark notebooks or Spark applications?

The Bluemix OpenWhisk offering provides an easy way to schedule actions run on a periodic schedule similar to cron jobs.
Overview of OpenWhisk-based solution
OpenWhisk provides a programming model based actions, triggers, and rules. For this use case, you would
Create an action that kicks off your spark job.
Use the /whisk.system/alarms package to arrange for triggers to arrive periodically according to your schedule.
Create a rule that declares that your action should fire whenever a trigger event occurs.
Your action can be coded in javascript if it's easy to kick off your job from a javascript function. If not, and you'd like your action to be implemented by a shell script, you can use whisk docker actions to manage your shell script as an action.
Using the whisk.system/alarms package to generate events on a schedule.
This page in the whisk docs includes a detailed description of how to accomplish this. Briefly:
The /whisk.system/alarms/alarm feed configures the Alarm service to fire a trigger event at a specified frequency. The parameters are as follows:
cron: A string, based on the Unix crontab syntax, that indicates when to fire the trigger in Coordinated Universal Time (UTC). The string is a sequence of six fields separated by spaces: X X X X X X. For more details on using cron syntax, see: https://github.com/ncb000gt/node-cron. Here are some examples of the frequency indicated by the string:
* * * * * *: every second.
0 * * * * *: top of every minute.
* 0 * * * *: top of every hour.
0 0 9 8 * *: at 9:00:00AM (UTC) on the eighth day of every month
trigger_payload: The value of this parameter becomes the content of the trigger every time the trigger is fired.
maxTriggers: Stop firing triggers when this limit is reached. Defaults to 1000.
Here is an example of creating a trigger that will be fired once every 20 seconds with name and place values in the trigger event.
$ wsk trigger create periodic --feed /whisk.system/alarms/alarm --param cron '*/20 * * * * *' --param trigger_payload '{"name":"Odin","place":"Asgard"}'
Each generated event will include as parameters the properties specified in the trigger_payload value. In this case, each trigger event will have parameters name=Odin and place=Asgard.

Related

Bull MQ repeatable job not triggering

This question is in continuation with this thread Repeatable jobs not getting triggered at given cron timing in Bull
I am also facing the same problem. How should I specify the timezone? I tried to specify as
repeat: { cron: '* 7 14 * * *', tz: 'Europe/Berlin'}
Meaning trigger the job at 14:07 German time zone. Though the job is listed in the queue, but the job is not triggered.
I also tried repeat:
{
cron: '* 50 15 * * *',
offset: datetime.getTimezoneOffset(),
tz: 'Europe/Berlin'
}
I finally figured out the solution.
One thing to note is that I had not initialized a Queuescheduler instance. Ofcourse timezone also plays a crucial role. But without a Queuescheduler instance (which has the same name as the Queue), the jobs doesnt get added into the queue. The Queuescheduler instance acts as a book keeper. Also take care about one more important parameter "limit". If you dont set the limit to 1, then the job which is scheduled at a particular time will get triggered unlimited number of times.
For example: To run a job at german time 22:30 every day the configuration would look like:
repeat: {
cron: '* 30 22 * * *',
offset: datetime.getTimezoneOffset(),
tz: 'Europe/Berlin',
limit: 1
}
Reference: https://docs.bullmq.io/guide/queuescheduler In this above link, the documentation clearly mentions that the queuescheduler instance does the book keeping of the jobs.
In this link - https://docs.bullmq.io/guide/jobs/repeatable, the documentation specifically warns us to ensure that we instantiate a Queuescheduler instance.
You need to manage repeatable queues with the help of QueueSchedular. QueueSchedular takes the queue name as first parameter and connection as second. The code will be as following:
const queueSchedular = new QueueSchedular(yourQueue.name, { connection });

Combine two cron-scheduling intervals in a single DAG

Rewrite of the question:
Using airflow, I would like to schedule a process to run every two hours from 2 till 10 am and a single time at 22:30. The schedule_interval parameter accepts a cron-expression, but it is not possible to define a single cron-expression to achieve the above scheduling. Currently, I did:
dag = DAG(process_name, schedule_interval='30 2,4,6,8,10,12,14,16,18,20,22,23 * * *', default_args=default_args)
But this will execute the process every 30 minutes past the hour, and this every 2 hours from 2 till 23.
Is there a way I can combine two cron-schedules in Airflow?
0 2-10/2 * * *
30 22 * * *
Original question:
I have 2,4,6,10,12,14,16,18,20,22 00 * *
I need to have 23, 30 in my schedule, but I don't want 2-22 to be run at the 30 min interval.
So, I realized, it is not possible!
You can't use two cron expressions for the same DAG (Might change in the future if PR: Add support for multiple cron expressions in schedule_interval is accepted)
Starting Airflow >=2.2.0:
It is possible to get custom time based triggering using custom Timetable by customizing the DAG scheduling to match what you expect.
To do so you need to define the scheduling logic by implementing next_dagrun_info and infer_manual_data_interval functions - Airflow will leverage this logic to schedule your DAG.
You can view an example can be found here.

Can you programmatically switch serverless cron functions on/off

Scenario:
I want function A to run every minute, but not 24/7. More like 5-10 hours per week. However, a simple cron outlining these times will not do here because the 5-10 hours per week are dynamic and keep changing.
Function B will run e.g. every 30 minutes and determine whether Function A should be running or not. If so, it will switch it 'on', if not, it will switch it 'off'
Is this doable using Serverless.com (or any of the FAAS providers it uses)?
Thanks in advance!
Solution #1: Use s3 to save switch state
You can have the second function write to a file on S3 the state of the switch (ON or OFF).
Schedule the first function to run every min. But make sure it checks the content of the "switch file" from S3 before it starts executing it's logic.
Cost
It won't cost you a lot because: 60 times an hour * 24 hours a day * 31 days a month = 44,640 calls / month. If it would take an extra 100ms to read the flag and you've set the memory to 1GB then this will translate to 44,640 * (0.00001667 GB-SECOND / 10 -100ms per second-) = $0.07441488 / month.
In addition to 44,640 S3 GET request (0.001 per 1,000 requests) = 44,640 * (0.001 / 1000) = $0.04464 / month.
Solution #2: Control the cron of func1 from func2
In function 2, using the AWS CloudWatchEvents API you can create/update the rule's ScheduleExpression (e.g. "cron(* * * * * *)") that that triggers function 1. Read more here

Azure function CRON Schedule expression

If I wish to run my function app at 11.15 PM to next day 1.15 AM(That mean MONDAY 11.15 PM start and TUESDAY 1.15 AM will end) and it will trigger every minute during this time.
Can anyone help me to write this CRON expression?
The timer triggers are designed for a single repeating interval. The only way to do this completely within a Function is to run the trigger once per minute, then abort if the current time isn't in the desired target time period.
Alternately, put your logic into an HTTP trigger configured to act as a webhook, then use an Azure scheduler to configure start and stop times and intervals.
You won't be able to use the scheduler free plan since it can only run once per hour, but the standard plan can run once per minute. Scheduler pricing here.
You'll have to do this in three lines I think,
15-59 23 * * *
* 0 * * *
0-15 1 * * *
This will run it from 23:15-23:59 then 00:00-00:59 then 1:00-1:15

How do I run the camel scheduled jobs with quartz

I'm using camel framework to declare some scheduled jobs with quartz. In there, I want to execute my class in every two seconds.
So, I have mentioned this:
quartz2://quartzScheduler/Processor?cron=0/2+*+*+*+*+?
But its not executing.
the first six fields in a quartz cron expression are not optional, so your url should probably be:
quartz2://quartzScheduler/Processor?cron=0/2 * * * * *

Resources