Schedule jobs dynamically with Flask APScheduler - python-3.x

I'm trying the code given at advanced.py with the following modification for adding new jobs.
I created a route for adding a new job
#app.route('/add', methods = ['POST'])
def add_to_schedule():
data = request.get_json()
print(data)
d = {
'id': 'job'+str(random.randint(0, 100)),
'func': 'flask_aps_code:job1',
'args': (random.randint(200,300),random.randint(200, 300)),
'trigger': 'interval',
'seconds': 10
}
print(app.config['JOBS'])
scheduler.add_job(id = 'job'+str(random.randint(0, 100)), func = job1)
#app.config.from_object(app.config['JOBS'].append(d))
return str(app.config['JOBS']), 200
I tried adding the jobs to config['JOBS'] as well as scheduler.add_job. But none of my new jobs are not getting executed. Additionally, my first scheduled job doesnt get executed till I do a ctrl+c on the terminal, after which the first scheduled job seems to execute, twice. What am I missing?
Edit: Seemingly the job running twice is because of flask reloading, so ignore that.

Related

How do i check if all my tasks in an airflow dag were successful?

i need to check if all the tasks of my dag were marked as successful so that in the last task of the dag it sends an email to me to notify if all were successful or if any failed.
here is a piece of code that i tried:
dag_runs = DagRun.find(dag_id=self.dagId)
for dag_run in dag_runs:
if dag_run.state == 'success':
body = f'\nHello , \nHere is the values for the pipeline {self.dagId}: \ncount of lines is {new_lines}, \nMax date is {new_date}. \nRegards!'
else:
body= f'\nHello \nYour dag {self.dagId} has been Failed'
email_text = """\
Subject: %s
\nFrom: %s
\nTo: %s
\n%s
""" % (subject, sent_from, self.to, body)
try:
smtp_server = smtplib.SMTP_SSL('smtp.gmail.com', 465)
smtp_server.ehlo()
smtp_server.login(self.gmail_user, self.gmail_password)
smtp_server.sendmail(sent_from, self.to, email_text)
smtp_server.close()
print ("Email sent successfully!")
except Exception as ex:
print ("Something went wrong….",ex)
i'm unable to check if the dag state is success. so i want to check if the state of all tasks is success
thanks for the help and advice in advance.
We had a similar use-case where we want to identify if all the tasks were sucessful. In Airflow, if a task fails and if we have a trigger_rule one_failed, the DAG can run ends up being marked a successful as there was a recovery from failure.
Solution we implemented with single email to track all the task_instances:
from airflow.models.dagrun import DagRun
from airflow.models.taskinstance import TaskInstance
def check_all_success(**context):
dr: DagRun = context["dag_run"]
ti: TaskInstance = context["ti"]
# here we remove the task currently executing this logic
ti_summary = set([task.state for task in dr.get_task_instances() if task.task_id != ti.task_id])
# Remove success state
ti_summary.remove('success')
# If TI summary had any other state except success, there was an issue in the run
if ti:
# Send email: All tasks in DAG: {dr.dag_id} did not complete successfully
pass
else:
# Send email: All tasks in DAG: {dr.dag_id} completed successfully
pass
check_all_tasks = PythonOperator(
task_id='check_all_tasks',
python_callable=check_all_success,
provide_context=True
)
By default, every task in Airflow should succeed for a next task to start running. So if your email-task is the last task in your DAG, that automatically means all previous tasks have succeeded.
Alternatively, you could configure on_success_callback and on_failure_callback on your DAG, which executes a given callable. This passes in arguments to determine whether the DAG run failed or succeeded:
def email(dagrun: DagRun, success: bool, reason: str, session: Session):
# send email here...
success is a boolean value which indicates DAG Run success/failure.

Run databricks job from notebook

I want to know if it is possible to run a Databricks job from a notebook using code, and how to do it
I have a job with multiple tasks, and many contributors, and we have a job created to execute it all, now we want to run the job from a notebook to test new features without creating a new task in the job, also for running the job multiple times in a loop, for example:
for i in [1,2,3]:
run job with parameter i
Regards
what you need to do is the following:
install the databricksapi. %pip install databricksapi==1.8.1
Create your job and return an output. You can do that by exiting the notebooks like that:
import json dbutils.notebook.exit(json.dumps({"result": f"{_result}"}))
If you want to pass a dataframe, you have to pass them as json dump too, there is some official documentation about that from databricks. check it out.
Get the job id you will need it later. You can get it from the jobs details in databricks.
In the executors notebook you can use the following code.
def run_ks_job_and_return_output(params):
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
# context
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
jobs_instance = Jobs.Jobs(url, token) # initialize a jobs_instance
runs_job_id = jobs_instance.runJob(****************, 'notebook',
params) # **** is the job id
run_is_not_completed = True
while run_is_not_completed:
current_run = [run for run in jobs_instance.runsList('completed')['runs'] if run['run_id'] == runs_job_id['run_id'] and run['number_in_job'] == runs_job_id['number_in_job']]
if len(current_run) == 0:
time.sleep(30)
else:
run_is_not_completed = False
current_run = current_run[0]
print( f"Result state: {current_run['state']['result_state']}, You can check the resulted output in the following link: {current_run['run_page_url']}")
note_output = jobs_instance.runsGetOutput(runs_job_id['run_id'])['notebook_output']
return note_output
run_ks_job_and_return_output( { 'parm1' : 'george',
'variable': "values1"})
If you want to run the job many times in parallel you can do the following. (first be sure that you have increased the max concurent runs in the job settings)
from multiprocessing.pool import ThreadPool
pool = ThreadPool(1000)
results = pool.map(lambda j: run_ks_job_and_return_output( { 'table' : 'george',
'variable': "values1",
'j': j}),
[str(x) for x in range(2,len(snapshots_list))])
There is also the possibility to save the whole html output but maybe you are not interested on that. In any case I will answer to that to another post on StackOverflow.
Hope it helps.
You can use following steps :
Note-01:
dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")
dbutils.widgets.text("foo2", "foo2Default", "foo2EmptyLabel")
result = dbutils.widgets.get("foo")+"-"+dbutils.widgets.get("foo2")
def display():
print("Function Display: "+result)
dbutils.notebook.exit(result)
Note-02:
thislist = ["apple", "banana", "cherry"]
for x in thislist:
dbutils.notebook.run("Note-01 path", 60, {"foo": x,"foo2":'Azure'})

Python asyncio wait() with cumulative timeout

I am writing a job scheduler where I schedule M jobs across N co-routines (N < M). As soon as one job finishes, I add a new job so that it can start immediately and run in parallel with the other jobs. Additionally, I would like to ensure that no single job takes more than a certain fixed amount of time. Any jobs that take too long should be cancelled. I have something pretty close, like this:
def update_run_set(waiting, running, max_concurrency):
number_to_add = min(len(waiting), max_concurrency - len(running))
for i in range(0, number_to_add):
next_one = waiting.pop()
running.add(next_one)
async def _run_test_invocations_asynchronously(jobs:List[MyJob], max_concurrency:int, timeout_seconds:int):
running = set() # These tasks are actively being run
waiting = set() # These tasks have not yet started
waiting = {_run_job_coroutine(job) for job in jobs}
update_run_set(waiting, running, max_concurrency)
while len(running) > 0:
done, running = await asyncio.wait(running, timeout=timeout_seconds,
return_when=asyncio.FIRST_COMPLETED)
if not done:
timeout_count = len(running)
[r.cancel() for r in running] # Start cancelling the timed out jobs
done, running = await asyncio.wait(running) # Wait for cancellation to finish
assert(len(done) == timeout_count)
assert(len(running) == 0)
else:
for d in done:
job_return_code = await d
if len(waiting) > 0:
update_run_set(waiting, running, max_concurrency)
assert(len(running) > 0)
The problem here is that say my timeout is 5 seconds, and I'm scheduling 3 jobs across 4 cores. Job A takes 2 seconds, Job B takes 6 seconds and job C takes 7 seconds.
We have something like this:
t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7
-------|-------|-------|-------|-------|-------|-------|-------|
AAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
However, at t=2 the asyncio.await() call returns because A completed. It then loops back up to the top and runs again. At this point B has already been running for 2 seconds, but since it starts the countdown over, and only has 4 seconds remaining until it completes, B will appear to be successful. So after 4 seconds we return again, B is successful, then we start the loop over and now C completes.
How do I make it so that B and C both fail? I somehow need the time to be preserved across calls to asyncio.wait().
One idea that I had is to do my own bookkeeping of how much time each job is allowed to continue running, and pass the minimum of these into asyncio.wait(). Then when something times out, I can cancel only those jobs whose time remaining was equal to the value I passed in for timeout_seconds.
This requires a lot of manual bookkeeping on my part though, and I can't help but wonder about floating point problems which cause me to decide that it's not time to cancel a job even though it really is). So I can't help but think that there's something easier. Would appreciate any ideas.
You can wrap each job into a coroutine that checks its timeout, e.g. using asyncio.wait_for. Limiting the number of parallel invocations could be done in the same coroutine using an asyncio.Semaphore. With those two combined, you only need one call to wait() or even just gather(). For example (untested):
# Run the job, limiting concurrency and time. This code could likely
# be part of _run_job_coroutine, omitted from the question.
async def _run_job_with_limits(job, sem, timeout):
async with sem:
try:
await asyncio.wait_for(_run_job_coroutine(job), timeout)
except asyncio.TimeoutError:
# timed out and canceled, decide what you want to return
pass
async def _run_test_invocations_async(jobs, max_concurrency, timeout):
sem = asyncio.Semaphore(max_concurrency)
return await asyncio.gather(
*(_run_job_with_limits(job, sem, timeout) for job in jobs)
)

scheduling a task at multiple timings(with different parameters) using celery beat but task run only once(with random parameters)

What i am trying to achieve
Write a scheduler, that uses a database to schedule similar tasks at different timings.
For the same i am using celery beat, the code snippet below would give an idea
try:
reader = MongoReader()
except:
raise
try:
tasks = reader.get_scheduled_tasks()
except:
raise
celerybeat_schedule = dict()
for task in tasks:
celerybeat_schedule[task["task_id"]] =dict()
celerybeat_schedule[task["task_id"]]["task"] = task["task_name"]
celerybeat_schedule[task["task_id"]]["args"] = (task,)
celerybeat_schedule[task["task_id"]]["schedule"] = get_task_schedule(task)
app.conf.update(BROKER_URL=rabbit_mq_endpoint, CELERY_TASK_SERIALIZER='json', CELERY_ACCEPT_CONTENT=['json'], CELERYBEAT_SCHEDULE=celerybeat_schedule)
so these are three steps
- reading all tasks from datastore
- creating a dictionary, celery scheduler which is populated by all tasks having properties, task_name(method that would run), parameters(data to pass to the method), schedule(stores when to run)
- updating this with celery configurations
Expected scenario
given all entries run the same celery task name that just prints, have same schedule to be run every 5 min, having different parameters specifying what to print, lets say db has
task name , parameter , schedule
regular_print , Hi , {"minutes" : 5}
regular_print , Hello , {"minutes" : 5}
regular_print , Bye , {"minutes" : 5}
I expect, these to be printing every 5 minutes to print all three
What happens
Only one of Hi, Hello, Bye prints( possible randomly, surely not in sequence)
Please help,
Thanks a lot in advance :)
Was able to resolve this using version 4 of celery. Sample similar to what worked for me.. can also find in documentation by celery for version 4
#taking address and user-pass from environment(you can mention direct values)
ex_host_queue = os.environ["EX_HOST_QUEUE"]
ex_port_queue = os.environ["EX_PORT_QUEUE"]
ex_user_queue = os.environ["EX_USERID_QUEUE"]
ex_pass_queue = os.environ["EX_PASSWORD_QUEUE"]
broker= "amqp://"+ex_user_queue+":"+ex_pass_queue+"#"+ex_host_queue+":"+ex_port_queue+"//"
#celery initialization
app = Celery(__name__,backend=broker, broker=broker)
app.conf.task_default_queue = 'scheduler_queue'
app.conf.update(
task_serializer='json',
accept_content=['json'], # Ignore other content
result_serializer='json'
)
task = {"task_id":1,"a":10,"b":20}
##method to update scheduler
def add_scheduled_task(task):
print("scheduling task")
del task["_id"]
print("adding task_id")
name = task["task_name"]
app.add_periodic_task(timedelta(minutes=1),add.s(task), name = task["task_id"])
#app.task(name='scheduler_task')
def scheduler_task(data):
print(str(data["a"]+data["b"]))

How to trigger several downstream builds in groovy and then finish the upstream job without waiting for results

I would like to trigger several downstream builds in groovy and then make the upstream job finish without waiting for the results of the downstream jobs.
With the following code:
hudson.model.queue.QueueTaskFuture build(String fullName) {
def p = jenkins.model.Jenkins.instance.getItemByFullName(fullName)
def thisR = Thread.currentThread().executable
def f = p.scheduleBuild2(p.quietPeriod, new hudson.model.Cause.UpstreamCause(thisR))
return f
}
def f1 = build('job1')
def f2 = build('job2')
// wait for both builds to finish
def b1 = f1.get()
def b2 = f2.get()
The downstream builds must finish before the upstream job can finish. Can I force the upstream job to finish with a build status of SUCCESS while the downstream jobs continue to run?
I think you could call the URL to start the job using the remote access API
You merely need to perform an HTTP POST on
JENKINS_URL/job/JOBNAME/build?token=TOKEN where TOKEN is set up in
the job configuration.
Which can be done in groovy
def data = new URL(feedUrl).getText()

Resources