I have a Databricks scheduled job which runs 5 different notebooks sequentially, and each notebook contains, let's say 5 different command cells. When the job fails in notebook 3, cmd cell 3, I can properly recover from failure, though I'm not sure if there's any way of either restarting the scheduled job from notebook 3, cell 4, or even from the beginning of notebook 4, if I've manually completed the remaining cmd's in notebook 3. Here's an example of one of my jobs
%python
import sys
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/02 Processing Curated Staging/02 Build - Parameterised/Load CS Feedback Firmware STG", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in Load CS Feedback Firmware STG ({error})')
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/03 Processing Curated Technical/02 Build - Parameterised/Load CS Feedback Firmware TECH", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in Load CS Feedback Firmware TECH ({error})')
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/02 Processing Curated Staging/02 Build - Parameterised/STA_6S - CS Firmware Success", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in STA_6S - CS Firmware Success ({error})')
you should not use sys.exit, because it quits Python interpreter. Just let exception bubble up if it happens.
you must change the architecture of your application and add some sort of idempotency to ETL (online course), which would mean propagating a date to child notebooks or something like that.
run %pip install retry in the beginning of the notebook to install retry package
from retry import retry, retry_call
#retry(Exception, tries=3)
def idempotent_run(notebook, timeout=6000, **args):
# this is only approximate code to be used for inspiration and you should adjust it to your needs. It's not guaranteed to work for your case.
did_it_run_before = spark.sql(f"SELECT COUNT(*) FROM meta.state WHERE notebook = '{notebook}' AND args = '{sorted(args.items())}'").first()[0]
if did_it_run_before > 0:
return
result = dbutils.notebook.run(notebook, timeout, args)
spark.sql(f"INSERT INTO meta.state SELECT '{notebook}' AS notebook, '{sorted(args.items())}' AS args")
return result
pd = dbutils.widgets.get("env_parent_directory")
# call this within respective cells.
idempotent_run(
f"/01. SMETS1Mig/{pd}/03 Processing Curated Technical/02 Build - Parameterised/Load CS Feedback Firmware TECH",
# set it to something, that would define the frequency of the job
this_date='2020-09-28',
env_ingest_db=dbutils.widgets.get("env_ingest_db"),
env_stg_db=dbutils.widgets.get("env_stg_db"),
env_tech_db=dbutils.widgets.get("env_tech_db"))
Related
I would like to migrate my data into an other database. To do that I use Airflow to run a DAG composed of ETL flows. Each run is for one day of data and I have 3 years to catch-up. The problem is that each run has an unknown execution time and it's important to wait the end of the run before starting the next one.
I try to put a minimal schedule_interval and fix max_active_runs=1, but at the end not all data were loaded. A lot of days have jumped. I notice this Airflow comportment : at the end of the schedule interval, my program is partially executed (so my variable that give the date of the day to catch-up is incremented), then is stopped because of the max_active_runs.
with DAG(
dag_id='catch_up',
default_args={
'owner': 'airflow',
'start_date': datetime.now(),
'depends_on_past': False,
'retries': 1,
},
description=' ',
schedule_interval=*/2 * * * *,
max_active_runs=1,
catchup=False
) as dag:
date = launch_date()
start_etl = PythonOperator(
task_id='flow',
python_callable=launch_flow,
op_args=[date]
)
...
My question is how can I really make wait the end of the execution before starting an other one ?
What you are after is a combination of max_active_runs, depends_on_past & wait_for_downstream (see code base)
depends_on_past - when set to true, task instances will run
sequentially and only if the previous instance has succeeded or has been skipped. The task instance for the start_date is allowed
to run.
wait_for_downstream - when set to true, an instance of task
X will wait for tasks immediately downstream of the previous instance
of task X to finish successfully or be skipped before it runs. This is useful if the
different instances of a task X alter the same asset, and this asset
is used by tasks downstream of task X. Note that depends_on_past
is forced to True wherever wait_for_downstream is used. Also note that
only tasks immediately downstream of the previous task instance are waited
for; the statuses of any tasks further downstream are ignored.
I think having :
default_args = {
'wait_for_downstream': true,
'depends_on_past': true, // (*) see note below
...
}
with DAG(
default_args=default_args,
max_active_runs=1,
...
) as dag:
...
Will give you the logic you are after as only 1 active DagRun can be at any given time while each task must wait to the predecessor one in the previous run. On top of that if a specific task failed it will block everything and no new runs will be created till you address the issue manually.
Noting that since you didn't provide much details it may be that your use case can be solved by setting only max_active_runs=1 and having depends_on_past=True on the last task, but once you understand the parameters it's easier to find the right config for you.
(*) technically there is no need to set depends_on_past=True when setting wait_for_downstream=True because Airflow will override it for you but being explicit can help for readability.
I want to know if it is possible to run a Databricks job from a notebook using code, and how to do it
I have a job with multiple tasks, and many contributors, and we have a job created to execute it all, now we want to run the job from a notebook to test new features without creating a new task in the job, also for running the job multiple times in a loop, for example:
for i in [1,2,3]:
run job with parameter i
Regards
what you need to do is the following:
install the databricksapi. %pip install databricksapi==1.8.1
Create your job and return an output. You can do that by exiting the notebooks like that:
import json dbutils.notebook.exit(json.dumps({"result": f"{_result}"}))
If you want to pass a dataframe, you have to pass them as json dump too, there is some official documentation about that from databricks. check it out.
Get the job id you will need it later. You can get it from the jobs details in databricks.
In the executors notebook you can use the following code.
def run_ks_job_and_return_output(params):
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
# context
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
jobs_instance = Jobs.Jobs(url, token) # initialize a jobs_instance
runs_job_id = jobs_instance.runJob(****************, 'notebook',
params) # **** is the job id
run_is_not_completed = True
while run_is_not_completed:
current_run = [run for run in jobs_instance.runsList('completed')['runs'] if run['run_id'] == runs_job_id['run_id'] and run['number_in_job'] == runs_job_id['number_in_job']]
if len(current_run) == 0:
time.sleep(30)
else:
run_is_not_completed = False
current_run = current_run[0]
print( f"Result state: {current_run['state']['result_state']}, You can check the resulted output in the following link: {current_run['run_page_url']}")
note_output = jobs_instance.runsGetOutput(runs_job_id['run_id'])['notebook_output']
return note_output
run_ks_job_and_return_output( { 'parm1' : 'george',
'variable': "values1"})
If you want to run the job many times in parallel you can do the following. (first be sure that you have increased the max concurent runs in the job settings)
from multiprocessing.pool import ThreadPool
pool = ThreadPool(1000)
results = pool.map(lambda j: run_ks_job_and_return_output( { 'table' : 'george',
'variable': "values1",
'j': j}),
[str(x) for x in range(2,len(snapshots_list))])
There is also the possibility to save the whole html output but maybe you are not interested on that. In any case I will answer to that to another post on StackOverflow.
Hope it helps.
You can use following steps :
Note-01:
dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")
dbutils.widgets.text("foo2", "foo2Default", "foo2EmptyLabel")
result = dbutils.widgets.get("foo")+"-"+dbutils.widgets.get("foo2")
def display():
print("Function Display: "+result)
dbutils.notebook.exit(result)
Note-02:
thislist = ["apple", "banana", "cherry"]
for x in thislist:
dbutils.notebook.run("Note-01 path", 60, {"foo": x,"foo2":'Azure'})
Scenario:
I have a job created in azure batch against a pool. Now, I created another pool and want to point my job to the newly created pool.
I use the azure-batch SDK to write the following piece of code
import azure.batch.batch_service_client as batch
batch_service_client = batch.BatchServiceClient(credentials, batch_url = account_url)
job_id="LinuxTrainingJob"
pool_id="linux-e6a63ad4-9e52-4b9a-8b09-2a0249802981"
pool_info = batch.models.PoolInformation(pool_id=pool_id)
job_patch_param = batch.models.JobPatchParameter(pool_info=pool_info)
batch_service_client.job.patch(job_id, job_patch_param)
This gives me the following error
BatchErrorException Traceback (most recent call last)
<ipython-input-104-ada32b24d6a0> in <module>
2 pool_info = batch.models.PoolInformation(pool_id=pool_id)
3 job_patch_param = batch.models.JobPatchParameter(pool_info=pool_info)
----> 4 batch_service_client.job.patch(job_id, job_patch_param)
~/anaconda3/lib/python3.8/site-packages/azure/batch/operations/job_operations.py in patch(self, job_id, job_patch_parameter, job_patch_options, custom_headers, raw, **operation_config)
452
453 if response.status_code not in [200]:
--> 454 raise models.BatchErrorException(self._deserialize, response)
455
456 if raw:
BatchErrorException: {'additional_properties': {}, 'lang': 'en-US', 'value': 'The specified operation is not valid for the current state of the resource.\nRequestId:46074112-9a99-4569-a078-30a7f4ad2b91\nTime:2020-10-06T17:52:43.6924378Z'}
The credentials are set above and are working properly as I was able to create pool and jobs using the same client.
Environment details
azure-batch==9.0.0
python 3.8.3
Ubuntu 18.04
To assign a Job to another Pool you must call the disableJob API to drain currently running tasks from the Pool. Then you can call updateJob to assign a new poolId to run on. Once it is updated you can then call enableJob to continue the jobs execution.
I'm trying the code given at advanced.py with the following modification for adding new jobs.
I created a route for adding a new job
#app.route('/add', methods = ['POST'])
def add_to_schedule():
data = request.get_json()
print(data)
d = {
'id': 'job'+str(random.randint(0, 100)),
'func': 'flask_aps_code:job1',
'args': (random.randint(200,300),random.randint(200, 300)),
'trigger': 'interval',
'seconds': 10
}
print(app.config['JOBS'])
scheduler.add_job(id = 'job'+str(random.randint(0, 100)), func = job1)
#app.config.from_object(app.config['JOBS'].append(d))
return str(app.config['JOBS']), 200
I tried adding the jobs to config['JOBS'] as well as scheduler.add_job. But none of my new jobs are not getting executed. Additionally, my first scheduled job doesnt get executed till I do a ctrl+c on the terminal, after which the first scheduled job seems to execute, twice. What am I missing?
Edit: Seemingly the job running twice is because of flask reloading, so ignore that.
What i am trying to achieve
Write a scheduler, that uses a database to schedule similar tasks at different timings.
For the same i am using celery beat, the code snippet below would give an idea
try:
reader = MongoReader()
except:
raise
try:
tasks = reader.get_scheduled_tasks()
except:
raise
celerybeat_schedule = dict()
for task in tasks:
celerybeat_schedule[task["task_id"]] =dict()
celerybeat_schedule[task["task_id"]]["task"] = task["task_name"]
celerybeat_schedule[task["task_id"]]["args"] = (task,)
celerybeat_schedule[task["task_id"]]["schedule"] = get_task_schedule(task)
app.conf.update(BROKER_URL=rabbit_mq_endpoint, CELERY_TASK_SERIALIZER='json', CELERY_ACCEPT_CONTENT=['json'], CELERYBEAT_SCHEDULE=celerybeat_schedule)
so these are three steps
- reading all tasks from datastore
- creating a dictionary, celery scheduler which is populated by all tasks having properties, task_name(method that would run), parameters(data to pass to the method), schedule(stores when to run)
- updating this with celery configurations
Expected scenario
given all entries run the same celery task name that just prints, have same schedule to be run every 5 min, having different parameters specifying what to print, lets say db has
task name , parameter , schedule
regular_print , Hi , {"minutes" : 5}
regular_print , Hello , {"minutes" : 5}
regular_print , Bye , {"minutes" : 5}
I expect, these to be printing every 5 minutes to print all three
What happens
Only one of Hi, Hello, Bye prints( possible randomly, surely not in sequence)
Please help,
Thanks a lot in advance :)
Was able to resolve this using version 4 of celery. Sample similar to what worked for me.. can also find in documentation by celery for version 4
#taking address and user-pass from environment(you can mention direct values)
ex_host_queue = os.environ["EX_HOST_QUEUE"]
ex_port_queue = os.environ["EX_PORT_QUEUE"]
ex_user_queue = os.environ["EX_USERID_QUEUE"]
ex_pass_queue = os.environ["EX_PASSWORD_QUEUE"]
broker= "amqp://"+ex_user_queue+":"+ex_pass_queue+"#"+ex_host_queue+":"+ex_port_queue+"//"
#celery initialization
app = Celery(__name__,backend=broker, broker=broker)
app.conf.task_default_queue = 'scheduler_queue'
app.conf.update(
task_serializer='json',
accept_content=['json'], # Ignore other content
result_serializer='json'
)
task = {"task_id":1,"a":10,"b":20}
##method to update scheduler
def add_scheduled_task(task):
print("scheduling task")
del task["_id"]
print("adding task_id")
name = task["task_name"]
app.add_periodic_task(timedelta(minutes=1),add.s(task), name = task["task_id"])
#app.task(name='scheduler_task')
def scheduler_task(data):
print(str(data["a"]+data["b"]))