Run databricks job from notebook - python-3.x

I want to know if it is possible to run a Databricks job from a notebook using code, and how to do it
I have a job with multiple tasks, and many contributors, and we have a job created to execute it all, now we want to run the job from a notebook to test new features without creating a new task in the job, also for running the job multiple times in a loop, for example:
for i in [1,2,3]:
run job with parameter i
Regards

what you need to do is the following:
install the databricksapi. %pip install databricksapi==1.8.1
Create your job and return an output. You can do that by exiting the notebooks like that:
import json dbutils.notebook.exit(json.dumps({"result": f"{_result}"}))
If you want to pass a dataframe, you have to pass them as json dump too, there is some official documentation about that from databricks. check it out.
Get the job id you will need it later. You can get it from the jobs details in databricks.
In the executors notebook you can use the following code.
def run_ks_job_and_return_output(params):
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
# context
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
jobs_instance = Jobs.Jobs(url, token) # initialize a jobs_instance
runs_job_id = jobs_instance.runJob(****************, 'notebook',
params) # **** is the job id
run_is_not_completed = True
while run_is_not_completed:
current_run = [run for run in jobs_instance.runsList('completed')['runs'] if run['run_id'] == runs_job_id['run_id'] and run['number_in_job'] == runs_job_id['number_in_job']]
if len(current_run) == 0:
time.sleep(30)
else:
run_is_not_completed = False
current_run = current_run[0]
print( f"Result state: {current_run['state']['result_state']}, You can check the resulted output in the following link: {current_run['run_page_url']}")
note_output = jobs_instance.runsGetOutput(runs_job_id['run_id'])['notebook_output']
return note_output
run_ks_job_and_return_output( { 'parm1' : 'george',
'variable': "values1"})
If you want to run the job many times in parallel you can do the following. (first be sure that you have increased the max concurent runs in the job settings)
from multiprocessing.pool import ThreadPool
pool = ThreadPool(1000)
results = pool.map(lambda j: run_ks_job_and_return_output( { 'table' : 'george',
'variable': "values1",
'j': j}),
[str(x) for x in range(2,len(snapshots_list))])
There is also the possibility to save the whole html output but maybe you are not interested on that. In any case I will answer to that to another post on StackOverflow.
Hope it helps.

You can use following steps :
Note-01:
dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")
dbutils.widgets.text("foo2", "foo2Default", "foo2EmptyLabel")
result = dbutils.widgets.get("foo")+"-"+dbutils.widgets.get("foo2")
def display():
print("Function Display: "+result)
dbutils.notebook.exit(result)
Note-02:
thislist = ["apple", "banana", "cherry"]
for x in thislist:
dbutils.notebook.run("Note-01 path", 60, {"foo": x,"foo2":'Azure'})

Related

Restarting nested notebook runs in Databricks Job Workflow

I have a Databricks scheduled job which runs 5 different notebooks sequentially, and each notebook contains, let's say 5 different command cells. When the job fails in notebook 3, cmd cell 3, I can properly recover from failure, though I'm not sure if there's any way of either restarting the scheduled job from notebook 3, cell 4, or even from the beginning of notebook 4, if I've manually completed the remaining cmd's in notebook 3. Here's an example of one of my jobs
%python
import sys
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/02 Processing Curated Staging/02 Build - Parameterised/Load CS Feedback Firmware STG", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in Load CS Feedback Firmware STG ({error})')
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/03 Processing Curated Technical/02 Build - Parameterised/Load CS Feedback Firmware TECH", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in Load CS Feedback Firmware TECH ({error})')
try:
dbutils.notebook.run("/01. SMETS1Mig/" + dbutils.widgets.get("env_parent_directory") + "/02 Processing Curated Staging/02 Build - Parameterised/STA_6S - CS Firmware Success", 6000, {
"env_ingest_db": dbutils.widgets.get("env_ingest_db")
, "env_stg_db": dbutils.widgets.get("env_stg_db")
, "env_tech_db": dbutils.widgets.get("env_tech_db")
})
except Exception as error:
sys.exit('Failure in STA_6S - CS Firmware Success ({error})')
you should not use sys.exit, because it quits Python interpreter. Just let exception bubble up if it happens.
you must change the architecture of your application and add some sort of idempotency to ETL (online course), which would mean propagating a date to child notebooks or something like that.
run %pip install retry in the beginning of the notebook to install retry package
from retry import retry, retry_call
#retry(Exception, tries=3)
def idempotent_run(notebook, timeout=6000, **args):
# this is only approximate code to be used for inspiration and you should adjust it to your needs. It's not guaranteed to work for your case.
did_it_run_before = spark.sql(f"SELECT COUNT(*) FROM meta.state WHERE notebook = '{notebook}' AND args = '{sorted(args.items())}'").first()[0]
if did_it_run_before > 0:
return
result = dbutils.notebook.run(notebook, timeout, args)
spark.sql(f"INSERT INTO meta.state SELECT '{notebook}' AS notebook, '{sorted(args.items())}' AS args")
return result
pd = dbutils.widgets.get("env_parent_directory")
# call this within respective cells.
idempotent_run(
f"/01. SMETS1Mig/{pd}/03 Processing Curated Technical/02 Build - Parameterised/Load CS Feedback Firmware TECH",
# set it to something, that would define the frequency of the job
this_date='2020-09-28',
env_ingest_db=dbutils.widgets.get("env_ingest_db"),
env_stg_db=dbutils.widgets.get("env_stg_db"),
env_tech_db=dbutils.widgets.get("env_tech_db"))

HOW-TO push/pull to/from Airflow X_COM with spark_task and pythonOperator?

I have a dag that creates a spark-task and executes a certain script located in a particular directory. There are two tasks like this. Both of these tasks need to receive the same ID generated in the DAG file before these tasks are executed. If I simply store and pass a value solely via the python script, the IDs are different, which is normal. So I am trying to push the value to XCOM with a PythonOperator and task.
I need to pull the values from XCOM and update a 'params' dictionary with that information in order to be able to pass it to my spark task.
Could you please help me, i am hitting my head in the wall and just can't figure it out.
I tried the following:
create a function just to retrieve the data from xcom and the return it. Assigned this function to the params variable, but doesn't work. I cannot return from a python function inside the DAG which uses the xcom_pull function
tried assigning an empty list and appending to it from the python function. and then the final list to provide directly to my spark task. Doesn't work either. Please help!
Thanks a lot in advance for any help related to this. I will need this value the same for this and multiple other spark tasks that may come into the same DAG file.
DAG FILE
import..
from common.base_tasks import spark_task
default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'email_on_failure': True,
'email_on_retry': False,
}
dag = DAG(
dag_id='dag',
default_args=default_args,
schedule_interval=timedelta(days=1)
)
log_level = "info"
id_info = {
"id": str(uuid.uuid1()),
"start_time": str(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S,%f'))
}
# this stores the value to XCOM successfully
def store_id(**kwargs):
kwargs['ti'].xcom_push(key='id_info', value=id_info)
store_trace_task = PythonOperator(task_id='store_id', provide_context=True, python_callable=store_id, dag=dag)
extra_config = {'log_level': log_level}
config = '''{"config":"data"}'''
params = {'config': config,'extra_config': json.dumps(extra_config}
# ---------- this doesn't work ----------
pars = []
pars.append(params)
def task1_pull_params(**kwargs):
tracing = kwargs['ti'].xcom_pull(task_ids='store_trace_task')
pars.append(tracing)
# params = {
# 'parsed_config': parsed_config,
# 'extra_config': json.dumps(extra_config),
# 'trace_data': tracing
# }
# return params # return pushes to xcom, xcom_push does the same
task1_pull_params = PythonOperator(task_id='task1_pull_params', provide_context=True, python_callable=task1_pull_params, dag=dag)
store_trace_task >> task1_pull_params
# returning value from the function and calling it to assign res to the params variable below also doesn't work
# params = task1_pull_params
# this prints only what's outside of the function, i.e. params
print("===== pars =====> ", pars)
pipeline_task1 = spark_task(
name='task1',
script='app.py',
params=params,
dag=dag
)
task1_pull_params >> pipeline_task1

Schedule jobs dynamically with Flask APScheduler

I'm trying the code given at advanced.py with the following modification for adding new jobs.
I created a route for adding a new job
#app.route('/add', methods = ['POST'])
def add_to_schedule():
data = request.get_json()
print(data)
d = {
'id': 'job'+str(random.randint(0, 100)),
'func': 'flask_aps_code:job1',
'args': (random.randint(200,300),random.randint(200, 300)),
'trigger': 'interval',
'seconds': 10
}
print(app.config['JOBS'])
scheduler.add_job(id = 'job'+str(random.randint(0, 100)), func = job1)
#app.config.from_object(app.config['JOBS'].append(d))
return str(app.config['JOBS']), 200
I tried adding the jobs to config['JOBS'] as well as scheduler.add_job. But none of my new jobs are not getting executed. Additionally, my first scheduled job doesnt get executed till I do a ctrl+c on the terminal, after which the first scheduled job seems to execute, twice. What am I missing?
Edit: Seemingly the job running twice is because of flask reloading, so ignore that.

scheduling a task at multiple timings(with different parameters) using celery beat but task run only once(with random parameters)

What i am trying to achieve
Write a scheduler, that uses a database to schedule similar tasks at different timings.
For the same i am using celery beat, the code snippet below would give an idea
try:
reader = MongoReader()
except:
raise
try:
tasks = reader.get_scheduled_tasks()
except:
raise
celerybeat_schedule = dict()
for task in tasks:
celerybeat_schedule[task["task_id"]] =dict()
celerybeat_schedule[task["task_id"]]["task"] = task["task_name"]
celerybeat_schedule[task["task_id"]]["args"] = (task,)
celerybeat_schedule[task["task_id"]]["schedule"] = get_task_schedule(task)
app.conf.update(BROKER_URL=rabbit_mq_endpoint, CELERY_TASK_SERIALIZER='json', CELERY_ACCEPT_CONTENT=['json'], CELERYBEAT_SCHEDULE=celerybeat_schedule)
so these are three steps
- reading all tasks from datastore
- creating a dictionary, celery scheduler which is populated by all tasks having properties, task_name(method that would run), parameters(data to pass to the method), schedule(stores when to run)
- updating this with celery configurations
Expected scenario
given all entries run the same celery task name that just prints, have same schedule to be run every 5 min, having different parameters specifying what to print, lets say db has
task name , parameter , schedule
regular_print , Hi , {"minutes" : 5}
regular_print , Hello , {"minutes" : 5}
regular_print , Bye , {"minutes" : 5}
I expect, these to be printing every 5 minutes to print all three
What happens
Only one of Hi, Hello, Bye prints( possible randomly, surely not in sequence)
Please help,
Thanks a lot in advance :)
Was able to resolve this using version 4 of celery. Sample similar to what worked for me.. can also find in documentation by celery for version 4
#taking address and user-pass from environment(you can mention direct values)
ex_host_queue = os.environ["EX_HOST_QUEUE"]
ex_port_queue = os.environ["EX_PORT_QUEUE"]
ex_user_queue = os.environ["EX_USERID_QUEUE"]
ex_pass_queue = os.environ["EX_PASSWORD_QUEUE"]
broker= "amqp://"+ex_user_queue+":"+ex_pass_queue+"#"+ex_host_queue+":"+ex_port_queue+"//"
#celery initialization
app = Celery(__name__,backend=broker, broker=broker)
app.conf.task_default_queue = 'scheduler_queue'
app.conf.update(
task_serializer='json',
accept_content=['json'], # Ignore other content
result_serializer='json'
)
task = {"task_id":1,"a":10,"b":20}
##method to update scheduler
def add_scheduled_task(task):
print("scheduling task")
del task["_id"]
print("adding task_id")
name = task["task_name"]
app.add_periodic_task(timedelta(minutes=1),add.s(task), name = task["task_id"])
#app.task(name='scheduler_task')
def scheduler_task(data):
print(str(data["a"]+data["b"]))

Jenkins Python API returns HTML

I'm trying to write a Python script to talk to my instance of Jenkins. I am using the newest version of the jenkinsapi module and querying Jenkins 1.509.3.
I can get a job list like follows:
l=j.get_jobs_list()
where j is an instance of jenkinsapi.Jenkins (I used the requester from jenkinsapi.utils.requester to skip ssl verification)
However, when I try to get more information on an individual job with
j.get_job(l[0])
it fails with this error: Inappropriate content found at [some_address] and what is returned is a bunch of HTML (that looks like the starting page for my instance, the one you see when you log in) instead of anything that should look like the response. Pasting [some_address] into the browser gives me what I expect as a response.
While I can get some information on the Jenkins instance, what I am really interested in is info on individual jobs. Any ideas how to fix it and get the job info?
Using python 3.6, python-jenkins 1.0.1 and Jenkins 2.121.1, following works nicely:
import pprint
import jenkins
IP = 'localhost'
USERNAME = 'my_username'
PW = 'my_password'
def get_version(server):
user = server.get_whoami()
version = server.get_version()
print('Hello %s from Jenkins %s' % (user['fullName'], version))
def get_jobs(server):
jobs = server.get_jobs() # List[dict]
print("Here are top 5 jobs")
pprint(jobs[:5])
return jobs
def get_job(server, job_name):
job_config = server.get_job_config(job_name) # XML
job_info = server.get_job_info(job_name) # dict
print("\n --- JOB CONFIG --- ")
print(job_config)
print("\n --- JOB INFO --- ")
pprint(job_info)
if __name__ == "__main__":
_server = jenkins.Jenkins(IP, username=USERNAME, password=PW)
get_version(_server)
_jobs = get_jobs(_server)
get_job(_server, _jobs[0]['name'])
Jenkins API I was using is documented here: https://python-jenkins.readthedocs.io/en/latest/index.html

Resources