Update databricks job status through API - databricks

We need to execute a long running exe running on a windows machine and thinking of ways to integrate with the workflow. The plan is to include the exe as a task in the Databricks workflow.
We are thinking of couple of approaches
Create a DB table and enter a row when this particular task gets started in the workflow. Exe which is running on a windows machine will ping the database table for any new records. Once a new record is found, the exe proceeds with actual execution and updates the status after completion. Databricks will query this table constantly for the status and once completed, task finishes.
Using databricks API, check whether the task has started execution in the exe and continue with execution. After application finishes, update the task status to completion until then the Databricks task will run like while (true). But the current API doesn't support updating the task execution status (To Complete) (not 100% sure).
Please share thoughts OR alternate solutions.

This is an interesting problem. Is there a reason you must use Databricks to execute an EXE?
Regardless, I think you have the right kind of idea. How I would do this with the jobs api is as described:
Have your EXE process output a file to a staging location probably in DBFS since this will be locally accessible inside of databricks.
Build a notebook to load this file, having a table is optional but may give you addtional logging capabilities if needed. The output of your notebook should use the dbutils.notebook.exit method which allows you to output any value string or array. You could return "In Progress" and "Success" or the latest line from your file you've written.
Wrap that notebook in a databricks job and execute on an interval with a cron schedule (you said 1 minute) and you can retrieve the output value of your job via the get-output endpoint
Additional Note, the benefit of abstracting this into return values from a notebook is you could orchestrate this via other workflow tools e.g. Databricks Workflows or Azure Data Factory with inside an Until condition. There are no limits so long as you can orchestrate a notebook in that tool.

Related

Databricks Jobs - Is there a way to restart a failed task instead of the whole pipeline?

If I have for example a (multitask) Databricks job with 3 tasks in series and the second one fails - is there a way to start from the second task instead of running the whole pipeline again?
Right now this is not possible, but if you refer to the Databrick's Q3 (2021) public roadmap, there were some items around improving multi-task jobs.
Update: September 2022. This functionality was released back in May 2022nd with name repair & rerun
If you are running Databricks on Azure it is possible via Azure Data Factory.

How to read Azure Databricks output using API or class library

I have Azure Databrick notebook which contain SQL command. I need to capture output of SQL command and use in Dot Net core.
Need help.
You cannot capture results of Azure Databricks Notebook directly in Dot Net Core.
Also, there are no .NET SDK's available and so you need to rely on Databricks REST API's from your .NET code for all your operations. You could try the following -
Update your Notebook to export result of your SQL Query as CSV file to file store using df.write. For example -
df.write.format("com.databricks.spark.csv").option("header","true").save("sqlResults.csv")
You can setup a Job with the above Notebook and then you can invoke the job using Jobs API - run-now in .NET
You need to poll the job status using the runs list method to check the job completion state from your .NET code.
Once the job is completed, you need to use the DBFS API - Read to read the content of the csv file your notebook has generated in step 1.

Stop azure databricks cluster after threshold time of job execution

I need to know , how to stop a azure databricks cluster by doing configuration when it is running infinitely for executing a job.(without manual stopping)and as well as create an email alert for it, as the job running time exceeds its usual running time.
You can do this in the Jobs UI, Select your job, under Advanced, edit the Alerts and Timeout values.
This Databricks docs page may help you: https://docs.databricks.com/jobs.html

Automatically spawn an Azure Batch AI job periodically

I want to automatically start a job on an Azure Batch AI cluster once a week. The jobs are all identical except for the starting time. I thought of writing a PowerShell Azure Function that does this, but Azure Functions v2 doesn't support PowerShell and I don't want to use v1 in case it will be phased out. I would prefer not to do this in C# or Java. How can I do this?
Currently, there's no option available to trigger a job on Azure Batch AI cluster. Maybe you want to run a shell script which in turn can create a regular schedule using system's task scheduler. Please see if this doc by Said Bleik helps:
https://github.com/saidbleik/batchai_mm_ad#scheduling-jobs
I assume this way you can add multiple schedules for the job!
Azure Batch portal has "Job schedules" tab. You can go there, add a Job, and set a schedule for the Job. You can specify the recurrence in the Schedule
Scheduled jobs
Job schedules enable you to create recurring jobs within the Batch service. A job schedule specifies when to run jobs and includes the specifications for the jobs to be run. You can specify the duration of the schedule--how long and when the schedule is in effect--and how frequently jobs are created during the scheduled period.

How do I share Databricks Spark Notebook report/dashboard with customers?

I have been using zeppelin for few months now. It is a great tool for internal data analytics. I am looking for more features for sharing the report with the customers. I need to send weekly/monthly/quarterly report to the customers. Looking for a way to automate this process.
Please let me know if Databricks Spark Notebook or any other tool has features to help me to do this.
You can use databricks dashboard for this. Once you have the dashboard, you can do an HTML export of the dashboard and share the HTML file to the public.
If you're interested in automating the reporting process, you may want to look into databricks REST API: https://docs.databricks.com/api/latest/jobs.html#runs-export. You need to pass the run_id of the notebook job and the desired views_to_export (this value should be DASHBOARD) as the query parameters. Note that this run export only supports notebook jobs exports only, which is fine cos dashboards are usually generated from notebook jobs.
If your databricks HTML dashboard export is successful, you'll get a "views" JSON response which consists of a list of key-value pair objects, your HTML string will be available under the "content" key in each of the objects. You can then do anything with this HTML string, you can send it directly to email/slack for automatic reporting.
In order to generate a run_id, you first need to create a notebook job, which you can do via databricks UI. Then, you can get the run_id by triggering the notebook job to run by either:
using databricks scheduler, or
using the databricks run job now REST API: https://docs.databricks.com/api/latest/jobs.html#run-now .
I preferred using the 2nd method, and run the job programmatically via REST API, because I can always find the run_id when I run the job, unlike the first method where I have to look at the databricks UI each time the job is scheduled to run. Either way, you must wait for the notebook job run to finish before running the notebook job export in order to get the complete databricks dashboard in HTML successfully.

Resources