I have requirement to monitor S3 bucket for files (zip) to be placed. As soon as a file is placed in S3 bucket, the pipeline should start processing the file. Currently I have Workflow Job with multiple tasks the performs processing. In Job parameter, I have configured S3 bucket file path and able to trigger pipeline. But I need to automate the monitoring through Autoloader. I have setup Databricks autoloader in another notebook and managed to get the list of files that are arriving S3 path by querying checkpoint.
checkpoint_query = "SELECT * FROM cloud_files_state('%s') ORDER BY create_time DESC LIMIT 1" % (checkpoint_path)
But I want to integrate this notebook with my job but no clue how to integrate it with pipeline job. Some pointers to help will be much appreciable.
You need to create a workflow job and add the pipeline as the upstream task and your notebook as the downstream. Currently there is no way to run custom notebooks within a dlt pipeline.
Check this for how to create a workflow: https://docs.databricks.com/workflows/jobs/jobs.html#job-create
Related
We need to execute a long running exe running on a windows machine and thinking of ways to integrate with the workflow. The plan is to include the exe as a task in the Databricks workflow.
We are thinking of couple of approaches
Create a DB table and enter a row when this particular task gets started in the workflow. Exe which is running on a windows machine will ping the database table for any new records. Once a new record is found, the exe proceeds with actual execution and updates the status after completion. Databricks will query this table constantly for the status and once completed, task finishes.
Using databricks API, check whether the task has started execution in the exe and continue with execution. After application finishes, update the task status to completion until then the Databricks task will run like while (true). But the current API doesn't support updating the task execution status (To Complete) (not 100% sure).
Please share thoughts OR alternate solutions.
This is an interesting problem. Is there a reason you must use Databricks to execute an EXE?
Regardless, I think you have the right kind of idea. How I would do this with the jobs api is as described:
Have your EXE process output a file to a staging location probably in DBFS since this will be locally accessible inside of databricks.
Build a notebook to load this file, having a table is optional but may give you addtional logging capabilities if needed. The output of your notebook should use the dbutils.notebook.exit method which allows you to output any value string or array. You could return "In Progress" and "Success" or the latest line from your file you've written.
Wrap that notebook in a databricks job and execute on an interval with a cron schedule (you said 1 minute) and you can retrieve the output value of your job via the get-output endpoint
Additional Note, the benefit of abstracting this into return values from a notebook is you could orchestrate this via other workflow tools e.g. Databricks Workflows or Azure Data Factory with inside an Until condition. There are no limits so long as you can orchestrate a notebook in that tool.
I have Azure Databrick notebook which contain SQL command. I need to capture output of SQL command and use in Dot Net core.
Need help.
You cannot capture results of Azure Databricks Notebook directly in Dot Net Core.
Also, there are no .NET SDK's available and so you need to rely on Databricks REST API's from your .NET code for all your operations. You could try the following -
Update your Notebook to export result of your SQL Query as CSV file to file store using df.write. For example -
df.write.format("com.databricks.spark.csv").option("header","true").save("sqlResults.csv")
You can setup a Job with the above Notebook and then you can invoke the job using Jobs API - run-now in .NET
You need to poll the job status using the runs list method to check the job completion state from your .NET code.
Once the job is completed, you need to use the DBFS API - Read to read the content of the csv file your notebook has generated in step 1.
I'm using S3 event based triggers to trigger lambda functions. It triggers a lambda function every time an _SUCCESS file is written at a specific location in S3. Data is being written at the source location using Databricks spark jobs. It has been observed that once the job writes data to source location, the lambda function is trigger twice, consistently.
This behavior is only observed when _SUCCESS is written by Databricks job. I tried to write the file from CLI, it triggers the lambda function just once.
It would be helpful to know the reason behind such a behavior from Databricks jobs.
I have been using zeppelin for few months now. It is a great tool for internal data analytics. I am looking for more features for sharing the report with the customers. I need to send weekly/monthly/quarterly report to the customers. Looking for a way to automate this process.
Please let me know if Databricks Spark Notebook or any other tool has features to help me to do this.
You can use databricks dashboard for this. Once you have the dashboard, you can do an HTML export of the dashboard and share the HTML file to the public.
If you're interested in automating the reporting process, you may want to look into databricks REST API: https://docs.databricks.com/api/latest/jobs.html#runs-export. You need to pass the run_id of the notebook job and the desired views_to_export (this value should be DASHBOARD) as the query parameters. Note that this run export only supports notebook jobs exports only, which is fine cos dashboards are usually generated from notebook jobs.
If your databricks HTML dashboard export is successful, you'll get a "views" JSON response which consists of a list of key-value pair objects, your HTML string will be available under the "content" key in each of the objects. You can then do anything with this HTML string, you can send it directly to email/slack for automatic reporting.
In order to generate a run_id, you first need to create a notebook job, which you can do via databricks UI. Then, you can get the run_id by triggering the notebook job to run by either:
using databricks scheduler, or
using the databricks run job now REST API: https://docs.databricks.com/api/latest/jobs.html#run-now .
I preferred using the 2nd method, and run the job programmatically via REST API, because I can always find the run_id when I run the job, unlike the first method where I have to look at the databricks UI each time the job is scheduled to run. Either way, you must wait for the notebook job run to finish before running the notebook job export in order to get the complete databricks dashboard in HTML successfully.
Is there a way to upload different data to a few tables in one load job to bigquery using nodejs gcloud library or using bq command line?
No. It's one load job per table.
https://cloud.google.com/bigquery/loading-data
If you're feeling adventurous, you could write a Dataflow pipeline which reads from multiple sources and writes to multiple sinks in BigQuery.
https://cloud.google.com/dataflow/model/bigquery-io