I have Azure Databrick notebook which contain SQL command. I need to capture output of SQL command and use in Dot Net core.
Need help.
You cannot capture results of Azure Databricks Notebook directly in Dot Net Core.
Also, there are no .NET SDK's available and so you need to rely on Databricks REST API's from your .NET code for all your operations. You could try the following -
Update your Notebook to export result of your SQL Query as CSV file to file store using df.write. For example -
df.write.format("com.databricks.spark.csv").option("header","true").save("sqlResults.csv")
You can setup a Job with the above Notebook and then you can invoke the job using Jobs API - run-now in .NET
You need to poll the job status using the runs list method to check the job completion state from your .NET code.
Once the job is completed, you need to use the DBFS API - Read to read the content of the csv file your notebook has generated in step 1.
Related
I have a csv file with comma delimeter.I tried to insert these file file data to cosmos databse.
This is my expression builder:-
#(A=A,B=B,C=C,D=D,E=E,F=F,G=G,H=H,I=I,J=J,K=K,L=L,M=M,N=N,O=O,P=P,Q=Q,R=R,S=S,T=T,U=U,V=V)
When i used upto 15 its working means upto O.If i used all value its not working.Pipeline is runing infinetly,I have check upto 4hours stil pipeline running.File contains only one row.
I reproduced this and faced the same issue despite increasing the data flow runtime cores. But you can try this alternate workaround for this.
First Transform the csv file into new container of blob storage as a JSON file. Then use a copy activity to copy it to azure cosmos db.
Create a JSON dataset with the new container but without any filename as dataflow will create part JSON file in the container and give it Dataflow sink.
Now, in the pipeline use copy activity and give the same JSON dataset as source for it. Use wildcard path to specify JSON files (*.json)as we can have only one file that is JSON generated from dataflow.
Give the cosmos db dataset as sink for copy activity. After execution, it will create a JSON file in the blob then copies this content to cosmos db.
JSON file in blob:
Cosmos db data:
NOTE: This approach might also result as InProgress if you are using the same Azure Integration runtime for the dataflow debug with small cores.
In that case try to create a new Azure integration run time with higher cores and use it for dataflow debug.
Please check How to create and configure Azure Integration Runtime.
After creation, go to Monitor and click on it and change the cores like this and check with dataflow.
Change runtime to this in the dataflow settings of the pipeline and in the dataflow debug as well.
We need to execute a long running exe running on a windows machine and thinking of ways to integrate with the workflow. The plan is to include the exe as a task in the Databricks workflow.
We are thinking of couple of approaches
Create a DB table and enter a row when this particular task gets started in the workflow. Exe which is running on a windows machine will ping the database table for any new records. Once a new record is found, the exe proceeds with actual execution and updates the status after completion. Databricks will query this table constantly for the status and once completed, task finishes.
Using databricks API, check whether the task has started execution in the exe and continue with execution. After application finishes, update the task status to completion until then the Databricks task will run like while (true). But the current API doesn't support updating the task execution status (To Complete) (not 100% sure).
Please share thoughts OR alternate solutions.
This is an interesting problem. Is there a reason you must use Databricks to execute an EXE?
Regardless, I think you have the right kind of idea. How I would do this with the jobs api is as described:
Have your EXE process output a file to a staging location probably in DBFS since this will be locally accessible inside of databricks.
Build a notebook to load this file, having a table is optional but may give you addtional logging capabilities if needed. The output of your notebook should use the dbutils.notebook.exit method which allows you to output any value string or array. You could return "In Progress" and "Success" or the latest line from your file you've written.
Wrap that notebook in a databricks job and execute on an interval with a cron schedule (you said 1 minute) and you can retrieve the output value of your job via the get-output endpoint
Additional Note, the benefit of abstracting this into return values from a notebook is you could orchestrate this via other workflow tools e.g. Databricks Workflows or Azure Data Factory with inside an Until condition. There are no limits so long as you can orchestrate a notebook in that tool.
Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?
There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.
I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.
I have been using zeppelin for few months now. It is a great tool for internal data analytics. I am looking for more features for sharing the report with the customers. I need to send weekly/monthly/quarterly report to the customers. Looking for a way to automate this process.
Please let me know if Databricks Spark Notebook or any other tool has features to help me to do this.
You can use databricks dashboard for this. Once you have the dashboard, you can do an HTML export of the dashboard and share the HTML file to the public.
If you're interested in automating the reporting process, you may want to look into databricks REST API: https://docs.databricks.com/api/latest/jobs.html#runs-export. You need to pass the run_id of the notebook job and the desired views_to_export (this value should be DASHBOARD) as the query parameters. Note that this run export only supports notebook jobs exports only, which is fine cos dashboards are usually generated from notebook jobs.
If your databricks HTML dashboard export is successful, you'll get a "views" JSON response which consists of a list of key-value pair objects, your HTML string will be available under the "content" key in each of the objects. You can then do anything with this HTML string, you can send it directly to email/slack for automatic reporting.
In order to generate a run_id, you first need to create a notebook job, which you can do via databricks UI. Then, you can get the run_id by triggering the notebook job to run by either:
using databricks scheduler, or
using the databricks run job now REST API: https://docs.databricks.com/api/latest/jobs.html#run-now .
I preferred using the 2nd method, and run the job programmatically via REST API, because I can always find the run_id when I run the job, unlike the first method where I have to look at the databricks UI each time the job is scheduled to run. Either way, you must wait for the notebook job run to finish before running the notebook job export in order to get the complete databricks dashboard in HTML successfully.
I am using Apache Spark in Bluemix.
I want to implement scheduler for sparksql jobs. I saw this link to a blog that describes scheduling. But its not clear how do I update the manifest. Maybe there is some other way to schedule my jobs.
The manifest file is to guide the deployment of cloud foundry (cf) apps. So in your case, sounds like you want to deploy your cf app that acts as a SparkSQL scheduler and use the manifest file to declare that your app doesn't need any of the web app routing stuff, or anything else for user-facing apps, because you just want to run a background scheduler. This is all well and good, and the cf docs will help you make that happen.
However, you cannot run a SparkSQL scheduler for the Bluemix Spark Service today because it only supports Jupyter notebooks through the Data-Analytics section of Bluemix; i.e., only a notebook UI. You need a Spark API you could drive from your scheduler cf app; e.g. spark-submit type thing where you can create your Spark context and then run programs, like SparkSQL you mention. This API is supposed to be coming to the Apache Spark Bluemix service.
UPDATE: spark-submit was made available sometime around the end of 1Q16. It is a shell script, but inside it makes REST calls via curl. REST API doesn't seem to yet be supported, but either you could call the script in your scheduler, or take the risk of calling the REST API directly and hope it doesn't changes and break you.