CDAP with Azure Data bricks - databricks

Has anyone tried using Azure data bricks as the spark cluster for CDAP job processing. CDAP documentation details how to add it to Azure HDInsight, but just wondering is there a way to configure CDAP to point to data bricks spark cluster, is it even possible? OR this kind of integration needs a specific data bricks client connector jar? If anyone has any insights that would be helpful.

There is no out of box support for Databricks spark on Azure. But, that said you can develop a new Cloud Runtime that is capable of submitting the jobs to Databricks spark cluster. Here is example of how to write a runtime extension for Cloud Dataproc and EMR.

Related

Why there is no Spark connector for Databricks?

I would like to read data in Databricks with Spark outside of Databricks. But looks like there is no spark connector for Databricks available. Snowflake Connector for Spark is an example, I am looking for something similar for Databricks.
May I know what you mean by connectors ?
Databricks by itself will have the spark clusters running in it.
You can attach notebooks or run a spark job on top of it.
https://docs.databricks.com/notebooks/index.html
If you want to connect your local machine to databricks cluster and do development - you can try below databricks connect.
https://docs.databricks.com/dev-tools/databricks-connect.html

Databricks notebooks lineage in Azure Purview

If I read file from ADLS into PySpark data frame and write back to another ADLS folder in different file format, will that lineage captured in Hive metastore, Can lineage show for this kind of operations?
Currently this lineage won't show up out of the box - however, Purview uses Atlas behind the scenes, thus you can probably capture this lineage using the API.
Here's an example of where Spline was used to track lineage from notebooks:
https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/
This article talks about how to get started with the Purview REST API:
https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058
You can use the OpenLineage based Databricks to Purview Solution Accelerator to ingest the lineage provided by Databricks. By deploying the solution accelerator, you'll have a set of Azure Functions and a Databricks cluster that can extract the logical plan from a Databricks notebook / job and transform it automatically to Apache Atlas / Microsoft Purview entities.
Supports table level lineage from Spark Notebooks and jobs for the following data sources:
Azure SQL
Azure Synapse Analytics
Azure Data Lake Gen 2
Azure Blob Storage
Delta Lake
Supports Spark 3.1 and 3.0 (Interactive and Job clusters) / Spark 2.x (Job clusters)
Databricks Runtimes between 6.4 and 10.3 are currently supported
Can be configured per cluster or for all clusters as a global configuration
Once configured, does not require any code changes to notebooks or jobs

How to get job/run level logs in Databricks?

Databricks only provides cluster level logs in the UI or in the API. Is there a way to configure spark or log4j in databricks such that we get run/job level logs?
You can find a Guide on Monitoring Azure Databricks on the Azure Architecture Center, explaining the concepts used in this article - Monitoring And Logging In Azure Databricks With Azure Log Analytics And Grafana.
To provide full data collection, we combine the Spark monitoring library with a custom log4j.properties configuration. The build of the monitoring library for Spark 2.4 and the installation in Databricks is automated through the scripts referenced in the tutorial and available at https://github.com/algattik/databricks-monitoring-tutorial/.

How to submit custom spark application on Azure Databricks?

I have created a small application that submits a spark job at certain intervals and creates some analytical reports. These jobs can read data from a local filesystem or a distributed filesystem (fs could be HDFS, ADLS or WASB). Can I run this application on Azure databricks cluster?
The application works fine on HDInsights cluster as I was able to access the nodes. I kept my deployable jar at one location, started it using the start-script similarly I could also stop it using the stop-script that I prepared.
One thing I found is that Azure Databricks has its own File System: ADFS, I can also add support for this file system but then will I be able to deploy and run my application as I was able to do it on the HDInsight cluster? If not, is there a way I can submit jobs from an edge node, my HDInsight cluster or any other OnPrem Cluster to Azure Databricks cluster.
Have you looked at Jobs? https://docs.databricks.com/user-guide/jobs.html. You can submit jars to spark-submit just like on HDInsight.
Databricks file system is DBFS - ABFS is used for Azure Data Lake. You should not need to modify your application for these - the file paths will be handled by databricks.

Spark Deployment in Azure cloud

Is it possible to deploy spark code in Azure cloud without the yarn component? thanks in advance
Yes,you can deploy Apache Spark cluster in Azure HDInsight without Yarn.
Spark clusters in HDInsight include the following components that are available on the clusters by default.
1)Spark Core. Includes Spark Core, Spark SQL, Spark streaming APIs, GraphX, and MLlib.
2)Anaconda
3)Livy
4)Jupyter notebook
5)Zeppelin notebook
Spark clusters on HDInsight also provide an ODBC driver for connectivity to Spark clusters in HDInsight from BI tools such as Microsoft Power BI and Tableau.
Refer to the following sites for more information,
Create an Apache Spark cluster in Azure HDInsight
Introduction to Spark on HDInsight
I don't think it is possible to deploy HDInsight cluster without YARN.Refer to the HDInsight documentation
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-hadoop-introduction
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-component-versioning
YARN is the resource manager for Hadoop. Is there any particular reason you would not want to use YARN while working with HDInsight Spark cluster?
If you want to use the standalone mode, you can modify the location of the master url while submitting the job using Spark-submit command.
I have some examples in my repo with Spark-submit both in local mode and on HDInsight cluster
https://github.com/NileshGule/learning-spark
You can refer to
local mode : https://github.com/NileshGule/learning-spark/blob/master/src/main/java/com/nileshgule/movielens/MovieLens.md
HDInsight Spark cluster : https://github.com/NileshGule/learning-spark/blob/master/Azure.md

Resources