SparkSession Connect to Databricks Azure - apache-spark

I'm using maven and scala to create a spark application that needs to connect to a cluster on azure databricks.
How can i point my sparksession to connect to the databricks cluster?
I saw databricks-connect, but it loads some jar files using sbt.
I don't understand how it achives that connectivity exactly.
My use case requires to run a spark job programmatically on the db cluster upon request, so i need to be able to connect it there.

Related

Why there is no Spark connector for Databricks?

I would like to read data in Databricks with Spark outside of Databricks. But looks like there is no spark connector for Databricks available. Snowflake Connector for Spark is an example, I am looking for something similar for Databricks.
May I know what you mean by connectors ?
Databricks by itself will have the spark clusters running in it.
You can attach notebooks or run a spark job on top of it.
https://docs.databricks.com/notebooks/index.html
If you want to connect your local machine to databricks cluster and do development - you can try below databricks connect.
https://docs.databricks.com/dev-tools/databricks-connect.html

How to run spark sql queries using Databricks Cluster through Linux?

I want to execute spark sql commands from Linux Machine on Databricks Cluster. Is there any way to achieve this?
I have set of spark sql commands in a .sql file and want to execute this file using Databricks cluster in Linux Machine.
I am looking something analogous to SQLPLUS, where we make connection with DB and execute sql, in the similar way do we have any utility/solution to execute spark sql over Databricks cluster.
You can connect to a Databricks cluster using ODBC, JDBC, HTTP or thrift protocol. In every case you will need an access token with enough permissions.
I am using IntelliJ DataGrip to connect via JDBC. I had to configure the databricks driver and used this URI.
jdbc:spark://mycompany.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<MY-DATABRICKS-ORGAINZATION-ID>/<MY-DATABRICKS-CLUSTER-ID>;AuthMech=3;UID=token;PWD=<MY-DATABRICKS-TOKEN>
I believe any modern SQL client should be able to connect as Databricks is exposing standard interfaces.
This is the official documentation from databricks
https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html

How to submit custom spark application on Azure Databricks?

I have created a small application that submits a spark job at certain intervals and creates some analytical reports. These jobs can read data from a local filesystem or a distributed filesystem (fs could be HDFS, ADLS or WASB). Can I run this application on Azure databricks cluster?
The application works fine on HDInsights cluster as I was able to access the nodes. I kept my deployable jar at one location, started it using the start-script similarly I could also stop it using the stop-script that I prepared.
One thing I found is that Azure Databricks has its own File System: ADFS, I can also add support for this file system but then will I be able to deploy and run my application as I was able to do it on the HDInsight cluster? If not, is there a way I can submit jobs from an edge node, my HDInsight cluster or any other OnPrem Cluster to Azure Databricks cluster.
Have you looked at Jobs? https://docs.databricks.com/user-guide/jobs.html. You can submit jars to spark-submit just like on HDInsight.
Databricks file system is DBFS - ABFS is used for Azure Data Lake. You should not need to modify your application for these - the file paths will be handled by databricks.

CDAP with Azure Data bricks

Has anyone tried using Azure data bricks as the spark cluster for CDAP job processing. CDAP documentation details how to add it to Azure HDInsight, but just wondering is there a way to configure CDAP to point to data bricks spark cluster, is it even possible? OR this kind of integration needs a specific data bricks client connector jar? If anyone has any insights that would be helpful.
There is no out of box support for Databricks spark on Azure. But, that said you can develop a new Cloud Runtime that is capable of submitting the jobs to Databricks spark cluster. Here is example of how to write a runtime extension for Cloud Dataproc and EMR.

Spark Deployment in Azure cloud

Is it possible to deploy spark code in Azure cloud without the yarn component? thanks in advance
Yes,you can deploy Apache Spark cluster in Azure HDInsight without Yarn.
Spark clusters in HDInsight include the following components that are available on the clusters by default.
1)Spark Core. Includes Spark Core, Spark SQL, Spark streaming APIs, GraphX, and MLlib.
2)Anaconda
3)Livy
4)Jupyter notebook
5)Zeppelin notebook
Spark clusters on HDInsight also provide an ODBC driver for connectivity to Spark clusters in HDInsight from BI tools such as Microsoft Power BI and Tableau.
Refer to the following sites for more information,
Create an Apache Spark cluster in Azure HDInsight
Introduction to Spark on HDInsight
I don't think it is possible to deploy HDInsight cluster without YARN.Refer to the HDInsight documentation
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-hadoop-introduction
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-component-versioning
YARN is the resource manager for Hadoop. Is there any particular reason you would not want to use YARN while working with HDInsight Spark cluster?
If you want to use the standalone mode, you can modify the location of the master url while submitting the job using Spark-submit command.
I have some examples in my repo with Spark-submit both in local mode and on HDInsight cluster
https://github.com/NileshGule/learning-spark
You can refer to
local mode : https://github.com/NileshGule/learning-spark/blob/master/src/main/java/com/nileshgule/movielens/MovieLens.md
HDInsight Spark cluster : https://github.com/NileshGule/learning-spark/blob/master/Azure.md

Resources