Azure Databricks: How to add Spark configuration in Databricks cluster - apache-spark

I am using a Spark Databricks cluster and want to add a customized Spark configuration.
There is a Databricks documentation on this but I am not getting any clue how and what changes I should make. Can someone pls share the example to configure the Databricks cluster.
Is there any way to see the default configuration for Spark in the Databricks cluster.

You can set cluster config in the compute section in your Databricks workspace.
Go to compute (and select cluster) > configuration > advanced options:
Or, you can set configs via a notebook.
%python
spark.conf.set("spark.sql.name-of-property", value)

Related

Why there is no Spark connector for Databricks?

I would like to read data in Databricks with Spark outside of Databricks. But looks like there is no spark connector for Databricks available. Snowflake Connector for Spark is an example, I am looking for something similar for Databricks.
May I know what you mean by connectors ?
Databricks by itself will have the spark clusters running in it.
You can attach notebooks or run a spark job on top of it.
https://docs.databricks.com/notebooks/index.html
If you want to connect your local machine to databricks cluster and do development - you can try below databricks connect.
https://docs.databricks.com/dev-tools/databricks-connect.html

How to set Spark configuration for Databricks SQL Endpoint

I know how to set Spark configuration in a regular Databricks compute cluster. But I didn't see any place to set it in Databricks SQL endpoint.

How to access table from Hive cluster located in HDInsight from Local Spark Server built on Intellij

I am not able to access and read data from Hive table located in HDInsight from my local Instance where application build on Intellij and Maven.
May someone please help me what are the Pre-requisite for scenario when we need to write data from Spark to Hive , but Hive is located on HDInsight and Spark is on local native instance.
Note: I don't have Spark cluster on HDInsight , i only have hive cluster on HDInsight.
Please share your comment
Please add the hive-site.xml of the cluster in your resources folder. Also, make sure you have the network port open for the local network. Please refer the below link.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

CDAP with Azure Data bricks

Has anyone tried using Azure data bricks as the spark cluster for CDAP job processing. CDAP documentation details how to add it to Azure HDInsight, but just wondering is there a way to configure CDAP to point to data bricks spark cluster, is it even possible? OR this kind of integration needs a specific data bricks client connector jar? If anyone has any insights that would be helpful.
There is no out of box support for Databricks spark on Azure. But, that said you can develop a new Cloud Runtime that is capable of submitting the jobs to Databricks spark cluster. Here is example of how to write a runtime extension for Cloud Dataproc and EMR.

Spark Deployment in Azure cloud

Is it possible to deploy spark code in Azure cloud without the yarn component? thanks in advance
Yes,you can deploy Apache Spark cluster in Azure HDInsight without Yarn.
Spark clusters in HDInsight include the following components that are available on the clusters by default.
1)Spark Core. Includes Spark Core, Spark SQL, Spark streaming APIs, GraphX, and MLlib.
2)Anaconda
3)Livy
4)Jupyter notebook
5)Zeppelin notebook
Spark clusters on HDInsight also provide an ODBC driver for connectivity to Spark clusters in HDInsight from BI tools such as Microsoft Power BI and Tableau.
Refer to the following sites for more information,
Create an Apache Spark cluster in Azure HDInsight
Introduction to Spark on HDInsight
I don't think it is possible to deploy HDInsight cluster without YARN.Refer to the HDInsight documentation
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-hadoop-introduction
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-component-versioning
YARN is the resource manager for Hadoop. Is there any particular reason you would not want to use YARN while working with HDInsight Spark cluster?
If you want to use the standalone mode, you can modify the location of the master url while submitting the job using Spark-submit command.
I have some examples in my repo with Spark-submit both in local mode and on HDInsight cluster
https://github.com/NileshGule/learning-spark
You can refer to
local mode : https://github.com/NileshGule/learning-spark/blob/master/src/main/java/com/nileshgule/movielens/MovieLens.md
HDInsight Spark cluster : https://github.com/NileshGule/learning-spark/blob/master/Azure.md

Resources