Databricks-connect Java connects to local instead of remote - apache-spark

I have a Java application that connects to an Apache Spark cluster and performs some operations. I'm trying to connect to a Databricks cluster on Azure, using databricks-connect 7.3. If I run from the terminal databricks-connect test, everything works perfectly. I'm following their documentation, I included the jars in IntelliJ, added spark.databricks.service.server.enabled true to the cluster in Databricks and used the following to create the SparkSession:
SparkSession spark = SparkSession
.builder()
.master("local")
.getOrCreate();
The problem is that this command connects to a local cluster that is instantiated at runtime, and does not connect to the remote Databricks cluster. Am I missing something?

Related

Not able to connect to Snowflake from EMR Cluster using Pyspark

I am trying to connect to Snowflake from EMR cluster using pyspark.
I am using these two jars in spark-submit.
snowflake-jdbc-3.5.2.jar
spark-snowflake_2.11-2.7.0-spark_2.4.jar
But it failing with connect time out error.
I have correct proxy configured for the EMR cluster. From the same EC2 (EMR Master)
I am able to connect to Snowflake using snowsql and python connector.
I am not sure why it is getting timed out for pyspark.
You can use our SnowCD tool to check the connectivity diagnostics. This is related to network issues.
https://docs.snowflake.com/en/user-guide/snowcd.html
Below were my commands when I tried running through EMR shell.
pyspark --packages net.snowflake:snowflake-jdbc:3.6.27,net.snowflake:spark-snowflake_2.12:2.4.14-spark_2.4
spark-submit --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4 SparkConnPythonWithCert.py
Spark-shell --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4

Apache livy error - "Unable to connect to server" spark://[hostname]:7077

I'm trying to submit a Spark batch job using Apache Livy.
I am running my Spark cluster and Livy service on Openshift.
the Spark cluster is in standalone cluster mode, Livy runs spark in cluster deploy mode.
I've configured Livy to the Spark cluster master service:
livy.spark.master = "spark://spark.spark.svc:7077
when i try to run a scala spark job, using the POST /batches method and get the exception:
Unable to connect to server spark://spark.spark.svc:7077
Anyone experienced this issue?

How to set apache spark config to run in cluster mode as a databricks job

I have developed an Apache Spark app, compiled it into a jar and I want to run it as a Databricks job. So far I have been setting master=local to test. What should I set this property or others in the spark config for it to run in cluster mode in databricks. Note that I do not have a cluster created in Databricks, I only have a job that will run on demand so I do not have the url of the master node.
For the databricks job, you do not need to set master to anything.
You will need to do following:
val spark = SparkSession.builder().getOrCreate()

Submitting pyspark script to a remote Spark server?

This is probably a really silly question, but I can't find the answer with Google. I've written a simple pyspark ETL script that reads in a CSV and writes it to Parquet, something like this:
spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark.sparkContext)
df = sqlContext.read.csv(input_filename)
df.write.parquet(output_path)
To run it, I start up a local Spark cluster in Docker:
$ docker run --network=host jupyter/pyspark-notebook
I run the Python script and it connects to this local Spark cluster and all works as expected.
Now I'd like to run the same script on a remote Spark cluster (AWS EMR). Can I just specify a remote IP address somewhere when initialising the Spark context? Or am I misunderstanding how Spark works?
You can create a spark session by specifying the IP address of the remote master.
spark = SparkSession.builder.master("spark://<ip>:<port>").getOrCreate()
In case of AWS EMR, standalone mode is not supported. You need to use yarn in either client or cluster mode, and point HADOOP_CONF_DIR to a location on your local server where all files from /etc/hadoop/conf are present. Then setup dynamic port forwarding to connect to the EMR cluster. Create a spark session like:
spark = SparkSession.builder.master('yarn').config('spark.submit.deployMode', 'cluster').getOrCreate()
refer https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

what to specify as spark master when running on amazon emr

Spark has native support by EMR. When using the EMR web interface to create a new cluster, it is possible to add a custom step that would execute a Spark application when the cluster starts, basically an automated spark-submit after cluster startup.
I've been wondering how to specify the master node to the SparkConf within the application, when starting the EMR cluster and submitting the jar file through the designated EMR step?
It is not possible to know the IP of the cluster master beforehand, as would be the case if I started the cluster manually and then used the information to build into my application before calling spark-submit.
Code snippet:
SparkConf conf = new SparkConf().setAppName("myApp").setMaster("spark:\\???:7077");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
Note that I am asking about the "cluster" execution mode, so the driver program runs on the cluster as well.
Short answer: don't.
Longer answer: A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf, so when you run spark-submit, you don't even have to include "--master ...".
However, since you are asking about cluster execution mode (actually, it's called "deploy mode"), you may specify either "--master yarn-cluster" (deprecated) or "--deploy-mode cluster" (preferred). This will make the Spark driver run on a random cluster mode rather than on the EMR master.

Resources