what to specify as spark master when running on amazon emr - apache-spark

Spark has native support by EMR. When using the EMR web interface to create a new cluster, it is possible to add a custom step that would execute a Spark application when the cluster starts, basically an automated spark-submit after cluster startup.
I've been wondering how to specify the master node to the SparkConf within the application, when starting the EMR cluster and submitting the jar file through the designated EMR step?
It is not possible to know the IP of the cluster master beforehand, as would be the case if I started the cluster manually and then used the information to build into my application before calling spark-submit.
Code snippet:
SparkConf conf = new SparkConf().setAppName("myApp").setMaster("spark:\\???:7077");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
Note that I am asking about the "cluster" execution mode, so the driver program runs on the cluster as well.

Short answer: don't.
Longer answer: A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf, so when you run spark-submit, you don't even have to include "--master ...".
However, since you are asking about cluster execution mode (actually, it's called "deploy mode"), you may specify either "--master yarn-cluster" (deprecated) or "--deploy-mode cluster" (preferred). This will make the Spark driver run on a random cluster mode rather than on the EMR master.

Related

How to set apache spark config to run in cluster mode as a databricks job

I have developed an Apache Spark app, compiled it into a jar and I want to run it as a Databricks job. So far I have been setting master=local to test. What should I set this property or others in the spark config for it to run in cluster mode in databricks. Note that I do not have a cluster created in Databricks, I only have a job that will run on demand so I do not have the url of the master node.
For the databricks job, you do not need to set master to anything.
You will need to do following:
val spark = SparkSession.builder().getOrCreate()

Possible to add extra jars to master/worker nodes AFTER spark submit at runtime?

I'm writing a service that runs on a long-running Spark application from a spark submit. The service won't know what jars to put on the classpaths by the time of the initial spark submit, so I can't include it using --jars. This service will then listen for requests that can include extra jars, which I then want to load onto my spark nodes so work can be done using these jars.
My goal is to call spark submit only once, being at the very beginning to launch my service. Then I'm trying to add jars from requests to the spark session by creating a new SparkConf and building a new SparkSession out of it, something like
SparkConf conf = new SparkConf();
conf.set("spark.driver.extraClassPath", "someClassPath")
conf.set("spark.executor.extraClassPath", "someClassPath")
SparkSession.builder().config(conf).getOrCreate()
I tried this approach but it looks like the jars aren't getting loaded onto the executor classpaths as my jobs don't recognize the UDFs from the jars. I'm trying to run this in Spark client mode right now.
Is there a way to add these jars AFTER a spark-submit has been
called and just update the existing Spark application or is it only possible with another spark-submit that includes these jars using --jars?
Would using cluster mode vs client mode matter in this kind of
situation?

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

Mesos Configuration with existing Apache Spark standalone cluster

I am a beginner in Apache-spark!
I have setup Spark standalone cluster using 4 PCs.
I want to use Mesos with existing Spark standalone cluster. But I read that I need to install Mesos first then configure the spark.
I have also seen the Documentation of Spark on setting with Mesos, but it is not helpful for me.
So how to configure Mesos with existing spark standalone cluster?
Mesos is an alternative cluster manager to standalone Spark manger. You don't use it with, you use it instead of.
to create Mesos cluster follow https://mesos.apache.org/gettingstarted/
make sure to distribute Mesos native library is available on the machine you use to submit jobs
for cluster mode start Mesos dispatcher (sbin/start-mesos-dispatcher.sh).
submit application using Mesos master URI (client mode) or dispatcher URI (cluster mode).

What is the entry point of Spark container in YARN cluster mode?

What is the main entry point of a Spark executor when a Spark job is being run in Yarn cluster mode (for Spark 1.2.0+)?
What I am looking for is the Scala class name for the entry point of an Executor (which will be the process executing one of the tasks on a slave machine).
I think what you're asking about is org.apache.spark.executor.Executor or perhaps org.apache.spark.executor.Executor$TaskRunner. It is TaskRunner that will ultimately run a task.
It is regardless of the deploy mode (client vs cluster) or the cluster manager used, i.e. Hadoop YARN or Spark Standalone or Apache Mesos.
spark-submit --class [FULLY QUALIFIED CLASS NAME]
--master yarn-cluster
[JAR_TO_USE]
So, given the above, the class to be used is the one specified, which is loaded from the given jar, and it searches within that class for a static main method.
From SparkSubmit.scala:
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)

Resources