Purpose of SparkConf and sparkContext - apache-spark

What’s a purpose of sparkContext and sparkConf ? Looking for detailed difference.
More than below definition:
Spark Context was the entry point of any spark application and used to access all spark features and needed a sparkConf which had all the cluster configs and parameters to create a Spark Context object.

The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark's cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI.
SparkConf is passed into SparkContext so our driver application knows how to access the cluster.
Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop's resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors.
Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.

Related

Does executors in spark nd application master in yarn do the same job?

Does executors in spark nd application master in yarn do the same
In Spark, there is a Driver and Executors. I'm not gonna go into detail of what driver and executors are but in a one-liner, the driver manages the job flow and schedules tasks, and Executors are worker nodes processes in charge of running individual tasks.
YARN is basically a resource manager which allocates memory to compute engines. Now, this compute engine can be Spark/Tez/Map-reduce. What you need to understand here is when YARN successfully allocates memory they are called containers.
Now when Spark Job is deployed in YARN, Assuming that YARN has sufficient memory for the spark job to run, Yarn first allocates resources as containers for Spark Application Master which will have the driver program (in case of cluster mode). This Application Master will further requests resources for spark executors which YARN will further allocate as containers. So spark job will have multiple containers, one for the driver program and n containers for n executors. So you see in the computing sense the fundamental difference between spark running in a spark cluster and spark running in YARN is the use of containers.
So executors and application master in YARN run inside containers and do the same thing as spark on spark clusters.

What if driver in spark job fails?

I am exploring spark job recovery mechanism and I have a queries related to it,
How spark recovers from driver node failure
recovery from executor node failures
what are the ways to handle such scenarios ?
Driver node Failure:: If driver node which is running our spark Application is down, then Spark Session details will be lost and all the executors with their in-memory data will get lost. If we restart our application, getorCreate() method will reinitialize spark sesssion from the checkpoint directory and resume processing.
On most cluster managers, Spark does not automatically relaunch the driver if it crashes, so we need to monitor it using a tool like monit and restart it. The best way to do this is probably specific to environment. One place where Spark provides more support is the Standalone cluster manager, which supports a --supervise flag when submitting driver that lets Spark restart it. We will also need to pass --deploy-mode cluster to make the driver run within the cluster and not on your local machine like:
./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar
Imp Point: When the driver crashes, executors in Spark will also restart.
Executor Node Failure: Any of the worker nodes running executor can fail, thus resulting in loss of in-memory.
For failure of a executor node, Spark uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the worker nodes. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
I hope I covered third question in the above points itself

Spark session/context lifecycle

I don't understand how the spark session/context lifecycle works. The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed? For example, if I have a production cluster and I spark-submit 10 ETLs, will these 10 jobs share the same SparkContext? Does it matter if I do this in cluster/client mode?
To the best of my understanding, the SparkContext lives in the driver so I assume the above would result in one SparkContext shared by 10 SparkSessions, but I'm not at all sure I got this correctly... Any clarification will be much appreciated.
Let's understand sparkSession and sparkContext
SparkContext is a channel to access all Spark functionality.The Spark driver program uses it to connect to the cluster manager to communicate, submit Spark jobs and knows what resource manager (YARN) to communicate to.And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.
With Spark 2.0, SparkSession can access all aforementioned Spark’s functionality through a single-unified point of entry.
It means SparkSession Encapsulates SparkContext.
Let’s say we have multiple users accessing the same notebook which had shared sparkContext and the requirement was to have an isolated environment sharing the same spark context. Prior to 2.0, the solution to this was to create multiple sparkContexts ie sparkContext per isolated environment or users and is an expensive operation(a single sparkContext exists per JVM). But with the introduction of the spark session, this issue has been addressed.
I spark-submit 10 ETLs, will these 10 jobs share the same
SparkContext? Does it matter if I do this in cluster/client mode? To
the best of my understanding, the SparkContext lives in the driver so
I assume the above would result in one SparkContext shared by 10
SparkSessions,
If you submit 10 ETL spark-submit jobs whether it is cluster/client they all are different application and they have their own sparkContext and sparkSession.In native spark you can't share objects between different application but if you want share objects you have to use share contexts(spark-jobserver).Multiple option are available like Apache Ivy, apache-ignite
There is one SparkContext per spark application.
The documentation says that you can have multiple SparkSessions that share an underlying SparkContext. But how/when are those created and destroyed?
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
If you have an existing spark session and want to create new one, use the newSession method on the existing SparkSession.
import org.apache.spark.sql.{SQLContext, SparkSession}
val newSparkSession1 = spark.newSession()
val newSparkSession2 = spark.newSession()
The newSession method creates a new spark session with isolated SQL configurations, temporary tables.The new session will share the underlying SparkContext and cached data.
Then you can use these different sessions to submit different jobs/sql queries.
newSparkSession1.sql("<ETL-1>")
newSparkSession2.sql("<ETL-2>")
Does it matter if I do this in cluster/client mode?
Client/Cluster mode doesn't matter.

Role of master in Spark standalone cluster

In a Spark standalone cluster, what is exactly the role of the master (a node started with start_master.sh script)?
I understand that is the node that receives the jobs from the submit-job.sh script, but what is its role when processing a job?
I'm seeing in the web UI that always delivers the job to a slave (a node started with start_slave.sh) and is not participating from processing, Am I right? In that case, should I also run also the script start_slave.sh in the same machine than master to to take advantage of its resources (cpu and memory)?
Thanks in advance.
Spark runs in the following cluster modes:
Local
Standalone
Mesos
Yarn
The above are cluster modes which offer resources to Spark Applications
Spark standalone mode is master slave architecture, we have Spark Master and Spark Workers. Spark Master runs in one of the cluster nodes and Spark Workers run on the Slave nodes of the cluster.
Spark Master (often written standalone Master) is the resource manager
for the Spark Standalone cluster to allocate the resources (CPU, Memory, Disk etc...) among the
Spark applications. The resources are used to run the Spark Driver and Executors.
Spark Workers report to Spark Master about resources information on the Slave nodes.
[apache-spark]
Spark standalone comes with its own resource manager. Think about Spark Master/Worker as YARN ResourceManager/NodeManager.

what to specify as spark master when running on amazon emr

Spark has native support by EMR. When using the EMR web interface to create a new cluster, it is possible to add a custom step that would execute a Spark application when the cluster starts, basically an automated spark-submit after cluster startup.
I've been wondering how to specify the master node to the SparkConf within the application, when starting the EMR cluster and submitting the jar file through the designated EMR step?
It is not possible to know the IP of the cluster master beforehand, as would be the case if I started the cluster manually and then used the information to build into my application before calling spark-submit.
Code snippet:
SparkConf conf = new SparkConf().setAppName("myApp").setMaster("spark:\\???:7077");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
Note that I am asking about the "cluster" execution mode, so the driver program runs on the cluster as well.
Short answer: don't.
Longer answer: A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf, so when you run spark-submit, you don't even have to include "--master ...".
However, since you are asking about cluster execution mode (actually, it's called "deploy mode"), you may specify either "--master yarn-cluster" (deprecated) or "--deploy-mode cluster" (preferred). This will make the Spark driver run on a random cluster mode rather than on the EMR master.

Resources