What if driver in spark job fails? - apache-spark

I am exploring spark job recovery mechanism and I have a queries related to it,
How spark recovers from driver node failure
recovery from executor node failures
what are the ways to handle such scenarios ?

Driver node Failure:: If driver node which is running our spark Application is down, then Spark Session details will be lost and all the executors with their in-memory data will get lost. If we restart our application, getorCreate() method will reinitialize spark sesssion from the checkpoint directory and resume processing.
On most cluster managers, Spark does not automatically relaunch the driver if it crashes, so we need to monitor it using a tool like monit and restart it. The best way to do this is probably specific to environment. One place where Spark provides more support is the Standalone cluster manager, which supports a --supervise flag when submitting driver that lets Spark restart it. We will also need to pass --deploy-mode cluster to make the driver run within the cluster and not on your local machine like:
./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar
Imp Point: When the driver crashes, executors in Spark will also restart.
Executor Node Failure: Any of the worker nodes running executor can fail, thus resulting in loss of in-memory.
For failure of a executor node, Spark uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the worker nodes. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
I hope I covered third question in the above points itself

Related

Spark client memory configuration

I'm trying to run multiple spark clients on Airflow(ETL scheduler).
I'm running in cluster mode on YARN, therefore ApplicationMaster Executor and Driver are all running on executor in Yarn context.
However, my spark client which sample the process and monitor the state is running in airflow worker.
The problem is that the Spark client take lot's of memory ~500 MB per job. It may sound as not much in terms of executors or drivers but for the role of spark client it sounds crazy.
My question is, how can I configure/manipulate spark client memory/cpu requirements can I limit it's intervals ? can I limit it's memory with flags?
So in spark code it make a distinction if it's running in standalone mode or cluster mode. For standalone it set a default of -Xmx 1G and in cluster mode it doesn't have default but it trying to read java options from environment variable called SPARK_SUBMIT_OPTS.
So if you wanna set any java opts for the client java process only use SPARK_SUBMIT_OPTS

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

Kafka Spark Streaming

I was trying to build Kafka and spark streaming use case. In that, Spark Streaming is consuming streaming from Kafka. And we are enhancing stream and storing enhanced stream into some target system.
My question here is that does it make sense to run spark streaming job in yarn-cluster or yarn-client mode? (Hadoop is not involved here)
What I think Spark streaming job should run only local mode but another question is how to improve the performance of spark streaming job.
Thanks,
local[*]
This is specific to run the job in local mode
Usually we use this to perform POC's and on a very small data.
You can debug the job to understand how each line of code is working.
But, you need to be aware that since the job is running in your local you cannot get the most out of sparks distributed architecture.
yarn-client
your driver program is running on the yarn client where you type the command to submit the spark application . But, the tasks are still executed on the Executors.
yarn-cluster
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is the finest way of running the spark job to be benefited by the advantages provided by a cluster manager
I hope this gives you a clarity on how you may want to deploy your spark job.
Infact, Spark provides you a very clean documentation explaining various deployment strategies with examples.
https://spark.apache.org/docs/latest/running-on-yarn.html
the difference will be with yarn-client, you will force the spark job to choose the host where you run spark-submit as the driver , because in yarn-cluster , the choice won't be the same host everytime you run it
so the best choice is to always choose yarn-cluster to avoide overloading the same host if you are going to submit multi job in the same host with yarn-client

can someone let me know how to decide --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores

How to decide the --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores.
Also the clear difference between cluster and client deploy mode. How to choose the deploy mode
The first part of your question where you ask about --executor-memory, --num-executors and --num-executor-cores usually depends on the variety of task your Spark application is going to perform.
Executor Memory indicates the amount of physical memory you want to allocate to the JVM that runs the executor. The value will depend on your requirement. For example, if you're just going to parse a large text file you'll require much less memory than what you need for, say, Image Processing.
The number of executors variable is the number of Executor JVMs you want to spawn on your cluster. Again, it depends on a lot of factors like your cluster size, type of machines in the cluster etc.
Each executor splits the code and performs the instructions in tasks. These tasks are performed in executor cores (or processors). This helps you to achieve parallelism within a certain executor but make sure you don't allocate all the cores of a machine to its executor because some are needed for normal functioning of it.
On to your second part of the question, we have two --deploy-mode in Spark that you have already named i.e. cluster and client.
client mode is when you connect an external machine to a cluster and you run a spark job from that external machine. Like when you connect your laptop to a cluster and run spark-shell from it. The driver JVM is invoked in your laptop and the session is killed as soon as you disconnect your laptop. Similar is the case for a spark-submit job, if you run a job with --deploy-mode client, your laptop acts like the master but the job is killed as soon as it is disconnected (not sure about this one).
cluster mode: When you specify --deploy-mode cluster in your job then even if you run it using your laptop or any other machine, the job (JAR) is taken care of by the ResourceManager and ApplicationMaster, just like any other application in YARN. You won't be able to see the output on your screen but anyway most complex Spark jobs write to a FS so that's taken care of that way.

running multiple Spark jobs on a Mesos cluster

I would like to run multiple spark jobs on my Mesos cluster, and have all spark jobs share the same spark framework. Is this possible?
I have tried running the MesosClusterDispatcher and have the spark jobs connect to the dispatcher, but each spark job launches its own "Spark Framework" (I have tried running both client-mode and cluster-mode).
Is this the expected behaviour?
Is it possible to share the same spark-framework among multiple spark jobs?
It is normal and it's the expected behaviour.
In Mesos as far as I know, SparkDispatcher is in charge of allocate resources for your Spark Driver which will act as a framework. Once Spark driver has been allocated, it is responsible for talk to Mesos and accept offers to allocate the executors where tasks will be executed.

Resources