Spark:executor.CoarseGrainedExecutorBackend: Driver Disassociated disassociated - apache-spark

I am learning how to use spark and I have a simple program.When I run the jar file it gives me the right result but I have some error in the stderr file.just like this:
15/05/18 18:19:52 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor#localhost:51976] -> [akka.tcp://sparkDriver#172.31.34.148:60060] disassociated! Shutting down.
15/05/18 18:19:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#172.31.34.148:60060] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
You can get the whole stderr file in there:
http://172.31.34.148:8081/logPage/?appId=app-20150518181945-0026&executorId=0&logType=stderr
I searched this problem and find this:
Why spark application fail with "executor.CoarseGrainedExecutorBackend: Driver Disassociated"?
And I turn up the spark.yarn.executor.memoryOverhead as it said but it doesn't work.
I just have one master node(8G memory) and in the spark's slaves file there is only one slave node--the master itself.I submit like this:
./bin/spark-submit --class .... --master spark://master:7077 --executor-memory 6G --total-executor-cores 8 /path/..jar hdfs://myfile
I don't know what is the executor and what is the driver...lol...
sorry about that..
anybody help me?

If Spark Driver fails, it gets disassociated (from YARN AM). Try the following to make it more fault-tolerant:
spark-submit with --supervise flag on Spark Standalone cluster
yarn-cluster mode on YARN
spark.yarn.driver.memoryOverhead parameter for increasing Driver's memory allocation on YARN
Note: Driver supervisation (spark.driver.supervise) is not supported on a YARN cluster (yet).

An overview of driver vs. executor (and others) can be found at http://spark.apache.org/docs/latest/cluster-overview.html or https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
They are java processes that could run in different or the same machine depending on your configuration. Driver contains the SparkContext, declares the RDD transformation (and if I'm not mistaken - think execution plan) then communicates that to the spark master which creates task definitions, asks the cluster manager (it's own,yarn, mesos) for resources (worker nodes) and those tasks in turn gets sent to executors (for execution).
Executors communicate back to master certain information and as far as I understand if the driver encounters a problem or crashes, the master will take note and will tell the executor (and it in turn logs) what you see "driver is disassociated". This could be because of a lot of things but the most common ones are because the java process (driver) runs out of memory (try increasing spark.driver.memory)
Some differences when running on Yarn vs Stand-alone vs Mesos but hope this helps. If driver is disassociated, the java process running (as the driver) likely encountered an error - the master logs might have something and not sure if there are driver specific logs. Hopefully someone more knowledgeable than me can provide more info.

Related

Spark driver failure recovery

I want to know about how spark restarts the spark driver in case of failure. My understanding is since the driver node is failed, all the computations will be lost, so the restart will mean re-submitting the application. I want to know how driver program is restarted w.r.t to Yarn as resource manager, I know that mesos has a standalone driver node and standalone mode we have --supervise flag but I'm not too sure about yarn. Any explanation will help. The answer I need w.r.t non-streaming application. Sorry for the big question.

What if driver in spark job fails?

I am exploring spark job recovery mechanism and I have a queries related to it,
How spark recovers from driver node failure
recovery from executor node failures
what are the ways to handle such scenarios ?
Driver node Failure:: If driver node which is running our spark Application is down, then Spark Session details will be lost and all the executors with their in-memory data will get lost. If we restart our application, getorCreate() method will reinitialize spark sesssion from the checkpoint directory and resume processing.
On most cluster managers, Spark does not automatically relaunch the driver if it crashes, so we need to monitor it using a tool like monit and restart it. The best way to do this is probably specific to environment. One place where Spark provides more support is the Standalone cluster manager, which supports a --supervise flag when submitting driver that lets Spark restart it. We will also need to pass --deploy-mode cluster to make the driver run within the cluster and not on your local machine like:
./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar
Imp Point: When the driver crashes, executors in Spark will also restart.
Executor Node Failure: Any of the worker nodes running executor can fail, thus resulting in loss of in-memory.
For failure of a executor node, Spark uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the worker nodes. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
I hope I covered third question in the above points itself

Spark Standalone cluster, memory per executor issue

Hi i am launch my Spark application with the spark submit script as such
spark-submit --master spark://Maatari-xxxxxxx.local:7077 --class EstimatorApp /Users/sul.maatari/IdeaProjects/Workshit/target/scala-2.11/Workshit-assembly-1.0.jar --d
eploy-mode cluster --executor-memory 15G num-executors 2
I have a spark standalone cluster deployed on two nodes (my 2 laptops). The cluster is running fine. By default it set 15G for the workers and 8 cores for the executors. Now i am experiencing the following strange behavior. Although i am explicity setting the memory and this can also be seen in the environmement variable of the sparconf UI, in the Cluster UI it says that my application is limited to 1024MB for the executor memory. This makes me think of the default 1G parameter. I wonder why that it.
My application indeed fail because of the memory issue. I know that i need a lot of memory for that application.
One last point of confusion is the Driver program. Why given that i am on cluster mode, spark submit does not return immediately ? I though that given that the driver is executed on the cluster, the client i.e. submit application should return immediately. This further suggest me that something is not right with my conf and how things are being executed.
Can anyone help diagnose that ?
Two possibilities:
given that your command line has the --num-executors mis-specified: it may be that Spark "gives up" on the other setting as well.
how much memory does your laptop have? Most of us use mac's .. and then you would not be able to run it with more than about 8GB in my experience.

can someone let me know how to decide --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores

How to decide the --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores.
Also the clear difference between cluster and client deploy mode. How to choose the deploy mode
The first part of your question where you ask about --executor-memory, --num-executors and --num-executor-cores usually depends on the variety of task your Spark application is going to perform.
Executor Memory indicates the amount of physical memory you want to allocate to the JVM that runs the executor. The value will depend on your requirement. For example, if you're just going to parse a large text file you'll require much less memory than what you need for, say, Image Processing.
The number of executors variable is the number of Executor JVMs you want to spawn on your cluster. Again, it depends on a lot of factors like your cluster size, type of machines in the cluster etc.
Each executor splits the code and performs the instructions in tasks. These tasks are performed in executor cores (or processors). This helps you to achieve parallelism within a certain executor but make sure you don't allocate all the cores of a machine to its executor because some are needed for normal functioning of it.
On to your second part of the question, we have two --deploy-mode in Spark that you have already named i.e. cluster and client.
client mode is when you connect an external machine to a cluster and you run a spark job from that external machine. Like when you connect your laptop to a cluster and run spark-shell from it. The driver JVM is invoked in your laptop and the session is killed as soon as you disconnect your laptop. Similar is the case for a spark-submit job, if you run a job with --deploy-mode client, your laptop acts like the master but the job is killed as soon as it is disconnected (not sure about this one).
cluster mode: When you specify --deploy-mode cluster in your job then even if you run it using your laptop or any other machine, the job (JAR) is taken care of by the ResourceManager and ApplicationMaster, just like any other application in YARN. You won't be able to see the output on your screen but anyway most complex Spark jobs write to a FS so that's taken care of that way.

spark-submit error : failed in initilizing sparkContext for non driver program vms

Cluster Specifications : Apache Spark on top of Mesos with 5 Vms and HDFS as storage.
spark-env.sh
export SPARK_LOCAL_IP=192.168.xx.xxx #to set the IP address Spark binds to on this node
enter code here`export MESOS_NATIVE_JAVA_LIBRARY="/home/xyz/tools/mesos-1.0.0/build/src/.libs/libmesos-1.0.0.so" #to point to your libmesos.so if you use Mesos
export SPARK_EXECUTOR_URI="hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz"
HADOOP_CONF_DIR="/usr/local/tools/hadoop" #To point Spark towards Hadoop configuration files
spark-defaults.conf
spark.executor.uri hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz
spark.driver.host 192.168.xx.xxx
spark.rpc netty
spark.rpc.numRetries 5
spark.ui.port 48888
spark.driver.port 48889
spark.port.maxRetries 32
I did some experiments with submitting word-count scala application in cluster mode, I observed that it executes successfully only when it finds driver program (containing main method) from the Vm it was submitted. As per my knowledge scheduling of resources (VMs) is handled by Mesos. for example if i submit my application from vm12 and coincidently if Mesos also schedules vm12 for executing application then it will execute successfully.In contrast it will fail if mesos scheduler decides to allocate let's say vm15.I checked logs in stderr of mesos UI and found error..
16/09/27 11:15:49 ERROR SparkContext: Error initializing SparkContext.
Besides I tried looking for configuration aspects of spark in following link.
[http://spark.apache.org/docs/latest/configuration.html][1] I tried setting rpc as it seemed necessary to keep driver program near to worker-node in LAN.
But couldn't get much insights.
I also tried uploading my code (application) in HDFS and submitting application jar file from HDFS.The same observations I received.
While connecting apache-spark with Mesos according to the documentation in
following link http://spark.apache.org/docs/latest/running-on-mesos.html
I also tried configuring spark-defaults.conf, spark-env.sh in other VM's in order to check if it successfully runs at least from 2 Vm's. That also didn't workout.
Am I missing any conceptual clarity here.?
So how can I make my application run successfully regardless of Vm's I'm submitting from ?

Resources