PySpark Glue Error: Remote RPC Client Disassociated - apache-spark

I have a PySpark code running in Glue which fails with error "Remote TPC Client Disassociated.Likely due to containers exceeding thresholds, or network issues". It seems like it's a common issue with Spark and I have gone through a bunch of stack overflow and other forums. I tried tweaking a number of parameters in the Spark config but nothing worked hence posting this here. Would appreciate any inputs. Below are few details about my job, I have played with different values of the spark config for rpc and memory. I tried executor memory of 1g, 20g, 30g, 64g; Driver memory of 20g, 30g, 64g. Please advise.
Glue version: 3.0
Spark versio: 3.1
Spark Config:
spark= (SparkSession
.builder
.appName("jkTestApp")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.autoBroadcastJoinThreshold","-1")
.config("spark.rpc.message.maxSize", "512")
.config("spark.driver.memory","64g")
.config("spark.executor.memory","1g")
.config("spark.executor.memoryOverhead","512")
.config("spark.rpc.numRetries","10")
.getOrCreate()
)

Related

What if driver in spark job fails?

I am exploring spark job recovery mechanism and I have a queries related to it,
How spark recovers from driver node failure
recovery from executor node failures
what are the ways to handle such scenarios ?
Driver node Failure:: If driver node which is running our spark Application is down, then Spark Session details will be lost and all the executors with their in-memory data will get lost. If we restart our application, getorCreate() method will reinitialize spark sesssion from the checkpoint directory and resume processing.
On most cluster managers, Spark does not automatically relaunch the driver if it crashes, so we need to monitor it using a tool like monit and restart it. The best way to do this is probably specific to environment. One place where Spark provides more support is the Standalone cluster manager, which supports a --supervise flag when submitting driver that lets Spark restart it. We will also need to pass --deploy-mode cluster to make the driver run within the cluster and not on your local machine like:
./bin/spark-submit --deploy-mode cluster --supervise --master spark://... App.jar
Imp Point: When the driver crashes, executors in Spark will also restart.
Executor Node Failure: Any of the worker nodes running executor can fail, thus resulting in loss of in-memory.
For failure of a executor node, Spark uses the same techniques as Spark for its fault tolerance. All the data received from external sources is replicated among the worker nodes. All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
I hope I covered third question in the above points itself

Purpose of SparkConf and sparkContext

What’s a purpose of sparkContext and sparkConf ? Looking for detailed difference.
More than below definition:
Spark Context was the entry point of any spark application and used to access all spark features and needed a sparkConf which had all the cluster configs and parameters to create a Spark Context object.
The first step of any Spark driver application is to create a SparkContext. The SparkContext allows your Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark's cluster manager. In order to create a SparkContext you should first create a SparkConf. The SparkConf stores configuration parameters that your Spark driver application will pass to SparkContext. Some of these parameters define properties of your Spark driver application and some are used by Spark to allocate resources on the cluster. Such as, the number, memory size and cores uses by the executors running on the workernodes. setAppName() gives your Spark driver application a name so you can identify it in the Spark or Yarn UI.
SparkConf is passed into SparkContext so our driver application knows how to access the cluster.
Now that your Spark driver application has a SparkContext it knows what resource manager to use and can ask it for resources on the cluster. If you are using YARN, Hadoop's resourcemanager (headnode) and nodemanager (workernode) will work to allocate a container for the executors. If the resources are available on the cluster the executors will allocate memory and cores based your configuration parameters. If you are using Sparks cluster manager, the SparkMaster (headnode) and SparkSlave (workernode) will be used to allocate the executors.
Each Spark driver application has its own executors on the cluster which remain running as long as the Spark driver application has a SparkContext. The executors run user code, run computations and can cache data for your application. The SparkContext will create a job that is broken into stages. The stages are broken into tasks which are scheduled by the SparkContext on an executor.

OutofMemory error for spark streaming job

I have a spark streaming job running in Hortonworks cluster .
I am running it in cluster mode through yarn , the job is shown as running in UI , but it is having the below exception in driver logs
Exception in thread "JobGenerator" java.lang.OutOfMemoryError: Java heap space
I fixed the issue by specifying driver-memory in spark-submit command.because the memory issue was in driver

Controlling log size in Spark streaming job

We have Spark streaming job running in HDInsight Spark cluster (yarn mode) and we are seeing the streaming job stopping after few weeks due to what looks like running out of disk space due to log volume.
Is there a way to set limit on log size for Spark streaming job and enable rolling log? I have tried setting the below spark executor log properties in code, but this setting doesn’t seem to be honored.
val sparkConfiguration: SparkConf = EventHubsUtils.initializeSparkStreamingConfigurations
sparkConfiguration.set("spark.executor.logs.rolling.maxRetainedFiles", "2")
sparkConfiguration.set("spark.executor.logs.rolling.maxSize", "107374182")
val spark = SparkSession
.builder
.config(sparkConfiguration)
.getOrCreate()

Spark:executor.CoarseGrainedExecutorBackend: Driver Disassociated disassociated

I am learning how to use spark and I have a simple program.When I run the jar file it gives me the right result but I have some error in the stderr file.just like this:
15/05/18 18:19:52 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor#localhost:51976] -> [akka.tcp://sparkDriver#172.31.34.148:60060] disassociated! Shutting down.
15/05/18 18:19:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#172.31.34.148:60060] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
You can get the whole stderr file in there:
http://172.31.34.148:8081/logPage/?appId=app-20150518181945-0026&executorId=0&logType=stderr
I searched this problem and find this:
Why spark application fail with "executor.CoarseGrainedExecutorBackend: Driver Disassociated"?
And I turn up the spark.yarn.executor.memoryOverhead as it said but it doesn't work.
I just have one master node(8G memory) and in the spark's slaves file there is only one slave node--the master itself.I submit like this:
./bin/spark-submit --class .... --master spark://master:7077 --executor-memory 6G --total-executor-cores 8 /path/..jar hdfs://myfile
I don't know what is the executor and what is the driver...lol...
sorry about that..
anybody help me?
If Spark Driver fails, it gets disassociated (from YARN AM). Try the following to make it more fault-tolerant:
spark-submit with --supervise flag on Spark Standalone cluster
yarn-cluster mode on YARN
spark.yarn.driver.memoryOverhead parameter for increasing Driver's memory allocation on YARN
Note: Driver supervisation (spark.driver.supervise) is not supported on a YARN cluster (yet).
An overview of driver vs. executor (and others) can be found at http://spark.apache.org/docs/latest/cluster-overview.html or https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
They are java processes that could run in different or the same machine depending on your configuration. Driver contains the SparkContext, declares the RDD transformation (and if I'm not mistaken - think execution plan) then communicates that to the spark master which creates task definitions, asks the cluster manager (it's own,yarn, mesos) for resources (worker nodes) and those tasks in turn gets sent to executors (for execution).
Executors communicate back to master certain information and as far as I understand if the driver encounters a problem or crashes, the master will take note and will tell the executor (and it in turn logs) what you see "driver is disassociated". This could be because of a lot of things but the most common ones are because the java process (driver) runs out of memory (try increasing spark.driver.memory)
Some differences when running on Yarn vs Stand-alone vs Mesos but hope this helps. If driver is disassociated, the java process running (as the driver) likely encountered an error - the master logs might have something and not sure if there are driver specific logs. Hopefully someone more knowledgeable than me can provide more info.

Resources