OutofMemory error for spark streaming job - apache-spark

I have a spark streaming job running in Hortonworks cluster .
I am running it in cluster mode through yarn , the job is shown as running in UI , but it is having the below exception in driver logs
Exception in thread "JobGenerator" java.lang.OutOfMemoryError: Java heap space

I fixed the issue by specifying driver-memory in spark-submit command.because the memory issue was in driver

Related

PySpark Glue Error: Remote RPC Client Disassociated

I have a PySpark code running in Glue which fails with error "Remote TPC Client Disassociated.Likely due to containers exceeding thresholds, or network issues". It seems like it's a common issue with Spark and I have gone through a bunch of stack overflow and other forums. I tried tweaking a number of parameters in the Spark config but nothing worked hence posting this here. Would appreciate any inputs. Below are few details about my job, I have played with different values of the spark config for rpc and memory. I tried executor memory of 1g, 20g, 30g, 64g; Driver memory of 20g, 30g, 64g. Please advise.
Glue version: 3.0
Spark versio: 3.1
Spark Config:
spark= (SparkSession
.builder
.appName("jkTestApp")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.autoBroadcastJoinThreshold","-1")
.config("spark.rpc.message.maxSize", "512")
.config("spark.driver.memory","64g")
.config("spark.executor.memory","1g")
.config("spark.executor.memoryOverhead","512")
.config("spark.rpc.numRetries","10")
.getOrCreate()
)

Spark Job failing with out of memory exception

Spark job failing with
Exception in thread "main"
java.lang.OutOfMemoryError: Java heap space.
Spark was by default taking 1g of driver memory. I increased driver memory to 4g.

Spark Thriftserver stops or freezes due to tableau queries

The spark cluster (spark 2.2) is used by around 30 people via spark-shell and tableau (10.4). Once a day the thriftserver gets killed or freezes because the jvm has to many garbage to collect. These are the error messages that I can find in the thriftserver log file:
ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded
ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded
ERROR TaskSchedulerImpl: Lost executor 2 on XXX.XXX.XXX.XXX: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Exception in thread "HiveServer2-Handler-Pool: Thread-152" java.lang.OutOfMemoryError: Java heap space
General information:
The Thriftserver is started with the following options (copied from the web-ui of the master -> sun.java.command):
org.apache.spark.deploy.SparkSubmit --master spark://bd-master:7077 --conf spark.driver.memory=6G --conf spark.driver.extraClassPath=--hiveconf --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --executor-memory 12G --total-executor-cores 12 --supervise --driver-cores 2 spark-internal hive.server2.thrift.bind.host bd-master --hiveconf hive.server2.thrift.port 10001
The spark standalone cluster has 48 cores and 240 GB memory at 6 machines. Every machine has 8 Cores and 64 GB memory. Two of them are virtual machines.
The users are querying a hive table which is a 1.6 GB csv file replicated on all machines.
Is there something I have done wrong why tableau is able to kill the thriftserver? Is there any other information I could provide that helps you to help me?
We are able to bypass this issue by setting:
spark.sql.thriftServer.incrementalCollect=true
With this parameter set to true, the thriftserver will send a result to the requester for every partition. This reduces the peak of memory the thriftserver needs when the thriftserver is going to send the result.

Spark executor lost because of time out even after setting quite long time out value 1000 seconds

I have written Spark job which seems to be working fine for almost an hour and after that executor start getting lost because of timeout I see the following in log statement
15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no recent heartbeats: 1051638 ms exceeds timeout 1000000 ms
I don't see any errors but I see above warning and because of it executor gets removed by YARN and I see Rpc client disassociated error and IOException connection refused and FetchFailedException
After executor gets removed I see it is again getting added and starts working and some other executors fails again. My question is is it normal for executor getting lost? What happens to that task lost executors were working on? My Spark job keeps on running since it is long around 4-5 hours I have very good cluster with 1.2 TB memory and good no of CPU cores.
To solve above time out issue I tried to increase time spark.akka.timeout to 1000 seconds but no luck. I am using the following command to run my Spark job. I am new to Spark. I am using Spark 1.4.1.
./spark-submit --class com.xyz.abc.MySparkJob --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 4g --master yarn-client --executor-memory 25G --executor-cores 8 --num-executors 5 --jars /path/to/spark-job.jar
What might happen is that the slaves cannot launch executor anymore, due to memory issue. Look for the following messages in the master logs:
15/07/13 13:46:50 INFO Master: Removing executor app-20150713133347-0000/5 because it is EXITED
15/07/13 13:46:50 INFO Master: Launching executor app-20150713133347-0000/9 on worker worker-20150713153302-192.168.122.229-59013
15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms) ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited with code 1),Some(1)) from Actor[akka.tcp://sparkWorker#192.168.122.229:59013/user/Worker#-83763597]
You might find some detailed java errors in the worker's log directory, and maybe this type of file: work/app-id/executor-id/hs_err_pid11865.log.
See http://pastebin.com/B4FbXvHR
This issue might be resolved by your application management of RDD's, not by increasing the size of the jvm's heap.

Spark:executor.CoarseGrainedExecutorBackend: Driver Disassociated disassociated

I am learning how to use spark and I have a simple program.When I run the jar file it gives me the right result but I have some error in the stderr file.just like this:
15/05/18 18:19:52 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor#localhost:51976] -> [akka.tcp://sparkDriver#172.31.34.148:60060] disassociated! Shutting down.
15/05/18 18:19:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#172.31.34.148:60060] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
You can get the whole stderr file in there:
http://172.31.34.148:8081/logPage/?appId=app-20150518181945-0026&executorId=0&logType=stderr
I searched this problem and find this:
Why spark application fail with "executor.CoarseGrainedExecutorBackend: Driver Disassociated"?
And I turn up the spark.yarn.executor.memoryOverhead as it said but it doesn't work.
I just have one master node(8G memory) and in the spark's slaves file there is only one slave node--the master itself.I submit like this:
./bin/spark-submit --class .... --master spark://master:7077 --executor-memory 6G --total-executor-cores 8 /path/..jar hdfs://myfile
I don't know what is the executor and what is the driver...lol...
sorry about that..
anybody help me?
If Spark Driver fails, it gets disassociated (from YARN AM). Try the following to make it more fault-tolerant:
spark-submit with --supervise flag on Spark Standalone cluster
yarn-cluster mode on YARN
spark.yarn.driver.memoryOverhead parameter for increasing Driver's memory allocation on YARN
Note: Driver supervisation (spark.driver.supervise) is not supported on a YARN cluster (yet).
An overview of driver vs. executor (and others) can be found at http://spark.apache.org/docs/latest/cluster-overview.html or https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html
They are java processes that could run in different or the same machine depending on your configuration. Driver contains the SparkContext, declares the RDD transformation (and if I'm not mistaken - think execution plan) then communicates that to the spark master which creates task definitions, asks the cluster manager (it's own,yarn, mesos) for resources (worker nodes) and those tasks in turn gets sent to executors (for execution).
Executors communicate back to master certain information and as far as I understand if the driver encounters a problem or crashes, the master will take note and will tell the executor (and it in turn logs) what you see "driver is disassociated". This could be because of a lot of things but the most common ones are because the java process (driver) runs out of memory (try increasing spark.driver.memory)
Some differences when running on Yarn vs Stand-alone vs Mesos but hope this helps. If driver is disassociated, the java process running (as the driver) likely encountered an error - the master logs might have something and not sure if there are driver specific logs. Hopefully someone more knowledgeable than me can provide more info.

Resources