I am using spark 1.3 for spark streaming application. When i start my application . I can see in spark UI that few of the jobs have failed tasks. On investigating the job details . I see few of the task were failed due to Executor Lost Exception either ExecutorLostFailure (executor 11 lost) or Resubmitted (resubmitted due to lost executor) .
In application logs from yarn the only Error shown is Lost executor 11 on <machineip> remote Akka client disassociated . I dont see any other exception or error being thrown.
The application stops after couple of hours. Logs shows all the executor are lost when application fails.
Can anyone suggest or point to link on how to resolve this issue.
There are many potential options for why you're seeing executor loss. One thing I have observed in the past is that Java garbage collection can take very long periods under heavy load. As a result the executor is 'lost' when the GC takes too long, and returns shortly thereafter.
You can determine if this is the issue by turning on executor GC logging. Simply add the following configuration:
--conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy"
See this great guide from Intel/DataBricks here for more details on GC tuning: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
Related
How to avoid Executor Failures while Spark jobs are executing .
We are using Spark 1.6 version as part of Cloudera CDH 5.10.
Normally I am getting below error.
ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 127100 ms
There could be various reasons behind the slow tasks execution then it gets timeout, you need to drill down to find the rootcause.
Sometimes tuning default timeout configuration parameters also helps. Go to spark UI configuration tab and find out values for below parameters then increase timeout parameters in spark-submit.
spark.worker.timeout
spark.network.timeout
spark.akka.timeout
Running job with speculative execution spark.speculation=true also helps, if one or more tasks are running slowly in a stage, they will be re-launched.
Explore more about spark 1.6.0 configuration properties.
We encounter a problem on a Spark job 1.6(on yarn) that never ends, whene several jobs launched simultaneously.
We found that by launching the job spark in yarn-client mode we do not have this problem, unlike launching it in yarn-cluster mode.
it could be a trail to find the cause.
we changed the code to add a sparkContext.stop ()
Indeed, the SparkContext was created (val sparkContext = createSparkContext) but not stopped. this solution has allowed us to decrease the number of jobs that remains blocked but nevertheless we still have some jobs blocked.
by analyzing the logs we have found this log that repeats without stopping:
17/09/29 11:04:37 DEBUG SparkEventPublisher: Enqueue SparkListenerExecutorMetricsUpdate(1,WrappedArray())
17/09/29 11:04:41 DEBUG ApplicationMaster: Sending progress
17/09/29 11:04:41 DEBUG ApplicationMaster: Number of pending allocations is 0. Sleeping for 5000.
it seems that the job block whene we call newAPIHadoopRDD to get data from Hbase. it may be the issue !!
Does someone have any idea about this issue ?
Thank you in advance
I've been running a Spark Application and one of the Stages failed with a FetchFailedException. At roughly the same time a log similar to the following appeared in the resource manager logs.
<data> <time>,988 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAudtiLogger: User=<user> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=<appid> CONTAINERID=<containerid>
My application was using more than yarn allocated it however it had been running for several days. What I expect happened is that other applications started and wanted to use the cluster and the Resource Manager killed one of my containers to give the resources to the others.
Can anyone help me verify my assumption and/or point me to the documentation that describes the log messages that the Resource Manager outputs?
Edit:
If it helps the Yarn version I'm running is 2.6.0-cdh5.4.9
The INFO message about OPERATION=AM Released Container is from org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt. My vague understanding of the code tells me that it was due to successful container release meaning that the container for the ApplicationMaster of your Spark application finished successfully.
I've just answered a similar question Why Spark application on YARN fails with FetchFailedException due to Connection refused? (yours was almost a duplicate).
FetchFailedException exception is thrown when a reducer task (for a ShuffleDependency) could not fetch shuffle blocks.
The root cause of the FetchFailedException is usually because the executor (with the BlockManager for the shuffle blocks) is lost (i.e. no longer available) due to:
OutOfMemoryError could be thrown (aka OOMed) or some other unhandled exception.
The cluster manager that manages the workers with the executors of your Spark application, e.g. YARN, enforces the container memory limits and eventually decided to kill the executor due to excessive memory usage.
You should review the logs of the Spark application using web UI, Spark History Server or cluster-specific tools like yarn logs -applicationId for Hadoop YARN (which is your case).
A solution is usually to tune the memory of your Spark application.
I have written Spark job which seems to be working fine for almost an hour and after that executor start getting lost because of timeout I see the following in log statement
15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no recent heartbeats: 1051638 ms exceeds timeout 1000000 ms
I don't see any errors but I see above warning and because of it executor gets removed by YARN and I see Rpc client disassociated error and IOException connection refused and FetchFailedException
After executor gets removed I see it is again getting added and starts working and some other executors fails again. My question is is it normal for executor getting lost? What happens to that task lost executors were working on? My Spark job keeps on running since it is long around 4-5 hours I have very good cluster with 1.2 TB memory and good no of CPU cores.
To solve above time out issue I tried to increase time spark.akka.timeout to 1000 seconds but no luck. I am using the following command to run my Spark job. I am new to Spark. I am using Spark 1.4.1.
./spark-submit --class com.xyz.abc.MySparkJob --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options -XX:MaxPermSize=512m --driver-memory 4g --master yarn-client --executor-memory 25G --executor-cores 8 --num-executors 5 --jars /path/to/spark-job.jar
What might happen is that the slaves cannot launch executor anymore, due to memory issue. Look for the following messages in the master logs:
15/07/13 13:46:50 INFO Master: Removing executor app-20150713133347-0000/5 because it is EXITED
15/07/13 13:46:50 INFO Master: Launching executor app-20150713133347-0000/9 on worker worker-20150713153302-192.168.122.229-59013
15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms) ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited with code 1),Some(1)) from Actor[akka.tcp://sparkWorker#192.168.122.229:59013/user/Worker#-83763597]
You might find some detailed java errors in the worker's log directory, and maybe this type of file: work/app-id/executor-id/hs_err_pid11865.log.
See http://pastebin.com/B4FbXvHR
This issue might be resolved by your application management of RDD's, not by increasing the size of the jvm's heap.
I'm running a Spark job with Spark version 1.4 and Cassandra 2.18. I telnet from master and it works to cassandra machine. Sometimes the job runs fine and sometimes I get the following exception. Why would this happen only sometimes?
"Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, 172.28.0.162): java.io.IOException: Failed to open native connection to Cassandra at {172.28.0.164}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:155)
"
It sometimes also gives me this exception along with the upper one:
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.28.0.164:9042 (com.datastax.driver.core.TransportException: [/172.28.0.164:9042] Connection has been closed))
I had the second error "NoHostAvailableException" happen to me quite a few times this week as I was porting Python spark to Java Spark.
I was having issues with the driver thread being nearly out of memory and the GC was taking up all my cores (98% of all 8 core), pausing the JVM all the time.
In python when this happens it's much more obvious (to me) so it took me a bit of time to realize what was going on, so I got this error quite a few times.
I had two theory on the root cause, but the solution was not having the GC go crazy.
First theory, was that because it was pausing so often, I just couldn't connect to Cassandra.
Second theory: Cassandra was running on the same machine as Spark and the JVM was taking 100% of all CPU so Cassandra just couldn't answer in time and it looked to the driver like there were no Cassandra host.
Hope this helps!