when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime
Hi I am a beginner in Spark. I am trying to run a job on Spark 1.4.1 with 8 slave nodes with 11.7 GB memory each 3.2 GB Disk . I am running the Spark task from one of the slave node (from 8 nodes) (so with 0.7 storage fraction approx 4.8 gb only is available on each node )and using Mesos as the Cluster Manager. I am using this configuration :
spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050
spark.eventLog.enabled true
spark.driver.memory 6g
spark.storage.memoryFraction 0.7
spark.core.connection.ack.wait.timeout 800
spark.akka.frameSize 50
spark.rdd.compress true
I am trying to Run Spark MLlib Naive Bayes Algorithm on a Folder around 14 GB of data. (There is no issue when I am running the task on a 6 GB folder) I am reading this folder from google storage as RDD and giving 32 as partition parameter.(I have tried increasing the partition as well). Then using TF to create feature vector and predict on basis of that.
But when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime. I tried different configurations but nothing is helping. May be I am missing something very basic but not able to figure out. Any help or suggestion will be highly valuable.
Log is:
15/07/21 01:18:20 ERROR TaskSetManager: Task 3 in stage 2.0 failed 4 times; aborting job
15/07/21 01:18:20 INFO TaskSchedulerImpl: Cancelling stage 2
15/07/21 01:18:20 INFO TaskSchedulerImpl: Stage 2 was cancelled
15/07/21 01:18:20 INFO DAGScheduler: ResultStage 2 (collect at /opt/work/V2ProcessRecords.py:213) failed in 28.966 s
15/07/21 01:18:20 INFO DAGScheduler: Executor lost: 20150526-135628-3255597322-5050-1304-S8 (epoch 3)
15/07/21 01:18:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 20150526-135628-3255597322-5050-1304-S8 from BlockManagerMaster.
15/07/21 01:18:20 INFO DAGScheduler: Job 2 failed: collect at /opt/work/V2ProcessRecords.py:213, took 29.013646 s
Traceback (most recent call last):
File "/opt/work/V2ProcessRecords.py", line 213, in <module>
secondPassRDD = firstPassRDD.map(lambda ( name, title, idval, pmcId, pubDate, article, tags , author, ifSigmaCust, wclass): ( str(name), title, idval, pmcId, pubDate, article, tags , author, ifSigmaCust , "Yes" if ("PMC" + pmcId) in rddNIHGrant else ("No") , wclass)).collect()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 745, in collect
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 12, vamp-m-2.c.quantum-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/07/21 01:18:20 INFO BlockManagerMaster: Removed 20150526-135628-3255597322-5050-1304-S8 successfully in removeExecutor
15/07/21 01:18:20 INFO DAGScheduler: Host added was in lost list earlier:vamp-m-2.c.quantum-854.internal
Jul 21, 2015 1:01:15 AM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
15/07/21 01:18:20 INFO SparkContext: Invoking stop() from shutdown hook
{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616389696,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"}
{"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616389697,"Failed":true,"Accumulables":[]}}
{"Event":"SparkListenerExecutorAdded","Timestamp":1437616389707,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Executor Info":{"Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Total Cores":1,"Log Urls":{}}}
{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616397743,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"}
{"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616397743,"Failed":true,"Accumulables":[]}}
{"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":2,"Stage Attempt ID":0,"Stage Name":"collect at /opt/work/V2ProcessRecords.py:215","Number of Tasks":72,"RDD Info":[{"RDD ID":6,"Name":"PythonRDD","Parent IDs":[0],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0},{"RDD ID":0,"Name":"gs://uc1f-bioinfocloud-vamp-m/literature/xml/P*/*.nxml","Scope":"{\"id\":\"0\",\"name\":\"wholeTextFiles\"}","Parent IDs":[],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0}],"Parent IDs":[],"Details":"","Submission Time":1437616365566,"Completion Time":1437616397753,"Failure Reason":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Accumulables":[]}}
{"Event":"SparkListenerJobEnd","Job ID":2,"Completion Time":1437616397755,"Job Result":{"Result":"JobFailed","Exception":{"Message":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Stack Trace":[{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages","File Name":"DAGScheduler.scala","Line Number":1266},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1257},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"scala.collection.mutable.ResizableArray$class","Method Name":"foreach","File Name":"ResizableArray.scala","Line Number":59},{"Declaring Class":"scala.collection.mutable.ArrayBuffer","Method Name":"foreach","File Name":"ArrayBuffer.scala","Line Number":47},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"abortStage","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"scala.Option","Method Name":"foreach","File Name":"Option.scala","Line Number":236},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"handleTaskSetFailed","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1450},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1411},{"Declaring Class":"org.apache.spark.util.EventLoop$$anon$1","Method Name":"run","File Name":"EventLoop.scala","Line Number":48}]}}}
The Most common cause of ExecutorLostFailure as per my understanding is OOM in executor.
In order to resolve the OOM issue, one needs to figure out what exactly is causing it. Simply increasing the default parallelism or increasing the executor memory is not a strategic solution.
If you look at what increasing parallelism do is it tries to create more executors so that each executor can work on less and less data. But if your data is skewed such that the key on which data partitioning happens (for parallelism) has more data, simply increasing parallelism will be of no effect.
Similarly just by increasing Executor memory will be a very inefficient way of handing such a scenario as if only one executor is failing with ExecutorLostFailure , requesting increased memory for all the executors will make your application require much more memory then actually expected.
This error is occurring because a task failed more than four times.
Try increase the parallelism in your cluster using the following parameter.
--conf "spark.default.parallelism=100"
Set the parallelism value to 2 to 3 time the number of cores available on your cluster. If that doesn't work. try increase the parallelism in an exponential fashion. i.e if your current parallelism doesn't work multiply it by two and so on. Also I have observed that it helps if your level of parallelism is a prime number especially if you are using groupByKkey.
It is hard to say what the problem is without the log of the failed executor and not the driver's but most likely it is a memory problem. Try increasing the partition number significantly (if your current is 32 try 200)
I was having this issue, and the problem for me was very high incidence of one key in a reduceByKey task. This was (I think) causing a massive list to collect on one of the executors, which would then throw OOM errors.
The solution for me was to just filter out keys with high population before doing the reduceByKey, but I appreciate that this may or may not be possible depending on your application. I didn't need all my data anyway.