JavaPairRDD - mapToPair() throws outofmemoryerror - apache-spark

I am trying to iterate JavaPairRDD and apply some transformation on Value(which is Java Model class, Key is String) and returning the same Key Value Pair as JavaPairRDD.
Before throwing outofMemoryError it says Marking Stage 5 (saveAsTextFile at AppDaoImpl.java:219) as failed due to a fetch failure from Stage 1 (mapToPair at AppDataUtil.java:221)
Is there anyway can we optimise the below code, it seems to me very simple code. but when i process huge file i am facing this outofmemoryerror.
i am also passing the below parameters.
--num-executors 20 --executor-memory 12288M --executor-cores 5 --driver-memory 6G --conf spark.yarn.executor.memoryOverhead=1332
Sample Code Is:
return parquetFileContent.mapToPair(tuple -> {
SimpleGroup simpleGroup = tuple._2();
Model inputData = applyTransformationLogic(simpleGroup);
return new Tuple2<String, Model>(inputData.getSomeStringField(), inputData);
});
Before calling saveAsTextFile(), i am adding three RDD's using union and calling this method.
javaSparkCtx.union(rdd1, rdd2, rdd3).saveAsTextFile("hdfs filepath");
i wanted to write all the rdd's to same location, so i am using union
Is it possible to call each rdd seperately in the same location?
log trace is:
15/12/18 15:47:39 INFO scheduler.DAGScheduler: Marking Stage 5 (saveAsTextFile at AppDaoImpl.java:219) as failed due to a fetch failure from Stage 1 (mapToPair at AppDataUtil.java:221)
15/12/18 15:47:39 INFO scheduler.DAGScheduler: Stage 5 (saveAsTextFile at AppDaoImpl.java:219) failed in 78.951 s
15/12/18 15:47:39 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (mapToPair at AppDataUtil.java:221) and Stage 5 (saveAsTextFile at AppDaoImpl.java:219) due to fetch failure
15/12/18 15:47:39 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 5)
15/12/18 15:47:39 INFO storage.BlockManagerMasterActor: Trying to remove executor 2 from BlockManagerMaster.
15/12/18 15:47:39 INFO storage.BlockManagerMasterActor: Removing block manager BlockManagerId(2, lpdn0185.com, 37626)
15/12/18 15:47:39 INFO storage.BlockManagerMaster: Removed 2 successfully in removeExecutor
15/12/18 15:47:39 INFO scheduler.Stage: Stage 1 is now unavailable on executor 2 (26/56, false)
15/12/18 15:47:39 INFO scheduler.Stage: Stage 2 is now unavailable on executor 2 (25/56, false)
15/12/18 15:47:39 WARN scheduler.TaskSetManager: Lost task 2.1 in stage 5.0 (TID 119, lpdn0185.com): FetchFailed(BlockManagerId(2, lpdn0185.com, 37626), shuffleId=4, mapId=0, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/hdfs1/yarn/nm/usercache/phdpentcustcdibtch/appcache/application_1449986083135_60217/blockmgr-34a2e882-6b36-42c6-bcff-03d9bc5ef80b/2c/shuffle_4_0_0.data, offset=3038022, length=2959077}
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)

Related

BlockManager removing RDD from a dead executor | Streaming query failure "RpcTimeoutException: Cannot receive any reply from null in 120 seconds"

I'm new to spark and running a Spark Structured Streaming application with multiple streaming queries.
During the runtime, an executor was lost and removed by driver.
168124731 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Executor lost: 1 (epoch 6091)
168124731 [dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 15.1 in stage 11779.0 (TID 967663, ZMLPRAFMSAPP02, executor 2, partition 15, PROCESS_LOCAL, 7903 bytes)
168124732 [dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 33.1 in stage 11780.0 (TID 967664, ZMLPRAFMSAPP02, executor 2, partition 33, PROCESS_LOCAL, 7981 bytes)
168124732 [dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 1 from BlockManagerMaster.
168124733 [dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 16.1 in stage 11780.0 (TID 967665, zmplhdpdtapp02, executor 17, partition 16, PROCESS_LOCAL, 7981 bytes)
168124736 [dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager BlockManagerId(1, zmplhdpdtapp01, 37025, None)
168124736 [dag-scheduler-event-loop] INFO org.apache.spark.storage.BlockManagerMaster - **Removed 1 successfully in removeExecutor**
Still driver tries to remove an RDD from this dead executor multiple times during the application
run-time.
Logs show driver removing an RDD from executor 1 ( dead executor ).
185819600 [block-manager-slave-async-thread-pool-17473] INFO org.apache.spark.storage.BlockManager - Removing RDD 64401
185819601 [block-manager-ask-thread-pool-19918] WARN org.apache.spark.storage.BlockManagerMasterEndpoint - **Error trying to remove RDD 64401 from block manager BlockManagerId(1, zmplhdpdtapp01, 37025, None)**
java.io.IOException: Failed to send RPC RPC 8704532677775782991 to /172.27.173.100:55376: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:357)
at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:334)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
This operation gets timed-out after 120 seconds and eventually somehow leads to streaming query failure
185939603 [block-manager-ask-thread-pool-20024] WARN org.apache.spark.storage.BlockManagerMaster - **Failed to remove RDD 64401 - Cannot receive any reply from null in 120 seconds. This timeout is controlled by spark.rpc.askTimeout**
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from null in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
...
Streaming query failure, query terminated along the RDD removal operation failure.
185939607 [stream execution thread for <Streaming_Query_Name> [id = 42bfda3d-5fd9-4202-9a15-b6f69288eef5, runId = d7032d32-58d8-4369-b8a6-e1d71816ba6e]] **ERROR org.apache.spark.sql.execution.streaming.MicroBatchExecution - Query <Streaming_Query_Name> [id = 42bfda3d-5fd9-4202-9a15-b6f69288eef5, runId = d7032d32-58d8-4369-b8a6-e1d71816ba6e] terminated with error
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout**
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131)
at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1830)
at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217)
at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.clearCache(InMemoryRelation.scala:70)
...
My queries are
Why spark driver is sending "Remove RDD" request to a dead executor that too multiple times after executor has been removed ?
Why does it lead to failure of streaming query ?
How can we avoid this issue ?
Expectations were
Request should not have been sent to a dead executor.
If it was sent and the operation timed-out it should not have impacted the streaming queries.
Can someone help me understand this behaviour ?

Strange non-critical exception when using spark 2.4.3 (emr 5.25.0) with delta lake io 0.6.0

I have been successfully using Spark 2.4.3 - Scala - (in EMR 5.25.0) together with Delta Lake IO 0.6.0. My jobs run fine, but I am doing some optimisations and cleaning the house and noticed this strange exception, which although does not appear to involve my code, and it does not affect the successful completion of the Spark application, makes eyebrows raise :) I have been searching through the spark issues and so on but did not found any justification or further tips for it. It happens during this job:
20/05/13 23:34:28 INFO SparkContext: Starting job: apply at DatabricksLogging.scala:77
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 81 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 96 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 88 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 101 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 104 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Got job 205 (apply at DatabricksLogging.scala:77) with 1 output partitions
20/05/13 23:34:28 INFO DAGScheduler: Final stage: ResultStage 1216 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1215)
20/05/13 23:34:28 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1215)
20/05/13 23:34:28 INFO DAGScheduler: Submitting ShuffleMapStage 1212 (MapPartitionsRDD[96] at apply at DatabricksLogging.scala:77), which has no missing parents
20/05/13 23:34:29 INFO MemoryStore: Block broadcast_220 stored as values in memory (estimated size 55.2 KB, free 4.6 GB)
20/05/13 23:34:29 INFO MemoryStore: Block broadcast_220_piece0 stored as bytes in memory (estimated size 20.7 KB, free 4.6 GB)
20/05/13 23:34:29 INFO BlockManagerInfo: Added broadcast_220_piece0 in memory on ip-10-10-175-231.eu-west-1.compute.internal:43215 (size: 20.7 KB, free: 4.6 GB)
20/05/13 23:34:29 INFO SparkContext: Created broadcast 220 from broadcast at DAGScheduler.scala:1201
20/05/13 23:34:29 INFO DAGScheduler: Submitting 521 missing tasks from ShuffleMapStage 1212 (MapPartitionsRDD[96] at apply at DatabricksLogging.scala:77) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
20/05/13 23:34:29 INFO YarnClusterScheduler: Adding task set 1212.0 with 521 tasks
The exception:
20/05/13 23:36:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.10.175.48:33590
20/05/13 23:36:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.10.162.50:55798
20/05/13 23:36:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.10.174.108:42382
20/05/13 23:36:23 INFO TaskSetManager: Starting task 188.0 in stage 1214.0 (TID 22247, ip-10-10-175-231.eu-west-1.compute.internal, executor 3, partition 188, PROCESS_LOCAL, 8073 bytes)
20/05/13 23:36:23 INFO TaskSetManager: Finished task 95.0 in stage 1214.0 (TID 22154) in 4006 ms on ip-10-10-175-231.eu-west-1.compute.internal (executor 3) (1/200)
20/05/13 23:36:23 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.lang.ClassCastException: java.util.Collections$SynchronizedSet cannot be cast to java.util.List
at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:348)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:324)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:324)
at scala.Option.map(Option.scala:146)
at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:324)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:317)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:317)
at scala.collection.immutable.List.map(List.scala:288)
at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:317)
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:309)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:149)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
20/05/13 23:36:24 INFO TaskSetManager: Starting task 189.0 in stage 1214.0 (TID 22248, ip-10-10-175-231.eu-west-1.compute.internal, executor 19, partition 189, PROCESS_LOCAL, 8073 bytes)
20/05/13 23:36:24 INFO TaskSetManager: Finished task 39.0 in stage 1214.0 (TID 22098) in 4276 ms on ip-10-10-175-231.eu-west-1.compute.internal (executor 19) (2/200)
Note:
I noticed that these exceptions do not happen when we first load the delta table, because in the init load we obviously don't use the .merge functionality of delta lake io. So, that leads me to believe that it is related to something while logging things during the merge operation. But again, this does not seem to affect any of the results, as the results are as expected.
It would be nice if anyone has an idea for such behaviour, to check if this is an issue, or not, in delta lake io 0.6.0.
Thanks!
This error doesn't impact anything of your jobs, except it may impact the debugging when you look at the Spark UI on Spark History Server: you may see an active stage which should have been finished.
This issue will be fixed in Apache Spark 2.4.7/3.0.1/3.1.0. Please check the following links for more details regarding this issue:
https://github.com/delta-io/delta/issues/439
https://issues.apache.org/jira/browse/SPARK-31923

What is the difference between two DAG scheduler times in Spark run log?

I run a spark job and it logs what is going on with the process.
In the end it gives two types of times which refers to completion times.
What is the difference between those two types.
Is this read and write difference or aggregation overhead added or something else?
DAGScheduler:54 - ResultStage 1 (runJob at SparkHadoopWriter.scala:78) finished in 41.988 s
DAGScheduler:54 - Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 67.610115 s
LONGER OUTPUT
.
.
.
2019-01-15 21:25:32 INFO TaskSetManager:54 - Finished task 2974.0 in stage 1.0 (TID 5956) in 898 ms on 172.17.6.100 (executor 8) (2982/2982)
2019-01-15 21:25:32 INFO TaskSchedulerImpl:54 - Removed TaskSet 1.0, whose tasks have all completed, from pool
2019-01-15 21:25:32 INFO DAGScheduler:54 - ResultStage 1 (runJob at SparkHadoopWriter.scala:78) finished in 41.988 s
2019-01-15 21:25:32 INFO DAGScheduler:54 - Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 67.610115 s
2019-01-15 21:25:45 INFO SparkHadoopWriter:54 - Job job_20190115212425_0001 committed.
2019-01-15 21:25:45 INFO AbstractConnector:318 - Stopped Spark#4d4d8fcf{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-01-15 21:25:45 INFO SparkUI:54 - Stopped Spark web UI at http://node-100.iris-cluster.uni.lux:4040
2019-01-15 21:25:45 INFO StandaloneSchedulerBackend:54 - Shutting down all executors
2019-01-15 21:25:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down
2019-01-15 21:25:45 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-01-15 21:25:45 INFO MemoryStore:54 - MemoryStore cleared
2019-01-15 21:25:45 INFO BlockManager:54 - BlockManager stopped
2019-01-15 21:25:45 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-01-15 21:25:45 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-01-15 21:25:45 INFO SparkContext:54 - Successfully stopped SparkContext
2019-01-15 21:25:45 INFO ShutdownHookManager:54 - Shutdown hook called
What is the correct approach to evaluate such output log?
Consider if the DAG scheduler maintains a hash-map to collect the sorted list from its "n" partitions. Then, on receiving the list from the last partition, the result-staging step would get over. However, the list of numbers in this last partitions has to be inserted into the hash-map. This would take: log(total-no-of-elements / no.of partition) times -- let it be log(nip) where nip is the number of elements in a partition. Furthermore, reading the entire list of sorted numbers (to write to a file) would take another log N times. Therefore, in total, we need "2 log N" extra time.
Therefore, if you increase the no of partitions (i.e, no of worker nodes), from 2 to 2^4, the final lag would change from say: 250 units to approximately 31 units.
Hope this would help!

Spark SQL: Why two jobs for one query?

Experiment
I tried the following snippet on Spark 1.6.1.
val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files
soDF.registerTempTable("so")
sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by cnt").write.parquet("/out/")
The Physical Plan is:
== Physical Plan ==
Sort [cnt#59L ASC], true, 0
+- ConvertToUnsafe
+- Exchange rangepartitioning(cnt#59L ASC,200), None
+- ConvertToSafe
+- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Final,isDistinct=false)], output=[dpHour#38,cnt#59L])
+- TungstenExchange hashpartitioning(dpHour#38,200), None
+- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Partial,isDistinct=false)], output=[dpHour#38,count#63L])
+- Scan ParquetRelation[dpHour#38] InputPaths: hdfs://hdfsNode:8020/batchPoC/saleOrder
For this query, I got two Jobs: Job 9 and Job 10
For Job 9, the DAG is:
For Job 10, the DAG is:
Observations
Apparently, there are two jobs for one query.
Stage-16 (marked as Stage-14 in Job 9) is skipped in Job 10.
Stage-15's last RDD[48], is same as Stage-17's last RDD[49]. How? I saw in the logs that after Stage-15 execution, the RDD[48] is registered as RDD[49]
Stage-17 is shown in the driver-logs but never got executed at Executors. On driver-logs the task-execution is shown, but when I looked at Yarn container's logs, there was no evidence of receiving any task from Stage-17.
Logs supporting these observations (only driver-logs, I lost executor logs due to later crash). It is seen that before Stage-17 starts, RDD[49] is registered:
16/06/10 22:11:22 INFO TaskSetManager: Finished task 196.0 in stage 15.0 (TID 1121) in 21 ms on slave-1 (199/200)
16/06/10 22:11:22 INFO TaskSetManager: Finished task 198.0 in stage 15.0 (TID 1123) in 20 ms on slave-1 (200/200)
16/06/10 22:11:22 INFO YarnScheduler: Removed TaskSet 15.0, whose tasks have all completed, from pool
16/06/10 22:11:22 INFO DAGScheduler: ResultStage 15 (parquet at <console>:26) finished in 0.505 s
16/06/10 22:11:22 INFO DAGScheduler: Job 9 finished: parquet at <console>:26, took 5.054011 s
16/06/10 22:11:22 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO SparkContext: Starting job: parquet at <console>:26
16/06/10 22:11:22 INFO DAGScheduler: Registering RDD 49 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Got job 10 (parquet at <console>:26) with 25 output partitions
16/06/10 22:11:22 INFO DAGScheduler: Final stage: ResultStage 18 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Submitting ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26), which has no missing parents
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 17.4 KB, free 512.3 KB)
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 8.9 KB, free 521.2 KB)
16/06/10 22:11:22 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 172.16.20.57:44944 (size: 8.9 KB, free: 517.3 MB)
16/06/10 22:11:22 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:1006
16/06/10 22:11:22 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26)
16/06/10 22:11:22 INFO YarnScheduler: Adding task set 17.0 with 200 tasks
16/06/10 22:11:23 INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 1125, slave-1, partition 0,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 1.0 in stage 17.0 (TID 1126, slave-2, partition 1,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 2.0 in stage 17.0 (TID 1127, slave-1, partition 2,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 3.0 in stage 17.0 (TID 1128, slave-2, partition 3,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 4.0 in stage 17.0 (TID 1129, slave-1, partition 4,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 5.0 in stage 17.0 (TID 1130, slave-2, partition 5,NODE_LOCAL, 1988 bytes)
Questions
Why two Jobs? What is the intention here by breaking a DAG into two jobs?
Job 10's DAG looks complete for the query execution. Is there anything specific Job 9 is doing?
Why Stage-17 is not Skipped? It looks like dummy tasks are created, do they have any purpose.
Later, I tried another rather simpler query. Unexpectedly, it was creating 3 Jobs.
sqlContext.sql("select dpHour from so order by dphour").write.parquet("/out2/")
When you are using the high-level dataframe/dataset APIs, you leave it up to Spark to determine the execution plan, including the job/stage chunking. These depend on many factors such as execution parallelism, cached/persisted data structures, etc. In future versions of Spark, as the optimizer sophistication increases, you may see even more jobs per query as, for example, some data sources are sampled to parameterize cost-based execution optimization.
For example, I have frequently, but not always, seen writing generate separate jobs from processing that involves shuffles.
Bottom line, if you are using the high-level APIs, unless you have to do extremely detailed optimization with huge data volumes, it rarely pays to dig into the specific chunking. Job startup costs are extremely low compared to processing/output.
If, on the other hand, you are curious about the Spark internals, read the optimizer code and engage on the Spark developer mailing list.

Error ExecutorLostFailure when running a task in Spark

when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime
Hi I am a beginner in Spark. I am trying to run a job on Spark 1.4.1 with 8 slave nodes with 11.7 GB memory each 3.2 GB Disk . I am running the Spark task from one of the slave node (from 8 nodes) (so with 0.7 storage fraction approx 4.8 gb only is available on each node )and using Mesos as the Cluster Manager. I am using this configuration :
spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050
spark.eventLog.enabled true
spark.driver.memory 6g
spark.storage.memoryFraction 0.7
spark.core.connection.ack.wait.timeout 800
spark.akka.frameSize 50
spark.rdd.compress true
I am trying to Run Spark MLlib Naive Bayes Algorithm on a Folder around 14 GB of data. (There is no issue when I am running the task on a 6 GB folder) I am reading this folder from google storage as RDD and giving 32 as partition parameter.(I have tried increasing the partition as well). Then using TF to create feature vector and predict on basis of that.
But when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime. I tried different configurations but nothing is helping. May be I am missing something very basic but not able to figure out. Any help or suggestion will be highly valuable.
Log is:
15/07/21 01:18:20 ERROR TaskSetManager: Task 3 in stage 2.0 failed 4 times; aborting job
15/07/21 01:18:20 INFO TaskSchedulerImpl: Cancelling stage 2
15/07/21 01:18:20 INFO TaskSchedulerImpl: Stage 2 was cancelled
15/07/21 01:18:20 INFO DAGScheduler: ResultStage 2 (collect at /opt/work/V2ProcessRecords.py:213) failed in 28.966 s
15/07/21 01:18:20 INFO DAGScheduler: Executor lost: 20150526-135628-3255597322-5050-1304-S8 (epoch 3)
15/07/21 01:18:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 20150526-135628-3255597322-5050-1304-S8 from BlockManagerMaster.
15/07/21 01:18:20 INFO DAGScheduler: Job 2 failed: collect at /opt/work/V2ProcessRecords.py:213, took 29.013646 s
Traceback (most recent call last):
File "/opt/work/V2ProcessRecords.py", line 213, in <module>
secondPassRDD = firstPassRDD.map(lambda ( name, title, idval, pmcId, pubDate, article, tags , author, ifSigmaCust, wclass): ( str(name), title, idval, pmcId, pubDate, article, tags , author, ifSigmaCust , "Yes" if ("PMC" + pmcId) in rddNIHGrant else ("No") , wclass)).collect()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 745, in collect
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 12, vamp-m-2.c.quantum-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/07/21 01:18:20 INFO BlockManagerMaster: Removed 20150526-135628-3255597322-5050-1304-S8 successfully in removeExecutor
15/07/21 01:18:20 INFO DAGScheduler: Host added was in lost list earlier:vamp-m-2.c.quantum-854.internal
Jul 21, 2015 1:01:15 AM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
15/07/21 01:18:20 INFO SparkContext: Invoking stop() from shutdown hook
{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616389696,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"}
{"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616389697,"Failed":true,"Accumulables":[]}}
{"Event":"SparkListenerExecutorAdded","Timestamp":1437616389707,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Executor Info":{"Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Total Cores":1,"Log Urls":{}}}
{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616397743,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"}
{"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616397743,"Failed":true,"Accumulables":[]}}
{"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":2,"Stage Attempt ID":0,"Stage Name":"collect at /opt/work/V2ProcessRecords.py:215","Number of Tasks":72,"RDD Info":[{"RDD ID":6,"Name":"PythonRDD","Parent IDs":[0],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0},{"RDD ID":0,"Name":"gs://uc1f-bioinfocloud-vamp-m/literature/xml/P*/*.nxml","Scope":"{\"id\":\"0\",\"name\":\"wholeTextFiles\"}","Parent IDs":[],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0}],"Parent IDs":[],"Details":"","Submission Time":1437616365566,"Completion Time":1437616397753,"Failure Reason":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Accumulables":[]}}
{"Event":"SparkListenerJobEnd","Job ID":2,"Completion Time":1437616397755,"Job Result":{"Result":"JobFailed","Exception":{"Message":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Stack Trace":[{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages","File Name":"DAGScheduler.scala","Line Number":1266},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1257},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"scala.collection.mutable.ResizableArray$class","Method Name":"foreach","File Name":"ResizableArray.scala","Line Number":59},{"Declaring Class":"scala.collection.mutable.ArrayBuffer","Method Name":"foreach","File Name":"ArrayBuffer.scala","Line Number":47},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"abortStage","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"scala.Option","Method Name":"foreach","File Name":"Option.scala","Line Number":236},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"handleTaskSetFailed","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1450},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1411},{"Declaring Class":"org.apache.spark.util.EventLoop$$anon$1","Method Name":"run","File Name":"EventLoop.scala","Line Number":48}]}}}
The Most common cause of ExecutorLostFailure as per my understanding is OOM in executor.
In order to resolve the OOM issue, one needs to figure out what exactly is causing it. Simply increasing the default parallelism or increasing the executor memory is not a strategic solution.
If you look at what increasing parallelism do is it tries to create more executors so that each executor can work on less and less data. But if your data is skewed such that the key on which data partitioning happens (for parallelism) has more data, simply increasing parallelism will be of no effect.
Similarly just by increasing Executor memory will be a very inefficient way of handing such a scenario as if only one executor is failing with ExecutorLostFailure , requesting increased memory for all the executors will make your application require much more memory then actually expected.
This error is occurring because a task failed more than four times.
Try increase the parallelism in your cluster using the following parameter.
--conf "spark.default.parallelism=100"
Set the parallelism value to 2 to 3 time the number of cores available on your cluster. If that doesn't work. try increase the parallelism in an exponential fashion. i.e if your current parallelism doesn't work multiply it by two and so on. Also I have observed that it helps if your level of parallelism is a prime number especially if you are using groupByKkey.
It is hard to say what the problem is without the log of the failed executor and not the driver's but most likely it is a memory problem. Try increasing the partition number significantly (if your current is 32 try 200)
I was having this issue, and the problem for me was very high incidence of one key in a reduceByKey task. This was (I think) causing a massive list to collect on one of the executors, which would then throw OOM errors.
The solution for me was to just filter out keys with high population before doing the reduceByKey, but I appreciate that this may or may not be possible depending on your application. I didn't need all my data anyway.

Resources