Dataproc: update log level in Spark shell - apache-spark

I use Jupyter terminal for accessing the driver of Dataproc cluster. This is my gateway to the cluster, and I do not have direct SSH enabled for the driver machine.
When I launch spark-shell , I keep getting these info, debug, Contextcleaner messages throughout my session and kind of disturbs my coding efforts. Is there a way to turn these off ?
scala> 22/10/11 15:47:31 INFO org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.22.86.219:43504) with ID 2
22/10/11 15:47:31 INFO org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.22.86.217:54770) with ID 1
22/10/11 15:47:31 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager cluster:39607 with 5.6 GB RAM, BlockManagerId(2, cluster, 39607, None)
22/10/11 15:47:31 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager cluster.internal:36731 with 5.6 GB RAM, BlockManagerId(1, cluster, 36731, None)
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 56
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 31
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 63
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 30
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 44
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 32
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 35
22/10/11 15:54:53 INFO org.apache.spark.storage.memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 23.1 KB, free 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.storage.memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.6 KB, free 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on clusterurl:33625 (size: 7.6 KB, free: 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1184
22/10/11 15:54:53 INFO org.apache.spark.scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[9] at show at <console>:39) (first 15 tasks are for partitions Vector(1))
22/10/11 15:54:53 INFO org.apache.spark.scheduler.cluster.YarnScheduler: Adding task set 4.0 with 1 tasks
22/10/11 15:54:53 INFO org.apache.spark.scheduler.FairSchedulableBuilder: Added task set TaskSet_4.0 tasks to pool default
22/10/11 15:54:53 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 7, cluster.internal, executor 1, partition 1, PROCESS_LOCAL, 7908 bytes)
22/10/11 15:54:53 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on cluster.internal:36731 (size: 7.6 KB, free: 5.6 GB)
22/10/11 15:54:54 INFO org.apache.spark.scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 7) in 558 ms on cluster.internal (executor 1) (1/1)
22/10/11 15:54:54 INFO org.apache.spark.scheduler.cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool default
22/10/11 15:54:54 INFO org.apache.spark.scheduler.DAGScheduler: ResultStage 4 (show at <console>:39) finished in 0.571 s
22/10/11 15:54:54 INFO org.apache.spark.scheduler.DAGScheduler: Job 4 finished: show at <console>:39, took 0.575517 s

The logs are controlled by /etc/spark/conf/log4j.properties, the default root log level is INFO, but in spark-shell, the root level is overridden as WARN. I guess the reason you see logs like INFO org.apache.spark.scheduler.DAGScheduler is because your cluster has settings like log4j.logger.org.apache.spark=INFO in the file.
There are several way you can change log settings for spark-shell:
Session level
Run sc.setLogLevel("WARN") in spark-shell which will update the root log level for the whole process. It has the same effect as
scala> import org.apache.log4j.{Level, Logger}
scala> Logger.getRootLogger().setLevel(Level.WARN)
Get the specific logger and set level, e.g.:
scala> import org.apache.log4j.{Level, Logger}
scala> Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Make a copy of /etc/spark/conf/log4j.properties to /tmp/spark-log4j.properties, edit it with the desired log settings, then run spark-shell --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///tmp/spark-log4j.properties.
Cluster level
Edit /etc/spark/conf/log4j.properties and set higher log levels for the spammy packages, then run spark-shell.
When creating the cluster, add --properties ^#^spark-log4j:log4j.logger.org.apache.spark=WARN#..., which will update the config file under the hood.

Related

Strange non-critical exception when using spark 2.4.3 (emr 5.25.0) with delta lake io 0.6.0

I have been successfully using Spark 2.4.3 - Scala - (in EMR 5.25.0) together with Delta Lake IO 0.6.0. My jobs run fine, but I am doing some optimisations and cleaning the house and noticed this strange exception, which although does not appear to involve my code, and it does not affect the successful completion of the Spark application, makes eyebrows raise :) I have been searching through the spark issues and so on but did not found any justification or further tips for it. It happens during this job:
20/05/13 23:34:28 INFO SparkContext: Starting job: apply at DatabricksLogging.scala:77
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 81 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 96 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 88 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 101 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Registering RDD 104 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Got job 205 (apply at DatabricksLogging.scala:77) with 1 output partitions
20/05/13 23:34:28 INFO DAGScheduler: Final stage: ResultStage 1216 (apply at DatabricksLogging.scala:77)
20/05/13 23:34:28 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1215)
20/05/13 23:34:28 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1215)
20/05/13 23:34:28 INFO DAGScheduler: Submitting ShuffleMapStage 1212 (MapPartitionsRDD[96] at apply at DatabricksLogging.scala:77), which has no missing parents
20/05/13 23:34:29 INFO MemoryStore: Block broadcast_220 stored as values in memory (estimated size 55.2 KB, free 4.6 GB)
20/05/13 23:34:29 INFO MemoryStore: Block broadcast_220_piece0 stored as bytes in memory (estimated size 20.7 KB, free 4.6 GB)
20/05/13 23:34:29 INFO BlockManagerInfo: Added broadcast_220_piece0 in memory on ip-10-10-175-231.eu-west-1.compute.internal:43215 (size: 20.7 KB, free: 4.6 GB)
20/05/13 23:34:29 INFO SparkContext: Created broadcast 220 from broadcast at DAGScheduler.scala:1201
20/05/13 23:34:29 INFO DAGScheduler: Submitting 521 missing tasks from ShuffleMapStage 1212 (MapPartitionsRDD[96] at apply at DatabricksLogging.scala:77) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
20/05/13 23:34:29 INFO YarnClusterScheduler: Adding task set 1212.0 with 521 tasks
The exception:
20/05/13 23:36:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.10.175.48:33590
20/05/13 23:36:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.10.162.50:55798
20/05/13 23:36:20 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.10.174.108:42382
20/05/13 23:36:23 INFO TaskSetManager: Starting task 188.0 in stage 1214.0 (TID 22247, ip-10-10-175-231.eu-west-1.compute.internal, executor 3, partition 188, PROCESS_LOCAL, 8073 bytes)
20/05/13 23:36:23 INFO TaskSetManager: Finished task 95.0 in stage 1214.0 (TID 22154) in 4006 ms on ip-10-10-175-231.eu-west-1.compute.internal (executor 3) (1/200)
20/05/13 23:36:23 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.lang.ClassCastException: java.util.Collections$SynchronizedSet cannot be cast to java.util.List
at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:348)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:324)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:324)
at scala.Option.map(Option.scala:146)
at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:324)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:317)
at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:317)
at scala.collection.immutable.List.map(List.scala:288)
at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:317)
at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:309)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:149)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
20/05/13 23:36:24 INFO TaskSetManager: Starting task 189.0 in stage 1214.0 (TID 22248, ip-10-10-175-231.eu-west-1.compute.internal, executor 19, partition 189, PROCESS_LOCAL, 8073 bytes)
20/05/13 23:36:24 INFO TaskSetManager: Finished task 39.0 in stage 1214.0 (TID 22098) in 4276 ms on ip-10-10-175-231.eu-west-1.compute.internal (executor 19) (2/200)
Note:
I noticed that these exceptions do not happen when we first load the delta table, because in the init load we obviously don't use the .merge functionality of delta lake io. So, that leads me to believe that it is related to something while logging things during the merge operation. But again, this does not seem to affect any of the results, as the results are as expected.
It would be nice if anyone has an idea for such behaviour, to check if this is an issue, or not, in delta lake io 0.6.0.
Thanks!
This error doesn't impact anything of your jobs, except it may impact the debugging when you look at the Spark UI on Spark History Server: you may see an active stage which should have been finished.
This issue will be fixed in Apache Spark 2.4.7/3.0.1/3.1.0. Please check the following links for more details regarding this issue:
https://github.com/delta-io/delta/issues/439
https://issues.apache.org/jira/browse/SPARK-31923

Spark - Writing large dataframe problems

In Spark 2.2 (via YARN), I am trying to write a pretty large dataframe to HDFS via an overnight batch job. We first have two source tables, which we join, and then we write the joined result. The output is compressed parquet, but the write is failing due to an out of memory error.
We're providing 12 executors each with 20g of memory and 4 cores, plus the driver with 32g.
In a write operation like this, what runs out of memory? The executors? Short of blindly throwing more memory at it, what steps can we take to resolve this?
The join code is simple:
joined.write.option("header", "true").parquet(destPath)
Here are the final logs before a bunch of "heap dump" spam:
18/06/15 14:25:27 INFO BlockManagerInfo: Added broadcast_1385_piece0 in memory on company02.host.comp.com:43201 (size: 36.8 KB, free: 10.5 GB)
18/06/15 14:25:38 INFO TaskSetManager: Finished task 5982.0 in stage 1008.0 (TID 41953) in 63870 ms on company02.host.comp.com (executor 7) (5979/12136)
18/06/15 14:25:39 INFO TaskSetManager: Finished task 5984.0 in stage 1008.0 (TID 41955) in 65189 ms on company02.host.comp.com (executor 7) (5980/12136)
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2018/06/15 14:25:48 - please wait.

Single long running task in each executor

Sorry if this question looks invalid, I tried to find general guidance to debug task processing times but found nothing yet. I think my problem is a known one, so any help to debug the problem or to understand the problem (related discussion or blog post) would answer my question.
I made multiple streaming spark jobs and almost all of them suffer by same problem; one task in each executor take much longer time than all other tasks:
But input size of tasks are not that different:
My workflow is flat mapping (mapParitionsWithPair ( flatMap )) over direct Kafka stream source with forty partitions to generating more objects from events and then reducing them (reduceByKey) and saving aggregated values to some DB:
The task timeline figure is for reduce stage.
It's a Apache Mesos based cluster with two nodes and two cores for each node and second stage of all jobs have this uneven task processing time distribution.
Update:
I replaced reduceByKey by Java reduce operation (Actually Kotlin Sequence operations) and still same problem occurs.
After replaying job I realized this problem does harm that much for bigger inputs; It processes 160K events in 1.8 to 4.8 minutes (worse case 580 events per second) and while there is still some tasks taking much longer time, the final effect is much less harmful than for small inputs whose processing rate is between 660 to 54. Interestingly for both cases long running tasks get same amount of time (about 41 seconds)
Problem exists even after increasing RAM. Executors now have %30 free RAM.
Update:
I changed workflow to not shuffle data by using Java 8 Stream reduce in each partition. Here is changed job's DAG:
I increased batch interval to 20 seconds and added more nodes; Now, there is not just one slow tasks but more slow tasks and few faster ones, but:
Now it is overally doing much faster than previous version with shorter intervals
I expect CPU usage always be high, specially for operation in mapPartition, but It's not always true.
Just put some logging around actual operation in each partition and I see strangely sometimes tasks are slow and sometimes is fast. When task is going on slow, CPU is idle and I can't see any blocking by network or CPU I/O. Memory usage is constant at %50. Here is mentioned executor logs:
started processing partitioned input: thread 99
started processing partitioned input: thread 98
finished processing partitioned input: thread 99 took 40615ms
finished processing partitioned input: thread 98 took 40469ms
started processing partitioned input: thread 98
started processing partitioned input: thread 99
finished processing partitioned input: thread 98 took 40476ms
finished processing partitioned input: thread 99 took 40523ms
started processing partitioned input: thread 98
started processing partitioned input: thread 99
finished processing partitioned input: thread 98 40465ms
finished processing partitioned input: thread 99 40379ms
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 468
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 525
started processing partitioned input: thread 99
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 738
finished processing partitioned input: thread 99 790
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 558
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 461
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 483
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 513
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 485
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 454
Above logs is just for mapping some incoming inputs to objects for saving in Cassandra, and does not include time for saving to Cassandra; here is logs for save operation which is always fast and don't leave CPU idle:
18/02/07 07:41:47 INFO Executor: Running task 17.0 in stage 5.0 (TID 207)
18/02/07 07:41:47 INFO TorrentBroadcast: Started reading broadcast variable 5
18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.8 KB, free 1177.1 MB)
18/02/07 07:41:47 INFO TorrentBroadcast: Reading broadcast variable 5 took 33 ms
18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 16.4 KB, free 1177.1 MB)
18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_2 locally
18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_17 locally
18/02/07 07:42:02 INFO TableWriter: Wrote 28926 rows to keyspace.table in 15.749 s.
18/02/07 07:42:02 INFO Executor: Finished task 17.0 in stage 5.0 (TID 207). 923 bytes result sent to driver
18/02/07 07:42:02 INFO CoarseGrainedExecutorBackend: Got assigned task 209
18/02/07 07:42:02 INFO Executor: Running task 18.0 in stage 5.0 (TID 209)
18/02/07 07:42:02 INFO BlockManager: Found block rdd_30_18 locally
18/02/07 07:42:03 INFO TableWriter: Wrote 29288 rows to keyspace.table in 16.042 s.
18/02/07 07:42:03 INFO Executor: Finished task 2.0 in stage 5.0 (TID 203). 1713 bytes result sent to driver
18/02/07 07:42:03 INFO CoarseGrainedExecutorBackend: Got assigned task 211
18/02/07 07:42:03 INFO Executor: Running task 21.0 in stage 5.0 (TID 211)
18/02/07 07:42:03 INFO BlockManager: Found block rdd_30_21 locally
18/02/07 07:42:19 INFO TableWriter: Wrote 29315 rows to keyspace.table in 16.308 s.
18/02/07 07:42:19 INFO Executor: Finished task 21.0 in stage 5.0 (TID 211). 923 bytes result sent to driver
18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 217
18/02/07 07:42:19 INFO Executor: Running task 24.0 in stage 5.0 (TID 217)
18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_24 locally
18/02/07 07:42:19 INFO TableWriter: Wrote 29422 rows to keyspace.table in 16.783 s.
18/02/07 07:42:19 INFO Executor: Finished task 18.0 in stage 5.0 (TID 209). 923 bytes result sent to driver
18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 218
18/02/07 07:42:19 INFO Executor: Running task 25.0 in stage 5.0 (TID 218)
18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_25 locally
18/02/07 07:42:35 INFO TableWriter: Wrote 29427 rows to keyspace.table in 16.509 s.
18/02/07 07:42:35 INFO Executor: Finished task 24.0 in stage 5.0 (TID 217). 923 bytes result sent to driver
18/02/07 07:42:35 INFO CoarseGrainedExecutorBackend: Got assigned task 225

Spark SQL: Why two jobs for one query?

Experiment
I tried the following snippet on Spark 1.6.1.
val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files
soDF.registerTempTable("so")
sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by cnt").write.parquet("/out/")
The Physical Plan is:
== Physical Plan ==
Sort [cnt#59L ASC], true, 0
+- ConvertToUnsafe
+- Exchange rangepartitioning(cnt#59L ASC,200), None
+- ConvertToSafe
+- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Final,isDistinct=false)], output=[dpHour#38,cnt#59L])
+- TungstenExchange hashpartitioning(dpHour#38,200), None
+- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Partial,isDistinct=false)], output=[dpHour#38,count#63L])
+- Scan ParquetRelation[dpHour#38] InputPaths: hdfs://hdfsNode:8020/batchPoC/saleOrder
For this query, I got two Jobs: Job 9 and Job 10
For Job 9, the DAG is:
For Job 10, the DAG is:
Observations
Apparently, there are two jobs for one query.
Stage-16 (marked as Stage-14 in Job 9) is skipped in Job 10.
Stage-15's last RDD[48], is same as Stage-17's last RDD[49]. How? I saw in the logs that after Stage-15 execution, the RDD[48] is registered as RDD[49]
Stage-17 is shown in the driver-logs but never got executed at Executors. On driver-logs the task-execution is shown, but when I looked at Yarn container's logs, there was no evidence of receiving any task from Stage-17.
Logs supporting these observations (only driver-logs, I lost executor logs due to later crash). It is seen that before Stage-17 starts, RDD[49] is registered:
16/06/10 22:11:22 INFO TaskSetManager: Finished task 196.0 in stage 15.0 (TID 1121) in 21 ms on slave-1 (199/200)
16/06/10 22:11:22 INFO TaskSetManager: Finished task 198.0 in stage 15.0 (TID 1123) in 20 ms on slave-1 (200/200)
16/06/10 22:11:22 INFO YarnScheduler: Removed TaskSet 15.0, whose tasks have all completed, from pool
16/06/10 22:11:22 INFO DAGScheduler: ResultStage 15 (parquet at <console>:26) finished in 0.505 s
16/06/10 22:11:22 INFO DAGScheduler: Job 9 finished: parquet at <console>:26, took 5.054011 s
16/06/10 22:11:22 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO SparkContext: Starting job: parquet at <console>:26
16/06/10 22:11:22 INFO DAGScheduler: Registering RDD 49 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Got job 10 (parquet at <console>:26) with 25 output partitions
16/06/10 22:11:22 INFO DAGScheduler: Final stage: ResultStage 18 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Submitting ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26), which has no missing parents
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 17.4 KB, free 512.3 KB)
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 8.9 KB, free 521.2 KB)
16/06/10 22:11:22 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 172.16.20.57:44944 (size: 8.9 KB, free: 517.3 MB)
16/06/10 22:11:22 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:1006
16/06/10 22:11:22 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26)
16/06/10 22:11:22 INFO YarnScheduler: Adding task set 17.0 with 200 tasks
16/06/10 22:11:23 INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 1125, slave-1, partition 0,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 1.0 in stage 17.0 (TID 1126, slave-2, partition 1,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 2.0 in stage 17.0 (TID 1127, slave-1, partition 2,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 3.0 in stage 17.0 (TID 1128, slave-2, partition 3,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 4.0 in stage 17.0 (TID 1129, slave-1, partition 4,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 5.0 in stage 17.0 (TID 1130, slave-2, partition 5,NODE_LOCAL, 1988 bytes)
Questions
Why two Jobs? What is the intention here by breaking a DAG into two jobs?
Job 10's DAG looks complete for the query execution. Is there anything specific Job 9 is doing?
Why Stage-17 is not Skipped? It looks like dummy tasks are created, do they have any purpose.
Later, I tried another rather simpler query. Unexpectedly, it was creating 3 Jobs.
sqlContext.sql("select dpHour from so order by dphour").write.parquet("/out2/")
When you are using the high-level dataframe/dataset APIs, you leave it up to Spark to determine the execution plan, including the job/stage chunking. These depend on many factors such as execution parallelism, cached/persisted data structures, etc. In future versions of Spark, as the optimizer sophistication increases, you may see even more jobs per query as, for example, some data sources are sampled to parameterize cost-based execution optimization.
For example, I have frequently, but not always, seen writing generate separate jobs from processing that involves shuffles.
Bottom line, if you are using the high-level APIs, unless you have to do extremely detailed optimization with huge data volumes, it rarely pays to dig into the specific chunking. Job startup costs are extremely low compared to processing/output.
If, on the other hand, you are curious about the Spark internals, read the optimizer code and engage on the Spark developer mailing list.

spark saveAsTextFile last partition (almost?) never finishes

I have a very simple word-count-like program that generates (Long, Double) counts like that:
val lines = sc.textFile(directory)
lines.repartition(600).mapPartitions{lineIterator =>
// Generate iterator of (Long,Double) counts
}
.reduceByKey(new HashPartitioner(30), (v1, v2) => v1 + v2).saveAsTextFile(outDir, classOf[GzipCodec])
My problem: The last of the 30 partitions never gets written.
Here are a few details:
My input is 5 GB gz-compressed and I expect about 1B unique Long keys.
I run on a 32 core 1.5TB machine. Input and output come from a local disk with 2TB free. Spark is assigned to use all the ram and happily does so. This application occupies about 0.5 TB.
I can observe the following:
For 29 partitions the reduce and repartition (because of the HashPartitioner) takes about 2h. The last one does not finish, not even after a day. Two to four threads stay on 100%.
No error or warning appears in the log
Spark occupies about 100GB in /tmp which aligns with what the UI reports for shuffle write.
In the UI I can see the number of "shuffle read records" growing very, very slowly for the remaining task. After one day, still one magnitude away from what all the finished tasks show.
The last log looks like that:
15/08/03 23:26:43 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000020_748: Committed
15/08/03 23:26:43 INFO Executor: Finished task 20.0 in stage 2.0 (TID 748). 865 bytes result sent to driver
15/08/03 23:27:50 INFO FileOutputCommitter: Saved output of task 'attempt_201508031748_0002_m_000009_737' to file:/output-dir/_temporary/0/task_201508031748_0002_m_000009
15/08/03 23:27:50 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000009_737: Committed
15/08/03 23:27:50 INFO Executor: Finished task 9.0 in stage 2.0 (TID 737). 865 bytes result sent to driver
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 3
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3_piece0 of size 2009 dropped from memory (free 611091153849)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3 of size 3336 dropped from memory (free 611091157185)
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 4
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4_piece0 of size 2295 dropped from memory (free 611091159480)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_4_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4 of size 4016 dropped from memory (free 611091163496)
Imagine the first five lines repeated for 28 other partitions within a two minute time frame.
I have tried several things:
Spark 1.3.0 and 1.4.0
nio instead of netty
flatMap instead of mapPartitions
Just 30 instead of 600 input partitions
Still, I never get the last 1/30 of my data out of spark. Did anyone ever observe something similar? These two posts here and here seem to describe similar problems but no solution.
UPDATE
The task that never finishes is always the first task of the reduceKey+writeToTextFile. I have also removed the HashPartitioner and even tried on a bigger cluster with 400 cores and 6000 partitions. Only 5999 finish successfully, the last runs forever.
The UI shows for all tasks something like
Shuffle Read Size / Records: 20.0 MB / 1954832
but for the first it shows (at the moment)
Shuffle Read Size / Records: 150.1 MB / 711836
Numbers still growing....
It might be that your keys are very skewed. Depending on how they are distributed (or if you have a null or default key), a significant amount of the data might be going to a single executor and be no different than running in your local machine (plus overhead of a distributed platform). It might even be causing that machine to swap to disk, becoming intolerably slow.
Try using aggregateByKey instead of reduceByKey, since it will attempt to get partial sums distributed across executors instead of shuffling all the (potentially large) set of key-value pairs to a single executor. And maybe avoid fixing the number of output partitions to 30 just in case.
Edit: It is hard to detect the problem for "it just does not finish". One thing you can do is to introduce a timeout:
val result = Await.result(future {
// Your normal computation
}, timeout)
That way, whatever task is taking too long, you can detect it and gather some metrics on the spot.

Resources