Single long running task in each executor - apache-spark

Sorry if this question looks invalid, I tried to find general guidance to debug task processing times but found nothing yet. I think my problem is a known one, so any help to debug the problem or to understand the problem (related discussion or blog post) would answer my question.
I made multiple streaming spark jobs and almost all of them suffer by same problem; one task in each executor take much longer time than all other tasks:
But input size of tasks are not that different:
My workflow is flat mapping (mapParitionsWithPair ( flatMap )) over direct Kafka stream source with forty partitions to generating more objects from events and then reducing them (reduceByKey) and saving aggregated values to some DB:
The task timeline figure is for reduce stage.
It's a Apache Mesos based cluster with two nodes and two cores for each node and second stage of all jobs have this uneven task processing time distribution.
Update:
I replaced reduceByKey by Java reduce operation (Actually Kotlin Sequence operations) and still same problem occurs.
After replaying job I realized this problem does harm that much for bigger inputs; It processes 160K events in 1.8 to 4.8 minutes (worse case 580 events per second) and while there is still some tasks taking much longer time, the final effect is much less harmful than for small inputs whose processing rate is between 660 to 54. Interestingly for both cases long running tasks get same amount of time (about 41 seconds)
Problem exists even after increasing RAM. Executors now have %30 free RAM.
Update:
I changed workflow to not shuffle data by using Java 8 Stream reduce in each partition. Here is changed job's DAG:
I increased batch interval to 20 seconds and added more nodes; Now, there is not just one slow tasks but more slow tasks and few faster ones, but:
Now it is overally doing much faster than previous version with shorter intervals
I expect CPU usage always be high, specially for operation in mapPartition, but It's not always true.
Just put some logging around actual operation in each partition and I see strangely sometimes tasks are slow and sometimes is fast. When task is going on slow, CPU is idle and I can't see any blocking by network or CPU I/O. Memory usage is constant at %50. Here is mentioned executor logs:
started processing partitioned input: thread 99
started processing partitioned input: thread 98
finished processing partitioned input: thread 99 took 40615ms
finished processing partitioned input: thread 98 took 40469ms
started processing partitioned input: thread 98
started processing partitioned input: thread 99
finished processing partitioned input: thread 98 took 40476ms
finished processing partitioned input: thread 99 took 40523ms
started processing partitioned input: thread 98
started processing partitioned input: thread 99
finished processing partitioned input: thread 98 40465ms
finished processing partitioned input: thread 99 40379ms
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 468
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 525
started processing partitioned input: thread 99
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 738
finished processing partitioned input: thread 99 790
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 558
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 461
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 483
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 513
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 485
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 454
Above logs is just for mapping some incoming inputs to objects for saving in Cassandra, and does not include time for saving to Cassandra; here is logs for save operation which is always fast and don't leave CPU idle:
18/02/07 07:41:47 INFO Executor: Running task 17.0 in stage 5.0 (TID 207)
18/02/07 07:41:47 INFO TorrentBroadcast: Started reading broadcast variable 5
18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.8 KB, free 1177.1 MB)
18/02/07 07:41:47 INFO TorrentBroadcast: Reading broadcast variable 5 took 33 ms
18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 16.4 KB, free 1177.1 MB)
18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_2 locally
18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_17 locally
18/02/07 07:42:02 INFO TableWriter: Wrote 28926 rows to keyspace.table in 15.749 s.
18/02/07 07:42:02 INFO Executor: Finished task 17.0 in stage 5.0 (TID 207). 923 bytes result sent to driver
18/02/07 07:42:02 INFO CoarseGrainedExecutorBackend: Got assigned task 209
18/02/07 07:42:02 INFO Executor: Running task 18.0 in stage 5.0 (TID 209)
18/02/07 07:42:02 INFO BlockManager: Found block rdd_30_18 locally
18/02/07 07:42:03 INFO TableWriter: Wrote 29288 rows to keyspace.table in 16.042 s.
18/02/07 07:42:03 INFO Executor: Finished task 2.0 in stage 5.0 (TID 203). 1713 bytes result sent to driver
18/02/07 07:42:03 INFO CoarseGrainedExecutorBackend: Got assigned task 211
18/02/07 07:42:03 INFO Executor: Running task 21.0 in stage 5.0 (TID 211)
18/02/07 07:42:03 INFO BlockManager: Found block rdd_30_21 locally
18/02/07 07:42:19 INFO TableWriter: Wrote 29315 rows to keyspace.table in 16.308 s.
18/02/07 07:42:19 INFO Executor: Finished task 21.0 in stage 5.0 (TID 211). 923 bytes result sent to driver
18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 217
18/02/07 07:42:19 INFO Executor: Running task 24.0 in stage 5.0 (TID 217)
18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_24 locally
18/02/07 07:42:19 INFO TableWriter: Wrote 29422 rows to keyspace.table in 16.783 s.
18/02/07 07:42:19 INFO Executor: Finished task 18.0 in stage 5.0 (TID 209). 923 bytes result sent to driver
18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 218
18/02/07 07:42:19 INFO Executor: Running task 25.0 in stage 5.0 (TID 218)
18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_25 locally
18/02/07 07:42:35 INFO TableWriter: Wrote 29427 rows to keyspace.table in 16.509 s.
18/02/07 07:42:35 INFO Executor: Finished task 24.0 in stage 5.0 (TID 217). 923 bytes result sent to driver
18/02/07 07:42:35 INFO CoarseGrainedExecutorBackend: Got assigned task 225

Related

Dataproc: update log level in Spark shell

I use Jupyter terminal for accessing the driver of Dataproc cluster. This is my gateway to the cluster, and I do not have direct SSH enabled for the driver machine.
When I launch spark-shell , I keep getting these info, debug, Contextcleaner messages throughout my session and kind of disturbs my coding efforts. Is there a way to turn these off ?
scala> 22/10/11 15:47:31 INFO org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.22.86.219:43504) with ID 2
22/10/11 15:47:31 INFO org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.22.86.217:54770) with ID 1
22/10/11 15:47:31 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager cluster:39607 with 5.6 GB RAM, BlockManagerId(2, cluster, 39607, None)
22/10/11 15:47:31 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager cluster.internal:36731 with 5.6 GB RAM, BlockManagerId(1, cluster, 36731, None)
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 56
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 31
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 63
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 30
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 44
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 32
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 35
22/10/11 15:54:53 INFO org.apache.spark.storage.memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 23.1 KB, free 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.storage.memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.6 KB, free 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on clusterurl:33625 (size: 7.6 KB, free: 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1184
22/10/11 15:54:53 INFO org.apache.spark.scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[9] at show at <console>:39) (first 15 tasks are for partitions Vector(1))
22/10/11 15:54:53 INFO org.apache.spark.scheduler.cluster.YarnScheduler: Adding task set 4.0 with 1 tasks
22/10/11 15:54:53 INFO org.apache.spark.scheduler.FairSchedulableBuilder: Added task set TaskSet_4.0 tasks to pool default
22/10/11 15:54:53 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 7, cluster.internal, executor 1, partition 1, PROCESS_LOCAL, 7908 bytes)
22/10/11 15:54:53 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on cluster.internal:36731 (size: 7.6 KB, free: 5.6 GB)
22/10/11 15:54:54 INFO org.apache.spark.scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 7) in 558 ms on cluster.internal (executor 1) (1/1)
22/10/11 15:54:54 INFO org.apache.spark.scheduler.cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool default
22/10/11 15:54:54 INFO org.apache.spark.scheduler.DAGScheduler: ResultStage 4 (show at <console>:39) finished in 0.571 s
22/10/11 15:54:54 INFO org.apache.spark.scheduler.DAGScheduler: Job 4 finished: show at <console>:39, took 0.575517 s
The logs are controlled by /etc/spark/conf/log4j.properties, the default root log level is INFO, but in spark-shell, the root level is overridden as WARN. I guess the reason you see logs like INFO org.apache.spark.scheduler.DAGScheduler is because your cluster has settings like log4j.logger.org.apache.spark=INFO in the file.
There are several way you can change log settings for spark-shell:
Session level
Run sc.setLogLevel("WARN") in spark-shell which will update the root log level for the whole process. It has the same effect as
scala> import org.apache.log4j.{Level, Logger}
scala> Logger.getRootLogger().setLevel(Level.WARN)
Get the specific logger and set level, e.g.:
scala> import org.apache.log4j.{Level, Logger}
scala> Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Make a copy of /etc/spark/conf/log4j.properties to /tmp/spark-log4j.properties, edit it with the desired log settings, then run spark-shell --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///tmp/spark-log4j.properties.
Cluster level
Edit /etc/spark/conf/log4j.properties and set higher log levels for the spammy packages, then run spark-shell.
When creating the cluster, add --properties ^#^spark-log4j:log4j.logger.org.apache.spark=WARN#..., which will update the config file under the hood.

Why Pyspark jobs are dying out in the middle of process without any particular error

Experts, I am noticing one peculiar thing with one of the Pyspark jobs in production(running in YARN cluster mode). After executing for around an hour + (around 65-75 mins), it just dies out without throwing any particular error message. We have analyzed the YARN logs for around 2 weeks now and there is no particular error in them, it just dies in the middle while doing ETL operations(reading/writing hive table, doing simple maps, trim, lambda operations etc), not any particular piece of code to point out. Sometimes rerunning fixes it, sometimes it takes more than one rerun.
The code is optimized, the spark-submit --conf has all the correctly optimized options. As we mentioned earlier, it is running absolutely perfect for around 30 other applications with very good performance stats. These are all the options we have -
spark-submit --conf spark.yarn.maxAppAttempts=1 --conf spark.sql.broadcastTimeout=36000 --conf spark.dynamicAllocation.executorIdleTimeout=1800 --conf spark.dynamicAllocation.minExecutors=8 --conf spark.dynamicAllocation.initialExecutors=8 --conf spark.dynamicAllocation.maxExecutors=32 --conf spark.yarn.executor.memoryOverhead=4096 --conf spark.kryoserializer.buffer.max=512m --driver-memory 2G --executor-memory 8G --executor-cores 2 --deploy-mode cluster --master yarn
We want to check if it is some drive configuration i need to change to address this issue?
Or there is some automatic timeout in Spark Cluster mode which can be increased? we are using Spark 1.6 with Python 2.7
The error looks like (there are several messages where it says -
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
But it fails when it encounters driver error (happens in the end)-
ERROR executor.CoarseGrainedExecutorBackend: Driver XX.XXX.XXX.XXX:XXXXX disassociated! Shutting down
Here is the log-
19/10/24 16:17:03 INFO compress.CodecPool: Got brand-new compressor [.gz]
19/10/24 16:17:03 INFO output.FileOutputCommitter: Saved output of task 'attempt_201910241617_0152_m_000323_0' to hdfs://myserver/production/out/TBL/_temporary/0/task_201910241617_0152_m_000323
19/10/24 16:17:03 INFO mapred.SparkHadoopMapRedUtil: attempt_201910241617_0152_m_000323_0: Committed
19/10/24 16:17:03 INFO executor.Executor: Finished task 323.0 in stage 152.0 (TID 27419). 2163 bytes result sent to driver
19/10/24 16:17:03 INFO output.FileOutputCommitter: Saved output of task 'attempt_201910241617_0152_m_000135_0' to hdfs://myserver/production/out/TBL/_temporary/0/task_201910241617_0152_m_000135
19/10/24 16:17:03 INFO mapred.SparkHadoopMapRedUtil: attempt_201910241617_0152_m_000135_0: Committed
19/10/24 16:17:03 INFO executor.Executor: Finished task 135.0 in stage 152.0 (TID 27387). 2163 bytes result sent to driver
19/10/24 16:18:04 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
19/10/24 16:18:04 INFO storage.DiskBlockManager: Shutdown hook called
19/10/24 16:18:04 INFO util.ShutdownHookManager: Shutdown hook called
19/10/24 16:21:12 INFO executor.Executor: Finished task 41.0 in stage 163.0 (TID 29954). 2210 bytes result sent to driver
19/10/24 16:21:12 INFO executor.Executor: Finished task 170.0 in stage 163.0 (TID 29986). 2210 bytes result sent to driver
19/10/24 16:21:13 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 30047
19/10/24 16:21:13 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 30079
19/10/24 16:21:13 INFO executor.Executor: Running task 10.0 in stage 165.0 (TID 30047)
19/10/24 16:21:13 INFO executor.Executor: Running task 42.0 in stage 165.0 (TID 30079)
19/10/24 16:21:13 INFO spark.MapOutputTrackerWorker: Updating epoch to 56 and clearing cache
19/10/24 16:21:13 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 210
19/10/24 16:21:13 INFO storage.MemoryStore: Block broadcast_210_piece0 stored as bytes in memory (estimated size 29.4 KB, free 3.8 GB)
19/10/24 16:21:13 INFO broadcast.TorrentBroadcast: Reading broadcast variable 210 took 3 ms
19/10/24 16:21:13 INFO storage.MemoryStore: Block broadcast_210 stored as values in memory (estimated size 83.4 KB, free 3.8 GB)
19/10/24 16:21:13 INFO executor.Executor: Finished task 10.0 in stage 165.0 (TID 30047). 931 bytes result sent to driver
19/10/24 16:21:13 INFO executor.Executor: Finished task 42.0 in stage 165.0 (TID 30079). 931 bytes result sent to driver
19/10/24 16:21:15 WARN executor.CoarseGrainedExecutorBackend: An unknown (rxxxxxx1.hadoop.com:XXXXX) driver disconnected.
19/10/24 16:21:15 ERROR executor.CoarseGrainedExecutorBackend: Driver XX.XXX.XXX.XXX:XXXXX disassociated! Shutting down.
19/10/24 16:21:15 INFO storage.DiskBlockManager: Shutdown hook called
19/10/24 16:21:15 INFO util.ShutdownHookManager: Shutdown hook called
Thanks,
Sid
Without any apparent stack trace it's a good idea to think of a problem from two angles: it's either a code issue or a data issue.
Either case you should start by giving the driver abundant memory so as to rule that out as a probable cause. Increase driver.memory and driver.memoryOverhead until you've diagnosed the problem.
Common code issues:
Too many transformations causes the lineage to get too big. If there's any kind of iterative operations happening on the dataframe then it's a good idea to truncate the DAG by doing a checkpoint in between. In Spark 2.x you can call dataFrame.checkpoint() directly and not have to access the RDD. Also #Sagar's answer describes how to do this for Spark 1.6
Trying to broadcast dataframes that are too big. This will usually result in an OOM exception but can sometimes just cause the job to seem stuck. Resolution is to not call broadcast if you are explicitly doing so. Otherwise check if you've set spark.sql.autoBroadcastJoinThreshold to some custom value and try lowering that value or disable broadcast altogether (setting -1).
Not enough partitions can cause every task to run hot. Easiest way to diagnose this is to check the stages view on the Spark UI and see the size of data being read and written per task. This should ideally be in 100MB-500MB range. Otherwise increase spark.sql.shuffle.partitions and spark.default.parallelism to higher values than the default 200.
Common data issues:
Data skew. Since your job is failing for a specific workload it could have data skew in the specific job. Diagnose this by checking that the median time for task completion is comparable to the 75 percentile which is comparable to the 90 percentile on the stage view in the Spark UI. There are many ways to redress data skew but the one I find best is to write a custom join function that salts the join keys prior to join. This splits the skewed partition into several smaller partitions at the expense of a constant size data explosion.
Input file format or number of files. If your input file isn't partitioned and you're only doing narrow transforms (those that do not cause a data shuffle) then all of your data will run through a single executor and not really benefit from the distributed cluster setup. Diagnose this from the Spark UI by checking how many tasks are getting created in each stage of the pipeline. It should be of the order of your spark.default.parallelism value. If not then do a .repartition(<some value>) immediately after the data read step prior to any transforms. If the file format is CSV (not ideal) then verify that you have multiLine disabled unless required in your specific case, otherwise this forces a single executor to read the entire csv file.
Happy debugging!
Are you breaking the lineage? If not then the issue might be with lineage. Can you try breaking the lineage in between the code somewhere and try it.
#Spark 1.6 code
sc.setCheckpointDit('.')
#df is the original dataframe name you are performing transformations on
dfrdd = df.rdd
dfrdd.checkpoint()
df=sqlContext.createDataFrame(dfrdd)
print df.count()
Let me know if it helps.

Spark - Writing large dataframe problems

In Spark 2.2 (via YARN), I am trying to write a pretty large dataframe to HDFS via an overnight batch job. We first have two source tables, which we join, and then we write the joined result. The output is compressed parquet, but the write is failing due to an out of memory error.
We're providing 12 executors each with 20g of memory and 4 cores, plus the driver with 32g.
In a write operation like this, what runs out of memory? The executors? Short of blindly throwing more memory at it, what steps can we take to resolve this?
The join code is simple:
joined.write.option("header", "true").parquet(destPath)
Here are the final logs before a bunch of "heap dump" spam:
18/06/15 14:25:27 INFO BlockManagerInfo: Added broadcast_1385_piece0 in memory on company02.host.comp.com:43201 (size: 36.8 KB, free: 10.5 GB)
18/06/15 14:25:38 INFO TaskSetManager: Finished task 5982.0 in stage 1008.0 (TID 41953) in 63870 ms on company02.host.comp.com (executor 7) (5979/12136)
18/06/15 14:25:39 INFO TaskSetManager: Finished task 5984.0 in stage 1008.0 (TID 41955) in 65189 ms on company02.host.comp.com (executor 7) (5980/12136)
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2018/06/15 14:25:48 - please wait.

Spark SQL: Why two jobs for one query?

Experiment
I tried the following snippet on Spark 1.6.1.
val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files
soDF.registerTempTable("so")
sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by cnt").write.parquet("/out/")
The Physical Plan is:
== Physical Plan ==
Sort [cnt#59L ASC], true, 0
+- ConvertToUnsafe
+- Exchange rangepartitioning(cnt#59L ASC,200), None
+- ConvertToSafe
+- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Final,isDistinct=false)], output=[dpHour#38,cnt#59L])
+- TungstenExchange hashpartitioning(dpHour#38,200), None
+- TungstenAggregate(key=[dpHour#38], functions=[(count(1),mode=Partial,isDistinct=false)], output=[dpHour#38,count#63L])
+- Scan ParquetRelation[dpHour#38] InputPaths: hdfs://hdfsNode:8020/batchPoC/saleOrder
For this query, I got two Jobs: Job 9 and Job 10
For Job 9, the DAG is:
For Job 10, the DAG is:
Observations
Apparently, there are two jobs for one query.
Stage-16 (marked as Stage-14 in Job 9) is skipped in Job 10.
Stage-15's last RDD[48], is same as Stage-17's last RDD[49]. How? I saw in the logs that after Stage-15 execution, the RDD[48] is registered as RDD[49]
Stage-17 is shown in the driver-logs but never got executed at Executors. On driver-logs the task-execution is shown, but when I looked at Yarn container's logs, there was no evidence of receiving any task from Stage-17.
Logs supporting these observations (only driver-logs, I lost executor logs due to later crash). It is seen that before Stage-17 starts, RDD[49] is registered:
16/06/10 22:11:22 INFO TaskSetManager: Finished task 196.0 in stage 15.0 (TID 1121) in 21 ms on slave-1 (199/200)
16/06/10 22:11:22 INFO TaskSetManager: Finished task 198.0 in stage 15.0 (TID 1123) in 20 ms on slave-1 (200/200)
16/06/10 22:11:22 INFO YarnScheduler: Removed TaskSet 15.0, whose tasks have all completed, from pool
16/06/10 22:11:22 INFO DAGScheduler: ResultStage 15 (parquet at <console>:26) finished in 0.505 s
16/06/10 22:11:22 INFO DAGScheduler: Job 9 finished: parquet at <console>:26, took 5.054011 s
16/06/10 22:11:22 INFO ParquetRelation: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO DefaultWriterContainer: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
16/06/10 22:11:22 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/06/10 22:11:22 INFO SparkContext: Starting job: parquet at <console>:26
16/06/10 22:11:22 INFO DAGScheduler: Registering RDD 49 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Got job 10 (parquet at <console>:26) with 25 output partitions
16/06/10 22:11:22 INFO DAGScheduler: Final stage: ResultStage 18 (parquet at <console>:26)
16/06/10 22:11:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 17)
16/06/10 22:11:22 INFO DAGScheduler: Submitting ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26), which has no missing parents
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 17.4 KB, free 512.3 KB)
16/06/10 22:11:22 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 8.9 KB, free 521.2 KB)
16/06/10 22:11:22 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 172.16.20.57:44944 (size: 8.9 KB, free: 517.3 MB)
16/06/10 22:11:22 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:1006
16/06/10 22:11:22 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 17 (MapPartitionsRDD[49] at parquet at <console>:26)
16/06/10 22:11:22 INFO YarnScheduler: Adding task set 17.0 with 200 tasks
16/06/10 22:11:23 INFO TaskSetManager: Starting task 0.0 in stage 17.0 (TID 1125, slave-1, partition 0,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 1.0 in stage 17.0 (TID 1126, slave-2, partition 1,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 2.0 in stage 17.0 (TID 1127, slave-1, partition 2,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 3.0 in stage 17.0 (TID 1128, slave-2, partition 3,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 4.0 in stage 17.0 (TID 1129, slave-1, partition 4,NODE_LOCAL, 1988 bytes)
16/06/10 22:11:23 INFO TaskSetManager: Starting task 5.0 in stage 17.0 (TID 1130, slave-2, partition 5,NODE_LOCAL, 1988 bytes)
Questions
Why two Jobs? What is the intention here by breaking a DAG into two jobs?
Job 10's DAG looks complete for the query execution. Is there anything specific Job 9 is doing?
Why Stage-17 is not Skipped? It looks like dummy tasks are created, do they have any purpose.
Later, I tried another rather simpler query. Unexpectedly, it was creating 3 Jobs.
sqlContext.sql("select dpHour from so order by dphour").write.parquet("/out2/")
When you are using the high-level dataframe/dataset APIs, you leave it up to Spark to determine the execution plan, including the job/stage chunking. These depend on many factors such as execution parallelism, cached/persisted data structures, etc. In future versions of Spark, as the optimizer sophistication increases, you may see even more jobs per query as, for example, some data sources are sampled to parameterize cost-based execution optimization.
For example, I have frequently, but not always, seen writing generate separate jobs from processing that involves shuffles.
Bottom line, if you are using the high-level APIs, unless you have to do extremely detailed optimization with huge data volumes, it rarely pays to dig into the specific chunking. Job startup costs are extremely low compared to processing/output.
If, on the other hand, you are curious about the Spark internals, read the optimizer code and engage on the Spark developer mailing list.

spark saveAsTextFile last partition (almost?) never finishes

I have a very simple word-count-like program that generates (Long, Double) counts like that:
val lines = sc.textFile(directory)
lines.repartition(600).mapPartitions{lineIterator =>
// Generate iterator of (Long,Double) counts
}
.reduceByKey(new HashPartitioner(30), (v1, v2) => v1 + v2).saveAsTextFile(outDir, classOf[GzipCodec])
My problem: The last of the 30 partitions never gets written.
Here are a few details:
My input is 5 GB gz-compressed and I expect about 1B unique Long keys.
I run on a 32 core 1.5TB machine. Input and output come from a local disk with 2TB free. Spark is assigned to use all the ram and happily does so. This application occupies about 0.5 TB.
I can observe the following:
For 29 partitions the reduce and repartition (because of the HashPartitioner) takes about 2h. The last one does not finish, not even after a day. Two to four threads stay on 100%.
No error or warning appears in the log
Spark occupies about 100GB in /tmp which aligns with what the UI reports for shuffle write.
In the UI I can see the number of "shuffle read records" growing very, very slowly for the remaining task. After one day, still one magnitude away from what all the finished tasks show.
The last log looks like that:
15/08/03 23:26:43 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000020_748: Committed
15/08/03 23:26:43 INFO Executor: Finished task 20.0 in stage 2.0 (TID 748). 865 bytes result sent to driver
15/08/03 23:27:50 INFO FileOutputCommitter: Saved output of task 'attempt_201508031748_0002_m_000009_737' to file:/output-dir/_temporary/0/task_201508031748_0002_m_000009
15/08/03 23:27:50 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000009_737: Committed
15/08/03 23:27:50 INFO Executor: Finished task 9.0 in stage 2.0 (TID 737). 865 bytes result sent to driver
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 3
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3_piece0 of size 2009 dropped from memory (free 611091153849)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3 of size 3336 dropped from memory (free 611091157185)
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 4
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4_piece0 of size 2295 dropped from memory (free 611091159480)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_4_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4 of size 4016 dropped from memory (free 611091163496)
Imagine the first five lines repeated for 28 other partitions within a two minute time frame.
I have tried several things:
Spark 1.3.0 and 1.4.0
nio instead of netty
flatMap instead of mapPartitions
Just 30 instead of 600 input partitions
Still, I never get the last 1/30 of my data out of spark. Did anyone ever observe something similar? These two posts here and here seem to describe similar problems but no solution.
UPDATE
The task that never finishes is always the first task of the reduceKey+writeToTextFile. I have also removed the HashPartitioner and even tried on a bigger cluster with 400 cores and 6000 partitions. Only 5999 finish successfully, the last runs forever.
The UI shows for all tasks something like
Shuffle Read Size / Records: 20.0 MB / 1954832
but for the first it shows (at the moment)
Shuffle Read Size / Records: 150.1 MB / 711836
Numbers still growing....
It might be that your keys are very skewed. Depending on how they are distributed (or if you have a null or default key), a significant amount of the data might be going to a single executor and be no different than running in your local machine (plus overhead of a distributed platform). It might even be causing that machine to swap to disk, becoming intolerably slow.
Try using aggregateByKey instead of reduceByKey, since it will attempt to get partial sums distributed across executors instead of shuffling all the (potentially large) set of key-value pairs to a single executor. And maybe avoid fixing the number of output partitions to 30 just in case.
Edit: It is hard to detect the problem for "it just does not finish". One thing you can do is to introduce a timeout:
val result = Await.result(future {
// Your normal computation
}, timeout)
That way, whatever task is taking too long, you can detect it and gather some metrics on the spot.

Resources