spark saveAsTextFile last partition (almost?) never finishes - apache-spark

I have a very simple word-count-like program that generates (Long, Double) counts like that:
val lines = sc.textFile(directory)
lines.repartition(600).mapPartitions{lineIterator =>
// Generate iterator of (Long,Double) counts
}
.reduceByKey(new HashPartitioner(30), (v1, v2) => v1 + v2).saveAsTextFile(outDir, classOf[GzipCodec])
My problem: The last of the 30 partitions never gets written.
Here are a few details:
My input is 5 GB gz-compressed and I expect about 1B unique Long keys.
I run on a 32 core 1.5TB machine. Input and output come from a local disk with 2TB free. Spark is assigned to use all the ram and happily does so. This application occupies about 0.5 TB.
I can observe the following:
For 29 partitions the reduce and repartition (because of the HashPartitioner) takes about 2h. The last one does not finish, not even after a day. Two to four threads stay on 100%.
No error or warning appears in the log
Spark occupies about 100GB in /tmp which aligns with what the UI reports for shuffle write.
In the UI I can see the number of "shuffle read records" growing very, very slowly for the remaining task. After one day, still one magnitude away from what all the finished tasks show.
The last log looks like that:
15/08/03 23:26:43 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000020_748: Committed
15/08/03 23:26:43 INFO Executor: Finished task 20.0 in stage 2.0 (TID 748). 865 bytes result sent to driver
15/08/03 23:27:50 INFO FileOutputCommitter: Saved output of task 'attempt_201508031748_0002_m_000009_737' to file:/output-dir/_temporary/0/task_201508031748_0002_m_000009
15/08/03 23:27:50 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000009_737: Committed
15/08/03 23:27:50 INFO Executor: Finished task 9.0 in stage 2.0 (TID 737). 865 bytes result sent to driver
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 3
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3_piece0 of size 2009 dropped from memory (free 611091153849)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3 of size 3336 dropped from memory (free 611091157185)
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 4
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4_piece0 of size 2295 dropped from memory (free 611091159480)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_4_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4 of size 4016 dropped from memory (free 611091163496)
Imagine the first five lines repeated for 28 other partitions within a two minute time frame.
I have tried several things:
Spark 1.3.0 and 1.4.0
nio instead of netty
flatMap instead of mapPartitions
Just 30 instead of 600 input partitions
Still, I never get the last 1/30 of my data out of spark. Did anyone ever observe something similar? These two posts here and here seem to describe similar problems but no solution.
UPDATE
The task that never finishes is always the first task of the reduceKey+writeToTextFile. I have also removed the HashPartitioner and even tried on a bigger cluster with 400 cores and 6000 partitions. Only 5999 finish successfully, the last runs forever.
The UI shows for all tasks something like
Shuffle Read Size / Records: 20.0 MB / 1954832
but for the first it shows (at the moment)
Shuffle Read Size / Records: 150.1 MB / 711836
Numbers still growing....

It might be that your keys are very skewed. Depending on how they are distributed (or if you have a null or default key), a significant amount of the data might be going to a single executor and be no different than running in your local machine (plus overhead of a distributed platform). It might even be causing that machine to swap to disk, becoming intolerably slow.
Try using aggregateByKey instead of reduceByKey, since it will attempt to get partial sums distributed across executors instead of shuffling all the (potentially large) set of key-value pairs to a single executor. And maybe avoid fixing the number of output partitions to 30 just in case.
Edit: It is hard to detect the problem for "it just does not finish". One thing you can do is to introduce a timeout:
val result = Await.result(future {
// Your normal computation
}, timeout)
That way, whatever task is taking too long, you can detect it and gather some metrics on the spot.

Related

Spark standalone application implementes PCA, then hangs for 10-12 minutes and only then removes RDD from memory

I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96 and Spark-Cassandra-Connector 3.1.0. I am doing a Spark-Join(broadcastHashJoin) between a dataset and a Cassandra table and then implement a PCA from SparkML library. Inbetween, I persist a dataset and I unpersist it only after the computations of the PCA are finished. According to the stages tab from SparkUI, everything is finished in less than 10 minutes and generally no executor is doing anything:
but the persisted dataset is still persisted and stays like that for another 10-12 minutes as shown below from the Storage tab of SparkUI:
This is the last lines of stderr from one of the nodes where you can see there is a difference of 10 minutes in the last 2 lines:
22/09/15 11:41:09 INFO MemoryStore: Block taskresult_1436 stored as bytes in memory (estimated size 89.3 MiB, free 11.8 GiB)
22/09/15 11:41:09 INFO Executor: Finished task 3.0 in stage 33.0 (TID 1436). 93681153 bytes result sent via BlockManager)
22/09/15 11:51:49 INFO BlockManager: Removing RDD 20
22/09/15 12:00:24 INFO BlockManager: Removing RDD 20
While in the main console where the application runs I only get:
1806703 [dispatcher-BlockManagerMaster] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_1_piece0 on 192.168.100.237:46523 in memory (size: 243.7 KiB, free: 12.1 GiB)
1806737 [block-manager-storage-async-thread-pool-75] INFO org.apache.spark.storage.BlockManager - Removing RDD 20
If I try to print the dataset after PCA is complete and before I unpersist it, then it still takes ~20 minutes, then it prints it and then unpersists it. Why? Would that have to do maybe with the query and the Cassandra table?
I have not enabled MLlib Linear Algebra Acceleration as I have ubuntu 20.04 which has incompatibility issues with libgfortran5, etc..but I am also not sure it would help. I am not sure where to look or for what to look in order to reduce these 20 minutes to 10. Any ideas what might be happening? Let me know if you want any more information.
It seems that activating the Linear Algebra Acceleration libraries of Apache Spark ML does make a difference! It reduced the PCA calculation time by 10 minutes, so no more Spark hanging!

Spark - Writing large dataframe problems

In Spark 2.2 (via YARN), I am trying to write a pretty large dataframe to HDFS via an overnight batch job. We first have two source tables, which we join, and then we write the joined result. The output is compressed parquet, but the write is failing due to an out of memory error.
We're providing 12 executors each with 20g of memory and 4 cores, plus the driver with 32g.
In a write operation like this, what runs out of memory? The executors? Short of blindly throwing more memory at it, what steps can we take to resolve this?
The join code is simple:
joined.write.option("header", "true").parquet(destPath)
Here are the final logs before a bunch of "heap dump" spam:
18/06/15 14:25:27 INFO BlockManagerInfo: Added broadcast_1385_piece0 in memory on company02.host.comp.com:43201 (size: 36.8 KB, free: 10.5 GB)
18/06/15 14:25:38 INFO TaskSetManager: Finished task 5982.0 in stage 1008.0 (TID 41953) in 63870 ms on company02.host.comp.com (executor 7) (5979/12136)
18/06/15 14:25:39 INFO TaskSetManager: Finished task 5984.0 in stage 1008.0 (TID 41955) in 65189 ms on company02.host.comp.com (executor 7) (5980/12136)
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2018/06/15 14:25:48 - please wait.

Single long running task in each executor

Sorry if this question looks invalid, I tried to find general guidance to debug task processing times but found nothing yet. I think my problem is a known one, so any help to debug the problem or to understand the problem (related discussion or blog post) would answer my question.
I made multiple streaming spark jobs and almost all of them suffer by same problem; one task in each executor take much longer time than all other tasks:
But input size of tasks are not that different:
My workflow is flat mapping (mapParitionsWithPair ( flatMap )) over direct Kafka stream source with forty partitions to generating more objects from events and then reducing them (reduceByKey) and saving aggregated values to some DB:
The task timeline figure is for reduce stage.
It's a Apache Mesos based cluster with two nodes and two cores for each node and second stage of all jobs have this uneven task processing time distribution.
Update:
I replaced reduceByKey by Java reduce operation (Actually Kotlin Sequence operations) and still same problem occurs.
After replaying job I realized this problem does harm that much for bigger inputs; It processes 160K events in 1.8 to 4.8 minutes (worse case 580 events per second) and while there is still some tasks taking much longer time, the final effect is much less harmful than for small inputs whose processing rate is between 660 to 54. Interestingly for both cases long running tasks get same amount of time (about 41 seconds)
Problem exists even after increasing RAM. Executors now have %30 free RAM.
Update:
I changed workflow to not shuffle data by using Java 8 Stream reduce in each partition. Here is changed job's DAG:
I increased batch interval to 20 seconds and added more nodes; Now, there is not just one slow tasks but more slow tasks and few faster ones, but:
Now it is overally doing much faster than previous version with shorter intervals
I expect CPU usage always be high, specially for operation in mapPartition, but It's not always true.
Just put some logging around actual operation in each partition and I see strangely sometimes tasks are slow and sometimes is fast. When task is going on slow, CPU is idle and I can't see any blocking by network or CPU I/O. Memory usage is constant at %50. Here is mentioned executor logs:
started processing partitioned input: thread 99
started processing partitioned input: thread 98
finished processing partitioned input: thread 99 took 40615ms
finished processing partitioned input: thread 98 took 40469ms
started processing partitioned input: thread 98
started processing partitioned input: thread 99
finished processing partitioned input: thread 98 took 40476ms
finished processing partitioned input: thread 99 took 40523ms
started processing partitioned input: thread 98
started processing partitioned input: thread 99
finished processing partitioned input: thread 98 40465ms
finished processing partitioned input: thread 99 40379ms
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 468
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 525
started processing partitioned input: thread 99
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 738
finished processing partitioned input: thread 99 790
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 558
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 461
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 483
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 513
started processing partitioned input: thread 98
finished processing partitioned input: thread 98 took 485
started processing partitioned input: thread 99
finished processing partitioned input: thread 99 took 454
Above logs is just for mapping some incoming inputs to objects for saving in Cassandra, and does not include time for saving to Cassandra; here is logs for save operation which is always fast and don't leave CPU idle:
18/02/07 07:41:47 INFO Executor: Running task 17.0 in stage 5.0 (TID 207)
18/02/07 07:41:47 INFO TorrentBroadcast: Started reading broadcast variable 5
18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.8 KB, free 1177.1 MB)
18/02/07 07:41:47 INFO TorrentBroadcast: Reading broadcast variable 5 took 33 ms
18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 16.4 KB, free 1177.1 MB)
18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_2 locally
18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_17 locally
18/02/07 07:42:02 INFO TableWriter: Wrote 28926 rows to keyspace.table in 15.749 s.
18/02/07 07:42:02 INFO Executor: Finished task 17.0 in stage 5.0 (TID 207). 923 bytes result sent to driver
18/02/07 07:42:02 INFO CoarseGrainedExecutorBackend: Got assigned task 209
18/02/07 07:42:02 INFO Executor: Running task 18.0 in stage 5.0 (TID 209)
18/02/07 07:42:02 INFO BlockManager: Found block rdd_30_18 locally
18/02/07 07:42:03 INFO TableWriter: Wrote 29288 rows to keyspace.table in 16.042 s.
18/02/07 07:42:03 INFO Executor: Finished task 2.0 in stage 5.0 (TID 203). 1713 bytes result sent to driver
18/02/07 07:42:03 INFO CoarseGrainedExecutorBackend: Got assigned task 211
18/02/07 07:42:03 INFO Executor: Running task 21.0 in stage 5.0 (TID 211)
18/02/07 07:42:03 INFO BlockManager: Found block rdd_30_21 locally
18/02/07 07:42:19 INFO TableWriter: Wrote 29315 rows to keyspace.table in 16.308 s.
18/02/07 07:42:19 INFO Executor: Finished task 21.0 in stage 5.0 (TID 211). 923 bytes result sent to driver
18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 217
18/02/07 07:42:19 INFO Executor: Running task 24.0 in stage 5.0 (TID 217)
18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_24 locally
18/02/07 07:42:19 INFO TableWriter: Wrote 29422 rows to keyspace.table in 16.783 s.
18/02/07 07:42:19 INFO Executor: Finished task 18.0 in stage 5.0 (TID 209). 923 bytes result sent to driver
18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 218
18/02/07 07:42:19 INFO Executor: Running task 25.0 in stage 5.0 (TID 218)
18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_25 locally
18/02/07 07:42:35 INFO TableWriter: Wrote 29427 rows to keyspace.table in 16.509 s.
18/02/07 07:42:35 INFO Executor: Finished task 24.0 in stage 5.0 (TID 217). 923 bytes result sent to driver
18/02/07 07:42:35 INFO CoarseGrainedExecutorBackend: Got assigned task 225

Why Spark Job executed on master server only during the Cleaned accumulator steps

I have launched a cluster of spark using 3 EC2 instances of c4.2xlarge (15GB RAM/ 8 Cores) type let's name then as A, B, and C.
Configuring A:
I have started it as a master-server.sh
start-master.sh
And on this cluster i have only launched 3 executors. with the following command
start-slave.sh <master-uri> -c 3
Configuring B and C:
I have created 8 executors on both of the instances by running the following command on each instance.
start-slave.sh <master-uri> -c 8
Now My code is following:
# Loading wiki dumps files.
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0].encode("utf-8"))
# Running word count algorithm. and selecting with count as 1.
counts = lines.flatMap(lambda x: x.lower().split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add) \
.filter(lambda x: x[1] == 1) \
.map(lambda (x,y): x)
# Making Dataframe from RDD.
df = lines.map(lambda x: (x, )).toDF(['raw_sentence'])
# Tokenizing using spark ml API.
fl = Tokenizer(inputCol="raw_sentence", outputCol="words")
df = fl.transform(df).select("words")
# Removing Stopwords. Pay attention I am converting counts to list iterator.
fl = StopWordsRemover(inputCol="words", outputCol="filtered")
fl.setStopWords(fl.getStopWords() + list(counts.toLocalIterator()))
df = fl.transform(df).select("filtered")
Initially when I started the Job. My server A, B and C were utilising all the cores. But then After some time my B and C cores won't use any memory
or cores and at this stage following were the logs:
17/09/08 20:31:54 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
172.31.35.55:45288 in memory (size: 25.0 KB, free: 6.2 GB) 17/09/08 20:31:54 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
172.31.44.209:39094 in memory (size: 25.0 KB, free: 6.2 GB) 17/09/08 20:31:54 INFO ContextCleaner: Cleaned accumulator 51
17/09/08 21:13:51 WARN HeartbeatReceiver: Removing executor 2 with no
recent heartbeats: 232069 ms exceeds timeout 120000 ms 17/09/08
21:26:15 ERROR TaskSchedulerImpl: Lost executor 2 on 172.31.44.209:
Executor heartbeat timed out after 232069 ms
17/09/08 21:27:09 ERROR TransportRequestHandler: Error sending result
RpcResponse{requestId=8270848140270032673,
body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47
cap=64]}} to /172.31.44.209:33418; closing connection
java.io.IOException: Broken pipe
Line 47 of my code is the second last line of above code which is following:
fl.setStopWords(fl.getStopWords() + list(counts.toLocalIterator()))
Custom Configuration are:
SPARK_EXECUTOR_MEMORY=12G
Rest were defaults.
So why on line 47 tasks were not running as distributed?
And Why it crashed even though I have Extra resources available, especially RAM ?
RDD.toLocalIterator fetches a single partition at the time. So the execution pattern will be similar to this:
A single partition is computed. This may require activity from a single executor (no wide dependencies) or multiple executors.
Data is fetched to driver and local thread starts to iterate. Driver is active, the rest of code is idle.
Once the end of the chunk is reached, and there are more partitions to follow, driver will request next partition (go to 1.).
Since you convert iterator to list you can as well collect. Memory consumption will be the same (and possibly lead to failure) but all nodes will compute their parts without pauses.

cassandra spark connector read performance

I have some Spark experience but just starting out with Cassandra. I am trying to do a very simple read and getting really bad performance -- can't tell why. Here is the code I am using:
sc.cassandraTable("nt_live_october","nt")
.where("group_id='254358'")
.where("epoch >=1443916800 and epoch<=1444348800")
.first
all 3 params are part of the key on the table:
PRIMARY KEY (group_id, epoch, group_name, auto_generated_uuid_field)
) WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
And the output I see from my driver is like this:
15/10/07 15:05:02 INFO CassandraConnector: Connected to Cassandra
cluster: shakassandra 15/10/07 15:07:02 ERROR Session: Error
creating pool to attila./198.xxx:9042
com.datastax.driver.core.ConnectionException:
[attila./198.xxx:9042] Unexpected error
during transport initialization
(com.datastax.driver.core.OperationTimedOutException: [attila
/198.xxx:9042] Operation timed out)
15/10/07 15:07:02 INFO SparkContext: Starting job: take at
CassandraRDD.scala:121
15/10/07 15:07:03 INFO BlockManagerInfo:
Added broadcast_5_piece0 in memory on
osd09:39903 (size: 4.8 KB, free: 265.4 MB)
15/10/07 15:08:23 INFO TaskSetManager: Finished task 0.0 in stage 6.0
(TID 8) in 80153 ms on osd09 (1/1)
15/10/07 15:08:23 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 8)
in 80153 ms on osd09 (1/1)
15/10/07 15:08:23
INFO DAGScheduler: ResultStage 6 (take at CassandraRDD.scala:121)
finished in 80.958 s 15/10/07 15:08:23 INFO TaskSchedulerImpl: Removed
TaskSet 6.0, whose tasks have all completed, from pool
15/10/07 15:08:23 INFO DAGScheduler: Job 5 finished: take at
CassandraRDD.scala:121, took 81.043413 s
I expect this query to be really fast yet it's taking over a minute. A few things jump out at me
It takes almost two minutes to get the session error -- I pass the IPs of 3 nodes to Spark Cassandra connector -- is there a way to tell it to skip failed connections faster?
The task gets sent to a Spark worker which is not a Cassandra node -- this seems pretty strange to me -- is there a way to get information as to why the scheduler chose to send the task to a remote node?
Even if the task was sent to a remote node, the Input Size(Max) on that worker shows up as 334.0 B / 1 but the executor time is 1.3 min (see picture). This seems really slow -- I would expect time to be spent on deserialization, not compute...
Any tips on how to debug this, where to look for potential problems much appreciated. Using Spark 1.4.1 with connector 1.4.0-M3, cassandra ReleaseVersion: 2.1.9, all defaults on tuneable connector params
I think the problem lays into distribution of data between partitions. Your table has one cluster (partitioning) key - groupId, epoch is a clustering column only. Data distributes on cluster nodes only by groupId, so you have a huge partition with groupId='254358' on one node on the cluster.
When you run your query Cassandra reaches very fast partition with groupId='254358' and then filter all rows to find records with epoch between 1443916800 and 1444348800. If there are a lot of rows the query will be really slow. Actually this query is not distributed it will always run on one node.
Better practice extract date or even hour and add it as partitioning key, in your case something like
PRIMARY KEY ((group_id, date), epoch, group_name, auto_generated_uuid_field)
WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
To verify my hypothesis you can run your current query in cqlsh with turning on tracing read here how to do it. So the problem has nothing in connect with Spark.
About error and time to get it, everything is fine because you receive error after timeout happened.
Also I remember recommendations of spark-cassandra-connector to place Spark slaves joint to Cassandra nodes exactly to distribute queries by partitioning key.

Resources