cassandra spark connector read performance

cassandra spark connector read performance - apache-spark

I have some Spark experience but just starting out with Cassandra. I am trying to do a very simple read and getting really bad performance -- can't tell why. Here is the code I am using:
sc.cassandraTable("nt_live_october","nt")
.where("group_id='254358'")
.where("epoch >=1443916800 and epoch<=1444348800")
.first
all 3 params are part of the key on the table:
PRIMARY KEY (group_id, epoch, group_name, auto_generated_uuid_field)
) WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
And the output I see from my driver is like this:
15/10/07 15:05:02 INFO CassandraConnector: Connected to Cassandra
cluster: shakassandra 15/10/07 15:07:02 ERROR Session: Error
creating pool to attila./198.xxx:9042
com.datastax.driver.core.ConnectionException:
[attila./198.xxx:9042] Unexpected error
during transport initialization
(com.datastax.driver.core.OperationTimedOutException: [attila
/198.xxx:9042] Operation timed out)
15/10/07 15:07:02 INFO SparkContext: Starting job: take at
CassandraRDD.scala:121
15/10/07 15:07:03 INFO BlockManagerInfo:
Added broadcast_5_piece0 in memory on
osd09:39903 (size: 4.8 KB, free: 265.4 MB)
15/10/07 15:08:23 INFO TaskSetManager: Finished task 0.0 in stage 6.0
(TID 8) in 80153 ms on osd09 (1/1)
15/10/07 15:08:23 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 8)
in 80153 ms on osd09 (1/1)
15/10/07 15:08:23
INFO DAGScheduler: ResultStage 6 (take at CassandraRDD.scala:121)
finished in 80.958 s 15/10/07 15:08:23 INFO TaskSchedulerImpl: Removed
TaskSet 6.0, whose tasks have all completed, from pool
15/10/07 15:08:23 INFO DAGScheduler: Job 5 finished: take at
CassandraRDD.scala:121, took 81.043413 s
I expect this query to be really fast yet it's taking over a minute. A few things jump out at me
It takes almost two minutes to get the session error -- I pass the IPs of 3 nodes to Spark Cassandra connector -- is there a way to tell it to skip failed connections faster?
The task gets sent to a Spark worker which is not a Cassandra node -- this seems pretty strange to me -- is there a way to get information as to why the scheduler chose to send the task to a remote node?
Even if the task was sent to a remote node, the Input Size(Max) on that worker shows up as 334.0 B / 1 but the executor time is 1.3 min (see picture). This seems really slow -- I would expect time to be spent on deserialization, not compute...
Any tips on how to debug this, where to look for potential problems much appreciated. Using Spark 1.4.1 with connector 1.4.0-M3, cassandra ReleaseVersion: 2.1.9, all defaults on tuneable connector params

I think the problem lays into distribution of data between partitions. Your table has one cluster (partitioning) key - groupId, epoch is a clustering column only. Data distributes on cluster nodes only by groupId, so you have a huge partition with groupId='254358' on one node on the cluster.
When you run your query Cassandra reaches very fast partition with groupId='254358' and then filter all rows to find records with epoch between 1443916800 and 1444348800. If there are a lot of rows the query will be really slow. Actually this query is not distributed it will always run on one node.
Better practice extract date or even hour and add it as partitioning key, in your case something like
PRIMARY KEY ((group_id, date), epoch, group_name, auto_generated_uuid_field)
WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
To verify my hypothesis you can run your current query in cqlsh with turning on tracing read here how to do it. So the problem has nothing in connect with Spark.
About error and time to get it, everything is fine because you receive error after timeout happened.
Also I remember recommendations of spark-cassandra-connector to place Spark slaves joint to Cassandra nodes exactly to distribute queries by partitioning key.

Related

Spark standalone application implementes PCA, then hangs for 10-12 minutes and only then removes RDD from memory

I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96 and Spark-Cassandra-Connector 3.1.0. I am doing a Spark-Join(broadcastHashJoin) between a dataset and a Cassandra table and then implement a PCA from SparkML library. Inbetween, I persist a dataset and I unpersist it only after the computations of the PCA are finished. According to the stages tab from SparkUI, everything is finished in less than 10 minutes and generally no executor is doing anything:
but the persisted dataset is still persisted and stays like that for another 10-12 minutes as shown below from the Storage tab of SparkUI:
This is the last lines of stderr from one of the nodes where you can see there is a difference of 10 minutes in the last 2 lines:
22/09/15 11:41:09 INFO MemoryStore: Block taskresult_1436 stored as bytes in memory (estimated size 89.3 MiB, free 11.8 GiB)
22/09/15 11:41:09 INFO Executor: Finished task 3.0 in stage 33.0 (TID 1436). 93681153 bytes result sent via BlockManager)
22/09/15 11:51:49 INFO BlockManager: Removing RDD 20
22/09/15 12:00:24 INFO BlockManager: Removing RDD 20
While in the main console where the application runs I only get:
1806703 [dispatcher-BlockManagerMaster] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_1_piece0 on 192.168.100.237:46523 in memory (size: 243.7 KiB, free: 12.1 GiB)
1806737 [block-manager-storage-async-thread-pool-75] INFO org.apache.spark.storage.BlockManager - Removing RDD 20
If I try to print the dataset after PCA is complete and before I unpersist it, then it still takes ~20 minutes, then it prints it and then unpersists it. Why? Would that have to do maybe with the query and the Cassandra table?
I have not enabled MLlib Linear Algebra Acceleration as I have ubuntu 20.04 which has incompatibility issues with libgfortran5, etc..but I am also not sure it would help. I am not sure where to look or for what to look in order to reduce these 20 minutes to 10. Any ideas what might be happening? Let me know if you want any more information.

It seems that activating the Linear Algebra Acceleration libraries of Apache Spark ML does make a difference! It reduced the PCA calculation time by 10 minutes, so no more Spark hanging!

Spark: driver logs showing "thread spilling sort data to disk"

Could somebody help me understand what could be the possible reasons for the below lines coming in spark job logs.
2018-06-11T05:35:46,181 - INFO [Executor task launch worker for task
328:Logging$class#54] - TID 328 waiting for at least 1/2N of on-heap
execution pool to be free 2018-06-11T05:35:46,182 - INFO [Executor
task launch worker for task 329:UnsafeExternalSorter#202] - Thread 151
spilling sort data of 50.0 MB to disk (20 times so far)
2018-06-11T05:35:46,188 - INFO [Executor task launch worker for task
322:UnsafeExternalSorter#202] - Thread 176 spilling sort data of 33.0
MB to disk (27 times so far)
Spark program working:
query the database, cache the whole table(2GB is cached)
select records sequentially for a country out of 3(Denmark, India, NewZealand)
break the dataframe in 500 pieces and pass it to a map function which creates the json of a set of records in a piece and send it to search server
map is being applied on parallel collection(Vector) to execute the parallel processing and we could send in parallel to search server for indexing.
I am newbie in Spark, so please help me to understand which part of configuration should I look to stop this spilling. Spark version is 2.1.1

Based on the log, you sort the data.
During the sort there is not enough memory to store auxiliary data structures for shuffle in memory.
Therefore Spark spills data to disk.

This log means there isn't enough memory for task computing, and exchange data to disk, it's expensive operation.
When you find this log in one or few executor tasks, it indicates there exists data skew, you may need to find skew key data and preprocess it.

org.apache.spark.shuffle.FetchFailedException

I am running this query on a data size of 4 billion rows and getting
org.apache.spark.shuffle.FetchFailedException error.
select adid,position,userid,price
from (
select adid,position,userid,price,
dense_rank() OVER (PARTITION BY adlocationid ORDER BY price DESC) as rank
FROM trainInfo) as tmp
WHERE rank <= 2
I have attached the error logs from spark-sql terminal.Please suggest what is the reason for these kind of errors and how can I resolve them.
error logs

The problem is that you lost an executor:
15/08/25 10:08:13 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 165758 ms exceeds timeout 120000 ms
15/08/25 10:08:13 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.1.223: Executor heartbeat timed out after 165758 ms
The exception occurs when trying to read shuffle data from that node. The node may be doing a very long GC (maybe try using a smaller heap size for executors), or network failure, or a pure crash. Normally Spark should recover from lost nodes like this one, and indeed it starts resubmitting the first stage to another node. Depending how big is your cluster, it may succeed or not.

Severe straggler tasks due to Locality Level being "Any" and a Network Fetch on cached RDD

A cached dataset that has been completely read through - successfully - is being reprocessed. A small number (typically 2/204 tasks - 1%) of the tasks may fail on a subsequent pass over the same (still cached) dataset. We are on spark 1.3.1.
The following screenshot shows that - of 204 tasks - the last two seem to have been 'forgotten' by the scheduler.
Is there any way to get more information about these tasks that are in limbo?
All of the other tasks completed within a reasonable fraction of similar time: in particular the 75% is still within 50% of the median. It is just these last two stragglers that are killing the entire job completion time. Notice also these are not due to record count skew
Update The two stragglers did finally finish - at over 7 minutes (over 3x longer any other other 202 tasks) !
15/08/15 20:04:54 INFO TaskSetManager: Finished task 201.0 in stage 2.0 (TID 601) in 133583 ms on x125 (202/204)
15/08/15 20:09:53 INFO TaskSetManager: Finished task 189.0 in stage 2.0 (TID 610) in 423230 ms on i386 (203/204)
15/08/15 20:10:05 INFO TaskSetManager: Finished task 190.0 in stage 2.0 (TID 611) in 435459 ms on i386 (204/204)
15/08/15 20:10:05 INFO DAGScheduler: Stage 2 (countByKey at MikeFilters386.scala:76) finished in 599.028 s
Suggestions on what to look for /review appreciated.
Another update The TYPE has turned out to be Network for those two. What does that mean?

I had a similar issue with you. Try increasing spark.locality.wait.
If that works, the following might apply to you:
https://issues.apache.org/jira/browse/SPARK-13718#
** ADDED **
Some extra information that I found helpful.
Spark will always initially assign a task to the executor that contains the respective cached RDD partition.
If Task is not accepted under the locality timeouts as defined in the spark config, then it will try NODE_LOCAL, RACK_LOCAL, ANY in that sequence.
Regardless if the cached data are available locally (HDFS replicas), Spark will always fetch the cached partition from the node that contains it. It will only re-compute if that executor crashed so the RDD is no longer cached. This will, in many cases, cause a network bottleneck on the original straggler node as well.

Have you tried using Spark speculation (spark.speculation true)? Spark will identify these stragglers and relaunch then on another node.

spark saveAsTextFile last partition (almost?) never finishes

I have a very simple word-count-like program that generates (Long, Double) counts like that:
val lines = sc.textFile(directory)
lines.repartition(600).mapPartitions{lineIterator =>
// Generate iterator of (Long,Double) counts
}
.reduceByKey(new HashPartitioner(30), (v1, v2) => v1 + v2).saveAsTextFile(outDir, classOf[GzipCodec])
My problem: The last of the 30 partitions never gets written.
Here are a few details:
My input is 5 GB gz-compressed and I expect about 1B unique Long keys.
I run on a 32 core 1.5TB machine. Input and output come from a local disk with 2TB free. Spark is assigned to use all the ram and happily does so. This application occupies about 0.5 TB.
I can observe the following:
For 29 partitions the reduce and repartition (because of the HashPartitioner) takes about 2h. The last one does not finish, not even after a day. Two to four threads stay on 100%.
No error or warning appears in the log
Spark occupies about 100GB in /tmp which aligns with what the UI reports for shuffle write.
In the UI I can see the number of "shuffle read records" growing very, very slowly for the remaining task. After one day, still one magnitude away from what all the finished tasks show.
The last log looks like that:
15/08/03 23:26:43 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000020_748: Committed
15/08/03 23:26:43 INFO Executor: Finished task 20.0 in stage 2.0 (TID 748). 865 bytes result sent to driver
15/08/03 23:27:50 INFO FileOutputCommitter: Saved output of task 'attempt_201508031748_0002_m_000009_737' to file:/output-dir/_temporary/0/task_201508031748_0002_m_000009
15/08/03 23:27:50 INFO SparkHadoopWriter: attempt_201508031748_0002_m_000009_737: Committed
15/08/03 23:27:50 INFO Executor: Finished task 9.0 in stage 2.0 (TID 737). 865 bytes result sent to driver
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 3
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3_piece0 of size 2009 dropped from memory (free 611091153849)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_3
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_3 of size 3336 dropped from memory (free 611091157185)
15/08/04 02:44:54 INFO BlockManager: Removing broadcast 4
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4_piece0
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4_piece0 of size 2295 dropped from memory (free 611091159480)
15/08/04 02:44:54 INFO BlockManagerMaster: Updated info of block broadcast_4_piece0
15/08/04 02:44:54 INFO BlockManager: Removing block broadcast_4
15/08/04 02:44:54 INFO MemoryStore: Block broadcast_4 of size 4016 dropped from memory (free 611091163496)
Imagine the first five lines repeated for 28 other partitions within a two minute time frame.
I have tried several things:
Spark 1.3.0 and 1.4.0
nio instead of netty
flatMap instead of mapPartitions
Just 30 instead of 600 input partitions
Still, I never get the last 1/30 of my data out of spark. Did anyone ever observe something similar? These two posts here and here seem to describe similar problems but no solution.
UPDATE
The task that never finishes is always the first task of the reduceKey+writeToTextFile. I have also removed the HashPartitioner and even tried on a bigger cluster with 400 cores and 6000 partitions. Only 5999 finish successfully, the last runs forever.
The UI shows for all tasks something like
Shuffle Read Size / Records: 20.0 MB / 1954832
but for the first it shows (at the moment)
Shuffle Read Size / Records: 150.1 MB / 711836
Numbers still growing....

It might be that your keys are very skewed. Depending on how they are distributed (or if you have a null or default key), a significant amount of the data might be going to a single executor and be no different than running in your local machine (plus overhead of a distributed platform). It might even be causing that machine to swap to disk, becoming intolerably slow.
Try using aggregateByKey instead of reduceByKey, since it will attempt to get partial sums distributed across executors instead of shuffling all the (potentially large) set of key-value pairs to a single executor. And maybe avoid fixing the number of output partitions to 30 just in case.
Edit: It is hard to detect the problem for "it just does not finish". One thing you can do is to introduce a timeout:
val result = Await.result(future {
// Your normal computation
}, timeout)
That way, whatever task is taking too long, you can detect it and gather some metrics on the spot.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string