spark-cassandra-connector performance: executors seem to be idle - cassandra

On our 40 nodes clusters (33 spark executors/5 nodes cassandra),
with spark-streaming we are inserting about 20 000 per min (among other things) in a cassandra table (with .saveToCassandra).
The result we get is :
If I understand things correctly, executors S3, S14 and S19 are idle 75% of the time and prevent the stage from finishing... Such a resources waste! And a performance loss.
Here are my conf options for my SparkContext:
.set("spark.cassandra.output.batch.size.rows", "5120")
.set("spark.cassandra.output.concurrent.writes", "100")
.set("spark.cassandra.output.batch.size.bytes", "100000")
.set("spark.cassandra.connection.keep_alive_ms","60000")
Is this behavior normal? If not should I tune the above settings to avoid it?
Does the problem come from the spark-cassandra-connector writes or is it something else?

On first glance I doubt this is a cassandra connector problem. We are currently doing .saveToCassandra with 300,000 records per minute and smaller clusters.
If it were .saveToCassandra taking a long time, you'd tend to see long tasks. What you're seeing is unexplained(?) gaps between tasks.
It's going to take a good bit more information to track this down. Start on the Jobs tab - do you see any jobs taking a long time? Drill down, what do you see?

Related

Spark -Optimizing long running Job

we have a spark job that's taking long time to complete, Looked at the spark WebUI and I see lot of shuffling. Couple of things I tried but no luck so far. Increased the sql.shuffle partitions(tried 320,640 and 1600), # of executors (8) and memory (10/12gb) and 4 cores but no significant improvement. Appreciate any guidance on below:
1)when I see the event time line in spark web UI, only one executor is doing most of the processing and rest I don't see any significant activity -
Metrics, lot of difference for shuffle spill between 75th percentile and max -
Any pointers on how to investigate further will be of great help! basically looking for documentation on the event timeline as i see single executor is performing bulk of hte work and how to use the metrics to fix the perf issue by adjusting the spark configuration parameters if thats an option?

long scheduler Delay in Spark UI

i am running pyspark jobs on a 2.3.0 cluster on yarn.
i see that all the stages have a very long scheduler Delay.
BUT - it is just the max time, the 75th precentile is 28ms ....
all the other time metric are very low (GC time, task desirialization , etc.)
almost no shuffle write size.
the locality changes between mostly node local , process local and rack local.
what can be the reason for such long scheduler delay time ?
is it yarn or just missing resources to run the tasks ?
will increasing/decreasing partitions help this issue ?
answering my own question in case somebody has the same issue - appeared to be related to skewed data that caused long delays . that was caused by using coalesce instead of repartition of the data , that divided the data unevenly.
on top of that i also cached the data frame after partitioning , so the processed ran locally(process_local) and not node_local and rack_locak.

Processing Pipeline using Spark SQL- jobs, stages and DAG sizes

I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks

Is my understanding of spark partitioning correct?

I'd like to know If my understanding of the partitioning in Spark is correct.
I always thought about the number of partitions and their size and never about the worker they were processed by.
Yesterday, as I was playing a bit with partitioning, I found out that I was able to track the cached partitions' location using the WEB UI (Storage -> Cached RDD -> Data Distribution) and it surprised me.
I have a cluster of 30 cores (3 cores * 10 executors) and I had a RDD of like 10 partitions. I tried to expand it to 100 partitions to increase the parallelism just to find out that almost 90% of the partitions were on the same worker node and thus my parallelism was not limited by the total number of cpu in my cluster but by the number of cpu of the node containing 90% of the partitions.
I tried to find answers on stackoverflow and the only answer I could come by was about data locality. Spark detected that most of my files were on this node so it decided to keep most of the partitions on this node.
Is my understanding correct?
And if it is, is there a way to tell Spark to really shuffle the data?
So far this "data locality" lead to heavy underutilization of my cluster....

Does Spark incur the same amount of overhead as Hadoop for vnodes?

I just read https://stackoverflow.com/a/19974621/260805. Does Spark (specifically Datastax's Cassandra Spark connector) incur the same amount of overhead as Hadoop when reading from a Cassandra cluster? I know Spark uses threads more heavily than Hadoop does.
Performance with vnodes and without in the connector should be basically the same. With hadoop each vnode split generated it's own task which created a large amount of overhead.
With Spark, tasks contain the token ranges from multiple vnodes and are merged into a single task and the overall task overhead is lower. There is a slight locality issue where it becomes difficult to get balanced numbers of tasks for all the nodes in the C* cluster with smaller data sizes. This issue is being worked on in SPARKC-43.
I'll give three separate answers. I apologize for the rather unstructured answer, but it's been building up over time:
A previous answer:
Here's one potential answer: Why not enable virtual node in an Hadoop node?. I quote:
Does this also apply to Spark?
No, if you're using the official DataStax spark-cassandra-connector. It can process multiple token ranges in a single Spark task. There is still some minor performance hit, but not as huge as with Hadoop.
A production benchmark
We ran a Spark job against a vnode-enabled Cassandra (Datastax Enterprise) datacenter with 3 nodes. The job took 9.7 hours. Running the same job on for slightly less data, using 5 non-vnode nodes, a couple of weeks back took 8.8 hours.
A controlled benchmark
To further test the overhead we ran a controlled benchmark on a Datastax Enterprise node in a single-node cluster. For both vnode enabled/disabled the node was 1) reset, 2) X number of rows were written and then 3) SELECT COUNT(*) FROM emp in Shark was executed a couple of times to get a cold vs. hot cache times. X tested were 10^{0-8}.
Assuming that Shark is not dealing with vnodes in any way, the average (quite stable) overhead for vnodes were ~28 seconds for cold Shark query executions and 17 seconds for hot executions. The latency difference did generally not vary with data size.
All the numbers for the benchmark can be found here. All scripts used to run the benchmark (see output.txt for usage) can be found here.
My only guess why there was a difference between "Cold diff" and "Hot diff" (see spreadsheet) is that it took Shark some time to create metadata, but this is simply speculation.
Conclusion
Our conclusion is that the overhead of vnodes is a constant time between 13 and 30 seconds, independent of data size.

Resources