Does Spark incur the same amount of overhead as Hadoop for vnodes? - apache-spark

I just read https://stackoverflow.com/a/19974621/260805. Does Spark (specifically Datastax's Cassandra Spark connector) incur the same amount of overhead as Hadoop when reading from a Cassandra cluster? I know Spark uses threads more heavily than Hadoop does.

Performance with vnodes and without in the connector should be basically the same. With hadoop each vnode split generated it's own task which created a large amount of overhead.
With Spark, tasks contain the token ranges from multiple vnodes and are merged into a single task and the overall task overhead is lower. There is a slight locality issue where it becomes difficult to get balanced numbers of tasks for all the nodes in the C* cluster with smaller data sizes. This issue is being worked on in SPARKC-43.

I'll give three separate answers. I apologize for the rather unstructured answer, but it's been building up over time:
A previous answer:
Here's one potential answer: Why not enable virtual node in an Hadoop node?. I quote:
Does this also apply to Spark?
No, if you're using the official DataStax spark-cassandra-connector. It can process multiple token ranges in a single Spark task. There is still some minor performance hit, but not as huge as with Hadoop.
A production benchmark
We ran a Spark job against a vnode-enabled Cassandra (Datastax Enterprise) datacenter with 3 nodes. The job took 9.7 hours. Running the same job on for slightly less data, using 5 non-vnode nodes, a couple of weeks back took 8.8 hours.
A controlled benchmark
To further test the overhead we ran a controlled benchmark on a Datastax Enterprise node in a single-node cluster. For both vnode enabled/disabled the node was 1) reset, 2) X number of rows were written and then 3) SELECT COUNT(*) FROM emp in Shark was executed a couple of times to get a cold vs. hot cache times. X tested were 10^{0-8}.
Assuming that Shark is not dealing with vnodes in any way, the average (quite stable) overhead for vnodes were ~28 seconds for cold Shark query executions and 17 seconds for hot executions. The latency difference did generally not vary with data size.
All the numbers for the benchmark can be found here. All scripts used to run the benchmark (see output.txt for usage) can be found here.
My only guess why there was a difference between "Cold diff" and "Hot diff" (see spreadsheet) is that it took Shark some time to create metadata, but this is simply speculation.
Conclusion
Our conclusion is that the overhead of vnodes is a constant time between 13 and 30 seconds, independent of data size.

Related

How do you efficiently bucket/partition on a shared cluster that autoscales?

Edit: Using Spark with Databricks
As far as I understand, effective partitioning should be based on the number of executors available, ideally partitions % executors = 0
But if you work on a shared Spark cluster that autoscales according to activity, and in which people may be keeping some executors busy with their own work, is it possible to efficiently partition and bucket in this way?
Say I notice there are 8 exectutors active on the cluster, so I make 8 partitions or buckets to distribute the workload more easily. While that's happening, Alice and Jane log on and start running big queries, so the cluster upscales to say, 12 executors.
Now I'm no longer efficiently parititioned. Or what if the cluster doesn't upscale, but Alice and Jane take up some executors, now my partitions will be skewed, right?
Or... will Spark recognise that I have 8 partitions, and upscale as needed to match that if enough aren't immediately available?
The rule partitions % executors = 0 is applied to the efficient processing so you don't have less partitions than executors at some point of time. Really, the things are more complicated - partitions could be small, and then automatically coalesced when Adaptive Query Execution (AQE) kicks in, combining multiple small partitions into bigger logical partitions, etc. And it's one of the "optimizations" on Spark 3.x - set shuffle partitions to some big number, and allow AQE to optimize it, instead of ending with too big partitions.
Yes, on shared cluster, some of resources could be consumed by other users, but that's just will allocate less cores for your processing, but not skew your partitions. Skewed partitions are primarily related to the partitions of different sizes, but this also should be handled by AQE that is enabled on DBR 7.3+.
Overall: yes, on shared clusters some resources will be taken by other users, but otherwise it's better to rely on the improvements in the Spark 3.x in area of automatic optimization. In previous versions there was a lot of manual tuning that isn't required in newer versions.

decide no of partition in spark (running on YARN) based on executer ,cores and memory

How to decide no of partition in spark (running on YARN) based on executer, cores and memory.
As i am new to spark so doesn't have much hands on real scenario
I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.
Thanks in advance
One important parameter for parallel collections is the number of
partitions to cut the dataset into. Spark will run one task for each
partition of the cluster. Typically you want 2-4 partitions for each
CPU in your cluster
the number of parition is recommended to be 2/4 * the number of cores.
so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition
https://spark.apache.org/docs/latest/rdd-programming-guide.html
IMO with spark 3.0 and AWS EMR 2.4.x with adaptive query execution you're often better off letting spark handle it. If you do want to hand tune it the answer can often times be complicated. One good option is to have 2 or 4 times the number of cpus available. While this is useful for most datasizes it becomes problematic with very large and very small datasets. In those cases it's useful to aim for ~128MB per partition.

How to speedup node joining process in cassandra cluster

I have a cluster 4 cassandra nodes. I have recently added a new node but data processing is taking too long. Is there a way to make this process faster ? output of nodetool
Less data per node. Your screenshot shows 80TB per node, which is insanely high.
The recommendation is 1TB per node, 2TB at most. The logic behind this is bootstrap times get too high (as you have noticed). A good Cassandra ring should be able to rapidly recover from node failure. What happens if other nodes fail while the first one is rebuilding?
Keep in mind that the typical model for Cassandra is lots of smaller nodes, in contrast to SQL where you would have a few really powerful servers. (Scale out vs scale up)
So, I would fix the problem by growing your cluster to have 10X - 20X the number of nodes.
https://groups.google.com/forum/m/#!topic/nosql-databases/FpcSJcN9Opw

Spark task duration difference

I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stage
task duration distribution
Is there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ?
Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.
There are two likely possibilities: one is under your control and .. unfortunately one is likely not ..
Skewed data. Check that the partitions are of relatively similar size - say within a factor of three or four.
Inherent variability of Spark tasks runtime. I have seen behavior of large delays in stragglers on Spark Standalone, Yarn, and Mesos without an apparent reason. The symptoms are:
extended periods (minutes) where little or no cpu or disk activity were occurring on the nodes hosting the straggler tasks
no apparent correlation of data size to the stragglers
different nodes/workers may experience the delays on subsequent runs of the same job
One thing to check: do hdfs dfsadmin -report and hdfs fsck to see if hdfs were healthy.

Cassandra vnodes performance overhead and changing the number of vnodes

We have a test cluster of 4 nodes, and we've turned on vnodes. It seems that reading out is somewhat slower than the old method (initial_token). Is there some performance overhead by using vnodes? Do we have to increase/decrease the default num_tokens (256) if we only have 4 physical nodes?
Another scenario we would like to test is to change the num_tokens of the cluster on the fly. Is it possible, or do we have to recreate the whole cluster? If possible, how can we accomplish that?
We're using Cassandra 2.0.4.
It really depends on your application, but if you are running Spark queries on top of Cassandra, then a high number of vnodes can significantly slow down your queries, by at least 2x or 5x. This is because Spark cannot subdivide queries across vnodes, and each vnode results in one Spark partition, and a high number of partitions slows down low latency queries.
The recommended number of vnodes is more like 16. This lets you split a two node cluster in theory to 32 nodes max, which is more than enough of an expansion ratio for most folks.

Resources