What is a performant partitioning strategy for key-agnostic mapping? - apache-spark

First of all I'm working with PySpark on Glue and I'm reading several very large CSV files. Those CSVs are bzip2 compressed and inflated several GB large.
At this stage of processing I'm only performing a simple map over all rows. No joins, group bys, filtering. Just a map.
Let's say I am working on 10 nodes. Generally speaking, would it be preferable to have a rather high number of partitions or a rather low number?
I would guess that independent of available cores on all those nodes that number should be pretty high to make sure that every executor is busy at all times having small chunks of data available.
So, let's say there are 20 cores on those 10 nodes and let's for a second assume there are key-based partitions then something larger than 40 would likely not be a good idea. But in the key-agnostic mapping case I'd tend to something like 1000 partitions or more.
Does that make sense? I'm especially interested in the thought process here.

Related

How do you determine shuffle partitions for Spark application?

I am new to spark so am following this amazing tutorial from sparkbyexamples.com and while reading I found this section:
Shuffle partition size & Performance
Based on your dataset size, a number of cores and memory PySpark
shuffling can benefit or harm your jobs. When you dealing with less
amount of data, you should typically reduce the shuffle partitions
otherwise you will end up with many partitioned files with less number
of records in each partition. which results in running many tasks with
lesser data to process.
On other hand, when you have too much of data and having less number
of partitions results in fewer longer running tasks and some times you
may also get out of memory error.
Getting the right size of the shuffle partition is always tricky and
takes many runs with different values to achieve the optimized number.
This is one of the key properties to look for when you have
performance issues on PySpark jobs.
Can someone help me understand how do you determine how many shuffle partitions you will need for your job?
As you quoted, it’s tricky, but this is my strategy:
If you’re using “static allocation”, means you tell Spark how many executors you want to allocate for the job, then it’s easy, number of partitions could be executors * cores per executor * factor. factor = 1 means each executor will handle 1 job, factor = 2 means each executor will handle 2 jobs, and so on
If you’re using “dynamic allocation”, then it’s trickier. You can read the long description here https://databricks.com/blog/2021/03/17/advertising-fraud-detection-at-scale-at-t-mobile.html. The general idea is you need to answer many questions like what’s the size if your data (how big in terms of gigabytes), how its structure looks like (how many files, how many folders, how many rows etc), how would you read it (from hdfs or from hive or from jdbc), how much resources do you have (cores, executors, memory), … Then you run and benchmark over and over to find the sweet spot that is “just right” for your circumstances.
Update #1:
So what is the general industry practice, will a company simply use first tactic and allocate more hardware or they will use dynamic allocation?
Usually, if you have an on-premise Hadoop environment, you can choose between static (default mode) and dynamic allocation (advanced mode). Also, I often start with dynamic because I have no idea how big the data and its transformation is, so stick with dynamic give me flexibility to expand my work without thinking too much about Spark configuration. But you also can start with static if you want to, nothing preventing you to do so.
Then eventually, when it came to productionize process, you also can choose between static (very stable but consumes more resources) vs dynamic (less stable, i.e fail sometimes due to resources allocation, but save resources.
Finally, most Hadoop cloud solution (like Databricks) come with dynamic allocation by default, which is is less costly.

Spark map-side aggregation: Per partition only?

I have been reading on map-side reduce/aggregation and there is one thing I can't seem to understand clearly. Does it happen per partition only or is it broader in scope? I mean does it also reduce across partitions if the same key appears in multiple partitions processed by the same Executor?
Now I have a few more questions depending on whether the answer is "per partition only" or not.
Assuming it's per partition:
What are good ways to deal with a situation where I know my dataset lends itself well to reducing further across local partitions before a shuffle. E.g. I process 10 partitions per Executor and I know they all include many overlapping keys, so it could potentially be reduced to just 1/10th. Basically I'm looking for a local reduce() (like so many). Coalesce()ing them comes to mind, any common methods to deal with this?
Assuming it reduces across partitions:
Does it happen per Executor? How about Executors assigned to the same Worker node, do they have the ability to reduce across each others partitions recognizing that they are co-located?
Does it happen per core (Thread) within the Executor? The reason I'm asking this is because some of the diagrams I looked at seem to show a Mapper per core/Thread of the executor, it looks like results of all tasks coming out of that core goes to a single Mapper instance. (which does the shuffle writes if I am not mistaken)
Is it deterministic? E.g. if I have a record, let's say A=1 in 10 partitions processed by the same Executor, can I expect to see A=10 for the task reading the shuffle output? Or is it best-effort, e.g. it still reduces but there are some constraints (buffer size etc.) so the shuffle read may encounter A=4 and A=6.
Map side aggregation is similar to Hadoop combiner approach. Reduce locally makes sense to Spark as well and means less shuffling. So it works per partition - as you state.
When applying reducing functionality, e.g. a groupBy & sum, then shuffling occurs initially so that keys are in same partition, so that the above can occur (with dataframes automatically). But a simple count, say, will also reduce locally and then the overall count will be computed by taking the intermediate results back to the driver.
So, results are combined on the Driver from Executors - depending on what is actually requested, e.g. collect, print of a count. But if writing out after aggregation of some nature, then the reducing is limited to the Executor on a Worker.

drawbacks to large spark partition sizes

I have read that too many small partitions hurt performance because of overhead, e.g. sending a very large number of tasks to executors.
What are the downside of using maximally large partitions, e.g. why do I see recommendations in the 100s of MB range?
I can see a few potential issues:
If you lose a partition, there's a large amount of work to recompute. With many smaller partitions you may lose more often, but you will have less variance in your runtime.
If one of your few tasks on large partitions takes longer to compute than the others, this would would leave other cores un-utilized, but with smaller partitions, this can better distribute this across the cluster.
Do these issues make sense, and are there others? Thanks!
These two potential issues are correct.
For a better cluster usage, one should define partitions large enough to compute an HDFS block (128 / 256 MB in general) but avoid exceeding it for a better distribution allowing horizontal scaling for performance (maximazing CPU usage).
As for the first point, you can not assume that the variance in runtime will be less if you have smaller and large number of partitions. Let's say one of the node crashes which will result in the recomputation of the rdd partition but now you have one less node to process the data your runtime will increase irrespective of the number of partitions.
If one of your few tasks on large partitions takes longer to compute than the others It happens if you have skewed data and increasing number of partitions can solve this problem but simply increasing the number of partitions isn't always sufficient.
The max partition size should not be greater than 128M which is default block size in hdfs. But you should not also have very small size partition as it add scheduling multiple tasks overhead and maintaining large meta data as well. Similar to any multithreaded application increasing the parallelism doesn't always increase performance. And in the end it comes down to finding that optimal value for which you get max performance.
By having large partition size you will have:
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
refer
Please refer : here to find optimal number of partitons.

How is task distributed in spark

I am trying to understand that when a job is submitted from the spark-submit and I have spark deployed system with 4 nodes how is the work distributed in spark. If there is large data set to operate on, I wanted to understand exactly in how many stages are the task divided and how many executors run for the job. Wanted to understand how is this decided for every stage.
It's hard to answer this question exactly, because there are many uncertainties.
Number of stages depends only on described workflow, which includes different kind of maps, reduces, joins, etc. If you understand it, you basically can read that right from the code. But most importantly that helps you to write more performant algorithms, because it's generally known the one have to avoid shuffles. For example, when you do a join, it requires shuffle - it's a boundary stage. This is pretty simple to see, you have to print rdd.toDebugString() and then look at indentation (look here), because indentation is a shuffle.
But with number of executors that's completely different story, because it depends on number of partitions. It's like for 2 partitions it requires only 2 executors, but for 40 ones - all 4, since you have only 4. But additionally number of partitions might depend on few properties you can provide at the spark-submit:
spark.default.parallelism parameter or
data source you use (f.e. for HDFS and Cassandra it is different)
It'd be a good to keep all of the cores in cluster busy, but no more (meaning single process only just one partition), because processing of each partition takes a bit of overhead. On the other hand if your data is skewed, then some cores would require more time to process bigger partitions, than others - in this case it helps to split data to more partitions so that all cores are busy roughly same amount of time. This helps with balancing cluster and throughput at the same time.

Settle the right number of partition on RDD

I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)
The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!

Resources