Spark repartition does not distribute records evenly - apache-spark

I have an rdd which I re-partition by one field
rdd = rdd.repartition( new Column("block_id"));
and save it to hdfs.
I would expect that if there are 20 different block_id's, the repartitioning would produce 20 new partitions each holding a different block_id.
But in fact after repartitioning there are 19 partitions, each holding exactly one block_id and one partition holding two block_id's.
This means that the core writing the partition with the two block_id's to disk takes twice the time compared to the other cores and therefore doubling the overall time.

Spark Dataset uses hash partitioning. There is no guarantee that there will be no hash colisions so you cannot expect:
that if there are 20 different block_id's, the repartitioning would produce 20 new partitions each holding a different block_id
You can try to increase number of partitions but it using number which offers good guarantees is rather impractical.
With RDDs you can design your own partitioner How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?

Related

Is it possible to coalesce Spark partitions "evenly"?

Suppose we have a PySpark dataframe with data spread evenly across 2048 partitions, and we want to coalesce to 32 partitions to write the data back to HDFS. Using coalesce is nice for this because it does not require an expensive shuffle.
But one of the downsides of coalesce is that it typically results in an uneven distribution of data across the new partitions. I assume that this is because the original partition IDs are hashed to the new partition ID space, and the number of collisions is random.
However, in principle it should be possible to coalesce evenly, so that the first 64 partitions from the original dataframe are sent to the first partition of the new dataframe, the next 64 are send to the second partition, and so end, resulting in an even distribution of partitions. The resulting dataframe would often be more suitable for further computations.
Is this possible, while preventing a shuffle?
I can force the relationship I would like between initial and final partitions using a trick like in this question, but Spark doesn't know that everything from each original partition is going to a particular new partition. Thus it can't optimize away the shuffle, and it runs much slower than coalesce.
In your case you can safely coalesce the 2048 partitions into 32 and assume that Spark is going to evenly assign the upstream partitions to the coalesced ones (64 for each in your case).
Here is an extract from the Scaladoc of RDD#coalesce:
This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
Consider that also how your partitions are physically spread across the cluster influence the way in which coalescing happens. The following is an extract from CoalescedRDD's ScalaDoc:
If there is no locality information (no preferredLocations) in the parent, then the coalescing is very simple: chunk parents that are close in the Array in chunks.
If there is locality information, it proceeds to pack them with the following four goals:
(1) Balance the groups so they roughly have the same number of parent partitions
(2) Achieve locality per partition, i.e. find one machine which most parent partitions prefer
(3) Be efficient, i.e. O(n) algorithm for n parent partitions (problem is likely NP-hard)
(4) Balance preferred machines, i.e. avoid as much as possible picking the same preferred machine

Does spark's coalesce function try to create partitions of uniform size?

I want to even out the partition size of rdds/dataframes in Spark to get rid of straggler tasks that slow my job down. I can do so using repartition(n_partition), which creates partitions of quite uniform size. However, that involves an expensive shuffle.
I know that coalesce(n_desired_partitions) is a cheaper alternative that avoids shuffling, and instead merges partitions on the same executor. However, it's not clear to me whether this function tries to create partitions of roughly uniform size, or simply merges input partitions without regard to their sizes.
For example, let's say that the following we have an Rdd of the integers in the range [1,12] in three partitions as follows: [(1,2,3,4,5,6,7,8),(9,10),(11,12)]. Let's say these are all on the same executor.
Now I call rdd.coalesce(2). Will the algorithm that powers coalesce know to merge the two small partitions (because they're smaller and we want balanced partition sizes), rather than just merging two arbitrary partitions?
Discussion of this topic elsewhere
According to this presentation (skip to 7:27) Netflix big data team needed to implement a custom coalese function to balance partition sizes. See also SPARK-14042.
Why this question's not a duplicate
There is a more general question about the differences between partition and coalesce here, but nobody gets there explains whether the algorithm that powers coalesce tries to balance partition size.
So actually repartition is nothing its def is look like below
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
So its simply coalesce with shuffle but when call coalesce its shuffle will be by default false so it will not shuffle the data till its will not needed.
Example you have 2 cluster node and each have 2 partitions and now u call rdd.coalesce(2) so it will merge the local partitions of the node or if you call the coalesce(1) then it will need the shuffle because other 2 partition will be on another node so may be in your case it will join local node partitions and that node have less number of partitions so ur partition size is not uniform.
ok according to your editing of question i also try to do the same as follows
val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10,11,12))
data.getNumPartitions
res2: Int = 4
data.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count
the output of above code will be
And now i coalesce the 4 partition to 2 and run the same code on that rdd to check how optimize spark coalesce the data so the output will be
Now you can easily see that the spark equally distribute the data to both the partitions 6-6 even before coalesce it the number of elements are not same in all partitions.
val coal=data.coalesce(2)
coal.getNumPartitions
res4: Int = 2
coal.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count

How does spark determine the preferredLocation of an RDD in repartition and coalesce?

When does an RDD get it's preferred location? How is the preferred location determined?
I've seen some weird behaviors in repartition and coalesce I could not quite make sense of:
1. When coalescing form n to n-1 partitions, I see spark just coalesce one partition to another single partition. (I think the ideal behavior would be evenly distribute to all n-1 nodes)
When run repartition I see spark repartition such that one node have multiple partition of rdds.
Does the above behavior have something to do with preferedLocations?
Note that rdd.repartition(n) just calls rdd.coalesce(n, shuffle = true), so we're just comparing shuffle true vs false.
shuffle = false
In this mode, Spark constructs a new RDD whose partitions contain one or more partitions of the parent RDD -- if you coalesce from n partitions -> n/2 partitions, then each partition consists of the elements from two semi-random partitions in the parent. This mode is appropriate when you want to reduce partitioning and the partitions are already balanced, like when you've done a filter that affects elements in each partition roughly equally. The overhead is very low. Also, note that it's impossible to increase number of partitions with this mode.
shuffle = true
For some background, I recommend this blog post for learning a bit more about how and why we shuffle. The fundamental differences in this execution mode are:
higher overhead (all data is transmitted over network)
good for rebalancing partitions (if you perform a filter that drops out either all elements in a partition or none, then shuffle=false will produce imbalanced partitions, but shuffle=true will resolve the issue)
can increase the number of partitions
Preferred locations don't have much to do with it -- you're seeing preferred locations only in the shuffle = false mode because the locality is preserved without shuffles, but after a shuffle the original preferredLocations are irrelevant (replaced with new preferred locations about shuffle destinations).

RDD and partition in Apache Spark

So, in Spark when an application is started then an RDD containing the dataset for the application (e.g. words dataset for WordCount) is created.
So far what I understand is that RDD is a collection of those words in WordCount and the operations that have been done to those dataset (e.g. map, reduceByKey, etc...)
However, afaik, Spark also has HadoopPartition (or in general: partition) which is read by every executor from HDFS. And I believe that an RDD in driver also contains all of these partitions.
So, what is getting divided among executors in Spark? Does every executor get those sub-dataset as a single RDD which contains less data compared to RDD in the driver or does every executor only deals with these partitions and read them directly from HDFS? Also, when are the partitions created? On the RDD creation?
Partitions are configurable provided the RDD is key-value based.
There are 3 main partition's property:
Tuples in the same partition are guaranteed to be in the same
machine.
Each node in a cluster can contain more than one partition.
The total number of partitions are configurable, by default it is
set to the total number of cores on all the executor nodes.
Spark supports two types of partitioning:
Hash Partitioning
Range Partitioning
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file.
When you call rdd.repartition(x) it would perform a shuffle of the data from N partitions you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
Please see more details here and here
Your RDD have rows in it. If it is a text file, it have lines separated by \n.
Those rows are getting divided into partitions across different nodes in Spark cluster.

Spark : how can evenly distribute my records in all partition

I have a RDD with 30 record (key/value pair : key is Time Stamp and Value is JPEG Byte Array)
and I am running 30 executors. I want to repartition this RDD in to 30 partitions so every partition gets one record and is assigned to one executor.
When I used rdd.repartition(30) it repartitions my rdd in 30 partitions but some partitions get 2 records, some get 1 record and some not getting any records.
Is there any way in Spark I can evenly distribute my records to all partitions.
Salting technique can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.
(here is link for salting)
Below is an example of repartitioning an rdd into npartitions partitions, so that the items are evenly distributed across the partitions. The number of items in each partition will be different by at most 1.
evenly_repartitioned = (
rdd
.zipWithIndex()
.map(lambda p: (p[1], p[0]))
.partitionBy(N, lambda p: p)
.values()
)
It does:
Make a tuple of (item, index) where the index is over the entire RDD
Swap the key and the value, so now the RDD contains (index, item)
Repartition to N partitions using an identity partitionFunc, moving item to partition index % N
Take only the values, dropping the index in the tuple.
Note that this is slower than the default hash-based repartitioning, because it requires another Spark stage during zipWithIndex() to count the size of each partition.
You can force a new partitioning by using the partitionBy command and providing a number of partitions. By default the partitioner is a hash-based but you can switch to a range-based for a better distribution. If you really want to force a repartitioning you can use a random number generator as the partition function (in PySpark).
my_rdd.partitionBy(pCount, partitionFunc = lambda x: np.random.randint(pCount))
This will, however, frequently cause inefficient shuffles (lots of data transferred between nodes), but if your process is compute limited then it can make sense.

Resources