How to repartition evenly in Spark? - apache-spark

To test how .repartition() works, I ran the following code:
rdd = sc.parallelize(range(100))
rdd.getNumPartitions()
rdd.getNumPartitions() resulted in 4. Then I ran:
rdd = rdd.repartition(10)
rdd.getNumPartitions()
rdd.getNumPartitions() this time resulted in 10, so there were now 10 partitions.
However, I checked the partitions by:
rdd.glom().collect()
The result gave 4 non-empty lists and 6 empty lists. Why haven't any elements been distributed to the other 6 lists?

The algorithm behind repartition() uses logic to optimize the most effective way to redistribute data across partitions. In this case, your range is very small and it doesn't find it optimal to actually break the data down further. If you were to use a much bigger range like 100000, you will find that it does in fact redistribute the data.
If you want to force a certain amount of partitions, you could specify the number of partitions upon the intial load of the data. At this point, it will try to evenly distribute the data across partitions even if it's not necessarily optimal. The parallelize function takes a second argument for partitions
rdd = sc.parallelize(range(100), 10)
The same thing would work if you were to say read from a text file
rdd = sc.textFile('path/to/file/, numPartitions)

Related

What is the difference between spark.shuffle.partition and spark.repartition in spark?

What I understand is
When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce.
For Shuffle, it only comes into the play when you hit any shuffle stages and data will continue to remain on those partitions until you hit coalesce or repartition.
I am right ?
If yes then, can any one point out a striking difference?
TLDR - Repartition is invoked as per developer's need but shuffle is done when there is a logical demand
I assume you're talking about config property spark.sql.shuffle.partitions and method .repartition.
As data distribution is an important aspect in any distributed environment, which not only governs parallelism but can also create adverse impacts if the distribution is uneven. However, repartitioning itself is a costly operation as it involves heavy movement of data (i.e. Shuffling). The .repartition method is used to explicitly repartition the data into new partitions - meaning to increase or decrease the number of partitions in the program based on your need. You can invoke this whenever you want.
As opposed to this, spark.sql.shuffle.partitions is a configuration property that governs the number of partitions created when a data movement happens as a result of operations like aggregations and joins.
Configures the number of partitions to use when shuffling data for
joins or aggregations.
When you're performing transformations other than join or aggregation, the above configuration won't have any impact on the number of partitions the new Dataframe will have.
Your confusion between the two is due to both operations involving shuffling. While that is true, the former (i.e. repartition) is an explicit operation where the user is dictating the framework to increase or decrease the number of partitions - which in turn causes shuffling, while in case of joins/aggregation - the shuffling is caused by the operation itself.
Basically -
Joins/Aggregations cause shuffling which causes repartitioning
repartition is asked thus, shuffling has to be done
Another method coalesce make the difference clearer.
For reference, coalesce is a variant of repartition which can only lower the number of partitions, not necessarily equal in size. As it already knows the number of partitions are only to be decreased, it can perform it with minimal shuffling (just join two adjacent partitions until the number is met).
Consider your dataframe has 4 partitions but has data only in 2 of them, thus you decide to reduce the number of partitions to 2. When using coalesce spark tries to achieve this without shuffling or with minimal shuffling.
df.rdd().getNumPartitions(); // Returns 4 with size 0, 0, 2, 4
df=df.coalesce(2); // Decrease partitions to 2
df.rdd().getNumPartitions(); // Returns 2 now with size 2, 4
So there was no shuffling involved. While the following
df1.rdd().getNumPartitions() // Returns 4
df2.rdd().getNumPartitions() // Returns 8
df1.join(df2).rdd().getNumPartitions() // Returns 200
As you've performed a join it'll always return the number of partitions based on spark.sql.shuffle.partitions

Splitting spark data into partitions and writing those partitions to disk in parallel

Problem outline: Say I have 300+ GB of data being processed with spark on an EMR cluster in AWS. This data has three attributes used to partition on the filesystem for use in Hive: date, hour, and (let's say) anotherAttr. I want to write this data to a fs in such a way that minimizes the number of files written.
What I'm doing right now is getting the distinct combinations of date, hour, anotherAttr, and a count of how many rows make up combination. I collect them into a List on the driver, and iterate over the list, building a new DataFrame for each combination, repartitioning that DataFrame using the number of rows to guestimate file size, and writing the files to disk with DataFrameWriter, .orc finishing it off.
We aren't using Parquet for organizational reasons.
This method works reasonably well, and solves the problem that downstream teams using Hive instead of Spark don't see performance issues resulting from a high number of files. For example, if I take the whole 300 GB DataFrame, do a repartition with 1000 partitions (in spark) and the relevant columns, and dumped it to disk, it all dumps in parallel, and finishes in ~9 min with the whole thing. But that gets up to 1000 files for the larger partitions, and that destroys Hive performance. Or it destroys some kind of performance, honestly not 100% sure what. I've just been asked to keep the file count as low as possible. With the method I'm using, I can keep the files to whatever size I want (relatively close anyway), but there is no parallelism and it takes ~45 min to run, mostly waiting on file writes.
It seems to me that since there's a 1-to-1 relationship between some source row and some destination row, and that since I can organize the data into non-overlapping "folders" (partitions for Hive), I should be able to organize my code/DataFrames in such a way that I can ask spark to write all the destination files in parallel. Does anyone have suggestions for how to attack this?
Things I've tested that did not work:
Using a scala parallel collection to kick off the writes. Whatever spark was doing with the DataFrames, it didn't separate out the tasks very well and some machines were getting massive garbage collection problems.
DataFrame.map - I tried to map across a DataFrame of the unique combinations, and kickoff writes from inside there, but there's no access to the DataFrame of the data that I actually need from within that map - the DataFrame reference is null on the executor.
DataFrame.mapPartitions - a non-starter, couldn't come up with any ideas for doing what I want from inside mapPartitions
The word 'partition' is also not especially helpful here because it refers both to the concept of spark splitting up the data by some criteria, and to the way that the data will be organized on disk for Hive. I think I was pretty clear in the usages above. So if I'm imagining a perfect solution to this problem, it's that I can create one DataFrame that has 1000 partitions based on the three attributes for fast querying, then from that create another collection of DataFrames, each one having exactly one unique combination of those attributes, repartitioned (in spark, but for Hive) with the number of partitions appropriate to the size of the data it contains. Most of the DataFrames will have 1 partition, a few will have up to 10. The files should be ~3 GB, and our EMR cluster has more RAM than that for each executor, so we shouldn't see a performance hit from these "large" partitions.
Once that list of DataFrames is created and each one is repartitioned, I could ask spark to write them all to disk in parallel.
Is something like this possible in spark?
One thing I'm conceptually unclear on: say I have
val x = spark.sql("select * from source")
and
val y = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr")
and
val z = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr2")
To what extent is y is a different DataFrame than z? If I repartition y, what effect does the shuffle have on z, and on x for that matter?
We had the same problem (almost) and we ended up by working directly with RDD (instead of DataFrames) and implementing our own partitioning mechanism (by extending org.apache.spark.Partitioner)
Details: we are reading JSON messages from Kafka. The JSON should be grouped by customerid/date/more fields and written in Hadoop using Parquet format, without creating too many small files.
The steps are (simplified version):
a)Read the messages from Kafka and transform them to a structure of RDD[(GroupBy, Message)]. GroupBy is a case class containing all the fields that are used for grouping.
b)Use a reduceByKeyLocally transformation and obtain a map of metrics (no of messages/messages size/etc) for each group - eg Map[GroupBy, GroupByMetrics]
c)Create a GroupPartitioner that's using the previously collected metrics (and some input parameters like the desired Parquet size etc) to compute how many partitions should be created for each GroupBy object. Basically we are extending org.apache.spark.Partitioner and overriding numPartitions and getPartition(key: Any)
d)we partition the RDD from a) using the previously defined partitioner: newPartitionedRdd = rdd.partitionBy(ourCustomGroupByPartitioner)
e)Invoke spark.sparkContext.runJob with two parameters: the first one is the RDD partitioned at d), the second one is a custom function (func: (TaskContext, Iterator[T]) that will write the messages taken from Iterator[T] into Hadoop/Parquet
Let's say that we have 100 mil messages, grouped like that
Group1 - 2 mil
Group2 - 80 mil
Group3 - 18 mil
and we decided that we have to use 1.5 mil messages per partition to obtain Parquet files greater than 500MB. We'll end up with 2 partitions for Group1, 54 for Group2, 12 for Group3.
This statement:
I collect them into a List on the driver, and iterate over the list,
building a new DataFrame for each combination, repartitioning that
DataFrame using the number of rows to guestimate file size, and
writing the files to disk with DataFrameWriter, .orc finishing it off.
is completely off-beam where Spark is concerned. Collecting to driver is never a good approach, volumes and OOM issues and latency in your approach is high.
Use so the below so as to simplify and get parallelism of Spark benefits saving time and money for your boss:
df.repartition(cols...)...write.partitionBy(cols...)...
shuffle occurs via repartition, no shuffling ever with partitionBy.
That simple, with Spark's default parallelism utilized.

Empty Files in output spark

I am writing my dataframe like below
df.write().format("com.databricks.spark.avro").save("path");
However I am getting around 200 files where around 30-40 files are empty.I can understand that it might be due to empty partitions. I then updated my code like
df.coalesce(50).write().format("com.databricks.spark.avro").save("path");
But I feel it might impact performance. Is there any other better approach to limit number of output files and remove empty files
You can remove the empty partitions in your RDD before writing by using repartition method.
The default partition is 200.
The suggested number of partition is number of partitions = number of cores * 4
repartition your dataframe using this method. To eliminate skew and ensure even distribution of data choose column(s) in your dataframe with high cardinality (having unique number of values in the columns) for the partitionExprs argument to ensure even distribution.
As default no. of RDD partitions is 200; you have to do shuffle to remove skewed partitions.
You can either use repartition method on the RDD; or make use of DISTRIBUTE BY clause on dataframe - which will repartition along with distributing data among partitions evenly.
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Returns dataset instance with proper partitions.
You may use repartitionAndSortWithinPartitions - which can improve compression ratio.

Understanding Shuffle and rePartitioning in spark

I would greatly appreciate if someone could answer these few spark shuffle related questions in simplified terms .
In spark, when loading a data-set ,we specify the number of partitions, which tells how many block the input data(RDD) should be divided into ,and based on the number of partitions, equal number of tasks are launched (correct me, if the assumption is wrong).for X number of cores in worker node.corresponding X number of task run at one time.
Along similar lines ,here are the few questions.
Since,All byKey operations along with coalesce, repartition,join and cogroup, causes data shuffle.
Is data shuffle another name for repartitiong operation?
What happens to the initial partitions(number of partitions declared)when repartitions happens.
Can someone give example(explain) how data movement across the cluster happens.i have seen couple of examples where
random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.
First of all, HDFS blocks is divided into number of partition not in the blocks. These petitions resides in the work of memory. These partitions resides in the worker memory.
Q- Is data shuffle another name for repartitiong operation?
A- No. Generally repartition means increasing the existing partition in which the data is divided into into. So whenever we increase the partition, we are actually trying to “move” the data in number of new partitions set in code not “Shuffling” . Shuffling is somewhat when we move the data of particular key in one partition.
Q- What happens to the initial partitions(number of partitions declared)when repartitions happens?
A- Covered above
One more underlying thing is rdd.repartition(n) will not do change the no. Of partitions of rdd, its a tranformation, which will work when some other rdd is created like
rdd1=rdd.repartition(n)
Now it will create new rdd1 that have n number of partition.To do this, we can call coalesce function like rdd.coalesce(n) Being an action function, this will change the partitions of rdd itself.
Q- Can someone give example(explain) how data movement across across the cluster happens.i have seen couple of examples where random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.
Ans- partition and partitioning at two different different concept so partition is something in which the data is divided evenly in the number of partitions set by the user but in partitioning, data is shuffled among those partitions according to algorithms set by user like HashPartitioning & RangePartitioning.
Like rdd= sc.textFile(“../path”,5) rdd.partitions.size/length
O/p: Int: 5(No.of partitions)
rdd.partitioner.isDefined
O/p: Boolean= false
rdd.partitioner
O/p: None(partitioning scheme)
But,
rdd=sc.textFile(“../path”,5).partitionBy(new org.apache.spark.HashPartition(10).cache()
rdd.partitions.size
O/p: Int: 10
rdd.partitioner.isDefined
O/p: Boolean: true
rdd.partitioner
O/p: HashPartitioning#
Hope this will help!!!

Does spark's coalesce function try to create partitions of uniform size?

I want to even out the partition size of rdds/dataframes in Spark to get rid of straggler tasks that slow my job down. I can do so using repartition(n_partition), which creates partitions of quite uniform size. However, that involves an expensive shuffle.
I know that coalesce(n_desired_partitions) is a cheaper alternative that avoids shuffling, and instead merges partitions on the same executor. However, it's not clear to me whether this function tries to create partitions of roughly uniform size, or simply merges input partitions without regard to their sizes.
For example, let's say that the following we have an Rdd of the integers in the range [1,12] in three partitions as follows: [(1,2,3,4,5,6,7,8),(9,10),(11,12)]. Let's say these are all on the same executor.
Now I call rdd.coalesce(2). Will the algorithm that powers coalesce know to merge the two small partitions (because they're smaller and we want balanced partition sizes), rather than just merging two arbitrary partitions?
Discussion of this topic elsewhere
According to this presentation (skip to 7:27) Netflix big data team needed to implement a custom coalese function to balance partition sizes. See also SPARK-14042.
Why this question's not a duplicate
There is a more general question about the differences between partition and coalesce here, but nobody gets there explains whether the algorithm that powers coalesce tries to balance partition size.
So actually repartition is nothing its def is look like below
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
So its simply coalesce with shuffle but when call coalesce its shuffle will be by default false so it will not shuffle the data till its will not needed.
Example you have 2 cluster node and each have 2 partitions and now u call rdd.coalesce(2) so it will merge the local partitions of the node or if you call the coalesce(1) then it will need the shuffle because other 2 partition will be on another node so may be in your case it will join local node partitions and that node have less number of partitions so ur partition size is not uniform.
ok according to your editing of question i also try to do the same as follows
val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10,11,12))
data.getNumPartitions
res2: Int = 4
data.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count
the output of above code will be
And now i coalesce the 4 partition to 2 and run the same code on that rdd to check how optimize spark coalesce the data so the output will be
Now you can easily see that the spark equally distribute the data to both the partitions 6-6 even before coalesce it the number of elements are not same in all partitions.
val coal=data.coalesce(2)
coal.getNumPartitions
res4: Int = 2
coal.mapPartitionsWithIndex{case (a,b)=>println("partitionssss"+a);b.map(y=>println("dataaaaaaaaaaaa"+y))}.count

Resources