Repartition followed by coalesce is not honored - apache-spark

I would like to spin up a lot of tasks when doing my calculation but coalesce into a smaller set of partitions when writing to the table.
A simple example for a demonstration is given below, where repartition is NOT honored during the execution.
My expected output is that the map operation happens in 100 partitions and finally collect happens in only 10 partitions.
It seems Spark has optimized the execution by ignoring the repartition. It would be helpful if someone can explain how to achieve my expected behavior.
sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).coalesce(10).collect()

Instead of coalesce, using repartition helps to achieve the expected behavior.
sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).cache().repartition(10).collect()
This helps to solve my problem. But, still would appreciate an explanation for this behavior.

"Returns a new Dataset that has exactly numPartitions partitions, when (sic) the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. "
Source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset#coalesce(numPartitions:Int):org.apache.spark.sql.Dataset[T]

Related

Why can coalesce lead to too few nodes for processing?

I am trying to understand spark partitions and in a blog I come across this passage
However, you should understand that you can drastically reduce the parallelism of your data processing — coalesce is often pushed up further in the chain of transformation and can lead to fewer nodes for your processing than you would like. To avoid this, you can pass shuffle = true. This will add a shuffle step, but it also means that the reshuffled partitions will be using full cluster resources if possible.
I understand that coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash partitioner. I am not able to understand what the author is trying to say in this para though. Can somebody please explain to me what is being said in this paragraph?
Coalesce has some not so obvious effects due to Spark
Catalyst.
E.g.
Let’s say you had a parallelism of 1000, but you only wanted to write
10 files at the end. You might think you could do:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to
as early a point as possible, so this will execute as:
load().coalesce(10).map(…).filter(…).save()
You can read in detail here an excellent article, that I quote from, that I chanced upon some time ago: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908
In summary: Catalyst treatment of coalesce can reduce concurrency early in the pipeline. This I think is what is being alluded to, though of course each case is different and JOIN and aggregating are not subject to such effects in general due to 200 default partitioning that applies for such Spark operations.
As what you have said in your question "coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash practitioner". This effectively means the following
The number of partitions have reduced
The main difference between repartition and coalesce is that in coalesce the movement of the data across the partitions is fewer than in repartition thus reducing the level of shuffle thus being more efficient.
Adding the property shuffle=true is just to distribute the data evenly across the nodes which is the same as using repartition(). You can use shuffle=true if you feel that your data might get skewed in the nodes after performing coalesce.
Hope this answers your question

How to distribute data into X partitions on read with Spark?

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.
To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.
In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html)
Default value is 128 MB which can be overridden to improve parallelism.
Read is not a shuffle as such. You need to get the data in at some stage.
The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.
You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.
Your point on controlling parallelism is less relevant when joining or aggregating as you note.

Difference between repartition(1) and coalesce(1)

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce.
I know repartition distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)?
coalesce has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.
The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)
You state nothing else in terms of logic.
coalesce will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle that repartition creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.
In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.

Repartitioning dataframe from Spark limit() function

I need to use the limit function to get n entries/rows from a dataframe. I do know it's not advisable, but this is meant as a pre-processing step which will not be required when actually implementing the code. However, I've read elsewhere that the resulting dataframe from using the limit function has only 1 partition.
I want to measure the processing time for my job which should not be limited by this. I actually tried repartitioning but the performance improvement is minimal (if any at all). I checked the partitioning by printing out df.rdd.getNumPartitions() and it's still 1. Is there someway to force repartitioning to happen?
EDIT: Note that the getNumPartitions() was run after a count action.
EDIT2: Sample code
df = random_data.groupBy("col").count().sort(F.desc("count")).limit(100).repartition(10)
df.count()
print("No. of partitions: {0}".format(df.rdd.getNumPartitions())) # Prints 1
Calling cache() then count() worked.
I think Spark's lazy evaluation is not executing the repartition for some reason, but I'm not sure why since count is supposed to be an action.

Why hardcode repartition value

Looking at some of the example spark code I see the numbers in repartitioning or coalesce is hardcoded:
val resDF = df.coalesce(16)
what's the best approach to manage this parameter where this hardcoded value becomes irrelevant when the cluster can be updated dynamically in matter of seconds.
Well in examples it is common to see hardcoded values, so you do not need to worry, I mean feel free to modify the example. I mean the Partitions documentation is full of hardcoded values, but these values are just examples.
The Rule of Thumb about number of partitions is:
one would want his RDD to have as many partitions as the product of
the number of executors by the number of used cores by 3 (or maybe 4).
Of course, that's a heuristic and it really depends on your
application, dataset and cluster configuration.
However, notice that repartition doesn't come for free, so in a dramatically dynamic environment, you have to be sure that the overhead of repartitioning is negligible towards the gains you will get by this operation.
Coalesce and repartition might have different cost, as I mention in my answer.

Resources