Spark repartitioning by column with dynamic number of partitions per column - apache-spark

How can a DataFrame be partitioned based on the count of the number of items in a column. Suppose we have a DataFrame with 100 people (columns are first_name and country) and we'd like to create a partition for every 10 people in a country.
If our dataset contains 80 people from China, 15 people from France, and 5 people from Cuba, then we'll want 8 partitions for China, 2 partitions for France, and 1 partition for Cuba.
Here is code that will not work:
df.repartition($"country"): This will create 1 partition for China, one partition for France, and one partition for Cuba
df.repartition(8, $"country", rand): This will create up to 8 partitions for each country, so it should create 8 partitions for China, but the France & Cuba partitions are unknown. France could be in 8 partitions and Cuba could be in up to 5 partitions. See this answer for more details.
Here's the repartition() documentation:
When I look at the repartition() method, I don't even see a method that takes three arguments, so looks like some of this behavior isn't documented.
Is there any way to dynamically set the number of partitions for each column? It would make creating partitioned data sets way easier.

You're not going to be able to exactly accomplish that due to the way spark partitions data. Spark takes the columns you specified in repartition, hashes that value into a 64b long and then modulo the value by the number of partitions. This way the number of partitions is deterministic. The reason why it works this way is that joins need matching number of partitions on the left and right side of a join in addition to assuring that the hashing is the same on both sides.
"we'd like to create a partition for every 10 people in a country."
What exactly are you trying to accomplish here? Having only 10 rows in a partition is likely terrible for performance. Are you trying to create a partitioned table where each of the files in the partition is guarunteed to only have x number of rows?
"df.repartition($"country"): This will create 1 partition for China, one partition for France, and one partition for Cuba"
This will actually create a dataframe with the default number of shuffle partitions hashed by country
def repartition(partitionExprs: Column*): Dataset[T] = {
repartition(sparkSession.sessionState.conf.numShufflePartitions, partitionExprs: _*)
}
"df.repartition(8, $"country", rand): This will create up to 8 partitions for each country, so it should create 8 partitions for China, but the France & Cuba partitions are unknown. France could be in 8 partitions and Cuba could be in up to 5 partitions. See this answer for more details."
Like wise this is subtly wrong. There's only 8 partitions with the countries essentially randomly shuffled among those 8 partitions.

Here's the code that'll create ten rows per data file (sample dataset is here):
val outputPath = new java.io.File("./tmp/partitioned_lake5/").getCanonicalPath
df
.repartition(col("person_country"))
.write
.option("maxRecordsPerFile", 10)
.partitionBy("person_country")
.csv(outputPath)
Here's the pre Spark 2.2 code that'll create roughly ten rows per data file:
val desiredRowsPerPartition = 10
val joinedDF = df
.join(countDF, Seq("person_country"))
.withColumn(
"my_secret_partition_key",
(rand(10) * col("count") / desiredRowsPerPartition).cast(IntegerType)
)
val outputPath = new java.io.File("./tmp/partitioned_lake6/").getCanonicalPath
joinedDF
.repartition(col("person_country"), col("my_secret_partition_key"))
.drop("count", "my_secret_partition_key")
.write
.partitionBy("person_country")
.csv(outputPath)

Related

If multiple dataframes are repartitioned based on the same column having same values, will they be colocated?

I have 10 DataFrames and they all have a common column comm_col with distinct count of that column, lets say 100 for each DataFrame.
If I repartition those 10 tables based on comm_col what are the chances that the partitions with the same values for each DataFrame is colocated in the same executor?
I am interested in colocation because I want to join all 10 tables and repartition would really break down the table into common chunks thus making other parts of the table irrelevant.
Lets say I have 5 executors with 3 cores and 10 gigs of memory.
Each DataFrame is about roughly 20 mill rows.

How to repartition into fixed number of partition per column in Spark?

I need to read data from one hive table and insert it into another Hive table. The schema of both the tables is the same. The table is partitioned by date & country. The size of each partition is ~500MB. I want to insert these data in a new table where the files inside each partition are roughly 128 MB (i.e 4 files)
Step 1: Read data from the source table in Spark.
Step 2: Repartition by column(country, date) and the number of partitions to 4.
df.repartition(4, col("country_code"), col("record_date"))
I am getting only 1 partition per country_code & record_date.
Whatever you are doing in the step 2 will repartition your data to 4 partitions in the memory but it won't save 4 files if you do df.write.
In order to do that you can use below code:
df.repartition(4, col("country_code"),col("record_date"))
.write
.partitionBy(col("country_code"),col("record_date"))
.mode(SaveMode.Append).saveAsTable("TableName")

Spark Generate A Lot Of Tasks Although Partition Number is 1 Pyspark

My code:
df = self.sql_context.sql(f"select max(id) as id from {table}")
return df.collect()[0][0]
My table is partitioned by id - it has 100M records but only 3 distinct id's.
I expected this query to work with 1 task and scan just the partition column (id).
I don't understand how I have 691 tasks for the collect line with just 3 partitions
I guess the query is executing full scan on the table but I can't figure why it doesn't scan just the metadata
Your df contains the result of an aggregation on the entire table, it contains only one row (with only one field being the max(id)), this is why it has only 1 partition.
But the original table DataFrame may have many partitions (or only 1 partition but its computation needs ~600 stages, triggering 1 task per stage, which is not that common)
Without details on your parallelism configurations and input source type and transformations, it is not easy to help more !

Does using multiple columns in partitioning Spark DataFrame makes read slower?

I wonder if using multiple columns while writing a Spark DataFrame in spark makes future read slower?
I know partitioning with critical columns for future filtering improves read performance, but what would be the effect of having multiple columns, even the ones not usable for filtering?
A sample would be:
(ordersDF
.write
.format("parquet")
.mode("overwrite")
.partitionBy("CustomerId", "OrderDate", .....) # <----------- add many columns
.save("/storage/Orders_parquet"))
Yes as spark have to do shuffle and short data to make so may partition .
As there will have many combination of partition key .
ie
suppose CustomerId have unique values 10
suppose orderDate have unique values 10
suppose Orderhave unique values 10
Number of partition will be 10 *10*10
In this small scenario we have 1000 bucket need to be created.
so hell loot of shuffle and short >> more time .

Spark DataFrame RangePartitioner

[New to Spark] Language - Scala
As per docs, RangePartitioner sorts and divides the elements into chunks and distributes the chunks to different machines. How would it work for below example.
Let's say we have a dataframe with 2 columns and one column (say 'A') has continuous values from 1 to 1000. There is another dataframe with same schema but the corresponding column has only 4 values 30, 250, 500, 900. (These could be any values, randomly selected from 1 to 1000)
If I partition both using RangePartitioner,
df_a.partitionByRange($"A")
df_b.partitionByRange($"A")
how will the data from both the dataframes be distributed across nodes ?
Assuming that the number of partitions is 5.
Also, if I know that second DataFrame has less number of values then will reducing number of partitions for it make any difference ?
What I am struggling to understand is that how Spark maps one partition of df_a to a partition of df_b and how it sends (if it does) both those partitions to same machine for processing.
A very detailed explanation of how RangePartitioner works internally is described here
Specific to your question, RangePartitioner samples the RDD at runtime, collects the statistics, and only then are the ranges (limits) evaluated. Note that there are 2 parameters here - ranges (logical), and partitions (physical). The number of partitions can be affected by many factors - number of input files, inherited number from parent RDD, 'spark.sql.shuffle.partitions' in case of shuffling, etc. The ranges evaluated according to the sampling. In any case, RangePartitioner ensures every range is contained in single partition.
how will the data from both the dataframes be distributed across nodes ? how Spark maps one partition of df_a to a partition of df_b
I assume you implicitly mean joining 'A' and 'B', otherwise the question does not make any sense. In that case, Spark would make sure to match partitions with ranges on both DataFrames, according to their statistics.

Resources