How to auto calculate numRepartition while using spark dataframe write - apache-spark

When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")

First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
)
val df = birthYears.toDF("year", "name")
df.repartition($"year")

By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5"), so Spark will create 5 partitions and 5 files will be written to HDFS.

Related

Avoid data shuffle and coalesce-numPartitions is not applied to individual partition while doing left anti-join in spark dataframe

I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
df_purge.coalesce(numPartitions).write.partitionBy("date").mode("overwrite").parquet("hdfs_path")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!
I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source: https://sparkbyexamples.com:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
dfRepart.repartition(2)
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("c:/tmp/zipcodes-state-more")

Why does Spark crossJoin take so long for a tiny dataframe?

I'm trying to do the following crossJoin on two dataframes with 5 rows each, but Spark spawns 40000 tasks on my machine and it took 30 seconds to achieve the task. Any idea why that is happening?
df = spark.createDataFrame([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']]).toDF('a','b')
df = df.repartition(1)
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
You call a .distinct before join, it requires a shuffle, so it repartitions data based on spark.sql.shuffle.partitions property value. Thus, df.select('a').distinct() and df.select('b').distinct() result in new DataFrames each with 200 partitions, 200 x 200 = 40000
Two things - it looks like you cannot directly control the number of partitions a DF is created with, so we can first create a RDD instead (where you can specify the number of partitions) and convert it to DF. Also you can set the shuffle partitions to '1' as well. These both ensure you will have just 1 partition during the whole execution and should speed things up.
Just note that this shouldn't be an issue at all for larger datasets, for which Spark is designed (it would be faster to achieve the same result on a dataset of this size not using spark at all). So in the general case you won't really need to do stuff like this, but tune the number of partitions to your resources/data.
spark.conf.set("spark.default.parallelism", "1")
spark.conf.set("spark.sql.shuffle.partitions", "1")
df = sc.parallelize([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']], 1).toDF(['a','b'])
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
spark.conf.set sets the configuration for a single execution only, if you want more permanent changes do them in the actual spark conf file

Is there a way to maintain appropriate number of files while doing insert overwrite using spark?

I am constantly using "insert overwrite table table_name partition(partition_column) query"" to write data into my table but the problem here is the number of files generated.
so i started using spark.sql.shuffle.partitions property to fix the number of files.
Now the problem statement here is that there is less data in some partition and very huge data in some partitions. when this happens, when i choose my shuffle partitions as per my large partition data there are unnecessary small files created and if i choose shuffle partitions as per partitions with low data, job starts failing with memory issues.
Is there a good way to solve this?
You need to consider .repartition() in this case, As repartition results almost same size partitions which further increases processing times!
Need to comeup with number of partitions to the dataframe, based on dataframe count..etc and apply repartition.
Refer to this link to dynamically create repartition based on number of rows in dataframe.
The function you are looking for is Size Estimator, it will return the number of bytes your file is. Spark is horrendous when it comes to files and number of files. To control the number of files being output you are going to want to run the repartition command because the number of output files form Spark is directly associated with number of partitions the object has. For my example below I am taking the size of an arbitrary input data frame find the "true" number of partitions (the reason for the + 1 is Spark on longs and ints innately rounds down so 0 partitions would be impossible.
Hope this helps!
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.util.SizeEstimator
val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
//write it out with that many partitions
val outputDF = inputDF.repartition(numPartitions.toInt)

How to set multiple spark configurations for same spark job

I am dealing with a weird situation , where I have small tables and big tables to process using spark and it must be a single spark job.
To achieve best performance targets, I need to set a property called
spark.sql.shuffle.partitions = 12 for small tables and
spark.sql.shuffle.partitions = 500 for bigger tables
I want to know how can I change these properties dynamically in spark ?
Can I have multiple configuration files and call it within the program ?
apache-spark documentation describes this property:
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.
according to above description you need just to load your data to rdds and when you want to apply the join or aggregation on them you can change partition numbers using repartition method.
As mentioned in documentation the default value is 200 then you need one increasing and one decreasing in partition numbers. Then you should use repartition and coalesce method. Repartition is useful for both decreasing and increasing but in decreasing is has a shuffle overhead but coalesce does not have this overhead and is more optimized than repartition.
If you have the tables in parquet files for instance, you could read files in advance and know if is small table or big table and change the value of your shuffle partitions.
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val shuffle12 = "spark.sql.shuffle.partitions = 12"
val shuffle500= "spark.sql.shuffle.partitions = 500"
val total = hdfs.getContentSummary(new Path(pathTable)).getLength
if (total < x) shuffle12 else shuffle500

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.
I know that for a RDD, while creating it we can mention the number of partitions like below.
val RDD1 = sc.textFile("path" , 6)
But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.
Only possibility i think is, after creating dataframe we can use repartition API.
df.repartition(4)
So can anyone please let me know if we can specify the number of partitions while creating a dataframe.
You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.
In general:
Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
Default number of shuffle partitions in spark dataframe(200)
Default number of partitions in rdd(10)

Resources