Is there a way to maintain appropriate number of files while doing insert overwrite using spark? - apache-spark

I am constantly using "insert overwrite table table_name partition(partition_column) query"" to write data into my table but the problem here is the number of files generated.
so i started using spark.sql.shuffle.partitions property to fix the number of files.
Now the problem statement here is that there is less data in some partition and very huge data in some partitions. when this happens, when i choose my shuffle partitions as per my large partition data there are unnecessary small files created and if i choose shuffle partitions as per partitions with low data, job starts failing with memory issues.
Is there a good way to solve this?

You need to consider .repartition() in this case, As repartition results almost same size partitions which further increases processing times!
Need to comeup with number of partitions to the dataframe, based on dataframe count..etc and apply repartition.
Refer to this link to dynamically create repartition based on number of rows in dataframe.

The function you are looking for is Size Estimator, it will return the number of bytes your file is. Spark is horrendous when it comes to files and number of files. To control the number of files being output you are going to want to run the repartition command because the number of output files form Spark is directly associated with number of partitions the object has. For my example below I am taking the size of an arbitrary input data frame find the "true" number of partitions (the reason for the + 1 is Spark on longs and ints innately rounds down so 0 partitions would be impossible.
Hope this helps!
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.util.SizeEstimator
val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
//write it out with that many partitions
val outputDF = inputDF.repartition(numPartitions.toInt)

Related

Avoid data shuffle and coalesce-numPartitions is not applied to individual partition while doing left anti-join in spark dataframe

I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
df_purge.coalesce(numPartitions).write.partitionBy("date").mode("overwrite").parquet("hdfs_path")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!
I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source: https://sparkbyexamples.com:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
dfRepart.repartition(2)
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("c:/tmp/zipcodes-state-more")

How to control number of files generated while setting large partitions in spark?

Because of large number of input data, I set large shuffle partitions of spark (spark.sql.shuffle.partitions=1000). However, the output file is small (~1GB), but it creates lots of small files (3000 files, each smaller than 1Mb). How can I combine these small files to one big file?
Another question is, why the number of output files is 3 times the number of shuffle partitions?
As per Spark docs, spark.sql.shuffle.partitions parameter Configures the number of partitions to use when shuffling data for joins or aggregations.. To control the number of output files use the repartition() method before writing the output. So something like this:
df
.filter(...) // some transformations
.join(...)
.repartition(1) // move data into a single partition
.write
.format(...)
.save(...)
The snippet above would result in a single output file.
You are not limited to repartitioning your data once - you can repartition as much as you need, but bare in mind that this is a costly operation:
df
.filter(...) // some transformations
.repartition(...) // repartition to improve join performance
.join(...)
.repartition(1) // move data into a single partition
.write
.format(...)
.save(...)
If you want a good explanation of how repartition works, here is a great answer:
Spark - repartition() vs coalesce()
For more information on how to improve the performance of the joins, refer to the Spark docs:
https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints-for-sql-queries
Since you have a large number of partitions. You may need to coalesce on your date frame. coalesce will decrease the number of partitions.
val df_res = df.coalesce(10)
This should decrease the number of output files from 1000 to just 10. or you can coalesce(1) to create one big file.
Coalesce uses existing partitions and minimizes shuffled data. The results may be different sizes.
The number of output files is equal to the number of partitions. the property (spark.sql.shuffle.partitions) is used when shuffling data for joins or aggregations.
You can perform df.repartition() to your dataframe to increase/decrease the partitions.

Spark coalescing on the number of objects in each partition

We are starting to experiment with spark on our team.
After we do reduce job in Spark, we would like to write the result to S3, however we would like to avoid collecting the spark result.
For now, we are writing the files to Spark forEachPartition of the RDD, however this resulted in a lot of small files. We would like to be able to aggregate the data into a couple files partitioned by the number of objects written to the file.
So for example, our total data is 1M objects (this is constant), we would like to produce 400K objects file, and our current partition produce around 20k objects file (this varies a lot for each job). Ideally we want to produce 3 files, each containing 400k, 400k and 200k instead of 50 files of 20K objects
Does anyone have a good suggestion?
My thought process is to let each partition handle which index it should write it to by assuming that each partition will roughy produce the same number of objects.
So for example, partition 0 will write to the first file, while partition 21 will write to the second file since it will assume that the starting index for the object is 20000 * 21 = 42000, which is bigger than the file size.
The partition 41 will write to the third file, since it is bigger than 2 * file size limit.
This will not always result on the perfect 400k file size limit though, more of an approximation.
I understand that there is coalescing, but as I understand it coalesce is to reduce the number of partition based on the number of partition wanted. What I want is to coalesce the data based on the number of objects in each partition, is there a good way to do it?
What you want to do is to re-partition the files into three partitions; the data will be split approximately 333k records per partition. The partition will be approximate, it will not be exactly 333,333 per partition. I do not know of a way to get the 400k/400k/200k partition you want.
If you have a DataFrame `df', you can repartition into n partitions as
df.repartition(n)
Since you want a maximum number or records per partition, I would recommend this (you don't specify Scala or pyspark, so I'm going with Scala; you can do the same in pyspark) :
val maxRecordsPerPartition = ???
val numPartitions = (df.count() / maxRecordsPerPartition).toInt + 1
df
.repartition(numPartitions)
.write
.format('json')
.save('/path/file_name.json')
This will guarantee your partitions are less than maxRecordsPerPartition.
We have decided to just go with the number of files being generated and just making sure that each files contain less than 1 million line items

How to auto calculate numRepartition while using spark dataframe write

When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")
First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
)
val df = birthYears.toDF("year", "name")
df.repartition($"year")
By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5"), so Spark will create 5 partitions and 5 files will be written to HDFS.

Empty Files in output spark

I am writing my dataframe like below
df.write().format("com.databricks.spark.avro").save("path");
However I am getting around 200 files where around 30-40 files are empty.I can understand that it might be due to empty partitions. I then updated my code like
df.coalesce(50).write().format("com.databricks.spark.avro").save("path");
But I feel it might impact performance. Is there any other better approach to limit number of output files and remove empty files
You can remove the empty partitions in your RDD before writing by using repartition method.
The default partition is 200.
The suggested number of partition is number of partitions = number of cores * 4
repartition your dataframe using this method. To eliminate skew and ensure even distribution of data choose column(s) in your dataframe with high cardinality (having unique number of values in the columns) for the partitionExprs argument to ensure even distribution.
As default no. of RDD partitions is 200; you have to do shuffle to remove skewed partitions.
You can either use repartition method on the RDD; or make use of DISTRIBUTE BY clause on dataframe - which will repartition along with distributing data among partitions evenly.
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Returns dataset instance with proper partitions.
You may use repartitionAndSortWithinPartitions - which can improve compression ratio.

Resources