Why is repartition faster than partitionBy in Spark? - apache-spark

I am attempting to use Spark for a very simple use case: given a large set of files (90k) with device time-series data for millions of devices group all of the time-series reads for a given device into a single set of files (partition). For now let’s say we are targeting 100 partitions, and it is not critical that a given devices data shows up in the same output file, just the same partition.
Given this problem we’ve come up with two ways to do this - repartition then write or write with partitionBy applied to the Writer. The code for either of these is very simple:
repartition (hash column is added to ensure that comparison to partitionBy code below is one-to-one):
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.repartition("partition") \
.write.format("json") \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.mode("overwrite") \
.save(output_path)
partitionBy:
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
.write.format("json") \
.partitionBy(“partition”) \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.mode("overwrite") \
.save(output_path)
In our testing repartition is 10x faster than partitionBy. Why is this?
Based on my understanding repartition will incur a shuffle which my Spark learnings have told me to try to avoid whenever possible. On the other hand, partitionBy (based on my understanding) only incurs an sort operation local to each node - no shuffle is needed. Am I misunderstanding something that is causing me to think partitionBy would be faster?

TLDR: Spark triggers a sort when you call partitionBy, and not a hash re-partitioning. This is why it is much slower in your case.
We can check that with a toy example:
spark.range(1000).withColumn("partition", 'id % 100)
.repartition('partition).write.csv("/tmp/test.csv")
Don't pay attention to the grey stage, it is skipped because it was computed in a previous job.
Then, with partitionBy:
spark.range(1000).withColumn("partition", 'id % 100)
.write.partitionBy("partition").csv("/tmp/test2.csv")
You can check that you can add repartition before partitionBy, the sort will still be there. So what's happening? Notice that the sort in the second DAG does not trigger a shuffle. It is a map partition. In fact, when you call partitionBy, spark does not shuffle the data as one would expect at first. Spark sorts each partition individually and then each executor writes his data in the according partition, in a separate file. Therefore, note that with partitionBy you are not writing num_partitions files but something between num_partitions and num_partitions * num_executors files. Each partition has one file per executor containing data belonging to that partition.

I think #Oli has explained the issue perfectly in his comments to the main answer. I just want to add my 2 cents and try to explain the same.
Let's say when you are reading the XML files [90K files], spark reads it into N partitions. This is decided based on the number of factors like spark.sql.files.maxPartitionBytes, file format, compression type etc.
Let's assume it to be 10K partitions. This is happening in the below part of the code.
df = spark.read.format("xml") \
.options(rowTag="DeviceData") \
.load(file_path, schema=meter_data) \
Assuming you are using num_partitions = 100, you are adding a new column called partition with values 0-99. Spark is just adding a new column to the existing dataframe [or rdd] which is split across the 10K partitions.
.withColumn("partition", hash(col("_DeviceName")).cast("Long") % num_partitions) \
Till this point, both the codes are the same.
Now, let's compare what is happening with repartition v/s partitionBy
Case 1: repartition
.repartition("partition") \
.write.format("json") \
Here, you are repartitioning the existing dataframe based on the column "partition" which has 100 distinct values. So the existing dataframe will incur a full shuffle bringing down the number of partitions from 10K to 100. This stage will be compute-heavy since a full shuffle is involved. This could also fail if the size of one particular partition is really huge [skewed partition].
But the advantage here is that in the next stage where write happens, Spark has to write only 100 files to the output_path. Each file will only have data corresponding to only one value of column "partition"
Case 2: partitionBy
.write.format("json") \
.partitionBy("partition") \
Here, you are asking spark to write the existing dataframe into output_path partitioned by the distinct values of the column "partition". You are nowhere asking spark to reduce the existing partition count of the dataframe.
So spark will create new folders inside the output_path
and write data corresponding to each partitions inside it.
output_path + "\partition=0\"
output_path + "\partition=1\"
output_path + "\partition=99\"
Now since you have 10K spark partitions on the existing data frame and assuming the worst case where each of these 10K partitions has all the distinct values of the column "partition", Spark will have to write 10K * 100 = 1M files.
ie, some part of all the 10K partitions will be written to all of the 100 folders created by the column "partition". This way spark will be writing 1M files to the output_path by creating sub-directories inside. The advantage is that we are skipping a full-shuffle using this method.
Now compared to the in-memory compute-intensive shuffle in Case 1, this will be much slower since Spark has to create 1M files and write them to persistent storage.
That too, initially to a temporary folder and then to the output_path.
This will be much more slower if the write is happening to an object-store like AWS S3 or GCP Blob
Case 3: coalesce + partitionBy
.coalesce(num_partitions) \
.write.format("json") \
.partitionBy("partition") \
In this case, you will be reducing the number of spark partitions from 10K to 100 with coalesce() and writing it to output_path partitioned by column "partition".
So, assuming the worst case where each of these 100 partitions has all the distinct values of the column "partition", spark will have to write 100 * 100 = 10K files.
This will still be faster than Case 2, but will be slower than Case 1.
This is because you are doing a partial shuffle with coalesce() but still end up writing 10K files to output_path.
Case 4: repartition+ partitionBy
.repartition("partition") \
.write.format("json") \
.partitionBy("partition") \
In this case, you will be reducing the number of spark partitions from 10K to 100 [distinct values of column "partition"] with repartition() and writing it to output_path partitioned by column "partition".
So, each of these 100 partitions has only one distinct value of the column "partition", spark will have to write 100 * 1 = 100 files. Each sub-folder created by partitionBy() will only have 1 file inside it.
This will take the same time as Case 1 since both the cases involve a full-shuffle and then writing 100 files. The only difference here will be that 100 files will be inside sub-folders under the output_path.
This setup will be useful for predicate push-down of filters while reading the output_path via spark or hive.
Conclusion:
Even though partitionBy is faster than repartition, depending on the number of dataframe partitions and distribution of data inside those partitions, just using partitionBy alone might end up costly.

Related

Avoid data shuffle and coalesce-numPartitions is not applied to individual partition while doing left anti-join in spark dataframe

I have two dataframe - target_df and reference_df. I need to remove account_id's in target_df which is present in reference_df.
target_df is created from hive table, will have hundreds of partitions. It is partitioned based on date(20220101 to 20221101).
I am doing left anti-join and writing data in hdfs location.
val numPartitions = 10
val df_purge = spark.sql(s"SELECT /*+ BROADCASTJOIN(ref) */ target.* FROM input_table target LEFT ANTI JOIN ${reference_table} ref ON target.${Customer_ID} = ref.${Customer_ID}")
df_purge.coalesce(numPartitions).write.partitionBy("date").mode("overwrite").parquet("hdfs_path")
I need to apply same numPartitions value to each partition. But it is applying to numPartitions value to entire dataframe. For example: If it has 100 date partitions, i need to have 100 * 10 = 1000 part files. These code is not working as expected. I tried repartitionby("date") but this is causing huge data shuffle.
Can anyone please provide an optimized solution. Thanks!
I am afraid that you can not skip shuffle in this case. All repartition/coalesce/partitionBy are working on dataset level and i dont think that there is a way to just split partitions into 10 without shuffle
You tried to use coalesce which is not causing shuffle and this is true, but coalesce can only be used to decrese number of partitions so its not going to help you
You can try to achieve what you want by using combination of raprtition and repartitionBy. Here is description of both functions (same applies to Scala source: https://sparkbyexamples.com:
PySpark repartition() is a DataFrame method that is used to increase
or reduce the partitions in memory and when written to disk, it create
all part files in a single directory.
PySpark partitionBy() is a method of DataFrameWriter class which is
used to write the DataFrame to disk in partitions, one sub-directory
for each unique value in partition columns.
If you first repartition your dataset with repartition = 1000 Spark is going to create 1000 partitions in memory. Later, when you call repartitionBy, Spark is going to create sub-directory forr each value and create one part file for each in-memory partition which contains given key
So if after repartition you have date X in 500 partitions out of 1000 you will find 500 file in sub-directory for this date
In article which i mentioned previously you can find simple example of this behaviourm, chech chapter 1.3 partitionBy(colNames : String*) Example
#Use repartition() and partitionBy() together
dfRepart.repartition(2)
.write.option("header",True) \
.partitionBy("state") \
.mode("overwrite") \
.csv("c:/tmp/zipcodes-state-more")

In Apache Spark's `bucketBy`, how do you generate 1 file per bucket instead of 1 file per bucket per partition?

I am trying to use Spark's bucketBy feature on a pretty large dataset.
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
The problem is that my Spark cluster has about 500 partitions/tasks/executors (not sure the terminology), so I end up with files that look like:
part-00001-{UUID}_00001.c000.snappy.parquet
part-00001-{UUID}_00002.c000.snappy.parquet
...
part-00001-{UUID}_00500.c000.snappy.parquet
part-00002-{UUID}_00001.c000.snappy.parquet
part-00002-{UUID}_00002.c000.snappy.parquet
...
part-00002-{UUID}_00500.c000.snappy.parquet
part-00500-{UUID}_00001.c000.snappy.parquet
part-00500-{UUID}_00002.c000.snappy.parquet
...
part-00500-{UUID}_00500.c000.snappy.parquet
That's 500x500=250000 bucketed parquet files! It takes forever for the FileOutputCommitter to commit that to S3.
Is there a way to generate one file per bucket, like in Hive? Or is there a better way to deal with this problem? As of now it seems like I have to choose between lowering the parallelism of my cluster (reduce number of writers) or reducing the parallelism of my parquet files (reduce number of buckets).
Thanks
In order to get 1 file per final bucket do the following. Right before writing the dataframe as table repartition it using exactly same columns as ones you are using for bucketing and set the number of new partitions to be equal to number of buckets you will use in bucketBy (or a smaller number which is a divisor of number of buckets, though I don't see a reason to use a smaller number here).
In your case that would probably look like this:
dataframe.repartition(500, bucketColumn1, bucketColumn2)
.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
In the cases when you're saving to an existing table you need to make sure the types of columns are matching exactly (e.g. if your column X is INT in dataframe, but BIGINT in the table you're inserting into your repartitioning by X into 500 buckets won't match repartitioning by X treated as BIGINT and you'll end up with each of 500 executors writing 500 files again).
Just to be 100% clear - this repartitioning will add another step into your execution which is to gather the data for each bucket on 1 executor (so one full data reshuffle if the data was not partitioned same way before). I'm assuming that is exactly what you want.
It was also mentioned in comments to another answer that you'll need to be prepared for possible issues if your bucketing keys are skewed. It is true, but default Spark behavior doesn't exactly help you much if the first thing you do after loading the table is to aggregate/join on the same columns you bucketed by (which seems like a very possible scenario for someone who chose to bucket by these columns). Instead you will get a delayed issue and only see the skewness when try to load the data after the writing.
In my opinion it would be really nice if Spark offered a setting to always repartition your data before writing a bucketed table (especially when inserting into existing tables).
This should solve it.
dataframe.write()
.format("parquet")
.bucketBy(1, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
Modify the Input Parameter for the BucketBy Function to 1.
You can look at the code of bucketBy from spark's git repository - https://github.com/apache/spark/blob/f8d59572b014e5254b0c574b26e101c2e4157bdd/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
The first split part-00001, part-00002 is based on the number of parallel tasks running when you save the bucketed table. In your case you had 500 parallel tasks running. The number of files inside each part file is decided based on the input you provide for the bucketBy function.
To learn more about Spark tasks, partitions, executors, view my Medium articles - https://medium.com/#tharun026

How to auto calculate numRepartition while using spark dataframe write

When I tried to write dataframe to Hive Parquet Partitioned Table
df.write.partitionBy("key").mode("append").format("hive").saveAsTable("db.table")
It will create a lots of blocks in HDFS, each of the block only have small size of data.
I understand how it goes as each spark sub-task will create a block, then write data to it.
I also understand, num of blocks will increase the Hadoop performance, but it will also decrease the performance after reaching a threshold.
If i want to auto set numPartition, does anyone have a good idea?
numPartition = ??? // auto calc basing on df size or something
df.repartition("numPartition").write
.partitionBy("key")
.format("hive")
.saveAsTable("db.table")
First of all, why do you want to have an extra repartition step when you are already using partitionBy(key)- your data would be partitioned based on the key.
Generally, you could re-partition by a column value, that's a common scenario, helps in operations like reduceByKey, filtering based on column value etc. For example,
val birthYears = List(
(2000, "name1"),
(2000, "name2"),
(2001, "name3"),
(2000, "name4"),
(2001, "name5")
)
val df = birthYears.toDF("year", "name")
df.repartition($"year")
By Default spark will create 200 Partitions for shuffle operations. so, 200 files/blocks (if the file size is less) will be written to HDFS.
Configure the number of partitions to be created after shuffle based on your data in Spark using below configuration:
spark.conf.set("spark.sql.shuffle.partitions", <Number of paritions>)
ex: spark.conf.set("spark.sql.shuffle.partitions", "5"), so Spark will create 5 partitions and 5 files will be written to HDFS.

How to avoid writing empty json files in Spark [duplicate]

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()
Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.
I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.
If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

How to avoid empty files while writing parquet files?

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()
Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.
I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.
If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

Resources