I have a performance issue on query after partitioning.
I have a daily parquet file of around 30 millions rows and 20 columns. For example, the file data_20210721.parquet looks like:
+-----------+---------------------+---------------------+------------+-----+
| reference | date_from | date_to | daytime | ... |
+-----------+---------------------+---------------------+------------+-----+
| A | 2021-07-21 17:30:25 | 2021-07-22 02:21:57 | 2021-07-22 | ... |
| A | 2021-07-21 12:10:10 | 2021-07-21 13:00:00 | 2021-07-21 | ... |
| A | ... | ... | ... | ... |
+-----------+---------------------+---------------------+------------+-----+
We have a code to process it to have only a single day and cut a midnight such that we have:
+-----------+---------------------+---------------------+------------+-----+
| reference | date_from | date_to | daytime | ... |
+-----------+---------------------+---------------------+------------+-----+
| A | 2021-07-21 17:30:25 | 2021-07-22 00:00:00 | 2021-07-21 | ... | <- split at midnight + daytime update
| A | 2021-07-22 00:00:00 | 2021-07-22 02:21:57 | 2021-07-22 | ... | <- residual
| A | 2021-07-21 12:10:10 | 2021-07-21 13:00:00 | 2021-07-21 | ... |
| A | ... | ... | ... | ... |
+-----------+---------------------+---------------------+------------+-----+
The line 2, can be called a residual because it is not from the same day as the file.
Then we wanted to generate 1 parquet per daytime so the default solution was to process each file and save the dataframe with:
df.write.partitionBy(["id", "daytime"]).mode("append").parquet("hdfs/path")
The mode is set to append because the next day, we may have residuals from past / future days.
There is also other levels of partitioning such as:
ID : it is fixed for around a year (quite good to save so storage ;) )
weeknumber
country
Even if partition are quite "balanced" in term of rows, the processing time becames incredibly slow.
For example, to count the number of rows per day for a given set of date:
Original df (7s seconds):
spark.read.parquet("path/to/data_2021071[0-5].parquet")\
.groupBy("DayTime")\
.count()\
.show()
Partitioned data (several minutes)
spark.read.parquet("path/to/data")\
.filter( (col("DayTime") >= "2021-07-10") & (col("DayTime") <= "2021-07-15") )\
.groupBy("DayTime")\
.count()\
.show()
We thought that there is too many small partitions at the final level (because of the append, there is around 600 very small files of few Kb/Mb) so we tried to coalesce them for each partition and there is no improvements. We also tried to partition only on daytime (in case having to many level of partition creates issues).
Is there is any solutions to improve the performance (or understand where is the bottleneck) ?
Can it be linked to the fact that we are partitioning a date column ? I saw a lot of example with partition by year/month/day for example which are 3 integers but does not fit our need.
This solution was perfect to solve a lot of problems we had but the loss of performance if far too important to be kept as is. Any suggestion is welcome :)
EDIT 1 :
The issues come from the fact the the plan is not the same between:
spark.read.parquet("path/to/data/DayTime=2021-07-10")
and
spark.read.parquet("path/to/data/").filter(col("DayTime")=="2021-07-10")
Here is the plan for a small example where DayTime has been converted to a "long" as I thought maybe the slowness was due to the datatype:
spark.read.parquet("path/to/test/").filter(col("ts") == 20200103).explain(extended=True)
== Parsed Logical Plan ==
'Filter ('ts = 20200103)
+- AnalysisBarrier
+- Relation[date_from#4297,date_to#4298, ....] parquet
== Analyzed Logical Plan ==
date_from: timestamp, date_to: timestamp, ts: int, ....
Filter (ts#4308 = 20200103)
+- Relation[date_from#4297,date_to#4298,ts#4308, ....] parquet
== Optimized Logical Plan ==
Filter (isnotnull(ts#4308) && (ts#4308 = 20200103))
+- Relation[date_from#4297,date_to#4298,ts#4308, ....] parquet
== Physical Plan ==
*(1) FileScan parquet [date_from#4297,date_to#4298,ts#4308, ....] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://.../test_perf], PartitionCount: 1, PartitionFilters: [isnotnull(ts#4308), (ts#4308 = 20200103)], PushedFilters: [], ReadSchema: struct<date_from:timestamp,date_to:timestamp, ....
vs
spark.read.parquet("path/to/test/ts=20200103").explain(extended=True)
== Parsed Logical Plan ==
Relation[date_from#2086,date_to#2087, ....] parquet
== Analyzed Logical Plan ==
date_from: timestamp, date_to: timestamp,, ....] parquet
== Optimized Logical Plan ==
Relation[date_from#2086,date_to#2087, ....] parquet
== Physical Plan ==
*(1) FileScan parquet [date_from#2086,date_to#2087, .....] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://.../test_perf/ts=20200103], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<date_from:timestamp,date_to:timestamp, ....
Thanks in advance,
Nicolas
You have to ensure that your filter is actually utilising the partitioned structure, pruning at disk level rather than bringing all data into memory and then applying filter.
Try to check your physical plan
spark.read.parquet("path/to/data")\
.filter( (col("DayTime") >= "2021-07-10") & (col("DayTime") <= "2021-07-15") )
.explain()
It should have a stage similar to PartitionFilters: [isnotnull(DayTime#123), (DayTime#76 = your condition)],
My guess is in your case, it is not utilising this PartitionFilters and whole data is scanned.
I would suggest to try experimenting your syntax / repartition strategy using a small data set until you achieve PartitionFilters.
Related
Seems like I'm missing something about repartition in spark.
AFAIK, you can repartition with a key:
df.repartition("key") , in which case spark will use a hash partitioning method.
And you can repartition with setting only partitions number:
df.repartition(10), in which spark will use a round robin partitioning method.
In which case a round robin partition will have a data skew which will require using salt to randomize the results equally, if repartitioning with only column numbers is done in a round robin manner?
With df.repartition(10) you cannot have a skew. As you mention it, spark uses a round robin partitioning method so that partitions have the same size.
We can check that:
spark.range(100000).repartition(5).explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange RoundRobinPartitioning(5), REPARTITION_BY_NUM, [id=#1380]
+- Range (0, 100000, step=1, splits=16)
spark.range(100000).repartition(5).groupBy(spark_partition_id).count
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|20000|
| 1|20000|
| 2|20000|
| 3|20000|
| 4|20000|
+--------------------+-----+
If you use df.repartition("key"), something different happens:
// let's specify the number of partitions as well
spark.range(100000).repartition(5, 'id).explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange hashpartitioning(id#352L, 5), REPARTITION_BY_NUM, [id=#1424]
+- Range (0, 100000, step=1, splits=16)
Let's try:
spark.range(100000).repartition(5, 'id).groupBy(spark_partition_id).count.show
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|20128|
| 1|20183|
| 2|19943|
| 3|19940|
| 4|19806|
+--------------------+-----+
Each element of the column is hashed and hashes are split between partitions. Therefore partitions have similar sizes but they don't have exactly the same size. However, two rows with the same key necessarily end up in the same partition. So if your key is skewed (one or more particular keys are over-represented in the dataframe), your partitioning will be skewed as well:
spark.range(100000)
.withColumn("key", when('id < 1000, 'id).otherwise(lit(0)))
.repartition(5, 'key)
.groupBy(spark_partition_id).count.show
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|99211|
| 1| 196|
| 2| 190|
| 3| 200|
| 4| 203|
+--------------------+-----+
I went through the documentation here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
It says:
for repartition: resulting DataFrame is hash partitioned.
for repartitionByRange: resulting DataFrame is range partitioned.
And a previous question also mentions it. However, I still don't understand how exactly they differ and what the impact will be when choosing one over the other?
More importantly, if repartition does hash partitioning, what impact does providing columns as its argument have?
I think it is best to look into the difference with some experiments.
Test Dataframes
For this experiment, I am using the following two Dataframes (I am showing the code in Scala but the concept is identical to Python APIs):
// Dataframe with one column "value" containing the values ranging from 0 to 1000000
val df = Seq(0 to 1000000: _*).toDF("value")
// Dataframe with one column "value" containing 1000000 the number 0 in addition to the numbers 5000, 10000 and 100000
val df2 = Seq((0 to 1000000).map(_ => 0) :+ 5000 :+ 10000 :+ 100000: _*).toDF("value")
Theory
repartition applies the HashPartitioner when one or more columns are provided and the RoundRobinPartitioner when no column is provided. If one or more columns are provided (HashPartitioner), those values will be hashed and used to determine the partition number by calculating something like partition = hash(columns) % numberOfPartitions. If no column is provided (RoundRobinPartitioner) the data gets evenly distributed across the specified number of partitions.
repartitionByRange will partition the data based on a range of the column values. This is usually used for continuous (not discrete) values such as any kind of numbers. Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.
It is also worth mentioning that for both methods if numPartitions is not given, by default it partitions the Dataframe data into spark.sql.shuffle.partitions configured in your Spark session, and could be coalesced by Adaptive Query Execution (available since Spark 3.x).
Test Setup
Based on the given Testdata I am always applying the same code:
val testDf = df
// here I will insert the partition logic
.withColumn("partition", spark_partition_id()) // applying SQL built-in function to determine actual partition
.groupBy(col("partition"))
.agg(
count(col("value")).as("count"),
min(col("value")).as("min_value"),
max(col("value")).as("max_value"))
.orderBy(col("partition"))
testDf.show(false)
Test Results
df.repartition(4, col("value"))
As expected, we get 4 partitions and because the values of df are ranging from 0 to 1000000 we see that their hashed values will result in a well distributed Dataframe.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |249911|12 |1000000 |
|1 |250076|6 |999994 |
|2 |250334|2 |999999 |
|3 |249680|0 |999998 |
+---------+------+---------+---------+
df.repartitionByRange(4, col("value"))
Also in this case, we get 4 partitions but this time the min and max values clearly shows the ranges of values within a partition. It is almost equally distributed with 250000 values per partition.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |244803|0 |244802 |
|1 |255376|244803 |500178 |
|2 |249777|500179 |749955 |
|3 |250045|749956 |1000000 |
+---------+------+---------+---------+
df2.repartition(4, col("value"))
Now, we are using the other Dataframe df2. Here, the hashing algorithm is hashing the values which are only 0, 5000, 10000 or 100000. Of course, the hash of the value 0 will always be the same, so all Zeros end up in the same partition (in this case partition 3). The other two partitions only contain one value.
+---------+-------+---------+---------+
|partition|count |min_value|max_value|
+---------+-------+---------+---------+
|0 |1 |100000 |100000 |
|1 |1 |10000 |10000 |
|2 |1 |5000 |5000 |
|3 |1000001|0 |0 |
+---------+-------+---------+---------+
df2.repartition(4)
Without using the content of the column "value" the repartition method will distribute the messages on a RoundRobin basis. All partitions have almost the same amount of data.
+---------+------+---------+---------+
|partition|count |min_value|max_value|
+---------+------+---------+---------+
|0 |250002|0 |5000 |
|1 |250002|0 |10000 |
|2 |249998|0 |100000 |
|3 |250002|0 |0 |
+---------+------+---------+---------+
df2.repartitionByRange(4, col("value"))
This case shows that the Dataframe df2 is not well defined for a repartitioning by range as almost all values are 0. Therefore, we end up having only two partitions whereas the partition 0 contains all Zeros.
+---------+-------+---------+---------+
|partition|count |min_value|max_value|
+---------+-------+---------+---------+
|0 |1000001|0 |0 |
|1 |3 |5000 |100000 |
+---------+-------+---------+---------+
By using df.explain you can get much information about these operations.
I'm using this DataFrame for the example :
df = spark.createDataFrame([(i, f"value {i}") for i in range(1, 22, 1)], ["id", "value"])
Repartition
Depending on whether a key expression (column) is specified or not, the partitioning method will be different. It is not always hash partitioning as you said.
df.repartition(3).explain(True)
== Parsed Logical Plan ==
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
Repartition 3, true
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange RoundRobinPartitioning(3)
+- Scan ExistingRDD[id#0L,value#1]
We can see in the generated physical plan that RoundRobinPartitioning is used:
Represents a partitioning where rows are distributed evenly across
output partitions by starting from a random target partition number
and distributing rows in a round-robin fashion. This partitioning is
used when implementing the DataFrame.repartition() operator.
When using repartition by column expression:
df.repartition(3, "id").explain(True)
== Parsed Logical Plan ==
'RepartitionByExpression ['id], 3
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
RepartitionByExpression [id#0L], 3
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
RepartitionByExpression [id#0L], 3
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange hashpartitioning(id#0L, 3)
+- Scan ExistingRDD[id#0L,value#1]
Now the picked partitioning method is hashpartitioning.
In hash partitioning method, a Java Object.hashCode is being calculated for every key expression to determine the destination partition_id by calculating a modulo: key.hashCode % numPartitions.
RepartitionByRange
This partitioning method creates numPartitions consecutive and not overlapping ranges of values based on the partitioning key. Thus, at least one key expression is required and needs to be orderable.
df.repartitionByRange(3, "id").explain(True)
== Parsed Logical Plan ==
'RepartitionByExpression ['id ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Analyzed Logical Plan ==
id: bigint, value: string
RepartitionByExpression [id#0L ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Optimized Logical Plan ==
RepartitionByExpression [id#0L ASC NULLS FIRST], 3
+- LogicalRDD [id#0L, value#1], false
== Physical Plan ==
Exchange rangepartitioning(id#0L ASC NULLS FIRST, 3)
+- Scan ExistingRDD[id#0L,value#1]
Looking at the generated physical plan, we can see that rangepartitioning differs from the two others described above by the presence of the ordering clause in the partitioning expression. When no explicit sort order is specified in the expression, it uses ascending order by default.
Some interesting links:
Repartition Logical Operators — Repartition and RepartitionByExpression
Range partitioning in Apache SparkSQL
hash vs range partitioning
I have a series of ~30 datasets that all need to be joined together for making a wide final table. This final table takes ~5 years of individual tables (one table per year) and unions them together, then joins this full history with the full history of other tables (similarly unioned) to make a big, historical, wide table.
The layout of these first, per year tables is as such:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
With other year tables like this:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 1 |
| key_2 | 1 |
These are then unioned together to create:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
Similarly, a second type of table when unioned results in the following:
table_type_2:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
I now want to join table_type_1 with table_type_2 on primary_key and year to yield a much wider table. I notice that this final join takes a very long time and shuffles a lot of data.
How can I make this faster?
You can use bucketing on the per-year tables over the primary_key and year columns into the exact same number of buckets to avoid an expensive exchange when computing the final join.
- output: table_type_1_year_0
input: raw_table_type_1_year_0
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
- output: table_type_1_year_1
input: raw_table_type_1_year_1
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
...
- output: table_type_2_year_0
input: raw_table_type_2_year_0
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
- output: table_type_2_year_1
input: raw_table_type_2_year_1
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
...
- output: all_tables
input:
- table_type_1_year_0
- table_type_1_year_1
...
- table_type_2_year_0
- table_type_2_year_1
...
hive_partitioning: none
bucketing: BUCKET_COUNT by (PRIMARY_KEY, YEAR)
Note: When you are picking the BUCKET_COUNT value, it's important to understand it should be optimized for the final all_tables output, not for the intermediate tables. This will mean you likely will end up with files that are quite small for the intermediate tables. This is likely to be inconsequential compared to the efficiency gains of the all_tables output since you won't have to compute a massive exchange when joining everything up; your buckets will be pre-computed and you can simply SortMergeJoin on the input files.
For an explicit example on how to write the transform writing out a specified number of buckets, my answer over here is probably useful.
What I advice you is: to make a first union on small datasets then to broadcast the dataset ,result of the first union , spark will deploy that dataset on its different nodes which will reduce the number of shuffles. The union on spark is well optimized so what you have to do is to think about the possess : select only columns that you need from the beginning, avoid any kind of non cost effective operations before the union like groupByKey ...etc because spark will call those operations when it makes the final process. I do advise you to avoid hive because it uses the map reduce strategy which is not worthy compared to spark sql you can use this example of a function just change the key, use scala if you can it will interact directly with spark:
def map_To_cells(df1: DataFrame, df2: DataFrame): DataFrame = {
val df0= df2.withColumn("key0",F.col("key")).drop("key")
df1.as("main").join(
broadcast(df0),
df0("key0") <=> df("key")
).select( needed columns)
}
Followup to this question
I have json streaming data in the format same as below
| A | B |
|-------|------------------------------------------|
| ABC | [{C:1, D:1}, {C:2, D:4}] |
| XYZ | [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] |
I need to transform it to the format below
| A | C | D |
|-------|-----|------|
| ABC | 1 | 1 |
| ABC | 2 | 4 |
| XYZ | 3 | 6 |
| XYZ | 9 | 11 |
| XYZ | 5 | 12 |
To achieve this performed the transformations as suggested to the previous question.
val df1 = df0.select($"A", explode($"B")).toDF("A", "Bn")
val df2 = df1.withColumn("SeqNum", monotonically_increasing_id()).toDF("A", "Bn", "SeqNum")
val df3 = df2.select($"A", explode($"Bn"), $"SeqNum").toDF("A", "B", "C", "SeqNum")
val df4 = df3.withColumn("dummy", concat( $"SeqNum", lit("||"), $"A"))
val df5 = df4.select($"dummy", $"B", $"C").groupBy("dummy").pivot("B").agg(first($"C"))
val df6 = df5.withColumn("A", substring_index(col("dummy"), "||", -1)).drop("dummy")
Now I trying to save the result to a csv file in HDFS
df6.withWatermark("event_time", "0 seconds")
.writeStream
.trigger(Trigger.ProcessingTime("0 seconds"))
.queryName("query_db")
.format("parquet")
.option("checkpointLocation", "/path/to/checkpoint")
.option("path", "/path/to/output")
// .outputMode("complete")
.start()
Now I get the below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
EventTimeWatermark event_time#223: timestamp, interval
My doubt is that I am not performing any aggregation that will require it store the aggregated value beyond the processing time for that row. Why do I get this error? Can I keep watermarking as 0 seconds?
Any help on this will be deeply appreciated.
As per my understanding, watermarking is required only when you are performing window operation on event time. Spark used watermarking to handle late data and for the same purpose Spark needs to save older aggregation.
The following link explains this very well with example:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking
I don't see any window operations in your transformation and if that is the case then I think you can try running the stream query without watermarking.
when grouping spark streaming structures you have to already have the watermark in the dataframe and take it into account while grouping, by including a window of the watermarks in your aggregation.
df.groupBy(col("dummy"), window(col("event_time"), "1 day")).
I'm using Spark-2.2.
I'm POCing Spark's bucketing.
I've created a bucketed table, here's the desc formatted my_bucketed_tbl output:
+--------------------+--------------------+-------+
| col_name| data_type|comment|
+--------------------+--------------------+-------+
| bundle| string| null|
| ifa| string| null|
| date_| date| null|
| hour| int| null|
| | | |
|# Detailed Table ...| | |
| Database| default| |
| Table| my_bucketed_tbl|
| Owner| zeppelin| |
| Created|Thu Dec 21 13:43:...| |
| Last Access|Thu Jan 01 00:00:...| |
| Type| EXTERNAL| |
| Provider| orc| |
| Num Buckets| 16| |
| Bucket Columns| [`ifa`]| |
| Sort Columns| [`ifa`]| |
| Table Properties|[transient_lastDd...| |
| Location|hdfs:/user/hive/w...| |
| Serde Library|org.apache.hadoop...| |
| InputFormat|org.apache.hadoop...| |
| OutputFormat|org.apache.hadoop...| |
| Storage Properties|[serialization.fo...| |
+--------------------+--------------------+-------+
When I'm executing an explain of a group by query, I can see that we've spared the exchange phase :
sql("select ifa,max(bundle) from my_bucketed_tbl group by ifa").explain
== Physical Plan ==
SortAggregate(key=[ifa#932], functions=[max(bundle#920)])
+- SortAggregate(key=[ifa#932], functions=[partial_max(bundle#920)])
+- *Sort [ifa#932 ASC NULLS FIRST], false, 0
+- *FileScan orc default.level_1[bundle#920,ifa#932] Batched: false, Format: ORC, Location: InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<bundle:string,ifa:string>
But, when I replace Spark's max function with collect_set, I can see that the execution plan is the same as a non-bucketed table, means, the exchange phase is not spared :
sql("select ifa,collect_set(bundle) from my_bucketed_tbl group by ifa").explain
== Physical Plan ==
ObjectHashAggregate(keys=[ifa#1010], functions=[collect_set(bundle#998, 0, 0)])
+- Exchange hashpartitioning(ifa#1010, 200)
+- ObjectHashAggregate(keys=[ifa#1010], functions=[partial_collect_set(bundle#998, 0, 0)])
+- *FileScan orc default.level_1[bundle#998,ifa#1010] Batched: false, Format: ORC, Location: InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<bundle:string,ifa:string>
Is there any configuration that I missed or it's a limitation that Spark's bucketing has at the moment?
The issue was fixed in version 2.2.1.
You can find the Jira issue here