how to enable storage partitioned join in spark/iceberg? - apache-spark

How do I use the storage partitioned join feature in Spark 3.3.0? I've tried it out, and my query plan still shows the expensive ColumnarToRow and Exchange steps. My setup is as follows:
joining two Iceberg tables, both partitioned on hours(ts), bucket(20, id)
join attempted on a.id = b.id AND a.ts = b.ts and on a.id = b.id
tables are large, 100+ partitions used, 100+ GB of data to join
spark: 3.3.0
iceberg: org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
set my spark session config with spark.sql.sources.v2.bucketing.enabled=true
I read through all the docs I could find on the storage partitioned join feature:
tracker
SPIP
PR
Youtube demo
I'm wondering if there are other things I need to configure, if there needs to be something implemented in Iceberg still, or if I've set up something wrong. I'm super excited about this feature. It could really speed up some of our large joins.

The support hasn't been implemented in Iceberg yet. In fact it looks like the work is proceeding as I'm typing: https://github.com/apache/iceberg/issues/430#issuecomment-1283014666
This answer should be updated when there's a release of Iceberg that supports Spark storage-partitioned joins.

Support for storage-partitioned joins (SPJ) has been added to Iceberg in PR #6371 and will be released in 1.2.0. Keep in mind Spark added support for SPJ for v2 sources only in 3.3, so earlier versions can't benefit from this feature.

Related

Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1 for all Spark SQL data pipelines but without AQE enabled yet. The primary goal is to keep everything the same but use the newest version. Later on, we can easily enable AQE for all data pipelines.
For a specific case, right after the spark version change, we faced the following error:
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
We investigated this issue and looking at Spark UI logs, we noticed a change in the query plan as follows:
Spark 2.4.0:
Spark 2.4.0 is using the default SortMergeJoin to do the join operation between the tbl_a and tbl_b, but when we look at query plan from new Spark 3.1.1:
We can notice that instead of SortMergeJoin it is using the BroadcastHashJoin to do the join between tbl_a and tbl_b. Not only this, but if I'm not wrong, the BroadcastExchange operation is occurring on the big table side, which seems strange from my perspective.
As additional information, we have the following properties regarding the execution of both jobs:
spark.sql.autoBroadcastJoinThreshold = 10Mb
spark.sql.adaptive.enabled = false # AQE is disabled
spark.sql.shuffle.partitions = 200
and other non-relevant properties.
Do you guys have any clue on why this is happening? My questions are:
Why Spark 3 has changed the join approach in this situation given that AQE is disabled and the spark.sql.autoBroadcastJoinThreshold is much smaller than the data set size?
Is this the expected behavior or could this represents a potential bug in Spark 3.x?
Please, let me know your thoughts. I appreciate all the help in advance.
UPDATE - 2022-07-27
After digging into Spark code for some days, and debugging it, I was able to understand what is happening. Basically, the retrieved statistics are the problem. Apparently, Spark 3 gets the statistics from a Hive table attribute called rawDataSize. If this isn't defined, than it looks for totalSize table property, as we can see in the following source code:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala#L69
During my tests, this property presented a very small number (way lower than the autoBroadcastThreshold property) making Spark Optimizer think it was safe to broadcast the right relation, but when the actual broadcast operation happened, it showed a bigger size, approximately the same as in the picture for the right relation, causing the timeout error.
I fixed the issue for my test by running the following command on Hive for a specific partition set:
ANALYZE TABLE table_b PARTITION(ds='PARTITION_VALUE', hr='PARTITION_VALUE') COMPUTE STATISTICS;
The rawDataSize now is zero and Spark 3 is using the totalSize (has a reasonable number) as the relation size and consequently, is not using BHJ for this situation.
Now the issue is figuring out why the rawDataSize is so small in the first place or even zero, given that the hive property hive.stats.autogather is true by default (auto calculates the statistics for every DML command) but it seems to be another problem.
Spark has made many improvements around joins.
One of them is :
AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it’s better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if spark.sql.adaptive.localShuffleReader.enabled is true)
https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#converting-sort-merge-join-to-broadcast-join

What is best approach to join data in spark streaming application?

Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ?
We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version.
But have one fundamental question regarding the efficiency in the below scenario.
For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing records( i.e. cassandraDataset) from Cassandra(C*) table.
i.e.
Dataset<Row> streamingDataSet = //kafka read dataset
Dataset<Row> cassandraDataset= //loaded from C* table those records loaded earlier from above.
To look up data i need to join above datasets
i.e.
Dataset<Row> joinDataSet = cassandraDataset.join(cassandraDataset).where(//somelogic)
process further the joinDataSet to implement the business logic ...
In the above scenario, my understanding is ,for each record received
from kafka stream it would query the C* table i.e. data base call.
Does not it take huge time and network bandwidth if C* table consists
billions of records? What should be the approach/procedure to be
followed to improve look up C* table ?
What is the best solution in this scenario ? I CAN NOT load once from
C* table and look up as the data keep on adding to C* table ... i.e.
new look ups might need newly persisted data.
How to handle this kind of scenario? any advices plzz..
If you're using Apache Cassandra, then you have only one possibility for effective join with data in Cassandra - via RDD API's joinWithCassandraTable. The open source version of the Spark Cassandra Connector (SCC) supports only it, while in version for DSE, there is a code that allows to perform effective join against Cassandra also for Spark SQL - so-called DSE Direct Join. If you'll use join in Spark SQL against Cassandra table, Spark will need to read all data from Cassandra, and then perform join - that's very slow.
I don't have an example for OSS SCC for doing the join for Spark Structured Streaming, but I have some examples for "normal" join, like this:
CassandraJavaPairRDD<Tuple1<Integer>, Tuple2<Integer, String>> joinedRDD =
trdd.joinWithCassandraTable("test", "jtest",
someColumns("id", "v"), someColumns("id"),
mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));

Can join with Cassandra table get pushdown?

I'm using structured stream. I need to left join a huge (billions of rows) Cassandra table to know whether the source data in micro-batch is new or existed in terms of id col. If I do something like:
val src = spark.read.cassandraFormat("src", "ks").load().select("id")
val query= some_dataset
.join(src, expr("src.id=some_dataset.id"), joinType = "leftOuter")
.withColumn("flag", expr("case when src.id is null then 0 else 1 end"))
.writeStream
.outputMode("update")
.foreach(...)
.start
Can Cassandra push down the left join and look up with the join col value in source delta? Is there a way to tell whether the pushdown happened or not?
Thanks
Not in the open source version of Spark Cassandra Connector. There is a support for it as DSE Direct Join in DSE Analytics, so if you use DataStax Enterprise, you'll get it. If you're using OSS connector you're limited to RDD API only.
Update, May 2020th: optimized join on dataframes is supported since SCC 2.5.0, together with other commercial features. See this blog posts for details.

update query in Spark SQL

I wonder can I use the update query in sparksql just like:
sqlContext.sql("update users set name = '*' where name is null")
I got the error:
org.apache.spark.sql.AnalysisException:
Unsupported language features in query:update users set name = '*' where name is null
If the sparksql does not support the update query or am i writing the code incorrectly?
Spark SQL doesn't support UPDATE statements yet.
Hive has started supporting UPDATE since hive version 0.14. But even with Hive, it supports updates/deletes only on those tables that support transactions, it is mentioned in the hive documentation.
See the answers in databricks forums confirming that UPDATES/DELETES are not supported in Spark SQL as it doesn't support transactions. If we think, supporting random updates is very complex with most of the storage formats in big data. It requires scanning huge files, updating specific records and rewriting potentially TBs of data. It is not normal SQL.
Now it's possible, with Databricks Delta Lake
Spark SQL now supports update, delete and such data modification operations if the underlying table is in delta format.
Check this out:
https://docs.delta.io/0.4.0/delta-update.html#update-a-table

Does Spark SQL include a table streaming optimization for joins?

Does Spark SQL include a table streaming optimization for joins and, if so, how does it decide which table to stream?
When doing joins, Hive assumes the last table is the largest one. As a join optimization, it will attempt to buffer the smaller join tables and stream the last one through. If the last table in the join list is not the largest one, Hive has the /*+ STREAMTABLE(tbl) */ hint which tells it the table that should be streamed. As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
This question has been asked for normal RDD processing, outside of Spark SQL, here. The answer does not apply to Spark SQL where the developer has no control of explicit cache operations.
I have looked for an answer to this question some time ago and all I could come up with was setting a spark.sql.autoBroadcastJoinThreshold parameter, which is by default 10 MB. It will then attempt to automatically broadcast all the tables with size smaller than the limit set by you. Join order plays no role here for this setting.
If you are interestend in further improving join performance, I highly recommend this presentation.
This is the upcoming Spark 2.3 here (RC2 is being voted for the next release).
As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
It does not in the latest (and voted to be released soon) Spark 2.3 either.
There is no support for STREAMTABLE hint, but given the recent change (in SPARK-20857 Generic resolved hint node) to build a hint framework that should be fairly easy to write.
You'd have to write some Spark optimizations and possibly physical plan(s) that would support STREAMTABLE (which seems like a lot of work) but it's possible. The tools are there.
Regarding join optimizations, in the upcoming Spark 2.3 there are two main logical optimizations:
ReorderJoin
CostBasedJoinReorder (exclusively for cost-based optimization)

Resources