BROADCASTJOIN hint is not working in PySpark SQL - apache-spark

I am trying to provide broadcast hint to table which is smaller in size, but physical plan is still showing me SortMergeJoin.
spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()
Output :
Note :
Size of tables are in KBs (test data)
Joining column 'serial_id' is not partitioned column
Using glue catalog as metastore (AWS)
Spark Version - Spark 2.4.4
I have tried BROADCASTJOIN and MAPJOIN hint as well
When I am trying to use created_date [partitioned column] instead of serial_id as my joining condition, it is showing me BroadCast Join -
spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.created_date = c.created_date').explain()
Output -
Why spark behavior is strange with AWS Glue Catalog as my metastore?

In BROADCAST hint we need to pass the alias name of the table (as you have alias kept in your sql statement).
Try with /*+ BROADCAST(c) */* instead of /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ *
spark.sql('select /*+ BROADCAST(c) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Related

Spark - Remove broadcast variable declared in sql hint

Is there a way in spark to remove broadcast variables from the executor memory if it has been declared in sql hint?
I've seen this How to remove / dispose a broadcast variable from heap in Spark? but in my case I want to destroy that broadcast if it has been declared in a sql sentence like
val dfResult = spark.sql("""
select /*+ BROADCAST(b) */ a.id, a.name
from tableA a
join tableB b
on a.id = b.id
""")
Is it possible somehow? maybe exploring the execution plan of the dataframe?
Thanks

Hive hql to Spark sql conversion

I have a requirement to convert hql to spark sql .I am using below approach , with this I am not seeing much change in performance. If anybody has better suggestion please let me know.
hive-
create table temp1 as select * from Table1 T1 join (select id , min(activity_date) as dt from Table1 group by id) T2 on T1.id=T2.id and T1.activity_date=T2.dt ;
create table temp2 as select * from temp1 join diff_table
I have around 70 such internal hive temp tables and data in the source Table1 is around 1.8 billion with no partitioning and 200 hdfs files .
spark code - running with 20 executor, 5 executor-core,10G executor memory, yarn-client , driver 4G
import org.apache.spark.sql.{Row,SaveMode,SparkSession}
val spark=SparkSession.builder().appName("test").config("spark.sql.warehouse.dir","/usr/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicit._
import spark.sql
val id_df=sql("select id , min(activity_date) as dt from Table1 group by id")
val all_id_df=sql("select * from Table1")
id_df.createOrReplaceTempView("min_id_table")
all_id_df.createOrReplaceTempView("all_id_table")
val temp1_df=sql("select * from all_id_table T1 join min_id_table T2 on T1.id=T2.id and T1.activity_date=T2.dt")
temp1_df.createOrReplaceTempView("temp2")
sql("create or replace table temp as select * from temp2")

spark cassandra connector problem using catalogs

I am following the instructions found here to connect my spark program to read data from Cassandra. Here is how I have configured spark:
val configBuilder = SparkSession.builder
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.config("spark.cassandra.connection.host", cassandraUrl)
.config("spark.cassandra.connection.port", 9042)
.config("spark.sql.catalog.myCatalogName", "com.datastax.spark.connector.datasource.CassandraCatalog")
According to the documentation, once this is done I should be able to query Cassandra like this:
spark.sql("select * from myCatalogName.myKeyspace.myTable where myPartitionKey = something")
however when I do so I get the following error message:
mismatched input '.' expecting <EOF>(line 1, pos 43)
== SQL ==
select * from myCatalog.myKeyspace.myTable where myPartitionKey = something
----------------------------------^^^
When I try in the following format I am successful at retrieving entries from Cassandra:
val frame = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "myKeyspace", "table" -> "myTable"))
.load()
.filter(col("timestamp") > startDate && col("timestamp") < endDate)
However this query requires a full table scan to be performed. The table contains a few million entries and I would prefer to avail myself of the predicate Pushdown functionality, which it would seem is only available via the SQL API.
I am using spark-core_2.11:2.4.3, spark-cassandra-connector_2.11:2.5.0 and Cassandra 3.11.6
Thanks!
The Catalogs API is available only in SCC version 3.0 that is not released yet. It will be released with Spark 3.0 release, so it isn't available in the SCC 2.5.0. So for 2.5.0 you need to register your table explicitly, with create or replace temporary view..., as described in docs:
spark.sql("""CREATE TEMPORARY VIEW myTable
USING org.apache.spark.sql.cassandra
OPTIONS (
table "myTable",
keyspace "myKeyspace",
pushdown "true")""")
Regarding the pushdowns (they work the same for all Dataframe APIs, SQL, Scala, Python, ...) - such filtering will happen when your timestamp is the first clustering column. And even in that case, the typical problem is that you may specify startDate and endDate as strings, not timestamp. You can check by executing frame.explain, and checking that predicate is pushed down - it should have * marker near predicate name.
For example,
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp) AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)")
val not_filtered = data.filter("ts >= '2019-03-10T14:41:34.373+0000' AND ts <= '2019-03-10T19:01:56.316+0000'")
the first filter expression will push predicate down, while 2nd (not_filtered) will require a full scan.

Spark structured streaming broadcast join hint

I'm using Spark 2.2.0, with following SQL statement, broadcast hint does not seem work.
// table dim is some static table
// table s is some stream table
spark.sql("select /*+ BROADCAST(dim) */ s.a, dim.b from s left outer join dim
on s.b = dim.b")
And I check the physical plan, it shows that the plan is the SortMergeJoin.

How to stop load the whole table in spark?

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24.
when I execute in Hive command:
hive>select * from table where year="2016" and month="06" and day="01";
I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24
But,when I read in spark
sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")
exceptition is throwable That I dont have the right to hdfs/.../2016/04/24
THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER?
HOW CAN I AVOID LOAD THE WHOLE TABLE?
You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.
To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).
val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)
Then you can create a JdbcRDD;
val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
sc,
() => DriverManager.getConnection(connUrl),
query,
lowerBound,
upperBound,
numOfPartitions,
(r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)
JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.
However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

Resources