Spark - Remove broadcast variable declared in sql hint - apache-spark

Is there a way in spark to remove broadcast variables from the executor memory if it has been declared in sql hint?
I've seen this How to remove / dispose a broadcast variable from heap in Spark? but in my case I want to destroy that broadcast if it has been declared in a sql sentence like
val dfResult = spark.sql("""
select /*+ BROADCAST(b) */ a.id, a.name
from tableA a
join tableB b
on a.id = b.id
""")
Is it possible somehow? maybe exploring the execution plan of the dataframe?
Thanks

Related

BROADCASTJOIN hint is not working in PySpark SQL

I am trying to provide broadcast hint to table which is smaller in size, but physical plan is still showing me SortMergeJoin.
spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()
Output :
Note :
Size of tables are in KBs (test data)
Joining column 'serial_id' is not partitioned column
Using glue catalog as metastore (AWS)
Spark Version - Spark 2.4.4
I have tried BROADCASTJOIN and MAPJOIN hint as well
When I am trying to use created_date [partitioned column] instead of serial_id as my joining condition, it is showing me BroadCast Join -
spark.sql('select /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.created_date = c.created_date').explain()
Output -
Why spark behavior is strange with AWS Glue Catalog as my metastore?
In BROADCAST hint we need to pass the alias name of the table (as you have alias kept in your sql statement).
Try with /*+ BROADCAST(c) */* instead of /*+ BROADCAST(pratik_test_temp.crosswalk2016) */ *
spark.sql('select /*+ BROADCAST(c) */ * from pratik_test_staging.crosswalk2016 t join pratik_test_temp.crosswalk2016 c on t.serial_id = c.serial_id').explain()

Spark structured streaming broadcast join hint

I'm using Spark 2.2.0, with following SQL statement, broadcast hint does not seem work.
// table dim is some static table
// table s is some stream table
spark.sql("select /*+ BROADCAST(dim) */ s.a, dim.b from s left outer join dim
on s.b = dim.b")
And I check the physical plan, it shows that the plan is the SortMergeJoin.

Getting OutofMemoryError- GC overhead limit exceed in pyspark

in the middle of project i am getting bellow error after invoking a function in my spark sql query
i have written a user define function which will take two string and concat them after concatenation it will take right most substring length of 5 depend on total string length(alternate method of right(string,integer) of sql server )
from pyspark.sql.types import*
def concatstring(xstring, ystring):
newvalstring = xstring+ystring
print newvalstring
if(len(newvalstring)==6):
stringvalue=newvalstring[1:6]
return stringvalue
if(len(newvalstring)==7):
stringvalue1=newvalstring[2:7]
return stringvalue1
else:
return '99999'
spark.udf.register ('rightconcat', lambda x,y:concatstring(x,y), StringType())
it works fine individually. now when i pass it in my spark sql query as column this exception occured
the query is
the written query is
spark.sql("select d.BldgID,d.LeaseID,d.SuiteID,coalesce(BLDG.BLDGNAME,('select EmptyDefault from EmptyDefault')) as LeaseBldgName,coalesce(l.OCCPNAME,('select EmptyDefault from EmptyDefault'))as LeaseOccupantName, coalesce(l.DBA, ('select EmptyDefault from EmptyDefault')) as LeaseDBA, coalesce(l.CONTNAME, ('select EmptyDefault from EmptyDefault')) as LeaseContact,coalesce(l.PHONENO1, '')as LeasePhone1,coalesce(l.PHONENO2, '')as LeasePhone2,coalesce(l.NAME, '') as LeaseName,coalesce(l.ADDRESS, '') as LeaseAddress1,coalesce(l.ADDRESS2,'') as LeaseAddress2,coalesce(l.CITY, '')as LeaseCity, coalesce(l.STATE, ('select EmptyDefault from EmptyDefault'))as LeaseState,coalesce(l.ZIPCODE, '')as LeaseZip, coalesce(l.ATTENT, '') as LeaseAttention,coalesce(l.TTYPID, ('select EmptyDefault from EmptyDefault'))as LeaseTenantType,coalesce(TTYP.TTYPNAME, ('select EmptyDefault from EmptyDefault'))as LeaseTenantTypeName,l.OCCPSTAT as LeaseCurrentOccupancyStatus,l.EXECDATE as LeaseExecDate, l.RENTSTRT as LeaseRentStartDate,l.OCCUPNCY as LeaseOccupancyDate,l.BEGINDATE as LeaseBeginDate,l.EXPIR as LeaseExpiryDate,l.VACATE as LeaseVacateDate,coalesce(l.STORECAT, (select EmptyDefault from EmptyDefault)) as LeaseStoreCategory ,rightconcat('00000',cast(coalesce(SCAT.SORTSEQ,99999) as string)) as LeaseStoreCategorySortID from Dim_CMLease_primer d join LEAS l on l.BLDGID=d.BldgID and l.LEASID=d.LeaseID left outer join SUIT on SUIT.BLDGID=l.BLDGID and SUIT.SUITID=l.SUITID left outer join BLDG on BLDG.BLDGID= l.BLDGID left outer join SCAT on SCAT.STORCAT=l.STORECAT left outer join TTYP on TTYP.TTYPID = l.TTYPID").show()
i have uploaded the the query and after query state here.
how could i solve this problem. Kindly guide me
The simplest thing to try would be increasing spark executor memory:
spark.executor.memory=6g
Make sure you're using all the available memory. You can check that in UI.
UPDATE 1
--conf spark.executor.extrajavaoptions="Option" you can pass -Xmx1024m as an option.
What's your current spark.driver.memory and spark.executor.memory?
Increasing them should resolve the problem.
Bear in mind that according to spark documentation:
Note that it is illegal to set Spark properties or heap size settings with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Heap size settings can be set with spark.executor.memory.
UPDATE 2
As GC overhead error is garbage collcection problem would also recommend to read this great answer

Why is Spark SQL in Spark 1.6.1 not using broadcast join in CTAS?

I have a query in Spark SQL which is using broadcast join as expected as my table b is smaller than spark.sql.autoBroadcastJoinThreshold.
However, if I put the exact same select query into an CTAS query then it's NOT using broadcast join for some reason.
The select query looks like this:
select id,name from a join b on a.name = b.bname;
And the explain for this looks this:
== Physical Plan ==
Project [id#1,name#2]
+- BroadcastHashJoin [name#2], [bname#3], BuildRight
:- Scan ParquetRelation: default.a[id#1,name#2] InputPaths: ...
+- ConvertToUnsafe
+- HiveTableScan [bname#3], MetastoreRelation default, b, Some(b)
Then my CTAS looks like this:
create table c as select id,name from a join b on a.name = b.bname;
And the explain for this one returns:
== Physical Plan ==
ExecutedCommand CreateTableAsSelect [Database:default}, TableName: c, InsertIntoHiveTable]
+- Project [id#1,name#2]
+- Join Inner, Some((name#2 = bname#3))
:- Relation[id#1,name#2] ParquetRelation: default.a
+- MetastoreRelation default, b, Some(b)
Is it expected to NOT use broadcast join for the select query that's part of a CTAS query? If not, is there a way to force CTAS to use broadcast join?
If your question is about the reason why Spark creates two different physical plans then this answer won't be helpful. I have observed plenty of sensitivity in Spark's optimizer where the same SQL snippets result in meaningfully different physical plans even if it is not obvious why that is the case.
However, if your question is ultimately about how to execute the CTAS with a broadcast join then here is a simple workaround I have used many times: register the query with the plan you like as a temporary table (or view if you are using the SQL console) and then use SELECT * from tmp_tbl as the query to feed the CTAS.
In other words, something like:
sql("select id, name from a join b on a.name = b.bname").registerTempTable("tmp_joined")
sql("create table c as select * from tmp_joined")

SPARK SQL Self join reading twice from HDFS block

Below is a sample snippet of the fairly complex query that I am executing with Spark 1.3.1 (window functions are not an option in this version). This query is reading around 18K blocks from HDFS twice and then doing a shuffle with 18K partitions.
Since it is a Self Join and since both the tables are grouped by and joined by the same keys, I was assuming that the all the keys would be co-located on the same partition for the join, possibly avoiding the Shuffle.
Is there a way around to avoid reading twice and also to avoid shuffle? Can I Repartition the input sets by default partitioner or use the group by separately on a DataFrame rather than executing it as single query? Thanks.
val df = hiveContext.sql("""SELECT
EVNT.COL1
,EVNT.COL2
,EVNT.COL3
,MAX(CASE WHEN (EVNT.COL4 = EVNT_DRV.min_COL4) THEN EVNT.COL5
ELSE -2147483648 END) AS COL5
FROM
TRANS_EVNT EVNT
INNER JOIN (SELECT
COL1
,COL2
,COL3
,COL6
,MIN(COL4) AS min_COL4
FROM
TRANS_EVNT
WHERE partition_key between '2015-01-01' and '2015-01-31'
GROUP BY
COL1
,COL2
,COL3
,COL6) EVNT_DRV
ON
EVNT.COL1 = EVNT_DRV.COL1
AND EVNT.COL2 = EVNT_DRV.COL2
AND EVNT.COL3 = EVNT_DRV.COL3
AND EVNT.COL6 = EVNT_DRV.COL6
WHERE partition_key between '2015-01-01' and '2015-01-31'
GROUP BY
EVNT.COL1
,EVNT.COL2
,EVNT.COL3
,EVNT.COL6""")

Resources