Spark SQL query with IN operator in CASE WHEN cannot be cast to SparkPlan - apache-spark

I'm trying to execute the test query like this:
SELECT COUNT(CASE WHEN name IN (SELECT name FROM requiredProducts) THEN name END)
FROM myProducts
which throws the following exception:
java.lang.ClassCastException:
org.apache.spark.sql.execution.datasources.LogicalRelation cannot be cast to
org.apache.spark.sql.execution.SparkPlan
I have a suggestion that IN operator can not be used in CASE WHEN. Is it really so? Spark documentation is silent about this.

The IN operator using a subquery does not work in a projection regardless of whether it is contained in a CASE WHEN, it will only work in filters. It works fine if you specify values in the IN clause directly rather than using a subquery.
I am not sure how to generate the exact exception you got above, but when I attempt to run a similar query in Spark Scala, it returns a more descriptive error:
org.apache.spark.sql.AnalysisException: IN/EXISTS predicate sub-queries can only be used in a Filter: Project [CASE WHEN agi_label#5 IN (list#96 []) THEN 1 ELSE 0 END AS CASE WHEN (agi_label IN (listquery())) THEN 1 ELSE 0 END#97]
I have run into this issue in the past. Your best bet is probably to restructure it to use a left join to requiredProducts and then check for a null in the case statement. For example, something like this might work:
SELECT COUNT(CASE WHEN rp.name is not null THEN mp.name END)
FROM myProducts mp
LEFT JOIN requiredProducts rp ON mp.name = rp.name

Related

Databricks- Spark SQL Update statement error

This is a pretty straightforward update statement that works on SQL Server DB and I have re-written it in Databricks which is not working, Can you provide your suggestions?
update
a
set
composite_account_key=nvl(e.account_key,0)
edw.account_fact a
join edw.account_dim b on (a.account_key=b.account_key)
join vw_account_hier c on (b.accountcode=c.accountcode)
join edw.analysis_codes_dim d on (d.anlys_code_dimkey=a.anlys_code_dimkey and c.atomic_anlys_appl_cde=d.anlys_appl_cde)
join vw_composite e on (c.edw_c_account_code=e.edw_c_account_code)
where
a.timekey='95'
ParseException:[PARSE_SYNTAX_ERROR] Syntax error at or near 'from'(line 5, pos 0)
The syntax of update statement in Databricks SQL does not support using from parameter.
You can create a temporary view from the result of all the join operations and use this view in the update statement directly instead.
The following is the demonstration of the same. I have the result of my join query as shown below:
When I try to use from parameter directly in update statement (update id value to 10 wherever it is 1 from join result), I get the same error.
So, I have created a view first and then used it in update query to get the result.
%sql
--CREATE TEMPORARY VIEW for_updt as (select a.id,a.gname,b.team from demo as a join demo1 as b on a.id=b.id );
update demo set id=10 where id in(select id from for_updt where) and (demo.id=1)

Alternative to count in Spark sql to check if a query return empty result

I know count action can be expensive in Spark, so to improve performance I'd like to have a different way just to check if a query can return any results
Here is what I did
var df = spark.sql("select * from table_name where condition = 'blah' limit 1");
var dfEmpty = df.head(1).isEmpty;
Is it a valid solution or is there any potential uncaught error if I use above solution to check query result? It is a lot faster though.
isEmpty is head of the data.. this is quite resonable to check empty or not and it was given by spark api and is optimized... Hence, I'd prefer this...
Also in the query I think limit 1 is not required.
/**
* Returns true if the `Dataset` is empty.
*
* #group basic
* #since 2.4.0
*/
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
I think this is ok, I guess you could also omit the limit(1) because this is also part of the implementation of df.isEmpty. See also How to check if spark dataframe is empty?.
Note that the solution with df.isEmpty does may not evaluate all columns. E.g. if you have an UDF for 1 column, this will probabely not execute and could throws exceptions on a real query. df.head(1).isEmpty on the other hand will evaluate all columns for 1 rows.

RedShift Correlated Sub-query

Need your help. I am trying to convert below SQL query into RedShift, but getting error message "Invalid operation: This type of correlated subquery pattern is not supported yet"
SELECT
Comp_Key,
Comp_Reading_Key,
Row_Num,
Prev_Reading_Date,
( SELECT MAX(X) FROM (
SELECT CAST(dateadd(day, 1, Prev_Reading_Date) AS DATE) AS X
UNION ALL
SELECT dim_date.calendar_date
) a
) as start_dt
FROM stage5
JOIN dim_date ON calendar_date BETWEEN '2020-04-01' and '2020-04-15'
WHERE Comp_Key =50906055
The same query works fine in SQL Server. Could you please help me to run it in RedShift?
Regards,
Kiru
Kiru - you need to convert the correlated query into a join structure. Not knowing the data content of your tables and the exact expected out put I'm just guessing but here's a swag:
SELECT
Comp_Key,
Comp_Reading_Key,
Row_Num,
Prev_Reading_Date,
Max_X
FROM stage5
JOIN dim_date ON calendar_date BETWEEN '2020-04-01' and '2020-04-15'
JOIN ( SELECT MAX(X) as Max_X, MAX(calendar_date) as date FROM (
SELECT CAST(dateadd(day, 1, Prev_Reading_Date) AS DATE) AS X FROM stage5
cross join
SELECT dim_date.calendar_date from dim_date
) a
) as start_dt ON a.date = dim_date.calendar_date
WHERE Comp_Key =50906055
This is just a starting guess but might get you started.
However, you are likely better off rewriting this query to use window functions as they are the fastest way to perform these types of looping queries in Redshift.
Thanks Bill. It won't work in RedShift as it still has correalted sub-query.
However I have modified query in another method and it works fine.
I am closing ticket.

How to change query plan before execution (possibly turning an optimization off)?

I have a simple spark SQL query :
SELECT x, y
FROM t1 INNER JOIN t2 ON t1.key = t2.key
WHERE expensiveFunction(t1.key)
Where expensiveFunction is a spark UDF (User-defined function).
When I look at the query plan generated by spark, I see that it has two filter operations instead of just one: it checks not only expensiveFunction(t1.key), but also expensiveFunction(t2.key).
In general, this optimization is not a bad thing, because it reduces the number of records to join, and joining is an expensive operation. But in my case expensiveFunction(t2.key) always returns true, so I would like to remove it.
Is there a way to change the query plan before executing a query ? Is there a way to indicate to spark that I don’t want a given optimization to be applied to my query ?
Is there a way to change the query plan before executing a query?
In general, yes. There are few extension points in Spark SQL query planner and optimizer that would make the wish doable
Is there a way to indicate to spark that I don’t want a given optimization to be applied to my query ?
That's nearly impossible unless the optimization allows for that. In other words you'd have to find out whether the rule has an option to turn it off, e.g. CostBasedJoinReorder with spark.sql.cbo.enabled or spark.sql.cbo.joinReorder.enabled configuration properties (when either is off CostBasedJoinReorder does nothing).
You could write a custom logical operator that would make the optimization void (as it would not be matched given unknown logical operator) and at optimization phase you'd remove it.
Use extendedOperatorOptimizationRules to register custom optimizations.
This is happening because of the optimizer rule org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints
Code comments is as follows(github)
/**
* Infers an additional set of constraints from a given set of equality constraints.
* For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an
* additional constraint of the form `b = 5`.
*/
def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression]
You could disable this Optimizer rule using spark.sql.optimizer.excludedRules
val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
.doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
"specified by their rule names and separated by comma. It is not guaranteed that all the " +
"rules in this configuration will eventually be excluded, as some rules are necessary " +
"for correctness. The optimizer will log the rules that have indeed been excluded.")
.stringConf
.createOptional
That way the filter will not get propagated to both sids of join
You can rewrite this query like below to avoid the extra function call.
SELECT x, y
FROM (SELECT <required-columns> FROM t1 WHERE expensiveFunction(t1.key)) t0 INNER JOIN t2 ON t0.key = t2.key
To be extra sure you can persist this query (SELECT FROM t1 WHERE expensiveFunction(t1.key)) as a separate DataFrame. and then join table t2 with this DataFrame.
For example lets say we have DataFrames df1 and df2 for table t1 and t2 respectively. we do the something like the following to avoid the expensiveFunction call twice.
val df3 = df1.filter("col1 == 1")
df3.persist() // forces evaluation of this dataframe and applies the expensive function filter on df1.
df3.createOrReplaceTempView("t1")
spark.sql("""SELECT t1.col1. t2.col2
FROM t1 INNER JOIN t2 ON t1.col2 = t2.col1""") // this query now have no reference to expensiveFunction

Postgres SQL Joins for Many To Many Relationship

right now I am "learning" Postgres SQL. I have 3 tables:
1) User: userId
2) Stack :stackId
3) User_Stack: userId, stackId
Now I want to fetch all stacks belonging to one user, given the userId. I understand I need to use Joins, but thats were I get stuck... I try it like this:
SELECT * FROM "Stack" LEFT OUTER JOIN "User_Stack" ON ('User_Stack.stackId' = 'Stack.stackId') WHERE "userId" = '590855';
Error: The returned data is empty.
PS: Is there any GUI Query builder out there ? Or do you have any other tips how to systematically create queries ?
EDIT: If I change the query to this:
SELECT * FROM "Stack" INNER JOIN "User_Stack" ON (User_Stack.stackId = Stack.stackId) WHERE "userId" = '590855';
I get the following error:
Kernel error: ERROR: missing FROM-clause entry for table "user_stack"
LINE 1: SELECT * FROM "Stack" INNER JOIN "User_Stack" ON (User_Stack...
Your main error is in the join. If you do 'something' = 'other' you're comparing string literals, not getting anything from the database. So this will always return false. You will want to compare table1.field1 = table2.field2
Another thing is the LEFT OUTER JOIN. I'm pretty sure you want an INNER JOIN since you want only fields that exist in the other table.
Also don't use double quotes for fields and table names since then the database will require case sensitivity and usually it's not good to have case sensitive names. You can use them with lowercase names if you need and always create them in lowercase.
Numbers also don't need to be quoted, it will just cause more processing when the system has to convert them from text to numbers.

Resources