Spark SQL how to use Filter - apache-spark

I have written the following SQL :
select count(value) as total, name , window from `event`
where count(value) > 1 group by window(event_time,'2 minutes'),name
Spark is giving me the following error :
nAggregate/Window/Generate expressions are not valid in where clause of the query.\nExpression in where clause: [(count(event.`value`) > CAST(1 AS BIGINT))]\nInvalid expressions: [count(event.`value`)]
What's the correct syntax ?

You need to use HAVING instead (documentation), and it should be put after GROUP BY:
select count(value) as total, name , window from `event`
group by window(event_time,'2 minutes'),name
having total > 1

Related

Databricks- Spark SQL Update statement error

This is a pretty straightforward update statement that works on SQL Server DB and I have re-written it in Databricks which is not working, Can you provide your suggestions?
update
a
set
composite_account_key=nvl(e.account_key,0)
edw.account_fact a
join edw.account_dim b on (a.account_key=b.account_key)
join vw_account_hier c on (b.accountcode=c.accountcode)
join edw.analysis_codes_dim d on (d.anlys_code_dimkey=a.anlys_code_dimkey and c.atomic_anlys_appl_cde=d.anlys_appl_cde)
join vw_composite e on (c.edw_c_account_code=e.edw_c_account_code)
where
a.timekey='95'
ParseException:[PARSE_SYNTAX_ERROR] Syntax error at or near 'from'(line 5, pos 0)
The syntax of update statement in Databricks SQL does not support using from parameter.
You can create a temporary view from the result of all the join operations and use this view in the update statement directly instead.
The following is the demonstration of the same. I have the result of my join query as shown below:
When I try to use from parameter directly in update statement (update id value to 10 wherever it is 1 from join result), I get the same error.
So, I have created a view first and then used it in update query to get the result.
%sql
--CREATE TEMPORARY VIEW for_updt as (select a.id,a.gname,b.team from demo as a join demo1 as b on a.id=b.id );
update demo set id=10 where id in(select id from for_updt where) and (demo.id=1)

Spark SQL query with IN operator in CASE WHEN cannot be cast to SparkPlan

I'm trying to execute the test query like this:
SELECT COUNT(CASE WHEN name IN (SELECT name FROM requiredProducts) THEN name END)
FROM myProducts
which throws the following exception:
java.lang.ClassCastException:
org.apache.spark.sql.execution.datasources.LogicalRelation cannot be cast to
org.apache.spark.sql.execution.SparkPlan
I have a suggestion that IN operator can not be used in CASE WHEN. Is it really so? Spark documentation is silent about this.
The IN operator using a subquery does not work in a projection regardless of whether it is contained in a CASE WHEN, it will only work in filters. It works fine if you specify values in the IN clause directly rather than using a subquery.
I am not sure how to generate the exact exception you got above, but when I attempt to run a similar query in Spark Scala, it returns a more descriptive error:
org.apache.spark.sql.AnalysisException: IN/EXISTS predicate sub-queries can only be used in a Filter: Project [CASE WHEN agi_label#5 IN (list#96 []) THEN 1 ELSE 0 END AS CASE WHEN (agi_label IN (listquery())) THEN 1 ELSE 0 END#97]
I have run into this issue in the past. Your best bet is probably to restructure it to use a left join to requiredProducts and then check for a null in the case statement. For example, something like this might work:
SELECT COUNT(CASE WHEN rp.name is not null THEN mp.name END)
FROM myProducts mp
LEFT JOIN requiredProducts rp ON mp.name = rp.name

Getting error while running an sql script in ADW

Am getting an error that goes like this:
Insert values statement can contain only constant literal values or variable references.
these are the statements in which I am getting the errors:
INSERT INTO val.summary_numbers (metric_name, metric_val, dt_create) VALUES ('Total IP Enconters',
(SELECT
count(DISTINCT encounter_id)
FROM prod.encounter
WHERE encounter_type = 'Inpatient')
,
(SELECT min(mod_loadidentifier)
FROM ccsm.stg_demographics_baseline)
);
INSERT INTO val.summary_numbers (metric_name, metric_val, dt_create) VALUES ('Total 30d Readmits',
(SELECT
count(DISTINCT encounter_id)
FROM prod.encounter_attr
WHERE
attr_name = 'day_30_readmit' AND attr_value = 1)
,
(SELECT min(mod_loadidentifier)
FROM ccsm.stg_demographics_baseline));
Change your query like this:
insert into val.summary_numbers
select
'Total IP Enconters',
(select count(distinct encounter_id)
from prod.encounter
where encounter_type = 'Inpatient'),
(select min(mod_loadidentifier)
from ccsm.stg_demographics_baseline)
When using the ADW service, I would recommend that you consider using the CTAS operation possibly combined with a RENAME. The RENAME is a metadata operation so it is fast and the CTAS is parallel where the INSERT INTO will be row by row.
You may still have a data related issue that can be hard to determine with out the create table statement.
Thanks

Spotfire : Syntax issue or Window functions not allowed?

I am using Data functions in Spotfire.
I have the sqldf package installed.
Here is the query:
#Package to run sqls
library(sqldf)
#Input data frame
op1 <- sqldf("SELECT Prod_parnt,prodct_grop,year,month,week,
count(distinct id) as prd_cnt,
Sum(Count(distinct id))
over (partition by modlty,prodct_grop order by year,month,week
rows between 12 preceding and current row) as cumu_prd_cnt,
avg(rate) as sal_rate
FROM ip1
group by Prod_parnt,prodct_grop,year,month,week")
The error I am facing:
"TIBCO Spotfire Statistics Services returned an error: 'Error: error in statement: near "(": syntax error'."
Now the point to note here is that when I remove the Window function statement i.e cumu_prd_cnt field; the code works fine.
Need your help.

Spark sql : how to count double values

I have a 100-millions line table, I would like to know how many unique values I have on a CTAC column.
I tried :
SELECT COUNT(*)
FROM ( SELECT CTAC
FROM my_table
GROUP BY CTAC
HAVING COUNT(*) > 1)
but this gives me an error :
sql.AnalysisException : cannot recognize input near '<EOF>' in subquery source
Can we do a subquery in spark ? If so, how ?
Which query should I try to solve my question ?
Try differently as
println(dataFrame.select("CTAC").distinct.count)

Resources