Spark sql : how to count double values - apache-spark

I have a 100-millions line table, I would like to know how many unique values I have on a CTAC column.
I tried :
SELECT COUNT(*)
FROM ( SELECT CTAC
FROM my_table
GROUP BY CTAC
HAVING COUNT(*) > 1)
but this gives me an error :
sql.AnalysisException : cannot recognize input near '<EOF>' in subquery source
Can we do a subquery in spark ? If so, how ?
Which query should I try to solve my question ?

Try differently as
println(dataFrame.select("CTAC").distinct.count)

Related

Spark SQL how to use Filter

I have written the following SQL :
select count(value) as total, name , window from `event`
where count(value) > 1 group by window(event_time,'2 minutes'),name
Spark is giving me the following error :
nAggregate/Window/Generate expressions are not valid in where clause of the query.\nExpression in where clause: [(count(event.`value`) > CAST(1 AS BIGINT))]\nInvalid expressions: [count(event.`value`)]
What's the correct syntax ?
You need to use HAVING instead (documentation), and it should be put after GROUP BY:
select count(value) as total, name , window from `event`
group by window(event_time,'2 minutes'),name
having total > 1

Snowflake interprets boolean values in parquet as NULL?

Parquet Entry Example (All entries have is_active_entity as true)
{
"is_active_entity": true,
"is_removed": false
}
Query that demonstrates all values are taken as NULL
select $1:IS_ACTIVE_ENTITY::boolean, count(*) from #practitioner_delta_stage/part-00000-49224c02-150b-493b-8036-54ab30a8ff40-c000.snappy.parquet group by $1:IS_ACTIVE_ENTITY::boolean ;
Output has only one group for NULL
$1:IS_ACTIVE_ENTITY::BOOLEAN COUNT(*)
NULL 4930277
I don't know where I am going wrong, Spark writes the correct schema in parquet as evident from the example but snowflake takes it as NULL.
How do I fix this?
The columns in your file are quoted. As a consequence "is_active_entity" is not the same like "IS_ACTIVE_ENTITY"
Please try this query:
select $1:is_active_entity::boolean, count(*) from #practitioner_delta_stage/part-00000-49224c02-150b-493b-8036-54ab30a8ff40-c000.snappy.parquet group by $1:IS_ACTIVE_ENTITY::boolean ;
More infos: https://docs.snowflake.com/en/sql-reference/identifiers-syntax.html#:~:text=The%20identifier%20is%20case%2Dsensitive.

RedShift Correlated Sub-query

Need your help. I am trying to convert below SQL query into RedShift, but getting error message "Invalid operation: This type of correlated subquery pattern is not supported yet"
SELECT
Comp_Key,
Comp_Reading_Key,
Row_Num,
Prev_Reading_Date,
( SELECT MAX(X) FROM (
SELECT CAST(dateadd(day, 1, Prev_Reading_Date) AS DATE) AS X
UNION ALL
SELECT dim_date.calendar_date
) a
) as start_dt
FROM stage5
JOIN dim_date ON calendar_date BETWEEN '2020-04-01' and '2020-04-15'
WHERE Comp_Key =50906055
The same query works fine in SQL Server. Could you please help me to run it in RedShift?
Regards,
Kiru
Kiru - you need to convert the correlated query into a join structure. Not knowing the data content of your tables and the exact expected out put I'm just guessing but here's a swag:
SELECT
Comp_Key,
Comp_Reading_Key,
Row_Num,
Prev_Reading_Date,
Max_X
FROM stage5
JOIN dim_date ON calendar_date BETWEEN '2020-04-01' and '2020-04-15'
JOIN ( SELECT MAX(X) as Max_X, MAX(calendar_date) as date FROM (
SELECT CAST(dateadd(day, 1, Prev_Reading_Date) AS DATE) AS X FROM stage5
cross join
SELECT dim_date.calendar_date from dim_date
) a
) as start_dt ON a.date = dim_date.calendar_date
WHERE Comp_Key =50906055
This is just a starting guess but might get you started.
However, you are likely better off rewriting this query to use window functions as they are the fastest way to perform these types of looping queries in Redshift.
Thanks Bill. It won't work in RedShift as it still has correalted sub-query.
However I have modified query in another method and it works fine.
I am closing ticket.

Getting error while running an sql script in ADW

Am getting an error that goes like this:
Insert values statement can contain only constant literal values or variable references.
these are the statements in which I am getting the errors:
INSERT INTO val.summary_numbers (metric_name, metric_val, dt_create) VALUES ('Total IP Enconters',
(SELECT
count(DISTINCT encounter_id)
FROM prod.encounter
WHERE encounter_type = 'Inpatient')
,
(SELECT min(mod_loadidentifier)
FROM ccsm.stg_demographics_baseline)
);
INSERT INTO val.summary_numbers (metric_name, metric_val, dt_create) VALUES ('Total 30d Readmits',
(SELECT
count(DISTINCT encounter_id)
FROM prod.encounter_attr
WHERE
attr_name = 'day_30_readmit' AND attr_value = 1)
,
(SELECT min(mod_loadidentifier)
FROM ccsm.stg_demographics_baseline));
Change your query like this:
insert into val.summary_numbers
select
'Total IP Enconters',
(select count(distinct encounter_id)
from prod.encounter
where encounter_type = 'Inpatient'),
(select min(mod_loadidentifier)
from ccsm.stg_demographics_baseline)
When using the ADW service, I would recommend that you consider using the CTAS operation possibly combined with a RENAME. The RENAME is a metadata operation so it is fast and the CTAS is parallel where the INSERT INTO will be row by row.
You may still have a data related issue that can be hard to determine with out the create table statement.
Thanks

Getting ValidationFailureSemanticException on 'INSERT OVEWRITE'

I am creating a DataFrame and registering that DataFrame as temp view using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query
insert overwrite table
myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
Field1,
Field2,
...
TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name
from
mytable
when I execute this query, I am getting the following error
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec
{testdate=, TestDate=2013-01-01}
Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')
Whereas if my partition name testdate is different from the fieldname(ie TestDate), the query executes successuflly. Example...
insert overwrite table
myhivedb.myhivetable
partition(my_partition) //Note here the partition name is not 'testdate'
select
Field1,
Field2,
...
TestDate
from
mytable
My guess is it looks like a Bug in Spark...but would like to have second opinion...Am I missing something here?
#DuduMarkovitz #dhee ; apologies for being too late for the response. I am finally able to resolve the issue. Earlier I was creating the table using cameCase(in the CREATE statement) which seems to be the reason for the Exception. Now i have created the table using the DDL where field names are in lower case. This has resolved my issue

Resources