Aliasing different WINDOW clauses in Spark SQL - apache-spark

Is it possible to have aliases for multiple Windows in the same query ?
For e.g. -
select
cust_id,
eff_dt,
row_number() over w AS rec1
from cust
WINDOW w AS (PARTITION BY cust_id ORDER BY eff_dt desc);
The above runs fine. But it fails when I try to add another Window alias:
select
cust_id,
eff_dt,
row_number() over w AS rec1,
rank() over w2 AS rec2
from cust
WINDOW w AS (PARTITION BY cust_id ORDER BY eff_dt desc),
WINDOW w2 AS (PARTITION BY cust_id ORDER BY version asc);
Can anyone please help on how to use both the Window aliases above ?
Thanks

You can do that with nested query,
select
cust_id,
eff_dt,
rec1,
rank() over w2 AS rec2
from (
select
cust_id,
eff_dt,
version,
row_number() over w AS rec1
from cust
WINDOW w AS (PARTITION BY cust_id ORDER BY eff_dt desc))
WINDOW w2 AS (PARTITION BY cust_id ORDER BY version asc);

Related

Cassandra select order by

I create table as this
CREATE TABLE sm.data (
did int,
tid int,
ts timestamp,
aval text,
dval decimal,
PRIMARY KEY (did, tid, ts)
) WITH CLUSTERING ORDER BY (tid ASC, ts DESC);
Before I did all select query with ts DESC so it was good. Now I also need select query with ts ASC in some cases. How do I accomplish that? Thank you
You can simply use ORDER BY ts ASC
Example :
SELECT * FROM data WHERE did = ? and tid = ? ORDER BY ts ASC
if you do this select
select * from data where did=1 and tid=2 order by ts asc;
you will end up with some errors
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
I have tested it against my local cassandra db
I would suggets altering the order of the primary key columns
the reason is that :
"Querying compound primary keys and sorting results ORDER BY clauses can select a single column only. That column has to be the second column in a compound PRIMARY KEY."
CREATE TABLE data2 (
did int,
tid int,
ts timestamp,
aval text,
dval decimal,
PRIMARY KEY (did, ts, tid)
) WITH CLUSTERING ORDER BY (ts DESC, tid ASC)
Now we are free to choose the type of ordering for TS
cassandra#cqlsh:airline> SELECT * FROM data2 WHERE did = 1 and ts=2 order by ts DESC;
did | ts | tid | aval | dval
-----+----+-----+------+------
(0 rows)
cassandra#cqlsh:airline> SELECT * FROM data2 WHERE did = 1 and ts=2 order by ts ASC;
did | ts | tid | aval | dval
-----+----+-----+------+------
(0 rows)
Another way would be either to create a new table or a materialized view , the later would lead behind the scene to data duplication anyway
hope that clear enough

equivalent percentile_cont function in apache spark sql

I am new to spark environment. I have dataset with column names as follows:
user_id, Date_time, order_quantity
I want to calculate the 90th percentile of order_quantity for each user_id.
If it were to be sql, I would have used the following query:
%sql
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)
However, spark doesn't have the built in support for using the percentile_cont function.
Any suggestions on how I can implement this in spark on the above dataset?
please let me know if more information is needed.
I have a solution for PERCENTILE_DISC (0.9) which will return the discrete order_quantity closest to percentile 0.9 (without interpolation).
The idea is to calculate PERCENT_RANK, substract 0.9 and calculate Absolute value, then take the minimal value:
%sql
WITH temp1 AS (
SELECT
user_id,
ABS(PERCENTILE_RANK () OVER
(PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp
SELECT
user_id,
FIRST_VALUE(order_quantity) OVER
(PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90
FROM
temp1;
I was dealing with a similar issue too. I worked in SAP HANA and then I moved to Spark SQL on Databricks. I have migrated the following SAP HANA query:
SELECT
DISTINCT ITEM_ID,
LOCATION_ID,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM MY_TABLE
to
SELECT DISTINCT
ITEM_ID,
LOCATION_ID,
PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
delta.`MY_TABLE`
In your particular case it should be as follows:
SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)
I hope this helps.

Filling in NULLS with previous records - Netezza SQL

I am using Netezza SQL on Aginity Workbench and have the following data:
id DATE1 DATE2
1 2013-07-27 NULL
2 NULL NULL
3 NULL 2013-08-02
4 2013-09-10 2013-09-23
5 2013-12-11 NULL
6 NULL 2013-12-19
I need to fill in all the NULL values in DATE1 with preceding values in the DATE1 field that are filled in. With DATE2, I need to do the same, but in reverse order. So my desired output would be the following:
id DATE1 DATE2
1 2013-07-27 2013-08-02
2 2013-07-27 2013-08-02
3 2013-07-27 2013-08-02
4 2013-09-10 2013-09-23
5 2013-12-11 2013-12-19
6 2013-12-11 2013-12-19
I only have read access to the data. So creating Tables or views are out of the question
How about this?
select
id
,last_value(date1 ignore nulls) over (
order by id
rows between unbounded preceding and current row
) date1
,first_value(date2 ignore nulls) over (
order by id
rows between current row and unbounded following
) date2
You can manually calculate this as well, rather than relying on the windowing functions.
with chain as (
select
this.*,
prev.date1 prev_date1,
case when prev.date1 is not null then abs(this.id - prev.id) else null end prev_distance,
next.date2 next_date2,
case when next.date2 is not null then abs(this.id - next.id) else null end next_distance
from
Table1 this
left outer join Table1 prev on this.id >= prev.id
left outer join Table1 next on this.id <= next.id
), min_distance as (
select
id,
min(prev_distance) min_prev_distance,
min(next_distance) min_next_distance
from
chain
group by
id
)
select
chain.id,
chain.prev_date1,
chain.next_date2
from
chain
join min_distance on
min_distance.id = chain.id
and chain.prev_distance = min_distance.min_prev_distance
and chain.next_distance = min_distance.min_next_distance
order by chain.id
If you're unable to calculate the distance between IDs by subtraction, just replace the ordering scheme by a row_number() call.
I think Netezza supports the order by clause for max() and min(). So, you can do:
select max(date1) over (order by date1) as date1,
min(date2) over (order by date2 desc) as date2
. . .
EDIT:
In Netezza, you may be able to do this with last_value() and first_value():
select last_value(date1 ignore nulls) over (order by id rows between unbounded preceding and 1 preceding) as date1,
first_value(date1 ignore nulls) over (order by id rows between 1 following and unbounded following) as date2
Netezza doesn't seem to support IGNORE NULLs on LAG(), but it does on these functions.
I've only tested this in Oracle so hopefully it works in Netezza:
Fiddle:
http://www.sqlfiddle.com/#!4/7533f/1/0
select id,
coalesce(date1, t1_date1, t2_date1) as date1,
coalesce(date2, t3_date2, t4_date2) as date2
from (select t.*,
t1.date1 as t1_date1,
t2.date1 as t2_date1,
t3.date2 as t3_date2,
t4.date2 as t4_date2,
row_number() over(partition by t.id order by t.id) as rn
from tbl t
left join tbl t1
on t1.id < t.id
and t1.date1 is not null
left join tbl t2
on t2.id > t.id
and t2.date1 is not null
left join tbl t3
on t3.id < t.id
and t3.date2 is not null
left join tbl t4
on t4.id > t.id
and t4.date2 is not null
order by t.id) x
where rn = 1
Here's a way to fill in NULL dates with the most recent min/max non-null dates using self-joins. This query should work on most databases
select t1.id, max(t2.date1), min(t3.date2)
from tbl t1
join tbl t2 on t1.id >= t2.id
join tbl t3 on t1.id <= t3.id
group by t1.id
http://www.sqlfiddle.com/#!4/acc997/2

Unpivot and Pivot does not return data

I'm trying to return data as columns.
I've written this unpivot and pivot query:
`select StockItemCode, barcode, barcode2 from (select StockItemCode, col+cast(seq as varchar(20)) col, value from (
select
(select min(StockItemCode)
from RTLBarCode t2
where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t) d unpivot(
value
for col in (barcode) ) unpiv) src pivot ( max(value) for col in (barcode, barcode2)) piv;`
But the problem is only the "Barcode2" field are returning a value (the barcode field returns a null when in fact there is a value.
SAMPLE DATA
I have a Table called RTLBarCode
It has a field called Barcode and a field called StockItemCode
For StockItemCode = 10 I have 2 rows with a Barcode value of 5014721112824 and 0000000019149.
Can anyone see where I am going wrong?
Many thanks
You are indexing your barcode in unpiv.
This results in col's-values barcode1 and barcode2.
But then you are pivoting on barcode instead of barcode1. No value is found and the aggregate returns null.
The correct statement would be:
select StockItemCode, barcode1, barcode2 from
(
select StockItemCode, col+cast(seq as varchar(20)) col, value
from
(
select
(select min(StockItemCode)from RTLBarCode t2 where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t
) d
unpivot(value for col in (barcode)) unpiv
) src
pivot (max(value) for col in (barcode1, barcode2)) piv

SQL query for SQL Server Compact Edition 3.5 - GROUP BY issue

SELECT BabyInformation.* , t1.*
FROM BabyInformation
LEFT JOIN
(SELECT * FROM BabyData
GROUP BY BabyID
ORDER By Date DESC ) AS t1 ON BabyInformation.BabyID=t1.BabyID
This is my query. I want to get the one most recent BabyData tuple based on date.
The BabyInformation should left join with babyData but one row per baby...
I tried TOP(1) but this worked only for the first baby
Here is one way to do it, there are other ways which can be faster, but I believe this one to be the clearest for a beginner.
SELECT BabyInformation.*, BabyData.*
FROM BabyInformation
JOIN
(SELECT BabyID, Max(Date) as maxDate FROM BabyData
GROUP BY BabyID
) AS t1
ON BabyInformation.BabyID=t1.BabyID
Join BabyData ON BabyData.BabyID = t1.BabyID and BabyData.Date = t1.maxDate
This should do it:
SELECT bi.* , bd.*
FROM BabyInformation [bi]
LEFT JOIN BabyData [bd]
on bd.BabyDataId = (select top 1 sub.BabyDataId from BabyData [sub] where sub.BabyId = bi.BabyId order by sub.Date desc)
I've assumed that there is a column called 'BabyDataId' in the BabyData table.

Resources