Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)? - apache-spark

What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848]
+- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo

You can assign row numbers based on an ordering on data_lots if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');

Related

How "stable" is monotonically_increasing_id() in Spark?

I'm looking for an inexpensive way to distinguish duplicates and/or uniquely identify rows. I've been looking at the Spark built-ins monotonically_increasing_id() and uuid().
The problem with uuid() is that it does not retain its value and seems to be evaluated on the spot. For example
with uuids as (select uuid() as uuid)
select * from uuids join uuids
produces two different UUIDs.
If I use monotonically_increasing_id(), I get two identical values, but can I trust that to always work? In other words, if I have a CTE, where I have an id column generated by monotonically_increasing_id(), will any later rows derived from a row from that CTE have a consistent value of the id column within the same query?
In pseudo-SQL:
with /* ... */
with_ids as (select monotonically_increasing_id() as id, * from /* ... */),
/* ... */
derived_a as (/* Somehow derived from with_ids */),
derived_b as (/* Somehow derived from with_ids */)
select
(a.id == b.id) as are_same,
(a.id != b.id) as are_different
from derived_a as a
join derived_b as b
Will rows derived from the exact same rows of with_ids have are_same == true? Is it guaranteed that if the original rows were different, then are_different == true? The former is definitely false for uuid().
[Updated] Another example, involving a join and group by:
with
with_ids as (
select
monotonically_increasing_id() as id
,*
from table_a)
joined as (
select struct(a.*) as packed_a, a.id
from with_ids as a
left join table_b as b
on /* whatever */
)
select collect_set(packed_a) as should_be_singular
from joined
group by id
Is the row count in the above equal to the number of rows in table_a and is should_be_singular a single element array?
The documentation for both functions state that they are non-deterministic, but don't really offer any details on when the functions are evaluated or how they should be used.
The issue seems to be mentioned in SPARK-14241 and this question, but it's not clear if and under what conditions monotonically_increasing_id() is consistent within a single SQL statement.
from my past experience when working with row identifiers (uuid, row_number or monotonically_increasing_id) I cache the dataframe.
Then every subsequent calculation using the dataframe will have static row identifiers.

Getting records based on latest date [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 11 months ago.
I'm quite new to SQL and I'm trying to filter the latest date record (DateTime column) for each unique ID present in the table.
Sample data: there are 2 unique IDs (16512) and (76513).
DateTime
ID
Notes
2021-03-26T10:39:54.9770238
16512
Still a work in Progress
2021-04-29T12:46:12.8277807
16512
Still working on it
2021-03-21T10:39:54.9770238
76513
Still a work in Progress
2021-04-20T12:46:12.8277800
76513
Still working on project
Desired result (get last row of each ID based on the DateTime column):
DateTime
ID
Notes
2021-04-29T12:46:12.8277807
16512
Still working on it
2021-04-20T12:46:12.8277800
76513
Still working on project
My query:
SELECT MAX(DateTime), ID
FROM Table1
GROUP BY DateTime, ID
Thanks in advance for you help.
SELECT max(DateTime), ID
FROM Table1
GROUP BY ID
You can use row_number here
with d as (
select *, row_number() over(partition by Id order by DataTime desc)rn
)
select Datetime, Id, Notes
from d
where rn = 1;
You didn't state a particular database but if you are using Postgres then you can use its DISTINCT ON and is often the fastest solution if the size of your groups is not too big (in your case this is the size of tasks that have the same id).
Here's an example. Please note I've excluded your notes column for brevity but it will work if you include it and will give you the output you desire above.
create temporary table tasks (
id int,
created_at date,
);
insert into tasks(id, created_at) values
(16512, '2021-03-26'),
(16512, '2021-04-29'),
(76513, '2021-03-21'),
(76513, '2021-04-20')
;
select
distinct on (id)
id,
created_at
from tasks
order by id, created_at desc
/*
id | created_at
-------+------------
16512 | 2021-04-29
76513 | 2021-04-20
*/
The mentioned row_number is one of the method solving your problem. You tagged databricks in your question, so let me show you another option that you can implement with Spark SQL using last function from aggregate functions pool.
In refrence to the spark documentation:
last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If isIgnoreNull is true, returns only non-null values.
Note that:
The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.
In your example:
%sql
WITH cte AS (
SELECT *
FROM my_table
ORDER BY DateTime asc
)
SELECT Id, last(DateTime) AS DateTime, last(Notes) as Notes
FROM cte
GROUP BY Id
Similarly, you can use first function to obtain the first record in a sorted dataset.
Check if that works for you.

sqlite combine 2 queries from different tables to make one

I recently took to using sql again, the last time I used it was in microsoft access 2000 so please bear with me if I'm behind the times a little.
I have 2 pointless virtual currencies on my discord server for my players to play pointless games with. Both of these currencies' transactions are currently stored in individual tables.
I wish to sum up all the transactions for each player to give them a single current amount for each currency. Individually I can do this:
SELECT
tblPlayers.PlayerID AS PlayerID,
tblPlayers.Name AS Name,
SUM(tblGorillaTears.Amount)
FROM
tblPlayers
INNER JOIN
tblGorillaTears
ON
tblPlayers.PlayerID = tblGorillaTears.PlayerID
GROUP BY
tblPlayers.PlayerID;
and
SELECT
tblPlayers.PlayerID AS PlayerID,
tblPlayers.Name AS Name,
SUM(tblKebabs.Amount)
FROM
tblPlayers
INNER JOIN
tblKebabs
ON
tblPlayers.PlayerID = tblKebabs.PlayerID
GROUP BY
tblPlayers.PlayerID;
What i need is a table that outputs the user name the id and the total for each currency on one row, but when i do this:
SELECT
tblPlayers.PlayerID AS PlayerID,
tblPlayers.Name AS Name,
SUM(tblGorillaTears.Amount) AS GT,
0 as Kebabs
FROM
tblPlayers
INNER JOIN
tblGorillaTears
ON
tblPlayers.PlayerID = tblGorillaTears.PlayerID
GROUP BY
tblPlayers.PlayerID
UNION
SELECT
tblPlayers.PlayerID AS PlayerID,
tblPlayers.Name AS Name,
0 as GP,
SUM(tblKebabs.Amount)
FROM
tblPlayers
INNER JOIN
tblKebabs
ON
tblPlayers.PlayerID = tblKebabs.PlayerID
GROUP BY
tblPlayers.PlayerID;
the results end in a row for each player for each currency. How can i make it so both currencies appear in the same row?
Previously in MSAccess i was able to create two queries and then make a query of those two queries as if they were a table, but I cannot figure out how to do that in this instance. Thanks <3
UNION will add new rows for sure, you can try like following query.
SELECT TP.playerid AS PlayerID,
TP.NAME AS NAME,
(SELECT Sum(TG.amount)
FROM tblgorillatears TG
WHERE TG.playerid = TP.playerid) AS GT,
(SELECT Sum(TG.amount)
FROM tblkebabs TG
WHERE TG.playerid = TP.playerid) AS Kebabs
FROM tblplayers TP

Left join in subquery

I am trying to do a left join on my main table using this code
select distinct VBen.BENF_NO_INDIV_BEN_BANLS as benbanls,
VBen.BENF_COD_SEXE AS Sexe,
VBen.BENF_DAT_NAISS AS DatNaiss,
VBen.BENF_DAT_DECES AS DatDec,
A.date_ch as date_chsld
from PROD.V_FICH_ID_BEN_CM AS VBen
left join (select distinct VAss.BENF_NO_INDIV_BEN_BANLS as benbanls,
vass.BENF_DD_ADMIS_ASSU_MED as date_ch
from Prod.V_ADMIS_ASSU_MED_PLAN_PRIOR_CM as vass ) as A
on VBen.BENF_NO_INDIV_BEN_BANLS =A. benbanls
where Vben.BENF_DAT_NAISS>'2016-04-01' or Vben.BENF_DAT_DECES>'2011-04-01'
The problem is that the query result is a table with of number of rows greater than the main table with the same where 'condition'. I don't understand what I am missing
Thanks for your help
Why is it a problem?
The results simply indicate you have a 1:M (one to many) relationship between VBen:Vass(A)
If you don't have a 1:M relationship and it should be 1:1 then...
you're missing join criteria between the tables.
you should be getting a min/max on your date instead of all dates per benbanls
To better understand and answer we would need to know what VBen and Vass actually represent; but to put simply, you have multiple VASS(A) per VBEN
To illustrate with an example: Think about Order_Header and Order_Line tables...
Order_header contains (order_Number PK)
Order_line contains (Order_Number, Order_Line PK)
An order can have multiple lines, each line could have it's own ship date several items may have gone out on the same shipment/day. where some that were backordered went out on a different day. In this situation, an order would still have multiple lines even though we distinct order_number and shipmentdate in a subquery. I would guess your situation is similar.
so 1 in base table * 2 rows in derived/lines table gives us 2 records
1 < 2 which is the situation you have now; and that to me is perfectly fine and expected if it's a 1:M relationship.
Maybe you need to do a min or max on date instead of a distinct?
If not you're missing join criteria to make a 1:1 relationship
maybe your expectation is just flawed.
The below will give you a 1:1 relationship but I'm not sure it's what you're after.
SELECT distinct VBen.BENF_NO_INDIV_BEN_BANLS as benbanls,
VBen.BENF_COD_SEXE AS Sexe,
VBen.BENF_DAT_NAISS AS DatNaiss,
VBen.BENF_DAT_DECES AS DatDec,
A.date_ch as date_chsld
FROM PROD.V_FICH_ID_BEN_CM AS VBen
LEFT JOIN (SELECT VAss.BENF_NO_INDIV_BEN_BANLS as benbanls,
Max(vass.BENF_DD_ADMIS_ASSU_MED) as date_ch
FROM Prod.V_ADMIS_ASSU_MED_PLAN_PRIOR_CM as vass
GROUP BY VAss.BENF_NO_INDIV_BEN_BANLS) as A
on VBen.BENF_NO_INDIV_BEN_BANLS = A. benbanls
WHERE (Vben.BENF_DAT_NAISS>'2016-04-01'
or Vben.BENF_DAT_DECES>'2011-04-01)
It is likely that there is more than one counterpart in the detail table of a record on the main table.
I try your scenario on my db get a correct result.
In my DB:
select distinct p.PollId as PollId,
p.Title AS Title,
p.InsertDate AS DatDec,
ps.date_ch as date_chsld
from dbo.Poll AS p
left join (select distinct pSt.PollId as pollId,
Max(pSt.InsertDate) as date_ch
from dbo.PollStore as pSt
Group by pSt.PollId ) as ps
on p.PollId =ps.pollId
As Your Query like this :
select distinct VBen.BENF_NO_INDIV_BEN_BANLS as benbanls,
VBen.BENF_COD_SEXE AS Sexe,
VBen.BENF_DAT_NAISS AS DatNaiss,
VBen.BENF_DAT_DECES AS DatDec,
A.date_ch as date_chsld
please try this query
from PROD.V_FICH_ID_BEN_CM AS VBen
left join (select distinct VAss.BENF_NO_INDIV_BEN_BANLS as benbanls,
Max(vass.BENF_DD_ADMIS_ASSU_MED) as date_ch
from Prod.V_ADMIS_ASSU_MED_PLAN_PRIOR_CM Group by VAss.BENF_NO_INDIV_BEN_BANLS as vass ) as A
on VBen.BENF_NO_INDIV_BEN_BANLS =A. benbanls
where Vben.BENF_DAT_NAISS>'2016-04-01' or Vben.BENF_DAT_DECES>'2011-04-01'

How to debug "Each GROUP BY expression must contain at least one column that is not an outer reference error"

Since SSRS doesn't allow filters on aggregates, I found some code which helped me come up with the below query. However, when I run it I get:
Each GROUP BY expression must contain at least one column that is not an outer reference
I have searched everywhere but can't find how to fix this. I've even removed the two extra tables from the query so there were no joins at all. I need to not return any order where the total of the lines on the order is less than $500 and greater than 0.
SELECT
tdsls041_sales_order_lines.company,
tdsls041_sales_order_lines.order_number,
tdsls041_sales_order_lines.amount,
tdsls041_sales_order_lines.item,
tdsls041_sales_order_lines.container
FROM
tdsls041_sales_order_lines AS tdsls041_sales_order_lines
WHERE
(tdsls041_sales_order_lines.company = 610) AND
(tdsls041_sales_order_lines.order_number IN
(SELECT
tdsls041_sales_order_lines.order_number
FROM
tdsls041_sales_order_lines AS tdsls041_sales_order_lines_1
GROUP BY
tdsls041_sales_order_lines.order_number
HAVING
(SUM(tdsls041_sales_order_lines.amount) <= 500) OR
SUM(tdsls041_sales_order_lines.amount) > 0))
The issue that SQL Server is complaining about is that the Grouping wants an aggregate function in the SELECT statement. Unfortunately, you want to use IN which you need a list of Order Numbers.
You just need to add an aggregate function to your subquery and then add another layer to select just the Order Numbers from that.
SELECT T1.company, T1.order_number, T1.amount, T1.item, T1.container
FROM tdsls041_sales_order_lines AS T1
WHERE (T1.company = 610) AND (T1.order_number IN
(SELECT order_number FROM
(SELECT TSOL.order_number, SUM(TSOL.amount) AS TTL
FROM tdsls041_sales_order_lines AS TSOL
GROUP BY TSOL.order_number
HAVING (SUM(TSOL.amount) <= 500) OR
SUM(TSOL.amount) > 0) AS T2) )
You can filter on aggreagates in Chart and Tables. You have to put the aggregate filter on your GROUP instead of on the table itself (Group Properties->Filters tab).

Resources