How "stable" is monotonically_increasing_id() in Spark? - apache-spark

I'm looking for an inexpensive way to distinguish duplicates and/or uniquely identify rows. I've been looking at the Spark built-ins monotonically_increasing_id() and uuid().
The problem with uuid() is that it does not retain its value and seems to be evaluated on the spot. For example
with uuids as (select uuid() as uuid)
select * from uuids join uuids
produces two different UUIDs.
If I use monotonically_increasing_id(), I get two identical values, but can I trust that to always work? In other words, if I have a CTE, where I have an id column generated by monotonically_increasing_id(), will any later rows derived from a row from that CTE have a consistent value of the id column within the same query?
In pseudo-SQL:
with /* ... */
with_ids as (select monotonically_increasing_id() as id, * from /* ... */),
/* ... */
derived_a as (/* Somehow derived from with_ids */),
derived_b as (/* Somehow derived from with_ids */)
select
(a.id == b.id) as are_same,
(a.id != b.id) as are_different
from derived_a as a
join derived_b as b
Will rows derived from the exact same rows of with_ids have are_same == true? Is it guaranteed that if the original rows were different, then are_different == true? The former is definitely false for uuid().
[Updated] Another example, involving a join and group by:
with
with_ids as (
select
monotonically_increasing_id() as id
,*
from table_a)
joined as (
select struct(a.*) as packed_a, a.id
from with_ids as a
left join table_b as b
on /* whatever */
)
select collect_set(packed_a) as should_be_singular
from joined
group by id
Is the row count in the above equal to the number of rows in table_a and is should_be_singular a single element array?
The documentation for both functions state that they are non-deterministic, but don't really offer any details on when the functions are evaluated or how they should be used.
The issue seems to be mentioned in SPARK-14241 and this question, but it's not clear if and under what conditions monotonically_increasing_id() is consistent within a single SQL statement.

from my past experience when working with row identifiers (uuid, row_number or monotonically_increasing_id) I cache the dataframe.
Then every subsequent calculation using the dataframe will have static row identifiers.

Related

Reduce results to first match for each pattern with spark sql

I have a spark sql query, where I have to search for multiple identifiers:
SELECT * FROM my_table WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
Now I get hundreds of results for each of these matches, where I am only interested in the first match for each identifier, i.e. one row with identifier == 'abc', one where identifier == 'cde' and so on.
What is the best way to reduce my result to only the first row for each match?
The best approach certainly depends a bit on your data and also on what you mean by first. Is that any random row that happens to be returned first? Or first by some particular sort order?
A general flexible approach is using window functions. row_number() allows you to easily filter for the first row by window.
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION BY identifier ORDER BY ???) as row_num
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')) tmp
WHERE
row_num = 1
Though, aggregations like first or max_by are often more efficient. But these get quickly inconvenient when dealing with lots of columns.
You can use the first() aggregation function (after grouping by identifier) to only get the first row in each group.
But I don't think you'll be able to select * with this approach. Instead, you can list every individual column you want to get:
SELECT identifier, first(col1), first(col2), first(col3), ...
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
GROUP BY identifier
Another approach would be to fire a query for each identifier value with a limit of 1 and then union all the results.
With the DataFrame API, you can use your original query and then use .dropDuplicates(["identifier"]) on the result to only keep a single row for each identifier value.

How to avoid key column name duplication in join?

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?

What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848]
+- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo
You can assign row numbers based on an ordering on data_lots if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');

MsAccess Delete all values but one in column by condition

I have a table in MS ACCESS 2013 that looks like this:
Id Department Status FollowingDept ActualArea
1000 Thinkerers Thinking Thinkerer Thinkerer
1000 Drawers OnDrawBoard Drawers Drawers
1000 MaterialPlan To Plan MaterialPlan MaterialPlan
1000 Painters MatNeeded MaterialPlan
1000 Builders DrawsNeeded Drawers
The table gives follow to an ID which has to pass through five departments, each department with atleast 5 different status.
Each status has a FollowingDept value, like *Department* Thinkerers has the status MoreCoffeeNow which means *FollowingDept* Drawers.
All columns except for ActualArea are columns which values are gotten from the feed of a query.
ActualArea is an Expr where I inserted this logic:
Iif(FollowingDept = Department, FollowingDept, "")
My logic is simple, if the FollowingDept and Department coincide, then the ID's ActualArea gets the value from FollowingDept.
But as you can see, there can be rare cases where an ID is like my example above, where 3 departments coincide with the FollowingDept. This cases are rare, but I would like to add something like a priority to Access.
Thinkerers has the top priority, then MaterialPlan, then Drawers, then Builders and lastly Painters. So, following the same example, after ActualArea gets 3 values, Access will execute another query or subquery or whatever, where it will evaluate each value priority and only leave behind that one with the top priority. So in this example, Thinkerers gets the top priority, and the other two values are eliminated from the ActualArea column.
Please keep in mind there are over 500 different IDs, each id is repeated 5 times, so the total records to evaluate will be 2500.
You need another table, with the possible values for actualArea and the priorities as numbers, and then you can select with a JOIN and order on the priority:
SELECT TOP 1 d.*, p.priority
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE d.id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
The IS NOT NULL clause eliminates all of the rows where actualArea is empty. The TOP condition leaves only the row with the top priority.
You don't seem to have a primary key for your table. If you don't, then I'll give another query in a minute, but I would strongly advise you to go back and add a primary key to the table. It will save you an incredible amount of headache later. I did add such a key to my test table, and it's called pID. This query makes use of that pID to remove the records you need:
DELETE FROM departments WHERE pID NOT IN (
SELECT TOP 1 d.pID
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
)
If you can't add a primary key to the data and actualArea is assumed to be unique, then you can just use the actualArea values to perform the delete:
DELETE FROM departments WHERE actualArea NOT IN (
SELECT TOP 1 d.actualArea
FROM departments d
LEFT JOIN priorities p ON d.actualArea = p.dept
WHERE id = 1000
AND p.priority IS NOT NULL
ORDER BY p.priority ASC
) AND id = 1000
If actualArea is not going to be unique, then we'll need to revisit this answer. This answer also assumes that you already have the id number.

How to debug "Each GROUP BY expression must contain at least one column that is not an outer reference error"

Since SSRS doesn't allow filters on aggregates, I found some code which helped me come up with the below query. However, when I run it I get:
Each GROUP BY expression must contain at least one column that is not an outer reference
I have searched everywhere but can't find how to fix this. I've even removed the two extra tables from the query so there were no joins at all. I need to not return any order where the total of the lines on the order is less than $500 and greater than 0.
SELECT
tdsls041_sales_order_lines.company,
tdsls041_sales_order_lines.order_number,
tdsls041_sales_order_lines.amount,
tdsls041_sales_order_lines.item,
tdsls041_sales_order_lines.container
FROM
tdsls041_sales_order_lines AS tdsls041_sales_order_lines
WHERE
(tdsls041_sales_order_lines.company = 610) AND
(tdsls041_sales_order_lines.order_number IN
(SELECT
tdsls041_sales_order_lines.order_number
FROM
tdsls041_sales_order_lines AS tdsls041_sales_order_lines_1
GROUP BY
tdsls041_sales_order_lines.order_number
HAVING
(SUM(tdsls041_sales_order_lines.amount) <= 500) OR
SUM(tdsls041_sales_order_lines.amount) > 0))
The issue that SQL Server is complaining about is that the Grouping wants an aggregate function in the SELECT statement. Unfortunately, you want to use IN which you need a list of Order Numbers.
You just need to add an aggregate function to your subquery and then add another layer to select just the Order Numbers from that.
SELECT T1.company, T1.order_number, T1.amount, T1.item, T1.container
FROM tdsls041_sales_order_lines AS T1
WHERE (T1.company = 610) AND (T1.order_number IN
(SELECT order_number FROM
(SELECT TSOL.order_number, SUM(TSOL.amount) AS TTL
FROM tdsls041_sales_order_lines AS TSOL
GROUP BY TSOL.order_number
HAVING (SUM(TSOL.amount) <= 500) OR
SUM(TSOL.amount) > 0) AS T2) )
You can filter on aggreagates in Chart and Tables. You have to put the aggregate filter on your GROUP instead of on the table itself (Group Properties->Filters tab).

Resources