Presto: is MAX_BY() deterministic

Presto: is MAX_BY() deterministic - presto

Is the function MAX_BY() deterministic.
If I use MAX_() for two different columns both depending on a third one, will I get the same row result?
The presto documentation doesn't mention this.
This documentation about mysql mention that it is not, so I'm not sure where to find this info.
I tested quickly with the following:
WITH my_table(id, arr, something) AS (
VALUES
(1, ARRAY['one'], 0.0),
(2, ARRAY['two'], 0.0),
(3, ARRAY['three'], 0.0),
(4, ARRAY['four'], 0.0),
(5, ARRAY['five'], 0.0),
(6, ARRAY[''], 0.0)
)
SELECT
MAX_BY(id,something),
MAX_BY(arr,something)
FROM my_table
It returned the first row, so it doesn't feel arbitrary but also does not prove things.
Anyone out there able to help?
There is a related question to return multiple columns from a single MAX_BY() so I'm thinking that I need to use that solution to guarantee the attribute of the same row is selected:
max_by with multiple return columns

No, in the case of ties, the result of max_by and min_by is arbitrary. It may appear to be deterministic, but that's not defined behavior and may change at some point.
If you want all the values to be consistent, you have to use the trick you referred to, where you pack all the columns of interest in a single value of type ROW:
SELECT max_by((x1, x2, x3), y) r
FROM (...) t(y, x1, x2, x3)

It is probably safer, and more efficient as well, to use window functions:
select *
from (
select t.*, row_number() over(order by something desc) rn
from my_table t
) t
where rn = 1
For this simple case, a row-limiting clause is actually good enough:
select *
from my_table
order by something desc
limit 1
Both query guarantee that the returned values all belong to the same row.
None, however, is deterministic, in the sense that consecutive executions of the same query might return a different row. If you want a stable result, then you need a column (or a set of columns) that can be used to uniquely identify each row: adding id to the order by clause would be just fine here.

Related

drop_duplicates after unionByName

I am trying to stack two dataframes (with unionByName()) and, then, dropping duplicate entries (with drop_duplicates()).
Can I trust that unionByName() will preserve the order of the rows, i.e., that df1.unionByName(df2) will always produce a dataframe whose first N rows are df1's? Because, if so, when applying drop_duplicates(), df1's row would always be preserved, which is the behaviour I want.

UnionByName will not guarantee that you will have your records ranked first from df1 and then from df2. These are distributed and parallel tasks so you definitely can't build on that.
The solution might be to add a technical priority column to each DataFrame, then unionByName() and use the row_number() analytical function to sort by priority within that ID and then select the one with the higher priority (in below case 1 means higher than 2).
Take a look at the Scala code below:
val df1WithPriority = df1.withColumn("priority", lit(1))
val df2WithPriority = df2.withColumn("priority", lit(2))
df1WithPriority
.unionByName(df2WithPriority)
.withColumn(
"row_num",
row_number()
.over(Window.partitionBy("ID").orderBy(col("priority").asc)
)
.where(col("row_num") === lit(1))

Reduce results to first match for each pattern with spark sql

I have a spark sql query, where I have to search for multiple identifiers:
SELECT * FROM my_table WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
Now I get hundreds of results for each of these matches, where I am only interested in the first match for each identifier, i.e. one row with identifier == 'abc', one where identifier == 'cde' and so on.
What is the best way to reduce my result to only the first row for each match?

The best approach certainly depends a bit on your data and also on what you mean by first. Is that any random row that happens to be returned first? Or first by some particular sort order?
A general flexible approach is using window functions. row_number() allows you to easily filter for the first row by window.
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION BY identifier ORDER BY ???) as row_num
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')) tmp
WHERE
row_num = 1
Though, aggregations like first or max_by are often more efficient. But these get quickly inconvenient when dealing with lots of columns.

You can use the first() aggregation function (after grouping by identifier) to only get the first row in each group.
But I don't think you'll be able to select * with this approach. Instead, you can list every individual column you want to get:
SELECT identifier, first(col1), first(col2), first(col3), ...
FROM my_table
WHERE identifier IN ('abc', 'cde', 'efg', 'ghi')
GROUP BY identifier
Another approach would be to fire a query for each identifier value with a limit of 1 and then union all the results.
With the DataFrame API, you can use your original query and then use .dropDuplicates(["identifier"]) on the result to only keep a single row for each identifier value.

how to sort field with char and decimal value

i am stuck with the scenario where i have to sort
'a-2.3'
'a-1.1' and
'a-1.02'.
how do we do this using jpql in spring data jpa or using sql query. I would appreciate your personal experience and idea.
the sorting expected in ascending order based on the numerical value after a- .

This will depend on the database you're using, but e.g. in Oracle, you could call TO_NUMBER(SUBSTR(col, 3)) with SQL:
WITH t (col) AS (
SELECT 'a-2.3' FROM dual UNION ALL
SELECT 'a-1.1' FROM dual UNION ALL
SELECT 'a-1.02' FROM dual
)
SELECT col
FROM t
ORDER BY to_number(substr(col, 3))
This yields:
a-1.02
a-1.1
a-2.3
Of course, you'll have to adapt the parsing in case your prefix isn't always exactly a-, but something dynamic.
In JPQL, this could be feasible: CAST(SUBSTRING(col, 3, LENGTH(col) - 2) AS NUMBER)

How to make the query to work?

I have Cassandra version 2.0, and in it I am totally new in it, so the question...
I have table T1, with columns with names: 1,2,3...14 (for simplicity);
Partitioning key is column 1 , 2;
Clustering key is column 3, 1 , 5;
I need to perform following query:
SELECT 1,2,7 FROM T1 where 2='A';
Column 2 is a flag, so values are repeating.
I get the following error:
Unable to execute CQL query: Partitioning column 2 cannot be restricted because the preceding column 1 is either not restricted or is restricted by a non-EQ relation
So what is the right way to do it? I really need to get the data that already filtered. Thanks.

So, to make sure I understand your schema, you have defined a table T1:
CREATE TABLE T1 (
1 INT,
2 INT,
3 INT,
...
14 INT,
PRIMARY ((1, 2), 3, 1, 5)
);
Correct?
If this is the case, then Cassandra cannot find the data to answer your CQL query:
SELECT 1,2,7 FROM T1 where 2 = 'A';
because your query has not provided a value for column "1", without which Cassandra cannot compute the partition key (which requires, per your composite PRIMARY KEY definition, both columns "1" and "2"), and without that, it cannot determine where to look on which nodes in the ring. By including "2" in your partition key, you are telling Cassandra that that data is required for determine where to store (and thus, where to read) that data.
For example, given your schema, this query should work:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
since you are providing both values of your partition key.
#Caleb Rockcliffe has good advice, though, regarding the need for other, secondary/supplemental lookup mechanisms if the above table definition is a big part of your workload. You may need to find some way to first lookup the values for "1" and "2", then issue your query. E.g.:
CREATE TABLE T1_index (
1 INT,
2 INT,
PRIMARY KEY (1, 2);
);
Given a value for "1", the above will provide all of the possible "2" values, through which you can then iterate:
SELECT 2 FROM T1_index WHERE 1 = 'X';
And then, for each "1" and "2" combination, you can then issue your query against table T1:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
Hope this helps!

Your WHERE clause needs to include the first element of the partition key.

PostgreSQL - Returning the results of multiple arbitrary sub-queries

Like the title of the question suggests, I'm attempting take a number of arbitrary sub-queries and combine them into a single, large query.
Ideally, I'd like to the data to be returned as a single record, with each column being the result of one of the sub-queries. E.G.
| sub-query 1 | sub-query 2 | ...
|-----------------|-----------------|-----------------
| (array of rows) | (array of rows) | ...
The sub-queries themselves are built using Knex.js in a Node app and are completely arbitrary. I've come fairly close to a proper solution, but I've hit a snag.
My current implementation has the final query like so:
SELECT
array_agg(sub0.*) as s0,
array_agg(sub1.*) as s1,
...
FROM
(...) as sub0,
(...) as sub1,
...
;
Which mostly works, but causes huge numbers of duplicates in the output. During my testing, I found that it returns records such each record is duplicated a number of times equal to how many records would have been returned without the duplicates. For example, a sub-query that should return 10 records would, instead, return 100 (each record being duplicated 10 times).
I've yet to figure out why this occurs or how to fix the query to not get the issue.
So far, I've only been able to determine that:
The number of records returned by the sub-queries is correct when queried separately
The duplicates are not caused by intersections between the sub-queries
i.e. sub-queries contain rows that exist in other sub-queries
Thanks in advance.

Just place the arbitrary queries in the select list:
with sq1 as (
values (1, 'x'),(2, 'y')
), sq2 as (
values ('a', 3), ('b', 4), ('c', 5)
)
select
(select array_agg(s.*) from (select * from sq1) s) as s0,
(select array_agg(s.*) from (select * from sq2) s) as s1
;
s0 | s1
-------------------+---------------------------
{"(1,x)","(2,y)"} | {"(a,3)","(b,4)","(c,5)"}

Also you can add row_number to sub-queries and use that column to outer join tables (instead of cross join):
SELECT
array_agg(sub0.*) as s0,
array_agg(sub1.*) as s1
FROM
(SELECT row_number() OVER (), * FROM (VALUES (1, 'x'),(2, 'y')) t) as sub0
FULL OUTER JOIN
(SELECT row_number() OVER (), * FROM (VALUES ('a', 3), ('b', 4), ('c', 5)) t1) as sub1
ON sub0.row_number=sub1.row_number
;
s0 | s1
----------------------------+---------------------------------
{"(1,1,x)","(2,2,y)",NULL} | {"(1,a,3)","(2,b,4)","(3,c,5)"}
(1 row)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Presto: is MAX_BY() deterministic - presto

Related

drop_duplicates after unionByName

Reduce results to first match for each pattern with spark sql

how to sort field with char and decimal value

How to make the query to work?

PostgreSQL - Returning the results of multiple arbitrary sub-queries

Categories

Resources