Potential optimization for GROUP BY? - apache-spark

Let's say I have three tables with the following structure.
table_1
- user_id BIGINT
- country STRING
table_2
- user_id BIGINT
- gender STRING
table_3
- user_id BIGINT
- age_bucket STRING
Let's assume all three tables are at the user_id level. Now I want to create a new table with the following structure.
output_table
- user_id
- country
- gender
- age_bucket
One approach to get that is to do table_1 FULL OUTER JOIN table_2 FULL OUTER JOIN table_3, the downside of this approach is the 2nd join won't start until table_1 and table_2 finishes.
Alternatively, we can do something like the following.
SELECT
user_id,
ARBITRARY(country) AS country,
ARBITRARY(gender) AS gender,
ARBITRARY(age_bucket) AS age_bucket
FROM
(
SELECT user_id, country, NULL AS gender, NULL AS age_bucket FROM table_1
UNION ALL
SELECT user_id, NULL AS country, gender, NULL AS age_bucket FROM table_2
UNION ALL
SELECT user_id, NULL AS country, NULL AS gender, age_bucket FROM table_3
) v
GROUP BY
user_id
In the execution plan, this will trigger an exchange across all three tables.
Now, let's say all three tables are bucketed and sorted the same way. In theory, we could make this UNION ALL query a map only operation. However, it looks like that's not how Spark works today.
My question is if Spark can do this UNION ALL without exchange? If not, what would it take to develop a feature like this?
Thanks.

Related

Query by Interleaved table fields using Spring Data Spanner

I'm trying to query by a field of a Interleaved table using Spring Data Spanner. The id comparison is automatically done by Spring Data Spanner when it does the ARRAY STRUCT inner join, but I'm not being able to add a WHERE clause to the Interleaved table query.
Considering the example below:
CREATE TABLE Singers (
Id INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (Id);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
Id INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, Id),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
Let's suppose I want to query all Singers where the AlbumTitle is "Fear of the Dark", how can I write a repository method to achieve that using Spring Data Spanner?
You're example seems to either contain a couple of typos, or it is otherwise not completely correct:
The Singers table has a column Id which is the primary key. That is in itself fine, but when creating a hierarchy of interleaved tables, it is recommended to prefix the primary key column with the table name. So it would be better to name it SingerId.
The Albums table has a SingerId column and an Id column. These two columns form the primary key of the Albums table. This is technically incorrect (and confusing), and also the reason that I think that your example is not completely correct. Because Albums is interleaved in Singers, Albums must contain the same primary key columns as the Singers table, in addition to any additional columns that form the primary key of Albums. In this case Id references the Singers table, and the SingerId is an additional column in the Albums table that has nothing to do with the Singers table. The primary key columns of the parent table must also appear in the same order as in the parent table.
The example data model should therefore be changed to:
CREATE TABLE Singers (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
From this point on you can consider the SingerId column in the Albums table as a foreign key relationship to a Singer and treat it as you would in any other database system. Note also that there can be multiple albums for each singer, so a query for ...I want to query all Singers where the AlbumTitle is "Fear of the Dark" is slightly ambiguous. I would rather say:
Give me all singers that have at least one album with the title "Fear of the Dark"
A valid query for that would be:
SELECT *
FROM Singers
WHERE SingerId IN (
SELECT SingerId
FROM Albums
WHERE AlbumTitle='Fear of the Dark'
)

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?

What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848]
+- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo
You can assign row numbers based on an ordering on data_lots if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');

How should I design the schema to get the last 2 records of each clustering key in Cassandra?

Each row in my table has 4 values product_id, user_id, updated_at, rating.
I'd like to create a table to find out how many users changed rating during a given period.
Currently my schema looks like:
CREATE TABLE IF NOT EXISTS ratings_by_product (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((product_id ), updated_at , user_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, user_id ASC);
but I couldn't figure out the way to only get the last 2 rows of each user in a given time window.
Any advice on query or changing the schema would be appreciated.
Cassandra requires a query-based approach to table design. Which means that typically one table will serve one query. So to serve the query you are talking about (last two updated rows per user) you should build a table specifically designed to serve it:
CREATE TABLE ratings_by_user_by_time (
product_id int,
updated_at timestamp,
user_id int,
rating int,
PRIMARY KEY ((user_id ), updated_at, product_id ))
WITH CLUSTERING ORDER BY (updated_at DESC, product_id ASC );
Then you will be able to get the last two updated ratings for a user by doing the following:
SELECT * FROM ratings_by_user_by_time
WHERE user_id = 'Bob' LIMIT 2;
Note that you'll need to keep the two ratings tables in-sync yourself, and using a batch statement is a good way to accomplish that.

equivalent percentile_cont function in apache spark sql

I am new to spark environment. I have dataset with column names as follows:
user_id, Date_time, order_quantity
I want to calculate the 90th percentile of order_quantity for each user_id.
If it were to be sql, I would have used the following query:
%sql
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)
However, spark doesn't have the built in support for using the percentile_cont function.
Any suggestions on how I can implement this in spark on the above dataset?
please let me know if more information is needed.
I have a solution for PERCENTILE_DISC (0.9) which will return the discrete order_quantity closest to percentile 0.9 (without interpolation).
The idea is to calculate PERCENT_RANK, substract 0.9 and calculate Absolute value, then take the minimal value:
%sql
WITH temp1 AS (
SELECT
user_id,
ABS(PERCENTILE_RANK () OVER
(PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp
SELECT
user_id,
FIRST_VALUE(order_quantity) OVER
(PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90
FROM
temp1;
I was dealing with a similar issue too. I worked in SAP HANA and then I moved to Spark SQL on Databricks. I have migrated the following SAP HANA query:
SELECT
DISTINCT ITEM_ID,
LOCATION_ID,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM MY_TABLE
to
SELECT DISTINCT
ITEM_ID,
LOCATION_ID,
PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
delta.`MY_TABLE`
In your particular case it should be as follows:
SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)
I hope this helps.

How to retrieve a date range from cassandra

I have a very simple table to store collection of IDs by a date rage
CREATE TABLE schedule_range (
start_date timestamp,
end_date timestamp,
schedules set<text>,
PRIMARY KEY ((start_date, end_date)));
I was hoping to be able to query it by a date range
SELECT *
FROM schedule_range
WHERE start_date >= 'xxx'
AND end_date < 'yyy'
Unfortunately it doesn't work this way. I've tried few different approaches and it always fail for a different reason.
How should I store IDs to be able to get them all by a date range?
In cassandra you only can use >, < operators with last field of primary key, in your case 'end_date'. For previous fields you must use equal operator. If you just considerate that schema maybe you could use other choices.
One approximation is use Apache Spark. There is some projects that built an abstraction layer in Spark over Cassandra and let you make operations in cassandra such as joins, any filter, groups by ...
Check this projects:
Stratio Deep
Datastax Connector
Using this table with a query that somewhat resembles yours works because 1) it doesn't use the conditional on the partition key start_date. Only EQ and IN relation are supported on the partition key. 2) The greater-than and less-than comparison on the clustering column is restricted to filters that select a contiguous ordering of rows. Filtering by the clustering column--2nd component in the compound key--id, does the latter.
create table schedule_range2(start_date timestamp, end_date timestamp, id int, schedules set<text>, primary key (start_date, id, end_date));
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-03 04:05', 1, '2014-02-04 04:00', {'event1', 'event2'});
insert into schedule_range2 (start_date, id, end_date, schedules) VALUES ('2014-02-05 04:05', 1, '2014-02-06 04:00', {'event3', 'event4'});
select * from schedule_range2 where id=1 and end_date >='2014-02-04 04:00' and end_date < '2014-02-06 04:00' ALLOW FILTERING;

Resources