equivalent percentile_cont function in apache spark sql - apache-spark

I am new to spark environment. I have dataset with column names as follows:
user_id, Date_time, order_quantity
I want to calculate the 90th percentile of order_quantity for each user_id.
If it were to be sql, I would have used the following query:
%sql
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)
However, spark doesn't have the built in support for using the percentile_cont function.
Any suggestions on how I can implement this in spark on the above dataset?
please let me know if more information is needed.

I have a solution for PERCENTILE_DISC (0.9) which will return the discrete order_quantity closest to percentile 0.9 (without interpolation).
The idea is to calculate PERCENT_RANK, substract 0.9 and calculate Absolute value, then take the minimal value:
%sql
WITH temp1 AS (
SELECT
user_id,
ABS(PERCENTILE_RANK () OVER
(PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp
SELECT
user_id,
FIRST_VALUE(order_quantity) OVER
(PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90
FROM
temp1;

I was dealing with a similar issue too. I worked in SAP HANA and then I moved to Spark SQL on Databricks. I have migrated the following SAP HANA query:
SELECT
DISTINCT ITEM_ID,
LOCATION_ID,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM MY_TABLE
to
SELECT DISTINCT
ITEM_ID,
LOCATION_ID,
PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
delta.`MY_TABLE`
In your particular case it should be as follows:
SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)
I hope this helps.

Related

Potential optimization for GROUP BY?

Let's say I have three tables with the following structure.
table_1
- user_id BIGINT
- country STRING
table_2
- user_id BIGINT
- gender STRING
table_3
- user_id BIGINT
- age_bucket STRING
Let's assume all three tables are at the user_id level. Now I want to create a new table with the following structure.
output_table
- user_id
- country
- gender
- age_bucket
One approach to get that is to do table_1 FULL OUTER JOIN table_2 FULL OUTER JOIN table_3, the downside of this approach is the 2nd join won't start until table_1 and table_2 finishes.
Alternatively, we can do something like the following.
SELECT
user_id,
ARBITRARY(country) AS country,
ARBITRARY(gender) AS gender,
ARBITRARY(age_bucket) AS age_bucket
FROM
(
SELECT user_id, country, NULL AS gender, NULL AS age_bucket FROM table_1
UNION ALL
SELECT user_id, NULL AS country, gender, NULL AS age_bucket FROM table_2
UNION ALL
SELECT user_id, NULL AS country, NULL AS gender, age_bucket FROM table_3
) v
GROUP BY
user_id
In the execution plan, this will trigger an exchange across all three tables.
Now, let's say all three tables are bucketed and sorted the same way. In theory, we could make this UNION ALL query a map only operation. However, it looks like that's not how Spark works today.
My question is if Spark can do this UNION ALL without exchange? If not, what would it take to develop a feature like this?
Thanks.

How do I filter and use GROUP BY in Cassandra?

I have a table in Cassandra -
CREATE TABLE orders.orders (
customer text,
ordered_at date,
orders bigint,
PRIMARY KEY (customer, ordered_at)
) WITH CLUSTERING ORDER BY (ordered_at ASC)
I have to do something like -
SELECT customer, count(1) from orders.orders GROUP BY customer HAVING count(1) > 4;
But, it seems to be not working. Is there a way I can do that?

Aliasing different WINDOW clauses in Spark SQL

Is it possible to have aliases for multiple Windows in the same query ?
For e.g. -
select
cust_id,
eff_dt,
row_number() over w AS rec1
from cust
WINDOW w AS (PARTITION BY cust_id ORDER BY eff_dt desc);
The above runs fine. But it fails when I try to add another Window alias:
select
cust_id,
eff_dt,
row_number() over w AS rec1,
rank() over w2 AS rec2
from cust
WINDOW w AS (PARTITION BY cust_id ORDER BY eff_dt desc),
WINDOW w2 AS (PARTITION BY cust_id ORDER BY version asc);
Can anyone please help on how to use both the Window aliases above ?
Thanks
You can do that with nested query,
select
cust_id,
eff_dt,
rec1,
rank() over w2 AS rec2
from (
select
cust_id,
eff_dt,
version,
row_number() over w AS rec1
from cust
WINDOW w AS (PARTITION BY cust_id ORDER BY eff_dt desc))
WINDOW w2 AS (PARTITION BY cust_id ORDER BY version asc);

COUNT(*) vs. COUNT(1) vs COUNT(column) with WHERE condition performance in Cassandra

I have a query in Cassandra
select count(pk1) from tableA where pk1=xyz
Table is :
create table tableA
(
pk1 uuid,
pk2 int,
pk3 text,
pk4 int,
fc1 int,
fc2 bigint,
....
fcn blob,
primary key (pk1, pk2 , pk3 , pk4)
The query is executed often and takes up to 2s to execute.
I am wondering if there will be any performance gain if refactoring to:
select count(1) from tableA where pk = xyz
Based on the documentation here, there is no difference between count(1) and count(*).
Generally speaking COUNT(1) and COUNT(*) will both return the number of rows that match the condition specified in your query
This is in line with how traditional SQL databases are implemented.
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
Count(1) is a condition that always evaluates to true.
Also, Count(Column_name) only returns the Non-Null values.
Since in your case because of where condition the "Non-null" is a non-factor, I don't think there will be any difference in performance in using one over the other. This answer tried confirming the same using some performance tests.
In general COUNT is not at all recommended in Cassandra . As it’s going to scan through multiple nodes and get your answer back . And I’m not sure the count you get is really consistent.

Row_Number Query for Groupby

My query is about to get row number as SN in a SQL query from access database in which I get total sales for the day group by [bill Date] clause
my Working Query is:
sql = "SELECT [Bill Date] as [Date], Sum(Purchase) + Sum(Returns) as [Total Sales] FROM TableName Group By [Bill Date];"
I found this Row_Number Clause over Internet and i tried like this.
sql = "SELECT ROW_NUMBER() OVER (ORDER BY [Bill Date]) AS [SN], [Bill Date] as [Date], Sum(Purchase) + Sum(Returns) as [Total Sales] FROM TableName Group By [Bill Date];"
when i Run the about Code i get this error.
-2147217900 Syntax error (missing operator) in query expression ROW_NUMBER() OVER (ORDER BY [Bill Date]);"
i am using Excel Vba to connect to Access Database
Could any one help me to to get it in correct order.
Looks like you aren't define DESC or ASC on (ORDER BY [Bill Date]), need be something like this (ORDER BY [Bill Date] DESC) or (ORDER BY [Bill Date] ASC)

Resources