I have a dataframe of products with a nested field (categories) which is an Array of Structs. Some product properties change over time getting corrected:
+---------+------------+-------+------------------------+
| product | date | brand | categories |
+=========+============+=======+========================+
| 1 | 01.01.2020 | b1 | name: ca1, taxonomy: a |
| | | +------------------------+
| | | | name: cb1, taxonomy: b |
+---------+------------+-------+------------------------+
| 2 | 01.01.2020 | b3 | name: ca3, taxonomy: a |
+---------+------------+-------+------------------------+
| 1 | 02.01.2020 | b2 | name: ca2, taxonomy: a |
| | | +------------------------+
| | | | name: cb2, taxonomy: b |
+---------+------------+-------+------------------------+
| 1 | 03.01.2020 | | |
+---------+------------+-------+------------------------+
I would like to get the last set brand, category_a (based on taxonomy a), category_b (based on taxonomy b) per product. So the expected outcome should look like:
+---------+-------+------------+------------+
| product | brand | category_a | category_b |
+=========+=======+============+============+
| 1 | b2 | ca2 | cb2 |
+---------+-------+------------+------------+
| 2 | b3 | ca3 | |
+---------+-------+------------+------------+
Assuming a view is created for this dataframe and named products, I have tried the following query:
SELECT DISTINCT
p.product AS product,
LAST_VALUE(p.brand) IGNORE NULLS OVER (PARTITION BY p.product ORDER BY p.date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS brand,
LAST_VALUE((SELECT name FROM LATERAL VIEW EXPLODE(p.categories) WHERE taxonomy = "a" LIMIT 1)) IGNORE NULLS OVER (PARTITION BY p.product ORDER BY p.date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS category_a,
LAST_VALUE((SELECT name FROM LATERAL VIEW EXPLODE(p.categories) WHERE taxonomy = "b" LIMIT 1)) IGNORE NULLS OVER (PARTITION BY p.product ORDER BY p.date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS category_b
FROM
products AS p
This query is leading to an exception
pyspark.sql.utils.AnalysisException: Accessing outer query column is not allowed in:
Generate explode(outer(categories..))
Although the exception is clear, I think the use-case is not unique, and there should be some solution for this problem that unfortunately I couldn't get so far.
I know I can get the outcome I am expecting using BigQuery's Standard SQL:
SELECT DISTINCT
p.product AS product,
LAST_VALUE(p.brand IGNORE NULLS) OVER (PARTITION BY p.product ORDER BY p.date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS brand,
LAST_VALUE((SELECT name FROM UNNEST(p.categories) WHERE taxonomy = "a") IGNORE NULLS) OVER (PARTITION BY p.product ORDER BY p.date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS category_a,
LAST_VALUE((SELECT name FROM UNNEST(p.categories) WHERE taxonomy = "b") IGNORE NULLS) OVER (PARTITION BY p.product ORDER BY p.date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS category_b
FROM
products AS p
I can also get this outcome if I broke the query into two (one per category taxonomy) and cross join the "exploded" categories into the original products view, but is there a way that I can get this in one query?
Given your df:
from datetime import date
import json
df = spark.read.json(spark.sparkContext.parallelize([
{'product': 1, 'date': '2020-01-01', 'brand': 'b1', 'categories': [{'name': 'ca1', 'taxonomy': 'a'}, {'name': 'cb1', 'taxonomy': 'b'}]},
{'product': 2, 'date': '2020-01-01', 'brand': 'b3', 'categories': [{'name': 'ca3', 'taxonomy': 'a'}]},
{'product': 1, 'date': '2020-01-02', 'brand': 'b2', 'categories': [{'name': 'ca2', 'taxonomy': 'a'}, {'name': 'cb2', 'taxonomy': 'b'}]},
{'product': 1, 'date': '2020-01-03', 'brand': None, 'categories': None}
]).map(json.dumps))
df.show(truncate=False)
+-----+--------------------+----------+-------+
|brand|categories |date |product|
+-----+--------------------+----------+-------+
|b1 |[{ca1, a}, {cb1, b}]|2020-01-01|1 |
|b3 |[{ca3, a}] |2020-01-01|2 |
|b2 |[{ca2, a}, {cb2, b}]|2020-01-02|1 |
|null |null |2020-01-03|1 |
+-----+--------------------+----------+-------+
you can use the following SQL
spark.sql("""
with df_with_maps as (
select product, brand, date, map_from_arrays(categories.taxonomy, categories.name) as category_map from df
)
select DISTINCT(product),
last(brand, true) over(PARTITION by product order by date asc rows BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) as brand,
last(category_map['a'], true) over(PARTITION by product order by date asc rows BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) as category_a,
last(category_map['b'], true) over(PARTITION by product order by date asc rows BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) as category_b
from df_with_maps
""").show(truncate=False)
+-------+-----+----------+----------+
|product|brand|category_a|category_b|
+-------+-----+----------+----------+
|1 |b2 |ca2 |cb2 |
|2 |b3 |ca3 |null |
+-------+-----+----------+----------+
this can obviously be written with the python API as well, i just assume you prefer the raw SQL
Related
This is the logic in SQL:
coalesce(if effc_dt <= tran_dt select(max of effc_dt) , if effc_dt >= tran_dt select (min of effc_dt)).
I want similar logic in Pyspark, when effc date is lesser than tran date it will select effc date closest to tran date and if lesser date is not present it will check for greater and select effc date closest to tran date.
Input dataframe:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|12|2020-02-01 |2018-02-01 |
|34|2020-02-01 |2021-02-15 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
|40|2020-02-01 |2020-03-15 |
Expected Output:
|id|tran_date |effc_date |
|--|-----------|-----------|
|12|2020-02-01 |2019-02-01 |
|34|2020-02-01 |2020-02-15 |
|40|2020-02-01 |2019-02-15 |
You can order by the date difference and limit the results to 1 row:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(
Window.partitionBy('ID')
.orderBy(F.datediff('start_date', 'end_date'))
)
).filter('rn = 1').drop('rn')
df2.show()
+---+----------+----------+
| id|start_date| end_date|
+---+----------+----------+
| 34|2021-02-01|2019-02-01|
| 12|2020-02-01|2019-02-01|
+---+----------+----------+
I have the following 2 tables for which I have to check the existence of values between them using a correlated sub-query.
The requirement is - for each record in the orders table check if the corresponding custid is present in the customer table, and then output a field (named FLAG) with value Y if the custid exists, otherwise N if it doesn't.
orders:
orderid | custid
12345 | XYZ
34566 | XYZ
68790 | MNP
59876 | QRS
15620 | UVW
customer:
id | custid
1 | XYZ
2 | UVW
Expected Output:
orderid | custid | FLAG
12345 | XYZ | Y
34566 | XYZ | Y
68790 | MNP | N
59876 | QRS | N
15620 | UVW | Y
I tried something like the following but couldn't get it to work -
select
o.orderid,
o.custid,
case when o.custid EXISTS (select 1 from customer c on c.custid = o.custid)
then 'Y'
else 'N'
end as flag
from orders o
Can this be solved with a correlated scalar sub-query ? If not what is the best way to implement this requirement ?
Please advise.
Note: using Spark SQL query v2.4.0
Thanks.
IN/EXISTS predicate sub-queries can only be used in a filter in Spark.
The following works in a locally recreated copy of your data:
select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
from (select o.orderid, o.custid, c.custid existing_customer
from orders o
left join customer c
on c.custid = o.custid)
Here's how it works with recreated data:
def textToView(csv: String, viewName: String) = {
spark.read
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "|")
.option("header", "true")
.csv(spark.sparkContext.parallelize(csv.split("\n")).toDS)
.createOrReplaceTempView(viewName)
}
textToView("""id | custid
1 | XYZ
2 | UVW""", "customer")
textToView("""orderid | custid
12345 | XYZ
34566 | XYZ
68790 | MNP
59876 | QRS
15620 | UVW""", "orders")
spark.sql("""
select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
from (select o.orderid, o.custid, c.custid existing_customer
from orders o
left join customer c
on c.custid = o.custid)""").show
Which returns:
+-------+------+-----------------+
|orderid|custid|existing_customer|
+-------+------+-----------------+
| 59876| QRS| N|
| 12345| XYZ| Y|
| 34566| XYZ| Y|
| 68790| MNP| N|
| 15620| UVW| Y|
+-------+------+-----------------+
I want to use spark SQL window functions to do some aggregations and windowing.
Suppose I'm using the example table provided here a: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I want to run the query to give me the max 2 revenue for each category and also the count of product for each category.
After I run this query
SELECT
product,
category,
revenue
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count
FROM productRevenue) tmp
WHERE
rank <= 2
I got the table like this:
product category revenue count
pro2 tablet 6500 1
mini tablet 5500 2
instead of
product category revenue count
pro2 tablet 6500 5
mini tablet 5500 5
which is what I expected.
How should I write my code to get the right count for each category (instead of using another separate Group By statement)?
In Spark if window clause having order by window defaults to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
For your case add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING in count(*) window clause.
Try with:
SELECT
product,
category,
revenue,count
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
count(*) OVER (PARTITION BY category ORDER BY revenue DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as count
FROM productRevenue) tmp
WHERE
rank <= 2
Change count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count to count(*) OVER (PARTITION BY category ORDER BY category DESC) as count. You will get expected result.
Try below code.
scala> spark.sql("""SELECT
| product,
| category,
| revenue,
| rank,
| count
| FROM (
| SELECT
| product,
| category,
| revenue,
| dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
| count(*) OVER (PARTITION BY category ORDER BY category DESC) as count
| FROM productRevenue) tmp
| WHERE
| tmp.rank <= 2 """).show(false)
+----------+----------+-------+----+-----+
|product |category |revenue|rank|count|
+----------+----------+-------+----+-----+
|Pro2 |tablet |6500 |1 |5 |
|Mini |tablet |5500 |2 |5 |
|Thin |cell phone|6000 |1 |5 |
|Very thin |cell phone|6000 |1 |5 |
|Ultra thin|cell phone|5000 |2 |5 |
+----------+----------+-------+----+-----+
Say I have the following spark dataframe:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 2 | 1 |
| 3 | 1 |
| 4 | NULL |
| 5 | 4 |
| 6 | NULL |
| 7 | 6 |
| 8 | 3 |
This dataframe represents a tree structure consisting of several disjoint trees. Now, say that we have a list of nodes [8, 7], and we want to get a dataframe containing just the nodes that are roots of the trees containing the nodes in the list.The output looks like:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 6 | NULL |
What would be the best (fastest) way to do this with spark queries and pyspark?
If I were doing this in plain SQL I would just do something like this:
CREATE TABLE #Tmp
Node_id int,
Parent_id int
INSERT INTO #Tmp Child_Nodes
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
WHILE #num > 0
INSERT INTO #Tmp (
SELECT
p.Node_id
p.Parent_id
FROM
#Tmp t
LEFT-JOIN Nodes p
ON t.Parent_id = p.Node_id)
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
END
SELECT Node_id FROM #Tmp WHERE Parent_id IS NULL
Just wanted to know if there's a more spark-centric way of doing this using pyspark, beyond the obvious method of simply looping over the dataframe using python.
parent_nodes = spark.sql("select Parent_id from table_name where Node_id in [2,7]").distinct()
You can join the above dataframe with the table to get the Parent_id of those nodes as well.
--------------------
|bookname |author |
--------------------
|book1 |author1 |
|book1 |author2 |
|book2 |author3 |
|book2 |author4 |
|book3 |author5 |
|book3 |author6 |
|book4 |author7 |
|book4 |author8 |
---------------------
but I want the booknames as columns and authors as its rows
ex
----------------------------------
|book1 |book2 |book3 |book4 |
----------------------------------
|author1|author3 |author5|author7|
|author2|author4 |author6|author8|
----------------------------------
is it possible in postgres? How can I do this?
I tried crosstab but I failed to do this.
You can get the result using an aggregate function with a CASE expression but I would first use row_number() so you have a value that can be used to group the data.
If you use row_number() then the query could be:
select
max(case when bookname = 'book1' then author end) book1,
max(case when bookname = 'book2' then author end) book2,
max(case when bookname = 'book3' then author end) book3,
max(case when bookname = 'book4' then author end) book4
from
(
select bookname, author,
row_number() over(partition by bookname
order by author) seq
from yourtable
) d
group by seq;
See SQL Fiddle with Demo. I added the row_number() so you will return each distinct value for the books. If you exclude the row_number(), then using an aggregate with a CASE will return only one value for each book.
This query gives the result:
| BOOK1 | BOOK2 | BOOK3 | BOOK4 |
-----------------------------------------
| author1 | author3 | author5 | author7 |
| author2 | author4 | author6 | author8 |