; WITH Hierarchy as
(
select distinct PersonnelNumber
, Email
, ManagerEmail
from dimstage
union all
select e.PersonnelNumber
, e.Email
, e.ManagerEmail
from dimstage e
join Hierarchy as h on e.Email = h.ManagerEmail
)
select * from Hierarchy
Can you help achieve the same in SPARK SQL
This is quite late, but today I tried to implement the cte recursive query using PySpark SQL.
Here, I have this simple dataframe. What I want to do is to find the NEWEST ID of each ID.
The original dataframe:
+-----+-----+
|OldID|NewID|
+-----+-----+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
| 6| 7|
| 7| 8|
| 9| 10|
+-----+-----+
The result I want:
+-----+-----+
|OldID|NewID|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 6| 8|
| 7| 8|
| 9| 10|
+-----+-----+
Here is my code:
df = sqlContext.createDataFrame([(1, 2), (2, 3), (3, 4), (4, 5), (6, 7), (7, 8),(9, 10)], "OldID integer,NewID integer").checkpoint().cache()
dfcheck = df.drop('NewID')
dfdistinctID = df.select('NewID').distinct()
dfidfinal = dfdistinctID.join(dfcheck, [dfcheck.OldID == dfdistinctID.NewID], how="left_anti") #We find the IDs that have not been replaced
dfcurrent = df.join(dfidfinal, [dfidfinal.NewID == df.NewID], how="left_semi").checkpoint().cache() #We find the the rows that are related to the IDs that have not been replaced, then assign them to the dfcurrent dataframe.
dfresult = dfcurrent
dfdifferentalias = df.select(df.OldID.alias('id1'), df.NewID.alias('id2')).checkpoint().cache()
while dfcurrent.count() > 0:
dfcurrent = dfcurrent.join(broadcast(dfdifferentalias), [dfcurrent.OldID == dfdifferentalias.id2], how="inner").select(dfdifferentalias.id1.alias('OldID'), dfcurrent.NewID.alias('NewID')).cache()
dfresult = dfresult.unionAll(dfcurrent)
display(dfresult.orderBy('OldID'))
Databricks notebook screenshot
I know that the performance is quite bad, but at least, it give the answer I need.
This is the first time that I post an answer to StackOverFlow, so forgive me if I made any mistake.
This is not possible using SPARK SQL. The WITH clause exists, but not for CONNECT BY like in, say, ORACLE, or recursion in DB2.
The Spark documentation provides a "CTE in CTE definition". This is reproduced below:
-- CTE in CTE definition
WITH t AS (
WITH t2 AS (SELECT 1)
SELECT * FROM t2
)
SELECT * FROM t;
+---+
| 1|
+---+
| 1|
+---+
You can extend this to multiple nested queries, but the syntax can quickly become awkward. My suggestion is to use comments to make it clear where the next select statement is pulling from. Essentially, start with the first query and place additional CTE statements above and below as needed:
WITH t3 AS (
WITH t2 AS (
WITH t1 AS (SELECT distinct b.col1
FROM data_a as a, data_b as b
WHERE a.col2 = b.col2
AND a.col3 = b.col3
-- select from t1
)
SELECT distinct b.col1, b.col2, b.col3
FROM t1 as a, data_b as b
WHERE a.col1 = b.col1
-- select from t2
)
SELECT distinct b.col1
FROM t2 as a, data_b as b
WHERE a.col2 = b.col2
AND a.col3 = b.col3
-- select from t3
)
SELECT distinct b.col1, b.col2, b.col3
FROM t3 as a, data_b as b
WHERE a.col1 = b.col1;
You can recursively use createOrReplaceTempView to build a recursive query. It's not going to be fast, nor pretty, but it works. Following #Pblade's example, PySpark:
def recursively_resolve(df):
rec = df.withColumn('level', F.lit(0))
sql = """
select this.oldid
, coalesce(next.newid, this.newid) as newid
, this.level + case when next.newid is not null then 1 else 0 end as level
, next.newid is not null as is_resolved
from rec this
left outer
join rec next
on next.oldid = this.newid
"""
find_next = True
while find_next:
rec.createOrReplaceTempView("rec")
rec = spark.sql(sql)
# check if any rows resolved in this iteration
# go deeper if they did
find_next = rec.selectExpr("ANY(is_resolved = True)").collect()[0][0]
return rec.drop('is_resolved')
Then:
src = spark.createDataFrame([(1, 2), (2, 3), (3, 4), (4, 5), (6, 7), (7, 8),(9, 10)], "OldID integer,NewID integer")
result = recursively_resolve(src)
result.show()
Prints:
+-----+-----+-----+
|oldid|newid|level|
+-----+-----+-----+
| 2| 5| 2|
| 4| 5| 0|
| 3| 5| 1|
| 7| 8| 0|
| 6| 8| 1|
| 9| 10| 0|
| 1| 5| 2|
+-----+-----+-----+
Related
Using the example in this question, how do I create rows of 0 count when aggregating all possible combinations? When using cube, rows of 0 do not populate.
This is the code and output:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
But this is the desired output (added row 4).
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 1| 0| <- count of records where x = bar AND y = 1
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
Is there another function that could do that?
I agree that crossJoin here is the correct approach. But I think afterwards it may be a bit more versatile to use a join instead of a union and groupBy. Especially if there are more aggregations than one count.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('foo', 1),
('foo', 2),
('bar', 2),
('bar', 2)],
['x', 'y'])
df_cartesian = df.select('x').distinct().crossJoin(df.select("y").distinct())
df_cubed = df.cube('x', 'y').count()
df_cubed.join(df_cartesian, ['x', 'y'], 'full').fillna(0, ['count']).show()
# +----+----+-----+
# | x| y|count|
# +----+----+-----+
# |null|null| 4|
# |null| 1| 1|
# |null| 2| 3|
# | bar|null| 2|
# | bar| 1| 0|
# | bar| 2| 2|
# | foo|null| 2|
# | foo| 1| 1|
# | foo| 2| 1|
# +----+----+-----+
First, let's see why you do not get combinations that do not appear in your dataset.
def cube(col1: String, cols: String*): RelationalGroupedDataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
As the doc states it, cube is just a fancy group by. You may aslo check that by running explain on your result. You would see that cube is basically an expand (to obtain the nulls) and a group by. Therefore it cannot show you combinations that are not in your dataset. A join would be needed for that so that values that are never in the same record together can "meet".
So let's construct a solution:
// this contains one line per possible combination, even those who are not in the dataset
// note that we set the count to 0
val cartesian = df
.select("x").distinct
.crossJoin(df.select("y").distinct)
.withColumn("count", lit(0))
// A dataset in which (2, 1) does not exist
val df = Seq((1, 1), (1, 2), (2, 2)).toDF("x", "y")
// Let's now union the cube with the Cartesian product (CP) and
// reperform a group by.
// Since the counts were set to zero in the CP, this will not impact the
// counts of the cube. It simply adds "missing" values with a count of 0.
df.cube("x", "y").count
.union(cartesian)
.groupBy("x", "y")
.agg(sum('count) as "count")
.show
which yields:
+----+----+-----+
| x| y|count|
+----+----+-----+
| 2| 2| 1|
| 1| 2| 1|
| 1| 1| 1|
| 2| 1| 0|
|null|null| 3|
| 1|null| 2|
|null| 1| 1|
|null| 2| 2|
| 2|null| 1|
+----+----+-----+
I have a use-case where I need to deduplicate a dataframe using a column (it's a GUID column). But instead of dumping the duplicates, I need to store them in a separate location. So for e.g., if we have the following data, with schema (name, GUID):
(a, 1), (b, 2), (a, 2), (a, 3), (c, 1), (c, 4). I want to split the dataset such that I have:
(a, 1), (b, 2), (a, 3), (c, 4) in 1 part and (a, 2), (c, 1) in second part. If I use dropDuplicates(col("GUID")), the second part gets lost. What would be an efficient way to do this?
You can assign a row number, and split the dataframe into two parts based on whether the row number is equal to 1.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(Window.partitionBy('GUID').orderBy(F.monotonically_increasing_id()))
)
df2.show()
+----+----+---+
|name|GUID| rn|
+----+----+---+
| a| 1| 1|
| c| 1| 2|
| a| 3| 1|
| b| 2| 1|
| a| 2| 2|
| c| 4| 1|
+----+----+---+
df2_part1 = df2.filter('rn = 1').drop('rn')
df2_part2 = df2.filter('rn != 1').drop('rn')
df2_part1.show()
+----+----+
|name|GUID|
+----+----+
| a| 1|
| a| 3|
| b| 2|
| c| 4|
+----+----+
df2_part2.show()
+----+----+
|name|GUID|
+----+----+
| c| 1|
| a| 2|
+----+----+
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I have below Two DF
MasterDF
NumberDF(Creating using Hive load)
Desire output:
Logic to populate
For Field1 need to pick sch_id where CAT='PAY' and SUB_CAT='client'
For Field2 need to pick sch_id where CAT='PAY' and SUB_CAT='phr'
For Field3 need to pick pay_id where CAT='credit' and
SUB_CAT='spGrp'
Currently before joining I performing filter on NumberDF and the picking the value
EX:
masterDF.as("master").join(NumberDF.filter(col("CAT")==="PAY" && col("SUB_CAT")==="phr").as("number"), "$master.id" ==="$number.id" , "leftouter" )
.select($"master.*", $"number.sch_id".as("field1") )
above approach would need multiple join. I look into pivot function but it does solve my problem
Note: Please ignore the syntax error in code
Better solution to do this is to pivot DataFrame (numberDF) by column (subject) before joining with studentDF.
pyspark code looks like this
numberDF = spark.createDataFrame([(1, "Math", 80), (1, "English", 60), (1, "Science", 80)], ["id", "subject", "marks"])
studentDF = spark.createDataFrame([(1, "Vikas")],["id","name"])
>>> numberDF.show()
+---+-------+-----+
| id|subject|marks|
+---+-------+-----+
| 1| Math| 80|
| 1|English| 60|
| 1|Science| 80|
+---+-------+-----+
>>> studentDF.show()
+---+-----+
| id| name|
+---+-----+
| 1|Vikas|
+---+-----+
pivotNumberDF = numberDF.groupBy("id").pivot("subject").sum("marks")
>>> pivotNumberDF.show()
+---+-------+----+-------+
| id|English|Math|Science|
+---+-------+----+-------+
| 1| 60| 80| 80|
+---+-------+----+-------+
>>> studentDF.join(pivotNumberDF, "id").show()
+---+-----+-------+----+-------+
| id| name|English|Math|Science|
+---+-----+-------+----+-------+
| 1|Vikas| 60| 80| 80|
+---+-----+-------+----+-------+
ref: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html
Finally I have implemented it using Pivot
flights.groupBy("ID", "CAT")
.pivot("SUB_CAT", Seq("client", "phr", "spGrp")).agg(avg("SCH_ID").as("SCH_ID"), avg("pay_id").as("pay_id"))
.groupBy("ID")
.pivot("CAT", Seq("credit", "price"))
.agg(
avg("client_SCH_ID").as("client_sch_id"), avg("client_pay_id").as("client_pay_id")
, avg("phr_SCH_ID").as("phr_SCH_ID"), avg("phr_pay_id").as("phr_pay_id")
, avg("spGrp_SCH_ID").as("spGrp_SCH_ID"), avg("spGrp_pay_id").as("spGrp_pay_id")
)
First Pivot would
Return table like
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| ID| CAT|client_SCH_ID|client_pay_id |phr_SCH_ID |phr_pay_id |spnGrp_SCH_ID|spnGrp_pay_id |
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| 1|credit| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0|
| 1| pay | 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
After second Pivot it would be like
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| ID|credit_client_sch_id|credit_client_pay_id | credit_phr_SCH_ID| credit_phr_pay_id |credit_spnGrp_SCH_ID|credit_spnGrp_pay_id |pay_client_sch_id|pay_client_pay_id | pay_phr_SCH_ID| pay_phr_pay_id |pay_spnGrp_SCH_ID|pay_spnGrp_pay_id |
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| 1| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0| 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
Though I am not sure about performance.
df.createOrReplaceTempView("NumberDF")
df.createOrReplaceTempView("MasterDf")
val sqlDF = spark.sql("select m.id,t1.fld1,t2.fld2,t3.fld3,m.otherfields
from
(select id, (case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end) fld1
from NumberDF n where case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end is not null ) t1 ,
(select id, (case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end) fld2
from NumberDF n where case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end is not null ) t2,
(select id, (case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end) fld3
from NumberDF n where case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end is not null ) t3,
MasterDf m ")
sqlDF.show()
Below code working fine, but if any one field is NULL out of 5 columns SAL1, SAL2, SAL3, SAL4, SAL5 the corresponding TOTAL_SALARY is coming as NULL.
Looks like some null condition or spark udfs need to create, could you please help in that.
input:
NO NAME ADDR SAL1 SAL2 SAL3 SAL4 SAL5
1 ABC IND 100 200 300 null 400
2 XYZ USA 200 333 209 232 444
The second record's sum coming fine, but in first record because of null in SAL4, the output also coming as null.
from pyspark.shell import spark
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
sc = spark.sparkContext
df = spark.read.option("header","true").option("delimiter", ",").csv("C:\\TEST.txt")
df.createOrReplaceTempView("table1")
df1 = spark.sql( "select * from table1" )
df2 = df1.groupBy('NO', 'NAME', 'ADDR').agg(F.sum(df1.SAL1 + df1.SAL2 + df1.SAL3 + df1.SAL4 + df1.SAL5).alias("TOTAL_SALARY"))
df2.show()
Thanks in advance
Just put a na.fill(0) in your code. This would replace the NULL values with 0 and you should be able to perform the operation.
So your last line should look like:
df2 = df1.na.fill(0).groupBy('NO', 'NAME', 'ADDR').agg(F.sum(df1.SAL1 + df1.SAL2 + df1.SAL3 + df1.SAL4 + df1.SAL5).alias("TOTAL_SALARY"))
It also seems that the sum function should be able to handle Null values correctly. I just tested the following code:
df_new = spark.createDataFrame([
(1, 4), (2, None), (3,None), (4,None),
(5,5), (6,None), (7,None),(1, 4), (2, 8), (3,9), (4,1),(1, 2), (2, 1), (3,3), (4,7),
], ("customer_id", "balance"))
df_new.groupBy("customer_id").agg(sum(col("balance"))).show()
df_new.na.fill(0).groupBy("customer_id").agg(sum(col("balance"))).show()
Output:
+-----------+------------+
|customer_id|sum(balance)|
+-----------+------------+
| 7| null|
| 6| null|
| 5| 5|
| 1| 10|
| 3| 12|
| 2| 9|
| 4| 8|
+-----------+------------+
+-----------+------------+
|customer_id|sum(balance)|
+-----------+------------+
| 7| 0|
| 6| 0|
| 5| 5|
| 1| 10|
| 3| 12|
| 2| 9|
| 4| 8|
+-----------+------------+
Version 1 only contains NULL values if all values in the sum are NULL.
Version 2 returns 0 instead, since all NULL values are replaced with 0's
Basically below line of code check all 5 SAL fields and if it is null, replace it with 0. If not keep the original value.
df1 = df.withColumn("SAL1", when(df.SAL1.isNull(), lit(0)).otherwise(df.SAL1))\
.withColumn("SAL2", when(df.SAL2.isNull(), lit(0)).otherwise(df.SAL2))\
.withColumn("SAL3", when(df.SAL3.isNull(), lit(0)).otherwise(df.SAL3))\
.withColumn("SAL4", when(df.SAL4.isNull(), lit(0)).otherwise(df.SAL4))\
.withColumn("SAL5", when(df.SAL5.isNull(), lit(0)).otherwise(df.SAL5))\
I have a requirement where a dataframe is sorted by col1 (timestamp) and I need to filter by col2.
Any row where col2 value is less than col2 value of the previous row, I need to filter out that row. Result should be monotonically increasing col2 value.
Note that this is not just about two rows.
For example, let's say the value of col2 for 4 rows are 4,2,3,5. The result should be 4,5 as both the 2nd and 3rd row are less than 4 (first row value).
val input = Seq(
(1,4), (2,2), (3,3), (4,5), (5, 1), (6, 9), (7, 6)
).toDF("timestamp", "value")
scala> input.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 2| 2|
| 3| 3|
| 4| 5|
| 5| 1|
| 6| 9|
| 7| 6|
+---------+-----+
val expected = Seq((1,4), (4,5), (6, 9)).toDF("timestamp", "value")
scala> expected.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+
Please note that:
rows 2 and 3 filtered out as its value is less than value in row 1, i.e. 4
row 5 is filtered out as its value is less than value in row 4, i.e. 6
Generically speaking, is there a way to filter rows based on comparison of value of one row with value in the previous rows?
I think what you're after is called running maximum (after running total). That always leads me to use windowed aggregation.
// I made the input a bit more tricky
val input = Seq(
(1,4), (2,2), (3,3), (4,5), (5, 1), (6, 9), (7, 6)
).toDF("timestamp", "value")
scala> input.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 2| 2|
| 3| 3|
| 4| 5|
| 5| 1|
| 6| 9|
| 7| 6|
+---------+-----+
I'm aiming at the following expected result. Correct me if I'm wrong.
val expected = Seq((1,4), (4,5), (6, 9)).toDF("timestamp", "value")
scala> expected.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+
The trick to use for "running" problems is to use rangeBetween when defining a window specification.
import org.apache.spark.sql.expressions.Window
val ts = Window
.orderBy("timestamp")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
With the window spec you filter out what you want to get rid of from the result and you're done.
val result = input
.withColumn("running_max", max("value") over ts)
.where($"running_max" === $"value")
.select("timestamp", "value")
scala> result.show
18/05/29 22:09:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+
As you can see it's not very efficient since it only uses a single partition (that leads to a poor single-threaded execution and so not much difference from running the experiment on a single machine).
I think we could partition the input calculate the running maximum partially and then union the partial results and re-run the running maximum calculation again. Just a thought I have not tried out myself.
checking equality with the running maximum should do the trick:
val input = Seq((1,4), (2,2), (3,3), (4,5), (5, 1), (6, 9), (7, 6)).toDF("timestamp", "value")
input.show()
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 2| 2|
| 3| 3|
| 4| 5|
| 5| 1|
| 6| 9|
| 7| 6|
+---------+-----+
input
.withColumn("max",max($"value").over(Window.orderBy($"timestamp")))
.where($"value"===$"max").drop($"max")
.show()
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+