This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I have below Two DF
MasterDF
NumberDF(Creating using Hive load)
Desire output:
Logic to populate
For Field1 need to pick sch_id where CAT='PAY' and SUB_CAT='client'
For Field2 need to pick sch_id where CAT='PAY' and SUB_CAT='phr'
For Field3 need to pick pay_id where CAT='credit' and
SUB_CAT='spGrp'
Currently before joining I performing filter on NumberDF and the picking the value
EX:
masterDF.as("master").join(NumberDF.filter(col("CAT")==="PAY" && col("SUB_CAT")==="phr").as("number"), "$master.id" ==="$number.id" , "leftouter" )
.select($"master.*", $"number.sch_id".as("field1") )
above approach would need multiple join. I look into pivot function but it does solve my problem
Note: Please ignore the syntax error in code
Better solution to do this is to pivot DataFrame (numberDF) by column (subject) before joining with studentDF.
pyspark code looks like this
numberDF = spark.createDataFrame([(1, "Math", 80), (1, "English", 60), (1, "Science", 80)], ["id", "subject", "marks"])
studentDF = spark.createDataFrame([(1, "Vikas")],["id","name"])
>>> numberDF.show()
+---+-------+-----+
| id|subject|marks|
+---+-------+-----+
| 1| Math| 80|
| 1|English| 60|
| 1|Science| 80|
+---+-------+-----+
>>> studentDF.show()
+---+-----+
| id| name|
+---+-----+
| 1|Vikas|
+---+-----+
pivotNumberDF = numberDF.groupBy("id").pivot("subject").sum("marks")
>>> pivotNumberDF.show()
+---+-------+----+-------+
| id|English|Math|Science|
+---+-------+----+-------+
| 1| 60| 80| 80|
+---+-------+----+-------+
>>> studentDF.join(pivotNumberDF, "id").show()
+---+-----+-------+----+-------+
| id| name|English|Math|Science|
+---+-----+-------+----+-------+
| 1|Vikas| 60| 80| 80|
+---+-----+-------+----+-------+
ref: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html
Finally I have implemented it using Pivot
flights.groupBy("ID", "CAT")
.pivot("SUB_CAT", Seq("client", "phr", "spGrp")).agg(avg("SCH_ID").as("SCH_ID"), avg("pay_id").as("pay_id"))
.groupBy("ID")
.pivot("CAT", Seq("credit", "price"))
.agg(
avg("client_SCH_ID").as("client_sch_id"), avg("client_pay_id").as("client_pay_id")
, avg("phr_SCH_ID").as("phr_SCH_ID"), avg("phr_pay_id").as("phr_pay_id")
, avg("spGrp_SCH_ID").as("spGrp_SCH_ID"), avg("spGrp_pay_id").as("spGrp_pay_id")
)
First Pivot would
Return table like
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| ID| CAT|client_SCH_ID|client_pay_id |phr_SCH_ID |phr_pay_id |spnGrp_SCH_ID|spnGrp_pay_id |
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| 1|credit| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0|
| 1| pay | 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
After second Pivot it would be like
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| ID|credit_client_sch_id|credit_client_pay_id | credit_phr_SCH_ID| credit_phr_pay_id |credit_spnGrp_SCH_ID|credit_spnGrp_pay_id |pay_client_sch_id|pay_client_pay_id | pay_phr_SCH_ID| pay_phr_pay_id |pay_spnGrp_SCH_ID|pay_spnGrp_pay_id |
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| 1| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0| 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
Though I am not sure about performance.
df.createOrReplaceTempView("NumberDF")
df.createOrReplaceTempView("MasterDf")
val sqlDF = spark.sql("select m.id,t1.fld1,t2.fld2,t3.fld3,m.otherfields
from
(select id, (case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end) fld1
from NumberDF n where case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end is not null ) t1 ,
(select id, (case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end) fld2
from NumberDF n where case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end is not null ) t2,
(select id, (case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end) fld3
from NumberDF n where case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end is not null ) t3,
MasterDf m ")
sqlDF.show()
Related
I have 3 Dataframe df1(EMPLOYEE_INFO),df2(DEPARTMENT_INFO),df3(COMPANY_INFO) and i want to update a column which is in df1 by joining all the three dataframes . The name of column is FLAG_DEPARTMENT which is in df1. I need to set the FLAG_DEPARTMENT='POLITICS' . In sql query will look like this.
UPDATE [COMPANY_INFO] INNER JOIN ([DEPARTMENT_INFO]
INNER JOIN [EMPLOYEE_INFO] ON [DEPARTMENT_INFO].DEPT_ID = [EMPLOYEE_INFO].DEPT_ID)
ON [COMPANY_INFO].[COMPANY_DEPT_ID] = [DEPARTMENT_INFO].[DEP_COMPANYID]
SET EMPLOYEE_INFO.FLAG_DEPARTMENT = "POLITICS";
If the values in columns of these three tables matches i need to set my FLAG_DEPARTMENT='POLITICS' in my employee_Info Table
How can i achieve this same thing in pyspark. I have just started learning pyspark don't have that much depth knowledge?
You can use a chain of joins with a select on top of it.
Suppose that you have the following pyspark DataFrames:
employee_df
+---------+-------+
| Name|dept_id|
+---------+-------+
| John| dept_a|
| Liù| dept_b|
| Luke| dept_a|
| Michail| dept_a|
| Noe| dept_e|
|Shinchaku| dept_c|
| Vlad| dept_e|
+---------+-------+
department_df
+-------+----------+------------+
|dept_id|company_id| description|
+-------+----------+------------+
| dept_a| company1|Department A|
| dept_b| company2|Department B|
| dept_c| company5|Department C|
| dept_d| company3|Department D|
+-------+----------+------------+
company_df
+----------+-----------+
|company_id|description|
+----------+-----------+
| company1| Company 1|
| company2| Company 2|
| company3| Company 3|
| company4| Company 4|
+----------+-----------+
Then you can run the following code to add the flag_department column to your employee_df:
from pyspark.sql import functions as F
employee_df = (
employee_df.alias('a')
.join(
department_df.alias('b'),
on='dept_id',
how='left',
)
.join(
company_df.alias('c'),
on=F.col('b.company_id') == F.col('c.company_id'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in employee_df.columns],
F.when(
F.col('b.dept_id').isNotNull() & F.col('c.company_id').isNotNull(),
F.lit('POLITICS')
).alias('flag_department')
)
)
The new employee_df will be:
+---------+-------+---------------+
| Name|dept_id|flag_department|
+---------+-------+---------------+
| John| dept_a| POLITICS|
| Liù| dept_b| POLITICS|
| Luke| dept_a| POLITICS|
| Michail| dept_a| POLITICS|
| Noe| dept_e| null|
|Shinchaku| dept_c| null|
| Vlad| dept_e| null|
+---------+-------+---------------+
I'm using pyspark on a 2.X Spark version for this.
I have 2 sql dataframes, df1 and df2. df1 is an union of multiple small dfs with the same header names.
df1 = (
df1_1.union(df1_2)
.union(df1_3)
.union(df1_4)
.union(df1_5)
.union(df1_6)
.union(df1_7)
.distinct()
)
df2 does not have the same header names.
What i'm trying to achieve is to create a new column and to fill it with 2 values depending on a condition. But the condition would be something like if in the column of df1 you contain an element of an column of df2 then write A else B
So I tried something like this:
df1 = df1.withColumn(
"new_col",
when(df1["ColA"].substr(0, 4).contains(df2["ColA_a"]), "A").otherwise(
"B"
),
)
Every fields are string types.
I tried also using isin but the error is the same.
note: substr(0, 4) is because in df1["ColA"] I only need 4 characters in my field to match df2["ColA_a"].
py4j.protocol.Py4JJavaError: An error occurred while calling o660.select. :
org.apache.spark.sql.AnalysisException: Resolved attribute(s) ColA_a#444 missing from
ColA#438,ColB#439 in operator !Project [Contains(ColA#438, ColA_a#444) AS contains(ColA, ColA_a)#451].;;
The solution I've read on the Internet that I tried:
Cloning dfs
Collecting df and create new df (here we lose the performance of spark, and that's very sad)
Renaming columns to have the same name, or different name. (ambiguous naming ?)
EDIT:
here is some input output as requested
df1
+-----+-----+-----+
| Col1| ColA| ColB|
+-----+-----+-----+
|value|3062x|value|
|value|2156x|value|
|value|3059x|value|
|value|3044x|value|
|value|2661x|value|
|value|2400x|value|
|value|1907x|value|
|value|4384x|value|
|value|4427x|value|
|value|2091x|value|
+-----+-----+-----+
df2
+------+------+
|ColA_a|ColB_b|
+------+------+
| 2156| GMVT7|
| 2156| JQL71|
| 2156| JZDSQ|
| 2050| GX8PH|
| 2050| G67CV|
| 2050| JFFF7|
| 2031| GCT5C|
| 2170| JN0LB|
| 2129| J2PRG|
| 2091| G87WT|
+------+------+
output
+-----+-----+-----+-------+
| Col1| ColA| ColB|new_col|
+-----+-----+-----+-------+
|value|3062x|value| B |
|value|2156x|value| A |
|value|3059x|value| B |
|value|3044x|value| B |
|value|2661x|value| B |
|value|2400x|value| B |
|value|1907x|value| B |
|value|4384x|value| B |
|value|4427x|value| B |
|value|2091x|value| A |
+-----+-----+-----+-------+
You can use rlike join, to determine if the value exists in other column
df1=sqlContext.createDataFrame([
('value',3062,'value'),
('value',2156,'value'),
('value',3059,'value'),
('value',3044,'value'),
('value',2661,'value'),
('value',2400,'value'),
('value',1907,'value'),
('value',4384,'value'),
('value',4427,'value'),
('value',2091,'value')
],schema=['Col1', 'ColA', 'ColB'])
df2 =sqlContext.createDataFrame([
(2156, 'GMVT7'),
( 2156, 'JQL71'),
( 2156, 'JZDSQ'),
( 2050, 'GX8PH'),
( 2050, 'G67CV'),
( 2050, 'JFFF7'),
( 2031, 'GCT5C'),
( 2170, 'JN0LB'),
( 2129, 'J2PRG'),
( 2091, 'G87WT')],schema=['ColA_a','ColB_b'])
#%%
df_join = df1.join(df2.select('ColA_a').distinct(),F.expr("""ColA rlike ColA_a"""),how = 'left')
df_fin = df_join.withColumn("new_col",F.when(F.col('ColA_a').isNull(),'B').otherwise('A'))
df_fin.show()
+-----+----+-----+------+-------+
| Col1|ColA| ColB|ColA_a|new_col|
+-----+----+-----+------+-------+
|value|3062|value| null| B|
|value|2156|value| 2156| A|
|value|3059|value| null| B|
|value|3044|value| null| B|
|value|2661|value| null| B|
|value|2400|value| null| B|
|value|1907|value| null| B|
|value|4384|value| null| B|
|value|4427|value| null| B|
|value|2091|value| 2091| A|
+-----+----+-----+------+-------+
If you don't prefer rlike join, you can use the isin() method in your join.
df_join = df1.join(df2.select('ColA_a').distinct(),F.col('ColA').isin(F.col('ColA_a')),how = 'left')
df_fin = df_join.withColumn("new_col",F.when(F.col('ColA_a').isNull(),'B').otherwise('A'))
The results will be the same
I have the following sample dataframe
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
and I want to explode the values in each row and associate alternating 1-0 values in the generated rows. This way I can identify the start/end entries in each row.
I am able to achieve the desired result this way
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = (df.withColumn('start_end', fn.array('start', 'end'))
.withColumn('date', fn.explode('start_end'))
.withColumn('row_num', fn.row_number().over(w)))
df = (df.withColumn('is_start', fn.when(fn.col('row_num')%2 == 0, 0).otherwise(1))
.select('date', 'is_start'))
which gives
| date | is_start |
|--------|----------|
| start | 1 |
| end | 0 |
| start1 | 1 |
| end1 | 0 |
but it seems overly complicated for such a simple task.
Is there any better/cleaner way without using UDFs?
You can use pyspark.sql.functions.posexplode along with pyspark.sql.functions.array.
First create an array out of your start and end columns, then explode this with the position:
from pyspark.sql.functions import array, posexplode
df.select(posexplode(array("end", "start")).alias("is_start", "date")).show()
#+--------+------+
#|is_start| date|
#+--------+------+
#| 0| end|
#| 1| start|
#| 0| end1|
#| 1|start1|
#+--------+------+
You can try union:
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df = df.withColumn('startv', F.lit(1))
df = df.withColumn('endv', F.lit(0))
df = df.select(['start', 'startv']).union(df.select(['end', 'endv']))
df.show()
+------+------+
| start|startv|
+------+------+
| start| 1|
|start1| 1|
| end| 0|
| end1| 0|
+------+------+
You can rename the columns and re-order the rows starting here.
I had similar situation in my use case. In my situation i had Huge dataset(~50GB) and doing any self join/heavy transformation was resulting in more memory and unstable execution .
I went one more level down of dataset and used flatmap of rdd. This will use map side transformation and it will be cost effective in terms of shuffle, cpu and memory.
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df.show()
+------+----+
| start| end|
+------+----+
| start| end|
|start1|end1|
+------+----+
final_df = df.rdd.flatMap(lambda row: [(row.start, 1), (row.end, 0)]).toDF(['date', 'is_start'])
final_df.show()
+------+--------+
| date|is_start|
+------+--------+
| start| 1|
| end| 0|
|start1| 1|
| end1| 0|
+------+--------+
I am working on a spark using Java, where I will download data from api and compare with mongodb data, while the downloaded json have 15-20 fields but database have 300 fields.
Now my task is to compare the downloaded jsons to mongodb data, and get whatever fields changed with past data.
Sample data set
Downloaded data from API
StudentId,Name,Phone,Email
1,tony,123,a#g.com
2,stark,456,b#g.com
3,spidy,789,c#g.com
Mongodb data
StudentId,Name,Phone,Email,State,City
1,tony,1234,a#g.com,NY,Nowhere
2,stark,456,bg#g.com,NY,Nowhere
3,spidy,789,c#g.com,OH,Nowhere
I can't use the except, because of column length.
Expected output
StudentId,Name,Phone,Email,Past_Phone,Past_Email
1,tony,1234,a#g.com,1234, //phone number only changed
2,stark,456,b#g.com,,bg#g.com //Email only changed
3,spidy,789,c#g.com,,
Consider your data is in 2 dataframes. We can create temporary views for them, as shown below,
api_df.createOrReplaceTempView("api_data")
mongo_df.createOrReplaceTempView("mongo_data")
Next we can use Spark SQL. Here, we join both these views using the StudentId column and then use a case statement on top of them to compute the past phone number and email.
spark.sql("""
select a.*
, case when a.Phone = b.Phone then '' else b.Phone end as Past_phone
, case when a.Email = b.Email then '' else b.Email end as Past_Email
from api_data a
join mongo_data b
on a.StudentId = b.StudentId
order by a.StudentId""").show()
Output:
+---------+-----+-----+-------+----------+----------+
|StudentId| Name|Phone| Email|Past_phone|Past_Email|
+---------+-----+-----+-------+----------+----------+
| 1| tony| 123|a#g.com| 1234| |
| 2|stark| 456|b#g.com| | bg#g.com|
| 3|spidy| 789|c#g.com| | |
+---------+-----+-----+-------+----------+----------+
Please find the below same source code. Here I am taking the only phone number condition as an example.
val list = List((1,"tony",123,"a#g.com"), (2,"stark",456,"b#g.com")
(3,"spidy",789,"c#g.com"))
val df1 = list.toDF("StudentId","Name","Phone","Email")
.select('StudentId as "StudentId_1", 'Name as "Name_1",'Phone as "Phone_1",
'Email as "Email_1")
df1.show()
val list1 = List((1,"tony",1234,"a#g.com","NY","Nowhere"),
(2,"stark",456,"bg#g.com", "NY", "Nowhere"),
(3,"spidy",789,"c#g.com","OH","Nowhere"))
val df2 = list1.toDF("StudentId","Name","Phone","Email","State","City")
.select('StudentId as "StudentId_2", 'Name as "Name_2", 'Phone as "Phone_2",
'Email as "Email_2", 'State as "State_2", 'City as "City_2")
df2.show()
val df3 = df1.join(df2, df1("StudentId_1") ===
df2("StudentId_2")).where(df1("Phone_1") =!= df2("Phone_2"))
df3.withColumnRenamed("Phone_1", "Past_Phone").show()
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
|StudentId_1|Name_1|Past_Phone|Email_1|StudentId_2|Name_2|Phone_2|Email_2|State_2| City_2|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
| 1| tony| 123|a#g.com| 1| tony| 1234|a#g.com| NY|Nowhere|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
We have :
df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
df2.show
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
After Join :
var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))
Then
var ans = jn.withColumn("Past_Phone", when(jn("Phone_2").notEqual(jn("Phone_1")),jn("Phone_1")).otherwise("")).withColumn("Past_Email", when(jn("Email_2").notEqual(jn("Email_1")),jn("Email_1")).otherwise(""))
Reference : Spark: Add column to dataframe conditionally
Next :
ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show
+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone| Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
| 1| tony| 1234| a#g.com| | 123|
| 2|stark| 456|bg#g.com| b#g.com| |
| 3|spidy| 789| c#g.com| | |
+---------+-----+-----+--------+----------+----------+
; WITH Hierarchy as
(
select distinct PersonnelNumber
, Email
, ManagerEmail
from dimstage
union all
select e.PersonnelNumber
, e.Email
, e.ManagerEmail
from dimstage e
join Hierarchy as h on e.Email = h.ManagerEmail
)
select * from Hierarchy
Can you help achieve the same in SPARK SQL
This is quite late, but today I tried to implement the cte recursive query using PySpark SQL.
Here, I have this simple dataframe. What I want to do is to find the NEWEST ID of each ID.
The original dataframe:
+-----+-----+
|OldID|NewID|
+-----+-----+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
| 6| 7|
| 7| 8|
| 9| 10|
+-----+-----+
The result I want:
+-----+-----+
|OldID|NewID|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 6| 8|
| 7| 8|
| 9| 10|
+-----+-----+
Here is my code:
df = sqlContext.createDataFrame([(1, 2), (2, 3), (3, 4), (4, 5), (6, 7), (7, 8),(9, 10)], "OldID integer,NewID integer").checkpoint().cache()
dfcheck = df.drop('NewID')
dfdistinctID = df.select('NewID').distinct()
dfidfinal = dfdistinctID.join(dfcheck, [dfcheck.OldID == dfdistinctID.NewID], how="left_anti") #We find the IDs that have not been replaced
dfcurrent = df.join(dfidfinal, [dfidfinal.NewID == df.NewID], how="left_semi").checkpoint().cache() #We find the the rows that are related to the IDs that have not been replaced, then assign them to the dfcurrent dataframe.
dfresult = dfcurrent
dfdifferentalias = df.select(df.OldID.alias('id1'), df.NewID.alias('id2')).checkpoint().cache()
while dfcurrent.count() > 0:
dfcurrent = dfcurrent.join(broadcast(dfdifferentalias), [dfcurrent.OldID == dfdifferentalias.id2], how="inner").select(dfdifferentalias.id1.alias('OldID'), dfcurrent.NewID.alias('NewID')).cache()
dfresult = dfresult.unionAll(dfcurrent)
display(dfresult.orderBy('OldID'))
Databricks notebook screenshot
I know that the performance is quite bad, but at least, it give the answer I need.
This is the first time that I post an answer to StackOverFlow, so forgive me if I made any mistake.
This is not possible using SPARK SQL. The WITH clause exists, but not for CONNECT BY like in, say, ORACLE, or recursion in DB2.
The Spark documentation provides a "CTE in CTE definition". This is reproduced below:
-- CTE in CTE definition
WITH t AS (
WITH t2 AS (SELECT 1)
SELECT * FROM t2
)
SELECT * FROM t;
+---+
| 1|
+---+
| 1|
+---+
You can extend this to multiple nested queries, but the syntax can quickly become awkward. My suggestion is to use comments to make it clear where the next select statement is pulling from. Essentially, start with the first query and place additional CTE statements above and below as needed:
WITH t3 AS (
WITH t2 AS (
WITH t1 AS (SELECT distinct b.col1
FROM data_a as a, data_b as b
WHERE a.col2 = b.col2
AND a.col3 = b.col3
-- select from t1
)
SELECT distinct b.col1, b.col2, b.col3
FROM t1 as a, data_b as b
WHERE a.col1 = b.col1
-- select from t2
)
SELECT distinct b.col1
FROM t2 as a, data_b as b
WHERE a.col2 = b.col2
AND a.col3 = b.col3
-- select from t3
)
SELECT distinct b.col1, b.col2, b.col3
FROM t3 as a, data_b as b
WHERE a.col1 = b.col1;
You can recursively use createOrReplaceTempView to build a recursive query. It's not going to be fast, nor pretty, but it works. Following #Pblade's example, PySpark:
def recursively_resolve(df):
rec = df.withColumn('level', F.lit(0))
sql = """
select this.oldid
, coalesce(next.newid, this.newid) as newid
, this.level + case when next.newid is not null then 1 else 0 end as level
, next.newid is not null as is_resolved
from rec this
left outer
join rec next
on next.oldid = this.newid
"""
find_next = True
while find_next:
rec.createOrReplaceTempView("rec")
rec = spark.sql(sql)
# check if any rows resolved in this iteration
# go deeper if they did
find_next = rec.selectExpr("ANY(is_resolved = True)").collect()[0][0]
return rec.drop('is_resolved')
Then:
src = spark.createDataFrame([(1, 2), (2, 3), (3, 4), (4, 5), (6, 7), (7, 8),(9, 10)], "OldID integer,NewID integer")
result = recursively_resolve(src)
result.show()
Prints:
+-----+-----+-----+
|oldid|newid|level|
+-----+-----+-----+
| 2| 5| 2|
| 4| 5| 0|
| 3| 5| 1|
| 7| 8| 0|
| 6| 8| 1|
| 9| 10| 0|
| 1| 5| 2|
+-----+-----+-----+