How to compare two dataframes and add new flag column in pyspark? - apache-spark

I have created two data frames by executing below command.
test1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
test2=sc.parallelize([
("a",1,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test2.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
using test1 and test2 data-frames I need to produce new dataframe which should contain result like below .
+---+--------+----------+------------+------------+
|SID|SSection|test1SRank|test2SRank | flag |
+---+--------+----------+------------+------------+
| a| 1| 1 | 1 | same_rank |
| b| 2| 2 | 3 |rank_changed|
| d| 4| 2 | 0 |No_rank |
| e| 4| 1 | 1 |same_rank |
| c| 3| 4 | 4 |same_rank |
| f| 4| 0 | 2 |new_rank |
+---+--------+----------+------------+------------+
above result I want to produce by comparison between test1 and test2 data frames using combination of columns SID and SSection and comparison between ranks.
for example :
1) SID (a) and SSection (1): in test1rank is1 and test2rank is 1 so my flag value should be same_rank.
2) SID (b) and SSection (2): in test1rank is 2 and test2rank is 3 here rank was changed so my flag value should be rank_changed.
3) SID (d) and SSection (4): in test1rank is 2 and in test2rank he lost his rank, so my flag value should be No_rank
4) SID (f) and SSection (4): in test1rank is he was not performed well so he don't have any rank and in test2rank he performed well his rank is 2, so my flag value should be New_rank

This should give you what you want:
from pyspark.sql import functions as f
test3=test1.withColumnRenamed('SRank','test1SRank')\
.join(test2.drop('SSection')\
.withColumnRenamed('SRank','test2SRank'), on='SID', how='outer')\
.fillna(0)
test3=test3.withColumn('flag', f.expr("case when test1SRank=0 and test2SRank>0 then 'new_rank'\
when test1SRank>0 and test2SRank=0 then 'No_rank'\
when test1SRank=test2SRank then 'same_rank'\
else 'rank_changed' end"))
test3.orderBy('SID').show()
Explanation: Outer join the data frame so you have test1 and test2 scores for all SIDs. Then fill nulls with 0 and perform the flag logic with a sql case when statement.

Related

add specific rows on pyspark

I'm trying to transform a python notebook in pyspark pipeline and I'm blocked by... it seems a simple problem ....
I have this dataframe after a count aggregation group By Id:
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 4 | 6 |
And I want this :
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 2 | 0 |
| 3 | 0 |
| 4 | 6 |
| 5 | 0 |
I have tried to add a [0,1,3,4,5] array in each rows, then explode outter this array, then tried to find a way to keep the rows I need but it's seems a bit complicated for this simple case.
DO you have any tips ?
Thx in advance
original.show()
+---+-----+
| id|count|
+---+-----+
| 1| 12|
| 3| 15|
+---+-----+
df = spark.createDataFrame([(0,0),(1,0),(2,0),(3,0),(4,0),(5,0)],['id', 'default_count'])
df.show()
+---+-------------+
| id|default_count|
+---+-------------+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 0|
+---+-------------+
result=original.join(df, on='id', how='right').withColumn('count', F.coalesce(F.col('count'), F.col('default_count'))).orderBy(F.col('id')).drop(F.col('default_count'))
+---+-----+
| id|count|
+---+-----+
| 0| 0|
| 1| 12|
| 2| 0|
| 3| 15|
| 4| 0|
| 5| 0|
+---+-----+
df.show()
+---+-----+
| Id|count|
+---+-----+
| 0| 5|
| 1| 3|
| 4| 6|
+---+-----+
extra_rows = spark.createDataFrame([(2, 0),
(3, 0),
(5, 0)],
['Id', 'count'])
df.unionByName(extra_rows).orderBy('Id').show()
+---+-----+
| Id|count|
+---+-----+
| 0| 5|
| 1| 3|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
+---+-----+

Pyspark: Stitching multiple event rows in windows

I am trying to stitch few event rows in dataframe together based on time difference between them. I have created a new column in dataframe which represent time difference with the previous row using lag. The dataframe looks as follows:
sc=spark.sparkContext
df = spark.createDataFrame(
sc.parallelize(
[['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
),
['id',"row_number", "time_diff"]
)
I want to stitch the rows if the time_diff with the previous event is less than 160.
For this, I was planning to assign the new row numbers to all the events which are within 160 time of each other and then take groupby on new row number
For the above dataframe I wanted the output as:
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 1|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
I wrote a program as follows:
from pyspark.sql.functions import when,col
window = Window.partitionBy('id').orderBy('row_number')
df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
.otherwise(f.lag(col('new_row_number')).over(window)))
but the output I got was as follows:
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 2|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
Can someone help me out in resolving this?
Thanks
So you want the previous value of the column currently being populated which is not possible, so to achieve this we can do following:
window = Window.partitionBy('id').orderBy('row_number')
df3=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))\
.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
To explain:
First we generate the row value for every row which is greater than 160 else null
df2=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))
df2.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| null|
| x| 3| 102| null|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| null|
+---+----------+---------+--------------+
Then we fill the dataframe with last value using this
df3=df2.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
df3.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
Hope it solves your question.

How spark RangeBetween works with Descending Order?

I thought rangeBetween(start, end) looks into values of the range(cur_value - start, cur_value + end). https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html
But, I saw an example where they used descending orderBy() on timestamp, and then used (unboundedPreceeding, 0) with rangeBetween. Which led me to explore the following example:
dd = spark.createDataFrame(
[(1, "a"), (3, "a"), (3, "a"), (1, "b"), (2, "b"), (3, "b")],
['id', 'category']
)
dd.show()
# output
+---+--------+
| id|category|
+---+--------+
| 1| a|
| 3| a|
| 3| a|
| 1| b|
| 2| b|
| 3| b|
+---+--------+
It seems to include preceding row whose value is higher by 1.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-1, Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 3|
| 3| a| 6|
| 3| a| 6|
| 1| a| 1|
+---+--------+---+
And with start set to -2, it includes value greater by 2 but in preceding rows.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-2,Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 6|
| 3| a| 6|
| 3| a| 6|
| 1| a| 7|
+---+--------+---+
So, what is the exact behavior of rangeBetween with desc orderBy?
It's not well documented but when using range (or value-based) frames the ascending and descending order affects the determination of the values that are included in the frame.
Let's take the example you provided:
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW
Depending on the order by direction, 1 PRECEDING means:
current_row_value - 1 if ASC
current_row_value + 1 if DESC
Consider the row with value 1 in partition b.
With the descending order, the frame includes :
(current_value and all preceding values where x = current_value + 1) = (1, 2)
With the ascending order, the frame includes:
(current_value and all preceding values where x = current_value - 1) = (1)
PS: using rangeBetween(-1, Window.currentRow) with desc ordering is just equivalent to rangeBetween(Window.currentRow, 1) with asc ordering.

How do I calculate the start/end of an interval (set of rows) containing identical values?

Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+

Keep track of the previous row values with additional condition using pyspark

I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?

Resources