Window function with lag based on another column - apache-spark

I have the following Spark DataFrame:
id
month
column_1
column_2
A
1
100
0
A
2
200
1
A
3
800
2
A
4
1500
3
A
5
1200
0
A
6
1600
1
A
7
2500
2
A
8
2800
3
A
9
3000
4
I would like to create a new column, let's call it 'dif_column1' based on a dynamic lag which is given by column_2. The desired output would be:
id
month
column_1
column_2
dif_column1
A
1
100
0
0
A
2
200
1
100
A
3
800
2
700
A
4
1500
3
1400
A
5
1200
0
0
A
6
1600
1
400
A
7
2500
2
1300
A
8
2800
3
1600
A
9
3000
4
1800
I have tried to use the lag function but apparently I can only use an integer with the lag function, so it does not work:
w = Window.partitionBy("id")
sdf = sdf.withColumn("dif_column1", F.col("column_1") - F.lag("column_1",F.col("column_2")).over(w))

You can add a row number column, and do a self join based on the row number and the lag defined in column_2:
from pyspark.sql import functions as F, Window
w = Window.partitionBy("id").orderBy("month")
df1 = df.withColumn('rn', F.row_number().over(w))
df2 = df1.alias('t1').join(
df1.alias('t2'),
F.expr('(t1.id = t2.id) and (t1.rn = t2.rn + t1.column_2)'),
'left'
).selectExpr(
't1.*',
't1.column_1 - t2.column_1 as dif_column1'
).drop('rn')
df2.show()
+---+-----+--------+--------+-----------+
| id|month|column_1|column_2|dif_column1|
+---+-----+--------+--------+-----------+
| A| 1| 100| 0| 0|
| A| 2| 200| 1| 100|
| A| 3| 800| 2| 700|
| A| 4| 1500| 3| 1400|
| A| 5| 1200| 0| 0|
| A| 6| 1600| 1| 400|
| A| 7| 2500| 2| 1300|
| A| 8| 2800| 3| 1600|
| A| 9| 3000| 4| 1800|
+---+-----+--------+--------+-----------+

Related

add rows in pyspark dataframe and adjust the column sequence accordingly

We have a dataframe like below say DF1
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name
2
1
last_name
3
1
full_name
4
1
XYZ
5
0
sal
6
1
AAA
7
0
Now I want to add 2 rows for one row where hash_inc_ind =1 and adjust the col seq accordingly so that the output would be like
DF1:
col_name
col_seq
Hash_enc_ind
abc
1
0
first_name_h
2
1
first_name_e
3
1
last_name_h
4
1
last_name_e
5
1
full_name_h
6
1
full_name_e
7
1
XYZ
8
0
sal_h
9
1
sal_e
10
1
AAA
11
0
You can explode an array constructed using when:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col_name',
F.expr("explode(transform(case when Hash_enc_ind = 1 then array('_h', '_e') else array('') end, x -> col_name || x))")
)
df2.show()
+------------+-------+------------+
| col_name|col_seq|Hash_enc_ind|
+------------+-------+------------+
| abc| 1| 0|
|first_name_h| 2| 1|
|first_name_e| 2| 1|
| last_name_h| 3| 1|
| last_name_e| 3| 1|
| full_name_h| 4| 1|
| full_name_e| 4| 1|
| XYZ| 5| 0|
| sal_h| 6| 1|
| sal_e| 6| 1|
| AAA| 7| 0|
+------------+-------+------------+

How to Filter Record Grouped by a field in PySpark Dataframe Based on Rank and Values

I have a Pyspark dafaframe (Spark 2.2/Python 2.7) which has multiple records for each customer received on multiple days over a period of time. Here is how the simplified version of data looks like. These are ranked in order of dates (YYYY-MM-DD) when they are received for each group. Data is guaranteed to have multiple instances of each CUST_ID.
CUST_ID Date_received rank
1 2015-01-01 1
1 2021-01-12 2
1 2021-01-20 3
2 2015-01-01 1
2 2017-12-31 2
2 2021-02-15 3
3 2018-01-01 1
3 2019-07-31 2
4 2015-01-01 1
4 2021-01-01 2
4 2021-01-15 3
I want to split this data in 2 separate dataframes. First Dataframe should only have records fulfilling below criteria-
CUST_ID was received first time (rank 1) on 2015-01-01 and next time it was received (rank 2) on or after 2021-01-01. From above data example first Dataframe should have only these rows. This should happen for each group of CUST_ID
CUST_ID Date_received rank
1 2015-01-01 1
1 2021-01-12 2
4 2015-01-01 1
4 2021-01-01 2
And 2nd Dataframe should have rest-
CUST_ID Date_received rank
1 2021-01-20 3
2 2015-01-01 1
2 2017-12-31 2
2 2021-02-15 3
3 2018-01-01 1
3 2019-07-31 2
4 2021-01-15 3
You can calculate the conditions and broadcast the conditions for each CUST_ID:
from pyspark.sql import functions as F, Window
df0 = df.withColumn(
'flag1',
(F.col('rank') == 1) & (F.col('Date_received') == '2015-01-01')
).withColumn(
'flag2',
(F.col('rank') == 2) & (F.col('Date_received') >= '2021-01-01')
).withColumn(
'grp',
F.max('flag1').over(Window.partitionBy('CUST_ID')) &
F.max('flag2').over(Window.partitionBy('CUST_ID'))
)
df0.show()
+-------+-------------+----+-----+-----+-----+
|CUST_ID|Date_received|rank|flag1|flag2| grp|
+-------+-------------+----+-----+-----+-----+
| 3| 2018-01-01| 1|false|false|false|
| 3| 2019-07-31| 2|false|false|false|
| 1| 2015-01-01| 1| true|false| true|
| 1| 2021-01-12| 2|false| true| true|
| 1| 2021-01-20| 3|false|false| true|
| 4| 2015-01-01| 1| true|false| true|
| 4| 2021-01-01| 2|false| true| true|
| 4| 2021-01-15| 3|false|false| true|
| 2| 2015-01-01| 1| true|false|false|
| 2| 2017-12-31| 2|false|false|false|
| 2| 2021-02-15| 3|false|false|false|
+-------+-------------+----+-----+-----+-----+
Then you can divide the dataframe using the grp column:
df1 = df0.filter('grp and rank <= 2').select(df.columns)
df2 = df0.filter('not (grp and rank <= 2)').select(df.columns)
df1.show()
+-------+-------------+----+
|CUST_ID|Date_received|rank|
+-------+-------------+----+
| 1| 2015-01-01| 1|
| 1| 2021-01-12| 2|
| 4| 2015-01-01| 1|
| 4| 2021-01-01| 2|
+-------+-------------+----+
df2.show()
+-------+-------------+----+
|CUST_ID|Date_received|rank|
+-------+-------------+----+
| 3| 2018-01-01| 1|
| 3| 2019-07-31| 2|
| 1| 2021-01-20| 3|
| 4| 2021-01-15| 3|
| 2| 2015-01-01| 1|
| 2| 2017-12-31| 2|
| 2| 2021-02-15| 3|
+-------+-------------+----+

How to add an auto-incrementing column in a dataframe based on another column?

I have a PySpark dataframe similar to below:
order_id item qty
123 abc 1
123 abc1 4
234 abc2 5
234 abc3 2
234 abc4 7
123 abc5 5
456 abc6 9
456 abc7 8
456 abc8 9
I want to add an auto-incrementing column based on the column 'order_id' and the expected result is:
order_id item qty AutoIncrementingColumn_orderID
123 abc 1 1
123 abc1 4 2
234 abc2 5 1
234 abc3 2 2
234 abc4 7 3
123 abc5 5 3
456 abc6 9 1
456 abc7 8 2
456 abc8 9 3
I couldn't find solutions to generate based on another column, any idea how to achieve?
You can use row_number:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'AutoIncrementingColumn_orderID',
F.row_number().over(Window.partitionBy('order_id').orderBy('item'))
)
df2.show()
+--------+----+---+------------------------------+
|order_id|item|qty|AutoIncrementingColumn_orderID|
+--------+----+---+------------------------------+
| 234|abc2| 5| 1|
| 234|abc3| 2| 2|
| 234|abc4| 7| 3|
| 456|abc6| 9| 1|
| 456|abc7| 8| 2|
| 456|abc8| 9| 3|
| 123| abc| 1| 1|
| 123|abc1| 4| 2|
| 123|abc5| 5| 3|
+--------+----+---+------------------------------+
Couple of ways of doing it:
Here is the sql way :
df=Ss.sql("""
select order_id,item,qty,row_number() over(partition by order_id order by qty) as autoInc
from (
select order_id,item,qty
from ( values
(123,'abc',1 ),
(123,'abc1',4),
(234,'abc2',5),
(234,'abc3',2),
(234,'abc4',7),
(123,'abc5',5),
(456,'abc6',9),
(456,'abc7',8),
(456,'abc8',9)
) as T(order_id,item,qty))""")
df.show()
Output:
+--------+----+---+-------+
|order_id|item|qty|autoInc|
+--------+----+---+-------+
| 456|abc7| 8| 1|
| 456|abc6| 9| 2|
| 456|abc8| 9| 3|
| 234|abc3| 2| 1|
| 234|abc2| 5| 2|
| 234|abc4| 7| 3|
| 123| abc| 1| 1|
| 123|abc1| 4| 2|
| 123|abc5| 5| 3|
+--------+----+---+-------+

Get the previous row value using spark sql

I have a table like this.
Id prod val
1 0 0
2 0 0
3 1 1000
4 0 0
5 1 2000
6 0 0
7 0 0
I want to add a new column new_val and the condition for this column is, if prod = 0, then new_val should be from the preceding row where prod = 1.
If prod = 1 it should have the same value as val column. How do I achieve this using spark sql?
Id prod val new_val
1 0 0 1000
2 0 0 1000
3 1 1000 1000
4 0 0 2000
5 1 2000 2000
6 1 4000 4000
7 1 3000 3000
Any help is greatly appreciated
You can use something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().orderBy("id")
df = df.withColumn("new_val", F.when(F.col("prod") == 0, F.lag("val").over(w)).otherwise(F.col("val")))
What we are basically doing is using an if-else condition:
When prod == 0, take lag of val which is value of previous row (over a window that is ordered by id column), and if prod == 1, then we use the present value of the column.
You can acheive that with
val w = Window.orderBy("id").rowsBetween(0, Window.unboundedFollowing)
df
.withColumn("new_val", when($"prod" === 0, null).otherwise($"val"))
.withColumn("new_val", first("new_val", ignoreNulls = true).over(w))
It first, creates new column with null values whenever value doesn't change:
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| null|
| 2| 0| 0| null|
| 3| 1|1000| 1000|
| 4| 0| 0| null|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+
And it replaces values with first non-null value in the following records
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| 1000|
| 2| 0| 0| 1000|
| 3| 1|1000| 1000|
| 4| 0| 0| 2000|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+

In pyspark, group based on a variable field, and add a counter for particular values (which resets when variable changes)

Create a spark dataframe from a pandas dataframe
import pandas as pd
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
Next I use the window partition on the field 'b'
from pyspark.sql import window
win_spec = (window.Window.partitionBy(['b']).orderBy("Sno").rowsBetween(window.Window.unboundedPreceding, 0))
Add a field positive , negative based on the values and created a lambda funtion
df2 = df2.withColumn("pos_neg",col("a") <0)
pos_neg_func =udf(lambda x: ((x) & (x != x.shift())).cumsum())
tried creating a new column (which is a counter for negative values but within variable 'b'. so counter restarts when the field in 'b' changes. Also if there are consecutive -ve values, they should be treated as a single value, counter changes by 1
df3 = (df2.select('pos_neg',pos_neg_func('pos_neg').alias('val')))
I want the output as,
b Sno a val val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 B 5 -3 True 1
5 B 6 -1 True 1
6 B 7 -7 True 1
7 C 8 -6 True 1
8 C 9 1 False 1
9 D 10 1 False 0
10 D 11 -1 True 1
11 D 12 1 False 1
12 D 13 4 False 1
13 D 14 5 False 1
14 D 15 -3 True 2
15 D 16 2 False 2
16 D 17 3 False 2
17 D 18 4 False 2
18 D 19 -1 True 3
19 D 20 -2 True 3
In python a simple function like following works:
df['val'] = df.groupby('b')['pos_neg'].transform(lambda x: ((x) & (x != x.shift())).cumsum())
josh-friedlander provided support in the above code
Pyspark doesn't have a shift function, but you could work with the lag window function which gives you the row before the current row. The first window (called w) sets the value of the val column to 1 if the value of the pos_neg column is True and the value of the previous pos_neg is False and to 0 otherwise.
With the second window (called w2) we calculate the cumulative sum to get your desired
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import Window
df = pd.DataFrame({"b": ['A','A','A','A','B', 'B','B','C','C','D','D', 'D','D','D','D','D','D','D','D','D'],"Sno": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"a": [3,-4,2, -1, -3, -1,-7,-6, 1, 1, -1, 1,4,5,-3,2,3,4, -1, -2]})
df2=spark.createDataFrame(df)
w = Window.partitionBy('b').orderBy('Sno')
w2 = Window.partitionBy('b').rowsBetween(Window.unboundedPreceding, 0).orderBy('Sno')
df2 = df2.withColumn("pos_neg",col("a") <0)
df2 = df2.withColumn('val', F.when((df2.pos_neg == True) & (F.lag('pos_neg', default=False).over(w) == False), 1).otherwise(0))
df2 = df2.withColumn('val', F.sum('val').over(w2))
df2.show()
Output:
+---+---+---+-------+---+
|Sno| a| b|pos_neg|val|
+---+---+---+-------+---+
| 5| -3| B| true| 1|
| 6| -1| B| true| 1|
| 7| -7| B| true| 1|
| 10| 1| D| false| 0|
| 11| -1| D| true| 1|
| 12| 1| D| false| 1|
| 13| 4| D| false| 1|
| 14| 5| D| false| 1|
| 15| -3| D| true| 2|
| 16| 2| D| false| 2|
| 17| 3| D| false| 2|
| 18| 4| D| false| 2|
| 19| -1| D| true| 3|
| 20| -2| D| true| 3|
| 8| -6| C| true| 1|
| 9| 1| C| false| 1|
| 1| 3| A| false| 0|
| 2| -4| A| true| 1|
| 3| 2| A| false| 1|
| 4| -1| A| true| 2|
+---+---+---+-------+---+
You may wonder why it was neccessary to have a column which allows us to order the dataset. Let me try to explain this with an example. The following data was read by pandas and got an index assigned (left column). You want to count the occurences of True in the pos_neg and you don't want to count consecuitive True's. This logic leads to the val2 column as shown below:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 2 -4 True 1
2 A 3 2 False 1
3 A 4 -1 True 2
4 A 5 -5 True 2
...but it depends on the index it got from pandas (order of rows). When you change the order of the rows (and the corrosponding pandas index) you will get a different result when you apply your logic to the same rows just because the order is different:
b Sno a pos_neg val_2
0 A 1 3 False 0
1 A 3 2 False 0
2 A 2 -4 True 1
3 A 4 -1 True 1
4 A 5 -5 True 1
You see that the order of the rows is important. You might wonder now why pyspark doesn't create an index like pandas does. That is because spark keeps your data in several partitions which are distributed on your cluster and is depending on your data source even able to read the data distributedly. An index can't therefore not be added during the reading of the data. You can add one after the data was read with the monotonically_increasing_id function, but your data could already have a different order compared to your data source due to the read process.
Your sno column avoids this problem and guaranties that you will get always the same result for the same data (deterministic).

Resources