RowNumber with Reset - apache-spark

I am trying to achieve the expected output shown here:
+---+-----+--------+--------+--------+----+
| ID|State| Time|Expected|lagState|rank|
+---+-----+--------+--------+--------+----+
| 1| P|20220722| 1| null| 1|
| 1| P|20220723| 2| P| 2|
| 1| P|20220724| 3| P| 3|
| 1| P|20220725| 4| P| 4|
| 1| D|20220726| 1| P| 1|
| 1| O|20220727| 1| D| 1|
| 1| D|20220728| 1| O| 1|
| 1| P|20220729| 2| D| 1|
| 1| P|20220730| 3| P| 9|
| 1| P|20220731| 4| P| 10|
+---+-----+--------+--------+--------+----+
# create df
df = spark.createDataFrame(sc.parallelize([
[1, 'P', 20220722, 1],
[1, 'P', 20220723, 2],
[1, 'P', 20220724, 3],
[1, 'P', 20220725, 4],
[1, 'D', 20220726, 1],
[1, 'O', 20220727, 1],
[1, 'D', 20220728, 1],
[1, 'P', 20220729, 2],
[1, 'P', 20220730, 3],
[1, 'P', 20220731, 4],
]),
['ID', 'State', 'Time', 'Expected'])
# lag
df = df.withColumn('lagState', F.lag('State').over(w.partitionBy('id').orderBy('time')))
# rn
df = df.withColumn('rank', F.when( F.col('State') == F.col('lagState'), F.rank().over(w.partitionBy('id').orderBy('time', 'state'))).otherwise(1))
# view
df.show()
The general problem is that the tail of the DF is not resetting to the expected value as hoped.

data_sdf. \
withColumn('st_notsame',
func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')),
func.lit(False)).cast('int')
). \
withColumn('rank_temp',
func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time').rowsBetween(-sys.maxsize, 0))
). \
withColumn('rank',
func.row_number().over(wd.partitionBy('id', 'rank_temp').orderBy('time'))
). \
show()
# +---+-----+--------+--------+----------+---------+----+
# | id|state| time|expected|st_notsame|rank_temp|rank|
# +---+-----+--------+--------+----------+---------+----+
# | 1| P|20220722| 1| 0| 0| 1|
# | 1| P|20220723| 2| 0| 0| 2|
# | 1| P|20220724| 3| 0| 0| 3|
# | 1| P|20220725| 4| 0| 0| 4|
# | 1| D|20220726| 1| 1| 1| 1|
# | 1| O|20220727| 1| 1| 2| 1|
# | 1| D|20220728| 1| 1| 3| 1|
# | 1| P|20220729| 2| 1| 4| 1|
# | 1| P|20220730| 3| 0| 4| 2|
# | 1| P|20220731| 4| 0| 4| 3|
# +---+-----+--------+--------+----------+---------+----+
your expected field looks a little incorrect. I believe the rank against "20220729" should be 1.
you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get a temp rank
use the temp rank as a partition column to be used for row_number()

Related

running sum / cumulative sum with floor and ceiling Py Spark

I'm new to spark, and I am trying to calculate a window running sum that is floored by 0 and ceiled by 8
a toy example is given below (note that the actual data is closer to millions of rows):
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'ids': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'counts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.ids,sdf.day)
This creates the table
+----+---+-------+
|aIds|day|eCounts|
+----+---+-------+
| 1| 1| -3|
| 1| 2| 3|
| 1| 3| -6|
| 1| 4| 3|
| 2| 1| 3|
| 2| 2| 6|
| 2| 3| -3|
| 2| 4| -6|
| 3| 1| 3|
| 3| 2| 3|
| 3| 3| 3|
| 3| 4| -3|
+----+---+-------+
Below is an example of the result of doing a running sum, and the expected output runSumCap
+----+---+-------+------+---------+
|aIds|day|eCounts|runSum|runSumCap|
+----+---+-------+------+---------+
| 1| 1| -3| -3| 0| <-- reset to 0
| 1| 2| 3| 0| 3|
| 1| 3| -6| -6| 0| <-- reset to 0
| 1| 4| 3| -3| 3|
| 2| 1| 3| 3| 3|
| 2| 2| 6| 9| 8| <-- reset to 8
| 2| 3| -3| 6| 5|
| 2| 4| -6| 0| 0| <-- reset to 0
| 3| 1| 3| 3| 3|
| 3| 2| 3| 6| 6|
| 3| 3| 3| 9| 8| <-- reset to 8
| 3| 4| -3| 6| 5|
+----+---+-------+------+---------+
i know i can calculate the running sum as
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)`
sdf1 = sdf.withColumn('runSum',F.sum(sdf.eCounts).over(partition))
sdf1.orderBy('aIds','day').show()
To achieve the expected I have tried looking into #pandas_udf to modify the sum:
#pandas_udf('double', PandasUDFType.GROUPED_AGG)
def runSumCap(counts):
#counts columns is passed as a pandas series
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('runSum',runSumCap(sdf['counts']).over(partition))
However this does not work, and it does not seem like the most efficient way to do this.
How can i make this work? Is there a way to keep it parallel, or do i have to go to pandas dataframes
EDIT:
Came with some clarifications about present columns to order the dataset by, and some more insights into what I am trying to achieve
EDIT2:
The answer that was provided by #DrChess almost yields the correct result, but the series isn't matching the correct day for some reason:
+----+---+-------+------+
|aIds|day|eCounts|runSum|
+----+---+-------+------+
| 1| 1| -3| 0|
| 1| 2| 3| 0|
| 1| 3| -6| 3|
| 1| 4| 3| 3|
| 2| 1| 3| 3|
| 2| 2| 6| 8|
| 2| 3| -3| 0|
| 2| 4| -6| 5|
| 3| 1| 3| 6|
| 3| 2| 3| 3|
| 3| 3| 3| 8|
| 3| 4| -3| 5|
+----+---+-------+------+
I found a way to do this by first making an array in each row (using collect_list as a window function) containing the values used to make the running sum up until that point.
I then defined an udf (couldn't make this work with pandas_udf) and this worked.
Below is full reproducible example:
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import numpy as np
def accumalate(iterable):
total = 0
ceil = 8
floor = 0
for element in iterable:
total = total + element
if (total > ceil):
total = ceil
elif (total < floor):
total = floor
return total
pdf = pd.DataFrame({'aIds': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'eCounts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.aIds,sdf.day)
runSumCap = F.udf(accumalate,LongType())
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('splitWindow',F.collect_list(sdf.eCounts).over(partition))
sdf2 = sdf1.withColumn('runSumCap',runSumCap(sdf1.splitWindow))
sdf2.orderBy('aIds','day').show()
This yields the expected result:
+----+---+-------+--------------+---------+
|aIds|day|eCounts| splitWindow|runSumCap|
+----+---+-------+--------------+---------+
| 1| 1| -3| [-3]| 0|
| 1| 2| 3| [-3, 3]| 3|
| 1| 3| -6| [-3, 3, -6]| 0|
| 1| 4| 3|[-3, 3, -6, 3]| 3|
| 2| 1| 3| [3]| 3|
| 2| 2| 6| [3, 6]| 8|
| 2| 3| -3| [3, 6, -3]| 5|
| 2| 4| -6|[3, 6, -3, -6]| 0|
| 3| 1| 3| [3]| 3|
| 3| 2| 3| [3, 3]| 6|
| 3| 3| 3| [3, 3, 3]| 8|
| 3| 4| -3| [3, 3, 3, -3]| 5|
+----+---+-------+--------------+---------+
Unfortunately window functions with pandas_udf of type GROUPED_AGG do not work with bounded window functions (.rowsBetween(Window.unboundedPreceding, Window.currentRow)). It currently only works with unbounded windows, namely .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing). Additionally the input is a pandas.Series but the output should be a constant of the provided type. Therefore you won't be able to achieve partial aggregations with that.
Instead you could use GROUPED_MAP pandas_udf which works with df.groupBy().apply().
Here some code:
#pandas_udf('ids integer, day integer, counts integer, runSum integer', PandasUDFType.GROUPED_MAP)
def runSumCap(pdf):
def _apply_on_series(counts):
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
pdf.sort_values(by=['day'], inplace=True)
pdf['runSum'] = _apply_on_series(pdf['counts'])
return pdf
sdf1 = sdf.groupBy('ids').apply(runSumCap)

Pyspark apply function to column value if condition is met [duplicate]

This question already has answers here:
Spark Equivalent of IF Then ELSE
(4 answers)
Closed 3 years ago.
Given a pyspark dataframe, for example:
ls = [
['1', 2],
['2', 7],
['1', 3],
['2',-6],
['1', 3],
['1', 5],
['1', 4],
['2', 7]
]
df = spark.createDataFrame(pd.DataFrame(ls, columns=['col1', 'col2']))
df.show()
+----+-----+
|col1| col2|
+----+-----+
| 1| 2|
| 2| 7|
| 1| 3|
| 2| -6|
| 1| 3|
| 1| 5|
| 1| 4|
| 2| 7|
+----+-----+
How can I apply a function to col2 values where col1 == '1' and store result in a new column?
For example the function is:
f = x**2
Result should look like this:
+----+-----+-----+
|col1| col2| y|
+----+-----+-----+
| 1| 2| 4|
| 2| 7| null|
| 1| 3| 9|
| 2| -6| null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7| null|
+----+-----+-----+
I tried defining a separate function, and use df.withColumn(y).when(condition,function) but it wouldn't work.
So what is a way to do this?
I hope this helps:
def myFun(x):
return (x**2).cast(IntegerType())
df2 = df.withColumn("y", when(df.col1 == 1, myFun(df.col2)).otherwise(None))
df2.show()
+----+----+----+
|col1|col2| y|
+----+----+----+
| 1| 2| 4|
| 2| 7|null|
| 1| 3| 9|
| 2| -6|null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7|null|
+----+----+----+

PySpark - Dynamically update a col depending on the value of a cell

This is a modified version of the problem posted here https://stackoverflow.com/questions/35898687/adding-column-to-pyspark-dataframe-depending-on-whether-column-value-is-in-anoth
I am using Pyspark (spark 1.6)
I have the following data:
myDict
{'1': 'c1', '3': 'c3', '2': 'c2', '5': 'c5', '4': 'c4', '6': 'c6'}
I have the foll df:
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 0| 0| 0| 0| 0| 0|
| u1| 4|4,4,4,5,6| 0| 0| 0| 0| 0| 0|
+----+----+---------+---+---+---+---+---+---+
The output should be:
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 2| 0| 1| 0| 0| 0|
| u1| 4|4,4,4,5,6| 0| 0| 0| 3| 1| 1|
+----+----+---------+---+---+---+---+---+---+
Depending on the cound and values in fav_items, lookup myDict to get the column mapping and update the column. For instance, for the first row we have 1 occuring twice, and 1 maps to 'c1' from myDict and therefore value for c1 for row 1 should be 2.
I got the following method working where we can iterate over columns but this approach is inefficient since the number of columns exceeds 2k+.
for key in myDict.keys():
contains_event = udf(lambda x: x.count(key), IntegerType())
df = df.withColumn(myDict[key], contains_event('fav_items'))
I'm looking for a more efficient method for this problem.
Thanks in advance.
Just tried in my way, hope it helps.
>>> from pyspark.sql.types import *
>>> from pyspark.sql imoport functions as F
>>> from collections import Counter
>>> d = {'1': 'c1', '3': 'c3', '2': 'c2', '5': 'c5', '4': 'c4', '6': 'c6'}
>>> df = spark.createDataFrame([('u1',1,'1,1,3',0,0,0,0,0,0),('u1',4,'4,4,4,5,6',0,0,0,0,0,0),('u1',1,'3,6,2',0,0,0,0,0,0)],['user','item','fav_items','c1','c2','c3','c4','c5','c6'])
>>> df.show()
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 0| 0| 0| 0| 0| 0|
| u1| 4|4,4,4,5,6| 0| 0| 0| 0| 0| 0|
| u1| 1| 3,6,2| 0| 0| 0| 0| 0| 0|
+----+----+---------+---+---+---+---+---+---+
>>> udf1 = F.udf(lambda c: Counter(c).most_common(),ArrayType(ArrayType(StringType())))
>>> df1 = df.select('user','item','fav_items',udf1(F.split(df.fav_items,',')).alias('item_counter'))
>>> df1.show(3,False)
+----+----+---------+------------------------------------------------------------+
|user|item|fav_items|item_counter |
+----+----+---------+------------------------------------------------------------+
|u1 |1 |1,1,3 |[WrappedArray(1, 2), WrappedArray(3, 1)] |
|u1 |4 |4,4,4,5,6|[WrappedArray(4, 3), WrappedArray(5, 1), WrappedArray(6, 1)]|
|u1 |1 |3,6,2 |[WrappedArray(3, 1), WrappedArray(6, 1), WrappedArray(2, 1)]|
+----+----+---------+------------------------------------------------------------+
>>> df2 = df2.select('user','item','fav_items','val',df2.val[0].alias('val1'),df2.val[1].alias('val2'))
>>> df2.show()
+----+----+---------+------+----+----+
|user|item|fav_items| val|val1|val2|
+----+----+---------+------+----+----+
| u1| 1| 1,1,3|[1, 2]| 1| 2|
| u1| 1| 1,1,3|[3, 1]| 3| 1|
| u1| 4|4,4,4,5,6|[4, 3]| 4| 3|
| u1| 4|4,4,4,5,6|[5, 1]| 5| 1|
| u1| 4|4,4,4,5,6|[6, 1]| 6| 1|
| u1| 1| 3,6,2|[3, 1]| 3| 1|
| u1| 1| 3,6,2|[6, 1]| 6| 1|
| u1| 1| 3,6,2|[2, 1]| 2| 1|
+----+----+---------+------+----+----+
>>> udf2 = F.udf(lambda x : d[x],StringType())
>>> df2 = df2.withColumn('d_col',udf2(df2.val1))
>>> df2.show()
+----+----+---------+------+----+----+-----+
|user|item|fav_items| val|val1|val2|d_col|
+----+----+---------+------+----+----+-----+
| u1| 1| 1,1,3|[1, 2]| 1| 2| c1|
| u1| 1| 1,1,3|[3, 1]| 3| 1| c3|
| u1| 4|4,4,4,5,6|[4, 3]| 4| 3| c4|
| u1| 4|4,4,4,5,6|[5, 1]| 5| 1| c5|
| u1| 4|4,4,4,5,6|[6, 1]| 6| 1| c6|
| u1| 1| 3,6,2|[3, 1]| 3| 1| c3|
| u1| 1| 3,6,2|[6, 1]| 6| 1| c6|
| u1| 1| 3,6,2|[2, 1]| 2| 1| c2|
+----+----+---------+------+----+----+-----+
>>> pvtdf = df2.groupby(['user','item','fav_items']).pivot('d_col').agg(F.first('val2')).na.fill({'c1':0,'c2':0,'c3':0,'c4':0,'c5':0,'c6':0})
>>> pvtdf.show()
+----+----+---------+---+---+---+---+---+---+
|user|item|fav_items| c1| c2| c3| c4| c5| c6|
+----+----+---------+---+---+---+---+---+---+
| u1| 1| 1,1,3| 2| 0| 1| 0| 0| 0|
| u1| 1| 3,6,2| 0| 1| 1| 0| 0| 1|
| u1| 4|4,4,4,5,6| 0| 0| 0| 3| 1| 1|
+----+----+---------+---+---+---+---+---+---+

[Py]Spark SQL: Constrain each frame of a Window using the frame's input row

I would like to constrain what rows in a Window frame are used by the aggregate function based on the current input row. For example, given a DataFrame df and a Window w, I want to be able to do something like:
df2 = df.withColumn("foo", first(col("bar").filter(...)).over(w))
where .filter would remove rows from the current Window frame based on the frame's input row.
My specific use case is as follows: Given a DataFrame df
+-----+--+--+
|group|n1|n2|
+-----+--+--+
| 1| 1| 6|
| 1| 0| 3|
| 1| 2| 2|
| 1| 3| 5|
| 2| 0| 5|
| 2| 0| 7|
| 2| 3| 2|
| 2| 5| 9|
+-----+--+--+
window
w = Window.partitionBy("group")\
.orderBy("n1", "n2")\
.rowsBetween(Window.currentRow + 1, Window.unboundedFollowing)
and some positive Long i, how would you find the first row (fr) in each input row r's frame such that r.n1 < fr.n1, r.n2 < fr.n2, and max(fr.n1 - r.n1, fr.n2 - r.n2) < i? The value returned can be either fr.n1 or fr's row index in df. So, for i = 6, the output for the example df would be
+-----+--+--+-----+
|group|n1|n2|fr.n1|
+-----+--+--+-----+
| 1| 1| 6| null|
| 1| 0| 3| 1|
| 1| 2| 2| 3|
| 1| 3| 5| null|
| 2| 0| 5| 5|
| 2| 0| 7| 5|
| 2| 3| 2| null|
| 2| 5| 9| null|
+-----+--+--+-----+
I've been studying the Spark API and looking at examples of Window, first, and when, but I can't seem to piece it together. Is this even possible with Window and aggregate functions or am I completely off the mark?
You won't be able to do it with just window functions and aggregations, you'll need a self join:
For the join:
df = sc.parallelize([[1, 1, 6],[1, 0, 3],[1, 2, 2],[1, 3, 5],[2, 0, 5],[2, 0, 7],[2, 3, 2],[2, 5, 9]]).toDF(["group","n1","n2"])
import pyspark.sql.functions as psf
df_r = df.select([df[c].alias("r_" + c) for c in df.columns])
df_join = df_r\
.join(df, (df_r.r_group == df.group)
& (df_r.r_n1 < df.n1)
& (df_r.r_n2 < df.n2)
& (psf.greatest(df.n1 - df_r.r_n1, df.n2 - df_r.r_n2) < i), "leftouter")\
.drop("group")
Now we can apply the window function to only keep the first row:
w = Window.partitionBy("r_group", "r_n1", "r_n2").orderBy("n1", "n2")
res = df_join\
.withColumn("rn", psf.row_number().over(w))\
.filter("rn = 1").drop("rn")
+-------+----+----+----+----+
|r_group|r_n1|r_n2| n1| n2|
+-------+----+----+----+----+
| 1| 0| 3| 1| 6|
| 1| 1| 6|null|null|
| 1| 2| 2| 3| 5|
| 1| 3| 5|null|null|
| 2| 0| 5| 5| 9|
| 2| 0| 7| 5| 9|
| 2| 3| 2|null|null|
| 2| 5| 9|null|null|
+-------+----+----+----+----+

Difference in dense rank and row number in spark

I tried to understand the difference between dense rank and row number.Each new window partition both is starting from 1. Does rank of a row is not always start from 1 ? Any help would be appreciated
The difference is when there are "ties" in the ordering column. Check the example below:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")
val windowSpec = Window.partitionBy("col1").orderBy("col2")
df
.withColumn("rank", rank().over(windowSpec))
.withColumn("dense_rank", dense_rank().over(windowSpec))
.withColumn("row_number", row_number().over(windowSpec)).show
+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
| a| 10| 1| 1| 1|
| a| 10| 1| 1| 2|
| a| 20| 3| 2| 3|
+----+----+----+----------+----------+
Note that the value "10" exists twice in col2 within the same window (col1 = "a"). That's when you see a difference between the three functions.
I'm showing #Daniel's answer in Python and I'm adding a comparison with count('*') that can be used if you want to get top-n at most rows per group.
from pyspark.sql.session import SparkSession
from pyspark.sql import Window
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
['a', 10], ['a', 20], ['a', 30],
['a', 40], ['a', 40], ['a', 40], ['a', 40],
['a', 50], ['a', 50], ['a', 60]], ['part_col', 'order_col'])
window = Window.partitionBy("part_col").orderBy("order_col")
df = (df
.withColumn("rank", F.rank().over(window))
.withColumn("dense_rank", F.dense_rank().over(window))
.withColumn("row_number", F.row_number().over(window))
.withColumn("count", F.count('*').over(window))
)
df.show()
+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
| a| 10| 1| 1| 1| 1|
| a| 20| 2| 2| 2| 2|
| a| 30| 3| 3| 3| 3|
| a| 40| 4| 4| 4| 7|
| a| 40| 4| 4| 5| 7|
| a| 40| 4| 4| 6| 7|
| a| 40| 4| 4| 7| 7|
| a| 50| 8| 5| 8| 9|
| a| 50| 8| 5| 9| 9|
| a| 60| 10| 6| 10| 10|
+--------+---------+----+----------+----------+-----+
For example if you want to take at most 4 without randomly picking one of the 4 "40" of the sorting column:
df.where("count <= 4").show()
+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
| a| 10| 1| 1| 1| 1|
| a| 20| 2| 2| 2| 2|
| a| 30| 3| 3| 3| 3|
+--------+---------+----+----------+----------+-----+
In summary, if you filter <= n those columns you will get:
rank at least n rows
dense_rank at least n different order_col values
row_number exactly n rows
count at most n rows

Resources