I came across those window function pyspark offers and they seem to be quite useful. Unfortunately trying to solve problems I often don't get it to work. Now I wonder if my problem can at all be solved with window function...
Here's my task:
Starting with a dataframe mockup like below:
values = [(0,"a",True,True),(1,"a",True,True),(2,"a",True,True),(3,"a",True,True),(4,"a",True,True),
(0,"b",False,True),(1,"b",True,True),(2,"b",True,True),(3,"b",False,True),(4,"b",True,True),
(0,"c",False,True),(1,"c",True,True),(2,"c",True,True),(3,"c",False,True),(4,"c",False,True)]
columns = ['index', 'name', 'Res','solution']
mockup= spark.createDataFrame(values, columns)
mockup.show()
+-----+----+-----+----------------+
|index|name| Res|default_solution|
+-----+----+-----+----------------+
| 0| a| true| true|
| 1| a| true| true|
| 2| a| true| true|
| 3| a| true| true|
| 4| a| true| true|
| 0| b|false| true|
| 1| b| true| true|
| 2| b| true| true|
| 3| b|false| true|
| 4| b| true| true|
| 0| c|false| true|
| 1| c| true| true|
| 2| c| true| true|
| 3| c|false| true|
| 4| c|false| true|
+-----+----+-----+----------------+
I now want to update the solution column using multiple conditions.
If there are more than 2 false valus per group(name) OR if there are two false values in a group but non of them is at index = 0 the solution column should be false for the whole group, otherwise true.
See the desired outcome:
+-----+----+-----+--------+
|index|name| Res|solution|
+-----+----+-----+--------+
| 0| a| true| true|
| 1| a| true| true|
| 2| a| true| true|
| 3| a| true| true|
| 4| a| true| true|
| 0| b|false| true|
| 1| b| true| true|
| 2| b| true| true|
| 3| b|false| true|
| 4| b| true| true|
| 0| c|false| false|
| 1| c| true| false|
| 2| c| true| false|
| 3| c|false| false|
| 4| c|false| false|
+-----+----+-----+--------+
I managed to solve the problem with solution following but I hope there is a more elegant way to do this - maybe with windows. For window functions I am always struggling with where to put the window and how to use it in a more complex "when" condition.
My not so great solution :0)
df = mockup.filter(mockup.trip_distance_greater_zero == False).groupby(mockup.name).count()
false_filter_1 = df.filter(F.col('count')>2) \
.select('name').collect()
false_filter_2 = df.filter(F.col('count')==2) \
.select('name').collect()
array_false_1 = [str(row['name']) for row in false_filter_1]
array_false_2 = [str(row['name']) for row in false_filter_2]
false_filter_3 = mockup.filter((mockup['index']==0) & (mockup['Res']== False))\
.select('name').collect()
array_false_3 = [str(row['name']) for row in false_filter_3]
mockup = mockup.withColumn("over_2",
F.when((F.col('name').isin(array_false_1)), True).otherwise(False))\
.withColumn("eq_2",
F.when((F.col('name').isin(array_false_2)), True).otherwise(False))\
.withColumn("at0",
F.when((F.col('name').isin(array_false_3)), True).otherwise(False))\
.withColumn("solution",
F.when(((F.col('eq_2')==True) & (F.col('at0')==True)) | (F.col('over_2')==False)&(F.col('eq_2')==False), True).otherwise(False))\
.drop('over_2')\
.drop('eq_2')\
.drop('at0')\
mockup.show()
Here's my attempt at coding up your description. The output is different from your "expected" output because I guess you dealt with some logic incorrectly? b and c have the same pattern in your dataframe but somehow one of them is true and the other one is false.
from pyspark.sql import functions as F, Window
df2 = mockup.withColumn(
'false_count',
F.count(F.when(F.col('Res') == False, 1)).over(Window.partitionBy('name'))
).withColumn(
'false_at_0',
F.count(F.when((F.col('Res') == False) & (F.col('index') == 0), 1)).over(Window.partitionBy('name'))
).withColumn(
'solution',
~((F.col('false_count') > 2) | ((F.col('false_count') == 2) & (F.col('false_at_0') != 1)))
)
df2.show()
+-----+----+-----+--------+-----------+----------+
|index|name| Res|solution|false_count|false_at_0|
+-----+----+-----+--------+-----------+----------+
| 0| c|false| true| 2| 1|
| 1| c| true| true| 2| 1|
| 2| c| true| true| 2| 1|
| 3| c|false| true| 2| 1|
| 4| c| true| true| 2| 1|
| 0| b|false| true| 2| 1|
| 1| b| true| true| 2| 1|
| 2| b| true| true| 2| 1|
| 3| b|false| true| 2| 1|
| 4| b| true| true| 2| 1|
| 0| a| true| true| 0| 0|
| 1| a| true| true| 0| 0|
| 2| a| true| true| 0| 0|
| 3| a| true| true| 0| 0|
| 4| a| true| true| 0| 0|
+-----+----+-----+--------+-----------+----------+
Another perhaps more useful example:
values = [(0,"a",True,True),(1,"a",True,True),(2,"a",True,True),(3,"a",True,True),(4,"a",True,True),
(0,"b",False,True),(1,"b",True,True),(2,"b",True,True),(3,"b",False,True),(4,"b",True,True),
(0,"c",True,True),(1,"c",False,True),(2,"c",True,True),(3,"c",False,True),(4,"c",True,True),
(0,"d",True,True),(1,"d",False,True),(2,"d",False,True),(3,"d",False,True),(4,"d",True,True)]
columns = ['index', 'name', 'Res','solution']
mockup= spark.createDataFrame(values, columns)
which, after being processed by the first code, will give
+-----+----+-----+--------+-----------+----------+
|index|name| Res|solution|false_count|false_at_0|
+-----+----+-----+--------+-----------+----------+
| 0| d| true| false| 3| 0|
| 1| d|false| false| 3| 0|
| 2| d|false| false| 3| 0|
| 3| d|false| false| 3| 0|
| 4| d| true| false| 3| 0|
| 0| c| true| false| 2| 0|
| 1| c|false| false| 2| 0|
| 2| c| true| false| 2| 0|
| 3| c|false| false| 2| 0|
| 4| c| true| false| 2| 0|
| 0| b|false| true| 2| 1|
| 1| b| true| true| 2| 1|
| 2| b| true| true| 2| 1|
| 3| b|false| true| 2| 1|
| 4| b| true| true| 2| 1|
| 0| a| true| true| 0| 0|
| 1| a| true| true| 0| 0|
| 2| a| true| true| 0| 0|
| 3| a| true| true| 0| 0|
| 4| a| true| true| 0| 0|
+-----+----+-----+--------+-----------+----------+
Related
I'm struggling to figure this out. I need to find the last record with reason backfill and update the non backfill record with the greatest timestamp.
Here is what I've tried -
w = Window.orderBy("idx")
w1 = Window.partitionBy('reason').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df_uahr.withColumn('idx',F.monotonically_increasing_id()).withColumn("app_data_new",F.last(F.lead("app_data").over(w)).over(w1)).orderBy("idx").show()
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id| reason| created_at| updated_at| app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
| 2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26| app_data_a| 0| app_data_c|
| 2| backfill|2020-05-29 17:47:09|2021-05-29 17:47:09| app_data_c| 1| null|
| 2| backfill|2022-03-09 09:47:09|2022-03-09 09:47:09| app_data_d| 2| null|
| 2| test|2022-04-09 09:47:09|2022-04-09 09:47:09| app_data_e| 3| app_data_f|
| 2| test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a| 4| app_data_f|
| 2| backfill|2022-05-09 09:47:09|2022-05-09 09:47:09| app_data_f| 5| null|
| 2| after|2023-04-09 09:47:09|2023-04-09 09:47:09| app_data_g| 6| app_data_h|
| 2| backfill|2023-05-09 09:47:09|2023-05-09 09:47:09| app_data_h| 7| null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
Expected value
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id| reason| created_at| updated_at| app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
| 2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26| app_data_a| 0| app_data_d|
| 2| backfill|2020-05-29 17:47:09|2021-05-29 17:47:09| app_data_c| 1| null|
| 2| backfill|2022-03-09 09:47:09|2022-03-09 09:47:09| app_data_d| 2| null|
| 2| test|2022-04-09 09:47:09|2022-04-09 09:47:09| app_data_e| 3| null|
| 2| test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a| 4| app_data_f|
| 2| backfill|2022-05-09 09:47:09|2022-05-09 09:47:09| app_data_f| 5| null|
| 2| after|2023-04-09 09:47:09|2023-04-09 09:47:09| app_data_g| 6| app_data_h|
| 2| backfill|2023-05-09 09:47:09|2023-05-09 09:47:09| app_data_h| 7| null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
I have a Spark dataframe that looks like this:
+---+-----------+-------------------------+---------------+
| id| Phase | Switch | InputFileName |
+---+-----------+-------------------------+---------------+
| 1| 2| 1| fileA|
| 2| 2| 1| fileA|
| 3| 2| 1| fileA|
| 4| 2| 0| fileA|
| 5| 2| 0| fileA|
| 6| 2| 1| fileA|
| 11| 2| 1| fileB|
| 12| 2| 1| fileB|
| 13| 2| 0| fileB|
| 14| 2| 0| fileB|
| 15| 2| 1| fileB|
| 16| 2| 1| fileB|
| 21| 4| 1| fileB|
| 22| 4| 1| fileB|
| 23| 4| 1| fileB|
| 24| 4| 1| fileB|
| 25| 4| 1| fileB|
| 26| 4| 0| fileB|
| 31| 1| 0| fileC|
| 32| 1| 0| fileC|
| 33| 1| 0| fileC|
| 34| 1| 0| fileC|
| 35| 1| 0| fileC|
| 36| 1| 0| fileC|
+---+-----------+-------------------------+---------------+
For each group (a combination of InputFileName and Phase) I need to run a validation function which checks that Switch equals 1 at the very start and end of the group, and transitions to 0 at any point in-between. The function should add the validation result as a new column. The expected output is below: (gaps are just to highlight the different groups)
+---+-----------+-------------------------+---------------+--------+
| id| Phase | Switch | InputFileName | Valid |
+---+-----------+-------------------------+---------------+--------+
| 1| 2| 1| fileA| true |
| 2| 2| 1| fileA| true |
| 3| 2| 1| fileA| true |
| 4| 2| 0| fileA| true |
| 5| 2| 0| fileA| true |
| 6| 2| 1| fileA| true |
| 11| 2| 1| fileB| true |
| 12| 2| 1| fileB| true |
| 13| 2| 0| fileB| true |
| 14| 2| 0| fileB| true |
| 15| 2| 1| fileB| true |
| 16| 2| 1| fileB| true |
| 21| 4| 1| fileB| false|
| 22| 4| 1| fileB| false|
| 23| 4| 1| fileB| false|
| 24| 4| 1| fileB| false|
| 25| 4| 1| fileB| false|
| 26| 4| 0| fileB| false|
| 31| 1| 0| fileC| false|
| 32| 1| 0| fileC| false|
| 33| 1| 0| fileC| false|
| 34| 1| 0| fileC| false|
| 35| 1| 0| fileC| false|
| 36| 1| 0| fileC| false|
+---+-----------+-------------------------+---------------+--------+
I have previously solved this using Pyspark and a Pandas UDF:
df = df.groupBy("InputFileName", "Phase").apply(validate_profile)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def validate_profile(df: pd.DataFrame):
first_valid = True if df["Switch"].iloc[0] == 1 else False
during_valid = (df["Switch"].iloc[1:-1] == 0).any()
last_valid = True if df["Switch"].iloc[-1] == 1 else False
df["Valid"] = first_valid & during_valid & last_valid
return df
However, now I need to rewrite this in Scala. I just want to know the best way of accomplishing this.
I'm currently trying window functions to get the first and last ids of each group:
val minIdWindow = Window.partitionBy("InputFileName", "Phase").orderBy("id")
val maxIdWindow = Window.partitionBy("InputFileName", "Phase").orderBy(col("id").desc)
I can then add the min and max ids as separate columns and use when to get the start and end values of Switch:
df.withColumn("MinId", min("id").over(minIdWindow))
.withColumn("MaxId", max("id").over(maxIdWindow))
.withColumn("Valid", when(
col("id") === col("MinId"), col("Switch")
).when(
col("id") === col("MaxId"), col("Switch")
))
This gets me the start and end values, but I'm not sure how to check if Switch equals 0 in between. Am I on the right track using window functions? Or would you recommend an alternative solution?
Try this,
val wind = Window.partitionBy("InputFileName", "Phase").orderBy("id")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val df1 = df.withColumn("Valid",
when(first("Switch").over(wind) === 1
&& last("Switch").over(wind) === 1
&& min("Switch").over(wind) === 0, true)
.otherwise(false))
df1.orderBy("id").show() //Ordering for display purpose
Output:
+---+-----+------+-------------+-----+
| id|Phase|Switch|InputFileName|Valid|
+---+-----+------+-------------+-----+
| 1| 2| 1| fileA| true|
| 2| 2| 1| fileA| true|
| 3| 2| 1| fileA| true|
| 4| 2| 0| fileA| true|
| 5| 2| 0| fileA| true|
| 6| 2| 1| fileA| true|
| 11| 2| 1| fileB| true|
| 12| 2| 1| fileB| true|
| 13| 2| 0| fileB| true|
| 14| 2| 0| fileB| true|
| 15| 2| 1| fileB| true|
| 16| 2| 1| fileB| true|
| 21| 4| 1| fileB|false|
| 22| 4| 1| fileB|false|
| 23| 4| 1| fileB|false|
| 24| 4| 1| fileB|false|
| 25| 4| 1| fileB|false|
| 26| 4| 0| fileB|false|
| 31| 1| 0| fileC|false|
| 32| 1| 0| fileC|false|
+---+-----+------+-------------+-----+
I have created two data frames by executing below command. I want to
join the two data frames and result data frames contain non duplicate items in PySpark.
df1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
df1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
df2 is
df2=sc.parallelize([
("a",2,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 2| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|ggVG
+---+--------+-----+
I want to join above two tables like below.
+---+--------+----------+----------+
|SID|SSection|test1SRank|test2SRank|
+---+--------+----------+----------+
| f| 4| 0| 2|
| e| 4| 1| 1|
| d| 4| 2| 0|
| c| 3| 4| 4|
| b| 2| 2| 3|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+
Doesn't look like something that can be achieved with a single join. Here's a solution involving multiple joins:
from pyspark.sql.functions import col
d1 = df1.unionAll(df2).select("SID" , "SSection" ).distinct()
t1 = d1.join(df1 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test1Srank"))
t2 = d1.join(df2 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test2Srank"))
t1.join(t2, ["SID", "SSection"]).na.fill(0).show()
+---+--------+----------+----------+
|SID|SSection|test1Srank|test2Srank|
+---+--------+----------+----------+
| b| 2| 2| 3|
| c| 3| 4| 4|
| d| 4| 2| 0|
| e| 4| 1| 1|
| f| 4| 0| 2|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+
You can simply rename the SRank column names and use outer join and use na.fill function
df1.withColumnRenamed("SRank", "test1SRank").join(df2.withColumnRenamed("SRank", "test2SRank"), ["SID", "SSection"], "outer").na.fill(0)
I want to find the IDs of groups (or blocks) of trues in a Spark DataFrame. That is, I want to go from this:
>>> df.show()
+---------+-----+
|timestamp| bool|
+---------+-----+
| 1|false|
| 2| true|
| 3| true|
| 4|false|
| 5| true|
| 6| true|
| 7| true|
| 8| true|
| 9|false|
| 10|false|
| 11|false|
| 12|false|
| 13|false|
| 14| true|
| 15| true|
| 16| true|
+---------+-----+
to this:
>>> df.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
| 1|false| 0|
| 2| true| 1|
| 3| true| 1|
| 4|false| 0|
| 5| true| 2|
| 6| true| 2|
| 7| true| 2|
| 8| true| 2|
| 9|false| 0|
| 10|false| 0|
| 11|false| 0|
| 12|false| 0|
| 13|false| 0|
| 14| true| 3|
| 15| true| 3|
| 16| true| 3|
+---------+-----+-----+
(the zeros are optional, could be Null or -1 or whatever is easier to implement)
I have a solution in scala, should be easy to adapt it to pyspark. Consider the following dataframe df:
+---------+-----+
|timestamp| bool|
+---------+-----+
| 1|false|
| 2| true|
| 3| true|
| 4|false|
| 5| true|
| 6| true|
| 7| true|
| 8| true|
| 9|false|
| 10|false|
| 11|false|
| 12|false|
| 13|false|
| 14| true|
| 15| true|
| 16| true|
+---------+-----+
then you could do:
df
.withColumn("prev_bool",lag($"bool",1).over(Window.orderBy($"timestamp")))
.withColumn("block",sum(when(!$"prev_bool" and $"bool",1).otherwise(0)).over(Window.orderBy($"timestamp")))
.drop($"prev_bool")
.withColumn("block",when($"bool",$"block").otherwise(0))
.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
| 1|false| 0|
| 2| true| 1|
| 3| true| 1|
| 4|false| 0|
| 5| true| 2|
| 6| true| 2|
| 7| true| 2|
| 8| true| 2|
| 9|false| 0|
| 10|false| 0|
| 11|false| 0|
| 12|false| 0|
| 13|false| 0|
| 14| true| 3|
| 15| true| 3|
| 16| true| 3|
+---------+-----+-----+
I have the following dataframe showing the revenue of purchases.
+-------+--------+-------+
|user_id|visit_id|revenue|
+-------+--------+-------+
| 1| 1| 0|
| 1| 2| 0|
| 1| 3| 0|
| 1| 4| 100|
| 1| 5| 0|
| 1| 6| 0|
| 1| 7| 200|
| 1| 8| 0|
| 1| 9| 10|
+-------+--------+-------+
Ultimately I want the new column purch_revenue to show the revenue generated by the purchase in every row.
As a workaround, I have also tried to introduce a purchase identifier purch_id which is incremented each time a purchase was made. So this is listed just as a reference.
+-------+--------+-------+-------------+--------+
|user_id|visit_id|revenue|purch_revenue|purch_id|
+-------+--------+-------+-------------+--------+
| 1| 1| 0| 100| 1|
| 1| 2| 0| 100| 1|
| 1| 3| 0| 100| 1|
| 1| 4| 100| 100| 1|
| 1| 5| 0| 100| 2|
| 1| 6| 0| 100| 2|
| 1| 7| 200| 100| 2|
| 1| 8| 0| 100| 3|
| 1| 9| 10| 100| 3|
+-------+--------+-------+-------------+--------+
I've tried to use the lag/lead function like this:
user_timeline = Window.partitionBy("user_id").orderBy("visit_id")
find_rev = fn.when(fn.col("revenue") > 0,fn.col("revenue"))\
.otherwise(fn.lead(fn.col("revenue"), 1).over(user_timeline))
df.withColumn("purch_revenue", find_rev)
This duplicates the revenue column if revenue > 0 and also pulls it up by one row. Clearly, I can chain this for a finite N, but that's not a solution.
Is there a way to apply this recursively until revenue > 0?
Alternatively, is there a way to increment a value based on a condition? I've tried to figure out a way to do that but struggled to find one.
Window functions don't support recursion but it is not required here. This type of sesionization can be easily handled with cumulative sum:
from pyspark.sql.functions import col, sum, when, lag
from pyspark.sql.window import Window
w = Window.partitionBy("user_id").orderBy("visit_id")
purch_id = sum(lag(when(
col("revenue") > 0, 1).otherwise(0),
1, 0
).over(w)).over(w) + 1
df.withColumn("purch_id", purch_id).show()
+-------+--------+-------+--------+
|user_id|visit_id|revenue|purch_id|
+-------+--------+-------+--------+
| 1| 1| 0| 1|
| 1| 2| 0| 1|
| 1| 3| 0| 1|
| 1| 4| 100| 1|
| 1| 5| 0| 2|
| 1| 6| 0| 2|
| 1| 7| 200| 2|
| 1| 8| 0| 3|
| 1| 9| 10| 3|
+-------+--------+-------+--------+