I have a dataframe with Customer_ID and Invoice_date and I want to convert each customer into either Active, New, Loss or Lapsed category. The data is present from July 2021 to June 2022 (12 months_)
The criteria for each split is as:
Active customer = Customer present once in (Apr, May, Jun 22) & once
in (Jul 21 to Mar 22)
New customer = Customer present just for (Apr, May, Jun 22) and no
other month
Lapsed customer = Customer present just for (Jan, Feb, Mar 22) and
not for (Apr, May, Jun 22)
Lost customer = Customer present from (Jul to Dec 21) and not for
(Jan to Jun 22)
So far I have tried to create a function using the below code
max_date = F.max(more_cust.INVOICE_DATE)
two_months = F.date_sub(more_cust.INVOICE_DATE, 60)
three_months = F.date_sub(more_cust.INVOICE_DATE, 90)
six_months = F.date_sub(more_cust.INVOICE_DATE, 180)
one_year = F.date_sub(more_cust.INVOICE_DATE, 360)
def recency_bucket(df1):
customer = dict()
df1 = df1.sort("INVOICE_DATE", ascending=False)
var_date = df1.rdd.map(lambda x: x.INVOICE_DATE).collect()
cust_list = df1.rdd.map(lambda x: x.CUST_ID).collect()
customer = customer.withColumn("CUST_ID", df1.collect[0]["cust_list"])
I want the output to look like this:
You can categorise your invoice date in quarters say 1(jul to sep 21), 2(oct to dec 21), 3(jan to march 22), 4(april to june 22).
Invoice data
cust_id invoice_date
c1 2021-07-05
c2 2022-02-01
c2 2022-05-10
c3 2022-02-01
c4 2022-04-10
Invoice data with quarter
df = df.withColumn("quarter", F.quarter("invoice_date")).withColumn("quarter", F.when((F.col("quarter")+2) > 4,
(F.col("quarter")+2) % 4).otherwise(F.col("quarter")+2))
+-------+------------+-------+
|cust_id|invoice_date|quarter|
+-------+------------+-------+
| c1| 2021-07-05| 1|
| c2| 2022-02-01| 3|
| c2| 2022-05-10| 4|
| c3| 2022-02-01| 3|
| c4| 2022-04-10| 4|
+-------+------------+-------+
Create pivot table and define rules based on bucket criteria and categorise customers
cust_quarter = df.groupBy("cust_id").pivot("quarter", [1,2,3,4]).count().fillna(0)
cust_quarter.show()
+-------+---+---+---+---+
|cust_id| 1| 2| 3| 4|
+-------+---+---+---+---+
| c1| 1| 0| 0| 0|
| c4| 0| 0| 0| 1|
| c3| 0| 0| 1| 0|
| c2| 0| 0| 1| 1|
+-------+---+---+---+---+
new = ((F.col("4") > 0) & (F.col("1") + F.col("2") + F.col("3") == 0))
active = ((F.col("4") > 0) & (F.col("1") + F.col("2") + F.col("3") > 0))
loss = ((F.col("1") + F.col("2") > 0) & (F.col("3") + F.col("4") == 0))
lapsed = ((F.col("3") > 0) & (F.col("1") + F.col("2") + F.col("4") == 0))
bucket_rules = F.when(new, "new").when(active, "acitve").when(loss, "loss").when(lapsed, "lapsed")
cust_quarter = cust_quarter.withColumn("bucket", bucket_rules)
cust_quarter.show()
+-------+---+---+---+---+------+
|cust_id| 1| 2| 3| 4|bucket|
+-------+---+---+---+---+------+
| c1| 1| 0| 0| 0| loss|
| c4| 0| 0| 0| 1| new|
| c3| 0| 0| 1| 0|lapsed|
| c2| 0| 0| 1| 1|acitve|
+-------+---+---+---+---+------+
Related
I have following dataset. I want to group all variables and split the data based on the conditions below.
However, I am getting error when I tried the code below.
CUST_ID NAME GENDER AGE
id_01 MONEY F 43
id_02 BAKER F 32
id_03 VOICE M 31
id_04 TIME M 56
id_05 TIME F 24
id_06 TALENT F 28
id_07 ISLAND F 21
id_08 ISLAND F 27
id_09 TUME F 24
id_10 TIME F 75
id_11 SKY M 35
id_12 VOICE M 70
from pyspark.sql.functions import *
df.groupBy("CUST_ID", "NAME", "GENDER", "AGE").agg(
CUST_ID.count AS TOTAL
SUM(WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'M') THEN COUNT(CUST_ID) ELSE 0 END AS "M18-34")
SUM(WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'F') THEN COUNT(CUST_ID) ELSE 0 END AS "F18-34")
SUM(WHEN ((AGE >= 18 AND AGE <= 34 THEN COUNT(CUST_ID) ELSE 0 END AS "18-34")
SUM(WHEN ((AGE >= 25 AND AGE <= 54 THEN COUNT(CUST_ID) ELSE 0 END AS "25-54")
SUM(WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'F') THEN COUNT(CUST_ID) ELSE 0 END AS "F25-54")
SUM(WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'M') THEN COUNT(CUST_ID) ELSE 0 END AS "M25-54")
)
I would appreciate your help/suggestions
Thanks in advance
Your code is neither valid pyspark nor valid Spark SQL. There are so many syntax problems. I attempted to fix them below, not sure if that's what you wanted. If you have so many SQL-like statements, it's better to use Spark SQL directly rather than the pyspark API:
df.createOrReplaceTempView('df')
result = spark.sql("""
SELECT NAME,
COUNT(CUST_ID) AS TOTAL,
SUM(CASE WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'M') THEN 1 ELSE 0 END) AS `M18-34`,
SUM(CASE WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'F') THEN 1 ELSE 0 END) AS `F18-34`,
SUM(CASE WHEN (AGE >= 18 AND AGE <= 34) THEN 1 ELSE 0 END) AS `18-34`,
SUM(CASE WHEN (AGE >= 25 AND AGE <= 54) THEN 1 ELSE 0 END) AS `25-54`,
SUM(CASE WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'F') THEN 1 ELSE 0 END) AS `F25-54`,
SUM(CASE WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'M') THEN 1 ELSE 0 END) AS `M25-54`
FROM df
GROUP BY NAME
""")
result.show()
+------+-----+------+------+-----+-----+------+------+
| NAME|TOTAL|M18-34|F18-34|18-34|25-54|F25-54|M25-54|
+------+-----+------+------+-----+-----+------+------+
|ISLAND| 2| 0| 2| 2| 1| 1| 0|
| MONEY| 1| 0| 0| 0| 1| 1| 0|
| TIME| 3| 0| 1| 1| 0| 0| 0|
| VOICE| 2| 1| 0| 1| 1| 0| 1|
| TUME| 1| 0| 1| 1| 0| 0| 0|
| BAKER| 1| 0| 1| 1| 1| 1| 0|
|TALENT| 1| 0| 1| 1| 1| 1| 0|
| SKY| 1| 0| 0| 0| 1| 0| 1|
+------+-----+------+------+-----+-----+------+------+
If you want a pyspark solution, here's an example of how to do it for the first column. You can work out the rest straightforwardly.
import pyspark.sql.functions as F
result = df.groupBy('Name').agg(
F.count('CUST_ID').alias('TOTAL'),
F.count(F.when(F.expr("(AGE >= 18 AND AGE <= 34) AND GENDER = 'M'"), 1)).alias("M18-34")
)
result.show()
+------+-----+------+
| Name|TOTAL|M18-34|
+------+-----+------+
|ISLAND| 2| 0|
| MONEY| 1| 0|
| TIME| 3| 0|
| VOICE| 2| 1|
| TUME| 1| 0|
| BAKER| 1| 0|
|TALENT| 1| 0|
| SKY| 1| 0|
+------+-----+------+
I have a PySpark dataframe-
df = spark.createDataFrame([
("u1", 0),
("u2", 0),
("u3", 1),
("u4", 2),
("u5", 3),
("u6", 2),],
['user_id', 'medals'])
df.show()
Output-
+-------+------+
|user_id|medals|
+-------+------+
| u1| 0|
| u2| 0|
| u3| 1|
| u4| 2|
| u5| 3|
| u6| 2|
+-------+------+
I want to get the distribution of the medals column for all the users. So if there are n unique values in the medals column, I want n columns in the output dataframe with corresponding number of users who received that many medals.
The output for the data given above should look like-
+------- +--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+
How do I achieve this?
it's a simple pivot:
df.groupBy().pivot("medals").count().show()
+---+---+---+---+
| 0| 1| 2| 3|
+---+---+---+---+
| 2| 1| 2| 1|
+---+---+---+---+
if you need some cosmetic to add the word medals in the column name, then you can do this :
medals_df = df.groupBy().pivot("medals").count()
for col in medals_df.columns:
medals_df = medals_df.withColumnRenamed(col, "medals_{}".format(col))
medals_df.show()
+--------+--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+
I thought rangeBetween(start, end) looks into values of the range(cur_value - start, cur_value + end). https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html
But, I saw an example where they used descending orderBy() on timestamp, and then used (unboundedPreceeding, 0) with rangeBetween. Which led me to explore the following example:
dd = spark.createDataFrame(
[(1, "a"), (3, "a"), (3, "a"), (1, "b"), (2, "b"), (3, "b")],
['id', 'category']
)
dd.show()
# output
+---+--------+
| id|category|
+---+--------+
| 1| a|
| 3| a|
| 3| a|
| 1| b|
| 2| b|
| 3| b|
+---+--------+
It seems to include preceding row whose value is higher by 1.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-1, Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 3|
| 3| a| 6|
| 3| a| 6|
| 1| a| 1|
+---+--------+---+
And with start set to -2, it includes value greater by 2 but in preceding rows.
byCategoryOrderedById = Window.partitionBy('category')\
.orderBy(desc('id'))\
.rangeBetween(-2,Window.currentRow)
dd.withColumn("sum", Fsum('id').over(byCategoryOrderedById)).show()
# output
+---+--------+---+
| id|category|sum|
+---+--------+---+
| 3| b| 3|
| 2| b| 5|
| 1| b| 6|
| 3| a| 6|
| 3| a| 6|
| 1| a| 7|
+---+--------+---+
So, what is the exact behavior of rangeBetween with desc orderBy?
It's not well documented but when using range (or value-based) frames the ascending and descending order affects the determination of the values that are included in the frame.
Let's take the example you provided:
RANGE BETWEEN 1 PRECEDING AND CURRENT ROW
Depending on the order by direction, 1 PRECEDING means:
current_row_value - 1 if ASC
current_row_value + 1 if DESC
Consider the row with value 1 in partition b.
With the descending order, the frame includes :
(current_value and all preceding values where x = current_value + 1) = (1, 2)
With the ascending order, the frame includes:
(current_value and all preceding values where x = current_value - 1) = (1)
PS: using rangeBetween(-1, Window.currentRow) with desc ordering is just equivalent to rangeBetween(Window.currentRow, 1) with asc ordering.
I have a table like below
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201015 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
Here, we can see that below weeks are missing :
First 201014 is missing
Second 201016 is missing
Third weeks missing 201020, 201021, 201022
My requirement is whenever we have missing values we need to show the count of previous week.
In this case output should be :
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201014 36
A100 201015 43
A100 201016 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201020 63
A100 201021 63
A100 201022 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
How I can achieve this requirement using hive/pyspark?
Although this answer is in Scala, Python version will look almost the same & can be easily converted.
Step 1:
Find the rows that has missing week(s) value prior to it.
Sample Input:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//sample input
val input = sc.parallelize(List(("A100",201008,2), ("A100",201009,9),("A100",201014,4), ("A100",201016,45))).toDF("id","week","count")
scala> input.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201014| 4| //missing 4 rows
|A100|201016| 45| //missing 1 row
+----+------+-----+
To find it, we can use .lead() function on week. And compute the difference between the leadWeek and week. The difference should not be > 1, if so there are missing row prior to it.
val diffDF = input
.withColumn("leadWeek", lead($"week", 1).over(Window.partitionBy($"id").orderBy($"week"))) // partitioning by id & computing lead()
.withColumn("diff", ($"leadWeek" - $"week") -1) // finding difference between leadWeek & week
scala> diffDF.show
+----+------+-----+--------+----+
| id| week|count|leadWeek|diff|
+----+------+-----+--------+----+
|A100|201008| 2| 201009| 0| // diff -> 0 represents that no rows needs to be added
|A100|201009| 9| 201014| 4| // diff -> 4 represents 4 rows are to be added after this row.
|A100|201014| 4| 201016| 1| // diff -> 1 represents 1 row to be added after this row.
|A100|201016| 45| null|null|
+----+------+-----+--------+----+
Step 2:
If the diff is >= 1: Create and Add n number of rows(InputWithDiff, check the case class below) as specified in
diff and increment week value accordingly. Return the newly
created rows along with the original row.
If the diff is 0, No additional computation is required. Return the original row as it is.
Convert diffDF to Dataset for ease of computation.
case class InputWithDiff(id: Option[String], week: Option[Int], count: Option[Int], leadWeek: Option[Int], diff: Option[Int])
val diffDS = diffDF.as[InputWithDiff]
val output = diffDS.flatMap(x => {
val diff = x.diff.getOrElse(0)
diff match {
case n if n >= 1 => x :: (1 to diff).map(y => InputWithDiff(x.id, Some(x.week.get + y), x.count,x.leadWeek, x.diff)).toList // create and append new Rows
case _ => List(x) // return as it is
}
}).drop("leadWeek", "diff").toDF // drop unnecessary columns & convert to DF
final output:
scala> output.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201010| 9|
|A100|201011| 9|
|A100|201012| 9|
|A100|201013| 9|
|A100|201014| 4|
|A100|201015| 4|
|A100|201016| 45|
+----+------+-----+
PySpark solution
Sample Data
df = spark.createDataFrame([(1,201901,10),
(1,201903,9),
(1,201904,21),
(1,201906,42),
(1,201909,3),
(1,201912,56)
],['id','weeknum','val'])
df.show()
+---+-------+---+
| id|weeknum|val|
+---+-------+---+
| 1| 201901| 10|
| 1| 201903| 9|
| 1| 201904| 21|
| 1| 201906| 42|
| 1| 201909| 3|
| 1| 201912| 56|
+---+-------+---+
1) The basic idea is to create a combination of all id's and weeks (starting from the minimum possible value to the maximum) with a cross join.
from pyspark.sql.functions import min,max,sum,when
from pyspark.sql import Window
min_max_week = df.agg(min(df.weeknum),max(df.weeknum)).collect()
#Generate all weeks using range
all_weeks = spark.range(min_max_week[0][0],min_max_week[0][1]+1)
all_weeks = all_weeks.withColumnRenamed('id','weekno')
#all_weeks.show()
id_all_weeks = df.select(df.id).distinct().crossJoin(all_weeks).withColumnRenamed('id','aid')
#id_all_weeks.show()
2) Thereafter, left joining the original dataframe on to these combinations helps identify missing values.
res = id_all_weeks.join(df,(df.id == id_all_weeks.aid) & (df.weeknum == id_all_weeks.weekno),'left')
res.show()
+---+------+----+-------+----+
|aid|weekno| id|weeknum| val|
+---+------+----+-------+----+
| 1|201911|null| null|null|
| 1|201905|null| null|null|
| 1|201903| 1| 201903| 9|
| 1|201904| 1| 201904| 21|
| 1|201901| 1| 201901| 10|
| 1|201906| 1| 201906| 42|
| 1|201908|null| null|null|
| 1|201910|null| null|null|
| 1|201912| 1| 201912| 56|
| 1|201907|null| null|null|
| 1|201902|null| null|null|
| 1|201909| 1| 201909| 3|
+---+------+----+-------+----+
3) Then, use a combination of window functions, sum -> to assign groups
and max -> to fill in the missing values once the groups are classified.
w1 = Window.partitionBy(res.aid).orderBy(res.weekno)
groups = res.withColumn("grp",sum(when(res.id.isNull(),0).otherwise(1)).over(w1))
w2 = Window.partitionBy(groups.aid,groups.grp)
missing_values_filled = groups.withColumn('filled',max(groups.val).over(w2)) #select required columns as needed
missing_values_filled.show()
+---+------+----+-------+----+---+------+
|aid|weekno| id|weeknum| val|grp|filled|
+---+------+----+-------+----+---+------+
| 1|201901| 1| 201901| 10| 1| 10|
| 1|201902|null| null|null| 1| 10|
| 1|201903| 1| 201903| 9| 2| 9|
| 1|201904| 1| 201904| 21| 3| 21|
| 1|201905|null| null|null| 3| 21|
| 1|201906| 1| 201906| 42| 4| 42|
| 1|201907|null| null|null| 4| 42|
| 1|201908|null| null|null| 4| 42|
| 1|201909| 1| 201909| 3| 5| 3|
| 1|201910|null| null|null| 5| 3|
| 1|201911|null| null|null| 5| 3|
| 1|201912| 1| 201912| 56| 6| 56|
+---+------+----+-------+----+---+------+
Hive Query with the same logic as described above (assuming a table with all weeks can be created)
select id,weeknum,max(val) over(partition by id,grp) as val
from (select i.id
,w.weeknum
,t.val
,sum(case when t.id is null then 0 else 1 end) over(partition by i.id order by w.weeknum) as grp
from (select distinct id from tbl) i
cross join weeks_table w
left join tbl t on t.id = i.id and w.weeknum = t.weeknum
) t
I need to write some custum code using multiple columns within a group of my data.
My custom code is to set a flag if a value is over a threshold, but suppress the flag if it is within a certain time of a previous flag.
Here is some sample code:
df = spark.createDataFrame(
[
("a", 1, 0),
("a", 2, 1),
("a", 3, 1),
("a", 4, 1),
("a", 5, 1),
("a", 6, 0),
("a", 7, 1),
("a", 8, 1),
("b", 1, 0),
("b", 2, 1)
],
["group_col","order_col", "flag_col"]
)
df.show()
+---------+---------+--------+
|group_col|order_col|flag_col|
+---------+---------+--------+
| a| 1| 0|
| a| 2| 1|
| a| 3| 1|
| a| 4| 1|
| a| 5| 1|
| a| 6| 0|
| a| 7| 1|
| a| 8| 1|
| b| 1| 0|
| b| 2| 1|
+---------+---------+--------+
from pyspark.sql.functions import udf, col, asc
from pyspark.sql.window import Window
def _suppress(dates=None, alert_flags=None, window=2):
sup_alert_flag = alert_flag
last_alert_date = None
for i, alert_flag in enumerate(alert_flag):
current_date = dates[i]
if alert_flag == 1:
if not last_alert_date:
sup_alert_flag[i] = 1
last_alert_date = current_date
elif (current_date - last_alert_date) > window:
sup_alert_flag[i] = 1
last_alert_date = current_date
else:
sup_alert_flag[i] = 0
else:
alert_flag = 0
return sup_alert_flag
suppress_udf = udf(_suppress, DoubleType())
df_out = df.withColumn("supressed_flag_col", suppress_udf(dates=col("order_col"), alert_flags=col("flag_col"), window=4).Window.partitionBy(col("group_col")).orderBy(asc("order_col")))
df_out.show()
The above fails, but my expected output is the following:
+---------+---------+--------+------------------+
|group_col|order_col|flag_col|supressed_flag_col|
+---------+---------+--------+------------------+
| a| 1| 0| 0|
| a| 2| 1| 1|
| a| 3| 1| 0|
| a| 4| 1| 0|
| a| 5| 1| 0|
| a| 6| 0| 0|
| a| 7| 1| 1|
| a| 8| 1| 0|
| b| 1| 0| 0|
| b| 2| 1| 1|
+---------+---------+--------+------------------+
Editing answer after more thought.
The general problem seems to be that the result of the current row depends upon result of the previous row. In effect, there is a recurrence relationship. I haven't found a good way to implement a recursive UDF in Spark. There are several challenges that result from the assumed distributed nature of the data in Spark which would make this difficult to achieve. At least in my mind. The following solution should work but may not scale for large data sets.
from pyspark.sql import Row
import pyspark.sql.functions as F
import pyspark.sql.types as T
suppress_flag_row = Row("order_col", "flag_col", "res_flag")
def suppress_flag( date_alert_flags, window_size ):
sorted_alerts = sorted( date_alert_flags, key=lambda x: x["order_col"])
res_flags = []
last_alert_date = None
for row in sorted_alerts:
current_date = row["order_col"]
aflag = row["flag_col"]
if aflag == 1 and (not last_alert_date or (current_date - last_alert_date) > window_size):
res = suppress_flag_row(current_date, aflag, True)
last_alert_date = current_date
else:
res = suppress_flag_row(current_date, aflag, False)
res_flags.append(res)
return res_flags
in_fields = [T.StructField("order_col", T.IntegerType(), nullable=True )]
in_fields.append( T.StructField("flag_col", T.IntegerType(), nullable=True) )
out_fields = in_fields
out_fields.append(T.StructField("res_flag", T.BooleanType(), nullable=True) )
out_schema = T.StructType(out_fields)
suppress_udf = F.udf(suppress_flag, T.ArrayType(out_schema) )
window_size = 4
tmp = df.groupBy("group_col").agg( F.collect_list( F.struct( F.col("order_col"), F.col("flag_col") ) ).alias("date_alert_flags"))
tmp2 = tmp.select(F.col("group_col"), suppress_udf(F.col("date_alert_flags"), F.lit(window_size)).alias("suppress_res"))
expand_fields = [F.col("group_col")] + [F.col("res_expand")[f.name].alias(f.name) for f in out_fields]
final_df = tmp2.select(F.col("group_col"), F.explode(F.col("suppress_res")).alias("res_expand")).select( expand_fields )
I think, You don't need custom function for this. you can use rowsBetween option along with window to get the 5 rows range. Please check and let me know if missed something.
>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy('group_col').orderBy('order_col').rowsBetween(-5,-1)
>>> df = df.withColumn('supr_flag_col',F.when(F.sum('flag_col').over(w) == 0,1).otherwise(0))
>>> df.orderBy('group_col','order_col').show()
+---------+---------+--------+-------------+
|group_col|order_col|flag_col|supr_flag_col|
+---------+---------+--------+-------------+
| a| 1| 0| 0|
| a| 2| 1| 1|
| a| 3| 1| 0|
| b| 1| 0| 0|
| b| 2| 1| 1|
+---------+---------+--------+-------------+