Aggregating columns conditionally with pyspark? - apache-spark

I have following dataset. I want to group all variables and split the data based on the conditions below.
However, I am getting error when I tried the code below.
CUST_ID NAME GENDER AGE
id_01 MONEY F 43
id_02 BAKER F 32
id_03 VOICE M 31
id_04 TIME M 56
id_05 TIME F 24
id_06 TALENT F 28
id_07 ISLAND F 21
id_08 ISLAND F 27
id_09 TUME F 24
id_10 TIME F 75
id_11 SKY M 35
id_12 VOICE M 70
from pyspark.sql.functions import *
df.groupBy("CUST_ID", "NAME", "GENDER", "AGE").agg(
CUST_ID.count AS TOTAL
SUM(WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'M') THEN COUNT(CUST_ID) ELSE 0 END AS "M18-34")
SUM(WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'F') THEN COUNT(CUST_ID) ELSE 0 END AS "F18-34")
SUM(WHEN ((AGE >= 18 AND AGE <= 34 THEN COUNT(CUST_ID) ELSE 0 END AS "18-34")
SUM(WHEN ((AGE >= 25 AND AGE <= 54 THEN COUNT(CUST_ID) ELSE 0 END AS "25-54")
SUM(WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'F') THEN COUNT(CUST_ID) ELSE 0 END AS "F25-54")
SUM(WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'M') THEN COUNT(CUST_ID) ELSE 0 END AS "M25-54")
)
I would appreciate your help/suggestions
Thanks in advance

Your code is neither valid pyspark nor valid Spark SQL. There are so many syntax problems. I attempted to fix them below, not sure if that's what you wanted. If you have so many SQL-like statements, it's better to use Spark SQL directly rather than the pyspark API:
df.createOrReplaceTempView('df')
result = spark.sql("""
SELECT NAME,
COUNT(CUST_ID) AS TOTAL,
SUM(CASE WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'M') THEN 1 ELSE 0 END) AS `M18-34`,
SUM(CASE WHEN ((AGE >= 18 AND AGE <= 34) AND GENDER = 'F') THEN 1 ELSE 0 END) AS `F18-34`,
SUM(CASE WHEN (AGE >= 18 AND AGE <= 34) THEN 1 ELSE 0 END) AS `18-34`,
SUM(CASE WHEN (AGE >= 25 AND AGE <= 54) THEN 1 ELSE 0 END) AS `25-54`,
SUM(CASE WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'F') THEN 1 ELSE 0 END) AS `F25-54`,
SUM(CASE WHEN ((AGE >= 25 AND AGE <= 54) AND GENDER = 'M') THEN 1 ELSE 0 END) AS `M25-54`
FROM df
GROUP BY NAME
""")
result.show()
+------+-----+------+------+-----+-----+------+------+
| NAME|TOTAL|M18-34|F18-34|18-34|25-54|F25-54|M25-54|
+------+-----+------+------+-----+-----+------+------+
|ISLAND| 2| 0| 2| 2| 1| 1| 0|
| MONEY| 1| 0| 0| 0| 1| 1| 0|
| TIME| 3| 0| 1| 1| 0| 0| 0|
| VOICE| 2| 1| 0| 1| 1| 0| 1|
| TUME| 1| 0| 1| 1| 0| 0| 0|
| BAKER| 1| 0| 1| 1| 1| 1| 0|
|TALENT| 1| 0| 1| 1| 1| 1| 0|
| SKY| 1| 0| 0| 0| 1| 0| 1|
+------+-----+------+------+-----+-----+------+------+
If you want a pyspark solution, here's an example of how to do it for the first column. You can work out the rest straightforwardly.
import pyspark.sql.functions as F
result = df.groupBy('Name').agg(
F.count('CUST_ID').alias('TOTAL'),
F.count(F.when(F.expr("(AGE >= 18 AND AGE <= 34) AND GENDER = 'M'"), 1)).alias("M18-34")
)
result.show()
+------+-----+------+
| Name|TOTAL|M18-34|
+------+-----+------+
|ISLAND| 2| 0|
| MONEY| 1| 0|
| TIME| 3| 0|
| VOICE| 2| 1|
| TUME| 1| 0|
| BAKER| 1| 0|
|TALENT| 1| 0|
| SKY| 1| 0|
+------+-----+------+

Related

Categorize customers into buckets based on criteria

I have a dataframe with Customer_ID and Invoice_date and I want to convert each customer into either Active, New, Loss or Lapsed category. The data is present from July 2021 to June 2022 (12 months_)
The criteria for each split is as:
Active customer = Customer present once in (Apr, May, Jun 22) & once
in (Jul 21 to Mar 22)
New customer = Customer present just for (Apr, May, Jun 22) and no
other month
Lapsed customer = Customer present just for (Jan, Feb, Mar 22) and
not for (Apr, May, Jun 22)
Lost customer = Customer present from (Jul to Dec 21) and not for
(Jan to Jun 22)
So far I have tried to create a function using the below code
max_date = F.max(more_cust.INVOICE_DATE)
two_months = F.date_sub(more_cust.INVOICE_DATE, 60)
three_months = F.date_sub(more_cust.INVOICE_DATE, 90)
six_months = F.date_sub(more_cust.INVOICE_DATE, 180)
one_year = F.date_sub(more_cust.INVOICE_DATE, 360)
def recency_bucket(df1):
customer = dict()
df1 = df1.sort("INVOICE_DATE", ascending=False)
var_date = df1.rdd.map(lambda x: x.INVOICE_DATE).collect()
cust_list = df1.rdd.map(lambda x: x.CUST_ID).collect()
customer = customer.withColumn("CUST_ID", df1.collect[0]["cust_list"])
I want the output to look like this:
You can categorise your invoice date in quarters say 1(jul to sep 21), 2(oct to dec 21), 3(jan to march 22), 4(april to june 22).
Invoice data
cust_id invoice_date
c1 2021-07-05
c2 2022-02-01
c2 2022-05-10
c3 2022-02-01
c4 2022-04-10
Invoice data with quarter
df = df.withColumn("quarter", F.quarter("invoice_date")).withColumn("quarter", F.when((F.col("quarter")+2) > 4,
(F.col("quarter")+2) % 4).otherwise(F.col("quarter")+2))
+-------+------------+-------+
|cust_id|invoice_date|quarter|
+-------+------------+-------+
| c1| 2021-07-05| 1|
| c2| 2022-02-01| 3|
| c2| 2022-05-10| 4|
| c3| 2022-02-01| 3|
| c4| 2022-04-10| 4|
+-------+------------+-------+
Create pivot table and define rules based on bucket criteria and categorise customers
cust_quarter = df.groupBy("cust_id").pivot("quarter", [1,2,3,4]).count().fillna(0)
cust_quarter.show()
+-------+---+---+---+---+
|cust_id| 1| 2| 3| 4|
+-------+---+---+---+---+
| c1| 1| 0| 0| 0|
| c4| 0| 0| 0| 1|
| c3| 0| 0| 1| 0|
| c2| 0| 0| 1| 1|
+-------+---+---+---+---+
new = ((F.col("4") > 0) & (F.col("1") + F.col("2") + F.col("3") == 0))
active = ((F.col("4") > 0) & (F.col("1") + F.col("2") + F.col("3") > 0))
loss = ((F.col("1") + F.col("2") > 0) & (F.col("3") + F.col("4") == 0))
lapsed = ((F.col("3") > 0) & (F.col("1") + F.col("2") + F.col("4") == 0))
bucket_rules = F.when(new, "new").when(active, "acitve").when(loss, "loss").when(lapsed, "lapsed")
cust_quarter = cust_quarter.withColumn("bucket", bucket_rules)
cust_quarter.show()
+-------+---+---+---+---+------+
|cust_id| 1| 2| 3| 4|bucket|
+-------+---+---+---+---+------+
| c1| 1| 0| 0| 0| loss|
| c4| 0| 0| 0| 1| new|
| c3| 0| 0| 1| 0|lapsed|
| c2| 0| 0| 1| 1|acitve|
+-------+---+---+---+---+------+

Get the previous row value using spark sql

I have a table like this.
Id prod val
1 0 0
2 0 0
3 1 1000
4 0 0
5 1 2000
6 0 0
7 0 0
I want to add a new column new_val and the condition for this column is, if prod = 0, then new_val should be from the preceding row where prod = 1.
If prod = 1 it should have the same value as val column. How do I achieve this using spark sql?
Id prod val new_val
1 0 0 1000
2 0 0 1000
3 1 1000 1000
4 0 0 2000
5 1 2000 2000
6 1 4000 4000
7 1 3000 3000
Any help is greatly appreciated
You can use something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().orderBy("id")
df = df.withColumn("new_val", F.when(F.col("prod") == 0, F.lag("val").over(w)).otherwise(F.col("val")))
What we are basically doing is using an if-else condition:
When prod == 0, take lag of val which is value of previous row (over a window that is ordered by id column), and if prod == 1, then we use the present value of the column.
You can acheive that with
val w = Window.orderBy("id").rowsBetween(0, Window.unboundedFollowing)
df
.withColumn("new_val", when($"prod" === 0, null).otherwise($"val"))
.withColumn("new_val", first("new_val", ignoreNulls = true).over(w))
It first, creates new column with null values whenever value doesn't change:
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| null|
| 2| 0| 0| null|
| 3| 1|1000| 1000|
| 4| 0| 0| null|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+
And it replaces values with first non-null value in the following records
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| 1000|
| 2| 0| 0| 1000|
| 3| 1|1000| 1000|
| 4| 0| 0| 2000|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+

Hive query to find the count for the weeks in middle

I have a table like below
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201015 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
Here, we can see that below weeks are missing :
First 201014 is missing
Second 201016 is missing
Third weeks missing 201020, 201021, 201022
My requirement is whenever we have missing values we need to show the count of previous week.
In this case output should be :
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201014 36
A100 201015 43
A100 201016 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201020 63
A100 201021 63
A100 201022 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
How I can achieve this requirement using hive/pyspark?
Although this answer is in Scala, Python version will look almost the same & can be easily converted.
Step 1:
Find the rows that has missing week(s) value prior to it.
Sample Input:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//sample input
val input = sc.parallelize(List(("A100",201008,2), ("A100",201009,9),("A100",201014,4), ("A100",201016,45))).toDF("id","week","count")
scala> input.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201014| 4| //missing 4 rows
|A100|201016| 45| //missing 1 row
+----+------+-----+
To find it, we can use .lead() function on week. And compute the difference between the leadWeek and week. The difference should not be > 1, if so there are missing row prior to it.
val diffDF = input
.withColumn("leadWeek", lead($"week", 1).over(Window.partitionBy($"id").orderBy($"week"))) // partitioning by id & computing lead()
.withColumn("diff", ($"leadWeek" - $"week") -1) // finding difference between leadWeek & week
scala> diffDF.show
+----+------+-----+--------+----+
| id| week|count|leadWeek|diff|
+----+------+-----+--------+----+
|A100|201008| 2| 201009| 0| // diff -> 0 represents that no rows needs to be added
|A100|201009| 9| 201014| 4| // diff -> 4 represents 4 rows are to be added after this row.
|A100|201014| 4| 201016| 1| // diff -> 1 represents 1 row to be added after this row.
|A100|201016| 45| null|null|
+----+------+-----+--------+----+
Step 2:
If the diff is >= 1: Create and Add n number of rows(InputWithDiff, check the case class below) as specified in
diff and increment week value accordingly. Return the newly
created rows along with the original row.
If the diff is 0, No additional computation is required. Return the original row as it is.
Convert diffDF to Dataset for ease of computation.
case class InputWithDiff(id: Option[String], week: Option[Int], count: Option[Int], leadWeek: Option[Int], diff: Option[Int])
val diffDS = diffDF.as[InputWithDiff]
val output = diffDS.flatMap(x => {
val diff = x.diff.getOrElse(0)
diff match {
case n if n >= 1 => x :: (1 to diff).map(y => InputWithDiff(x.id, Some(x.week.get + y), x.count,x.leadWeek, x.diff)).toList // create and append new Rows
case _ => List(x) // return as it is
}
}).drop("leadWeek", "diff").toDF // drop unnecessary columns & convert to DF
final output:
scala> output.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201010| 9|
|A100|201011| 9|
|A100|201012| 9|
|A100|201013| 9|
|A100|201014| 4|
|A100|201015| 4|
|A100|201016| 45|
+----+------+-----+
PySpark solution
Sample Data
df = spark.createDataFrame([(1,201901,10),
(1,201903,9),
(1,201904,21),
(1,201906,42),
(1,201909,3),
(1,201912,56)
],['id','weeknum','val'])
df.show()
+---+-------+---+
| id|weeknum|val|
+---+-------+---+
| 1| 201901| 10|
| 1| 201903| 9|
| 1| 201904| 21|
| 1| 201906| 42|
| 1| 201909| 3|
| 1| 201912| 56|
+---+-------+---+
1) The basic idea is to create a combination of all id's and weeks (starting from the minimum possible value to the maximum) with a cross join.
from pyspark.sql.functions import min,max,sum,when
from pyspark.sql import Window
min_max_week = df.agg(min(df.weeknum),max(df.weeknum)).collect()
#Generate all weeks using range
all_weeks = spark.range(min_max_week[0][0],min_max_week[0][1]+1)
all_weeks = all_weeks.withColumnRenamed('id','weekno')
#all_weeks.show()
id_all_weeks = df.select(df.id).distinct().crossJoin(all_weeks).withColumnRenamed('id','aid')
#id_all_weeks.show()
2) Thereafter, left joining the original dataframe on to these combinations helps identify missing values.
res = id_all_weeks.join(df,(df.id == id_all_weeks.aid) & (df.weeknum == id_all_weeks.weekno),'left')
res.show()
+---+------+----+-------+----+
|aid|weekno| id|weeknum| val|
+---+------+----+-------+----+
| 1|201911|null| null|null|
| 1|201905|null| null|null|
| 1|201903| 1| 201903| 9|
| 1|201904| 1| 201904| 21|
| 1|201901| 1| 201901| 10|
| 1|201906| 1| 201906| 42|
| 1|201908|null| null|null|
| 1|201910|null| null|null|
| 1|201912| 1| 201912| 56|
| 1|201907|null| null|null|
| 1|201902|null| null|null|
| 1|201909| 1| 201909| 3|
+---+------+----+-------+----+
3) Then, use a combination of window functions, sum -> to assign groups
and max -> to fill in the missing values once the groups are classified.
w1 = Window.partitionBy(res.aid).orderBy(res.weekno)
groups = res.withColumn("grp",sum(when(res.id.isNull(),0).otherwise(1)).over(w1))
w2 = Window.partitionBy(groups.aid,groups.grp)
missing_values_filled = groups.withColumn('filled',max(groups.val).over(w2)) #select required columns as needed
missing_values_filled.show()
+---+------+----+-------+----+---+------+
|aid|weekno| id|weeknum| val|grp|filled|
+---+------+----+-------+----+---+------+
| 1|201901| 1| 201901| 10| 1| 10|
| 1|201902|null| null|null| 1| 10|
| 1|201903| 1| 201903| 9| 2| 9|
| 1|201904| 1| 201904| 21| 3| 21|
| 1|201905|null| null|null| 3| 21|
| 1|201906| 1| 201906| 42| 4| 42|
| 1|201907|null| null|null| 4| 42|
| 1|201908|null| null|null| 4| 42|
| 1|201909| 1| 201909| 3| 5| 3|
| 1|201910|null| null|null| 5| 3|
| 1|201911|null| null|null| 5| 3|
| 1|201912| 1| 201912| 56| 6| 56|
+---+------+----+-------+----+---+------+
Hive Query with the same logic as described above (assuming a table with all weeks can be created)
select id,weeknum,max(val) over(partition by id,grp) as val
from (select i.id
,w.weeknum
,t.val
,sum(case when t.id is null then 0 else 1 end) over(partition by i.id order by w.weeknum) as grp
from (select distinct id from tbl) i
cross join weeks_table w
left join tbl t on t.id = i.id and w.weeknum = t.weeknum
) t

Add a priority column in PySpark dataframe

I have a PySpark dataframe(input_dataframe) which looks like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 -1
103 1 1 0 1 -1
104 0 0 1 1 -1
I want to have a PySpark function(update_col_check), which update column(col_check) of this dataframe. I will pass one column name as an argument to this function. Function should check if value of that column is 1, then update value of col_check as this column name., let us say i am passing col2 as an argument to this function:
output_dataframe = update_col_check(input_dataframe, col2)
So, my output_dataframe should look like below:
**id** **col1** **col2** **col3** **col4** **col_check**
101 1 0 1 1 -1
102 0 1 1 0 col2
103 1 1 0 1 col2
104 0 0 1 1 -1
Can i achieve this using PySpark? Any help will be appreciated.
You can do this fairly straight forward with functions when, otherwise:
from pyspark.sql.functions import when, lit
def update_col_check(df, col_name):
return df.withColumn('col_check', when(df[col_name] == 1, lit(col_name)).otherwise(df['col_check']))
update_col_check(df, 'col1').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| col1|
|102| 0| 1| 1| 0| -1|
|103| 1| 1| 0| 1| col1|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+
update_col_check(df, 'col2').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101| 1| 0| 1| 1| -1|
|102| 0| 1| 1| 0| col2|
|103| 1| 1| 0| 1| col2|
|104| 0| 0| 1| 1| -1|
+---+----+----+----+----+---------+

pyspark dataframe -- Why are null values recognized differently in below scenarios?

Why isNull() behaves differently in below scenarios?
PySpark 1.6
Python 2.6.6
Definition of two dataframes:
df_t1 = sqlContext.sql("select 1 id, 9 num union all select 1 id, 2 num union all select 2 id, 3 num")
df_t2 = sqlContext.sql("select 1 id, 1 start, 3 stop union all select 3 id, 1 start, 9 stop")
Scenario 1:
df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left").select([df_t2.start, df_t2.start.isNull()]).show()
Output 1:
+-----+-------------+
|start|isnull(start)|
+-----+-------------+
| null| false|
| 1| false|
| null| false|
+-----+-------------+
Scenario 2:
df_new=df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left")
Output 2:
+-----+-------------+
|start|isnull(start)|
+-----+-------------+
| null| true|
| 1| false|
| null| true|
+-----+-------------+
Scenario 3:
df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left").filter("start is null").show()
Output 3:
+---+---+----+-----+----+
| id|num| id|start|stop|
+---+---+----+-----+----+
| 1| 9|null| null|null|
| 2| 3|null| null|null|
+---+---+----+-----+----+
Thank you.

Resources