PySpark - Select rows where the column has non-consecutive values after grouping - apache-spark

I have a dataframe of the form:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
| ha42a | AB | 3 |
| ha42a | AB | 4 |
| ha42a | AB | 5 |
I want to filter out users that are seen on consecutive days, if they are not seen in at least a single nonconsecutive day. The resulting dataframe should be:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
where the last user has been removed, since he appeared just on consecutive days.
Does anyone know how this can be done in spark?

Using spark-sql window functions and without any udfs. The df construction is done in scala but the sql part will be same in python. Check this out:
val df = Seq(("d25as","AB",2),("d25as","AB",3),("d25as","AB",5),("m3562","AB",1),("m3562","AB",7),("m3562","AB",9),("ha42a","AB",3),("ha42a","AB",4),("ha42a","AB",5)).toDF("user_id","action","day")
df.createOrReplaceTempView("qubix")
spark.sql(
""" with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
select user_id, action, day from t2 where size(diff2) > 1
""").show(false)
Results:
+-------+------+---+
|user_id|action|day|
+-------+------+---+
|d25as |AB |2 |
|d25as |AB |3 |
|d25as |AB |5 |
|m3562 |AB |1 |
|m3562 |AB |7 |
|m3562 |AB |9 |
+-------+------+---+
pyspark version
>>> from pyspark.sql.functions import *
>>> values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
... ('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
... ('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
>>> df = spark.createDataFrame(values,['user_id','action','day'])
>>> df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
>>> df.createOrReplaceTempView("qubix")
>>> spark.sql(
... """ with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
... t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
... select user_id, action, day from t2 where size(diff2) > 1
... """).show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
+-------+------+---+
>>>

Read the comments in between. The code will be self explanatory then.
from pyspark.sql.functions import udf, collect_list, explode
#Creating the DataFrame
values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
df = sqlContext.createDataFrame(values,['user_id','action','day'])
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
# Grouping together the days in one list.
df = df.groupby(['user_id','action']).agg(collect_list('day'))
df.show()
+-------+------+-----------------+
|user_id|action|collect_list(day)|
+-------+------+-----------------+
| ha42a| AB| [3, 4, 5]|
| m3562| AB| [1, 7, 9]|
| d25as| AB| [2, 3, 5]|
+-------+------+-----------------+
# Creating a UDF to check if the days are consecutive or not. Only keep False ones.
check_consecutive = udf(lambda row: sorted(row) == list(range(min(row), max(row)+1)))
df = df.withColumn('consecutive',check_consecutive(col('collect_list(day)')))\
.where(col('consecutive')==False)
df.show()
+-------+------+-----------------+-----------+
|user_id|action|collect_list(day)|consecutive|
+-------+------+-----------------+-----------+
| m3562| AB| [1, 7, 9]| false|
| d25as| AB| [2, 3, 5]| false|
+-------+------+-----------------+-----------+
# Finally, exploding the DataFrame from above to get the result.
df = df.withColumn("day", explode(col('collect_list(day)')))\
.drop('consecutive','collect_list(day)')
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
+-------+------+---+

Related

PySpark percentile for multiple columns

I want to convert multiple numeric columns of PySpark dataframe into its percentile values using PySpark, without changing its order.
E.g. given an array of column names arr = [Salary, Age, Bonus] to convert columns into percentiles.
Input
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 1000 | 45 | 100 |
| 2 | Sales | 596 | 500 | 30 | 50 |
| 3 | Manufacture | 895 | 600 | 50 | 400 |
| 4 | HR | 212 | 700 | 26 | 60 |
| 5 | Business | 754 | 350 | 18 | 22 |
+----------+-------------+---------+--------+-----+-------+
Expected output
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 100 | 80 | 80 |
| 2 | Sales | 596 | 40 | 60 | 40 |
| 3 | Manufacture | 895 | 60 | 100 | 100 |
| 4 | HR | 212 | 80 | 40 | 60 |
| 5 | Business | 754 | 20 | 20 | 20 |
+----------+-------------+---------+--------+-----+-------+
The formula for percentile for a given element 'x' in the list = (Number of elements less than 'x'/Total number of elements) *100.
You can use percentile_approx for this , in conjunction with groupBy with the desired columns for which you want the percentile to be calculated.
Built in Spark > 3.x
input_list = [
(1,"HR",111,1000,45,100)
,(2,"Sales",112,500,30,50)
,(3,"Manufacture",127,600,50,500)
,(4,"Hr",821,700,26,60)
,(5,"Business",754,350,18,22)
]
sparkDF = sql.createDataFrame(input_list,['emp_no','dept','pincode','salary','age','bonus'])
sparkDF.groupBy(['emp_no','dept']).agg(
*[ F.first(F.col('pincode')).alias('pincode') ]
,*[ F.percentile_approx(F.col(col),0.95).alias(col) for col in ['salary','age','bonus'] ]
).show()
+------+-----------+-------+------+---+-----+
|emp_no| dept|pincode|salary|age|bonus|
+------+-----------+-------+------+---+-----+
| 3|Manufacture| 127| 600| 50| 500|
| 1| HR| 111| 1000| 45| 100|
| 2| Sales| 112| 500| 30| 50|
| 5| Business| 754| 350| 18| 22|
| 4| Hr| 821| 700| 26| 60|
+------+-----------+-------+------+---+-----+
Spark has a window function for calculating percentiles which is called percent_rank.
Test df:
from pyspark.sql import SparkSession, functions as F, Window as W
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, "HR", 111, 1000, 45, 100),
(2, "Sales", 596, 500, 30, 50),
(3, "Manufacture", 895, 600, 50, 400),
(4, "HR", 212, 700, 26, 60),
(5, "Business", 754, 350, 18, 22)],
['Empl_No', 'Dept', 'Pincode', 'Salary', 'Age', 'Bonus'])
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1000| 45| 100|
# | 2| Sales| 596| 500| 30| 50|
# | 3|Manufacture| 895| 600| 50| 400|
# | 4| HR| 212| 700| 26| 60|
# | 5| Business| 754| 350| 18| 22|
# +-------+-----------+-------+------+---+-----+
percent_rank works in a way that the smallest value gets percentile 0 and the biggest value gets 1.
arr = ['Salary', 'Age', 'Bonus']
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
df.show()
# +-------+-----------+-------+------+----+-----+
# |Empl_No| Dept|Pincode|Salary| Age|Bonus|
# +-------+-----------+-------+------+----+-----+
# | 1| HR| 111| 1.0|0.75| 0.75|
# | 2| Sales| 596| 0.25| 0.5| 0.25|
# | 3|Manufacture| 895| 0.5| 1.0| 1.0|
# | 4| HR| 212| 0.75|0.25| 0.5|
# | 5| Business| 754| 0.0| 0.0| 0.0|
# +-------+-----------+-------+------+----+-----+
However, your expectation is somewhat different. You expect it to assume 0 as the smallest value even though it does not exist in the columns.
To solve this I will add a row with 0 values and later it will be deleted.
arr = ['Salary', 'Age', 'Bonus']
# Adding a row containing 0 values
df = df.limit(1).withColumn('Dept', F.lit('_tmp')).select(
*[c for c in df.columns if c not in arr],
*[F.lit(0).alias(c) for c in arr]
).union(df)
# Calculating percentiles
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
# Removing the fake row
df = df.filter("Dept != '_tmp'")
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1.0|0.8| 0.8|
# | 2| Sales| 596| 0.4|0.6| 0.4|
# | 3|Manufacture| 895| 0.6|1.0| 1.0|
# | 4| HR| 212| 0.8|0.4| 0.6|
# | 5| Business| 754| 0.2|0.2| 0.2|
# +-------+-----------+-------+------+---+-----+
You can multiply the percentile by 100 if you like:
*[(100 * F.percent_rank().over(W.orderBy(c))).alias(c) for c in arr]
Then you get...
+-------+-----------+-------+------+-----+-----+
|Empl_No| Dept|Pincode|Salary| Age|Bonus|
+-------+-----------+-------+------+-----+-----+
| 1| HR| 111| 100.0| 80.0| 80.0|
| 2| Sales| 596| 40.0| 60.0| 40.0|
| 3|Manufacture| 895| 60.0|100.0|100.0|
| 4| HR| 212| 80.0| 40.0| 60.0|
| 5| Business| 754| 20.0| 20.0| 20.0|
+-------+-----------+-------+------+-----+-----+

how to execute many expressions in the selectExpr

it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")
def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+
Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+

Pyspark: pivot month level records into a single column based quarter level record

I have a dataframe in the form:
+---------+-------+------------+------------------+--------+-------+
| quarter | month | month_rank | unique_customers | units | sales |
+---------+-------+------------+------------------+--------+-------+
-
| 1 | 1 | 1 | 15 | 30 | 1000 |
--------------------------------------------------------------------
| 1 | 2 | 2 | 20 | 35 | 1200 |
--------------------------------------------------------------------
| 1 | 3 | 3 | 18 | 40 | 1500 |
--------------------------------------------------------------------
| 2 | 4 | 1 | 10 | 25 | 800 |
--------------------------------------------------------------------
| 2 | 5 | 2 | 25 | 50 | 2000 |
--------------------------------------------------------------------
| 2 | 6 | 3 | 28 | 45 | 1800 |
...
I am trying to group on quarter and track the monthly sales in a columnar fashion such as the following:
+---------+--------------+------------+------------------+--------+-------+
| quarter | month_rank1 | rank1_unique_customers | rank1_units | rank1_sales | month_rank2 | rank2_unique_customers | rank2_units | rank2_sales | month_rank3 | rank3_unique_customers | rank3_units | rank3_sales |
+---------+--------------+------------+------------------+--------+-------+
| 1 | 1 | 15|30|1000| 2 |20|35|1200 | 3 |18|40|1500
---------------------------------------------------------------------
| 2 | 4 | 10|25|800 | 5 |25|50|2000 | 6 |28|45|1800
---------------------------------------------------------------------
Is this achievable with multiple pivots? I have had no luck creating multiple columns from a pivot. I am thinking I might be able to achieve this result with windowing, but if anyone has run into a similar problem any suggestions would be greatly appriciated. Thank you!
Use pivot on month_rank column then agg other columns.
Example:
df=spark.createDataFrame([(1,1,1,15,30,1000),(1,2,2,20,35,1200),(1,3,3,18,40,1500),(2,4,1,10,25,800),(2,5,2,25,50,2000),(2,6,3,28,45,1800)],["quarter","month","month_rank","unique_customers","units","sales"])
df.show()
#+-------+-----+----------+----------------+-----+-----+
#|quarter|month|month_rank|unique_customers|units|sales|
#+-------+-----+----------+----------------+-----+-----+
#| 1| 1| 1| 15| 30| 1000|
#| 1| 2| 2| 20| 35| 1200|
#| 1| 3| 3| 18| 40| 1500|
#| 2| 4| 1| 10| 25| 800|
#| 2| 5| 2| 25| 50| 2000|
#| 2| 6| 3| 28| 45| 1800|
#+-------+-----+----------+----------------+-----+-----+
from pyspark.sql.functions import *
df1=df.\
groupBy("quarter").\
pivot("month_rank").\
agg(first(col("month")),first(col("unique_customers")),first(col("units")),first(col("sales")))
cols=["quarter","month_rank1","rank1_unique_customers","rank1_units","rank1_sales","month_rank2","rank2_unique_customers","rank2_units","rank2_sales","month_rank3","rank3_unique_customers","rank3_units","rank3_sales"]
df1.toDF(*cols).show()
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
#|quarter|month_rank1|rank1_unique_customers|rank1_units|rank1_sales|month_rank2|rank2_unique_customers|rank2_units|rank2_sales|month_rank3|rank3_unique_customers|rank3_units|rank3_sales|
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
#| 1| 1| 15| 30| 1000| 2| 20| 35| 1200| 3| 18| 40| 1500|
#| 2| 4| 10| 25| 800| 5| 25| 50| 2000| 6| 28| 45| 1800|
#+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+

How do I calculate the start/end of an interval (set of rows) containing identical values?

Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+

splitting content of a pyspark dataframe column and aggregating them into new columns

I am trying to extract and split the data within pyspark dataframe column, following which, aggregate it into a new columns.
Input Table.
+--+-----------+
|id|description|
+--+-----------+
|1 | 3:2,3|2:1|
|2 | 2 |
|3 | 2:12,16 |
|4 | 3:2,4,6 |
|5 | 2 |
|6 | 2:3,7|2:3|
+--------------+
Desired Output.
+--+-----------+-------+-----------+
|id|description|sum_emp|org_changed|
+--+-----------+-------+-----------+
|1 | 3:2,3|2:1| 5 | 3 |
|2 | 2 | 2 | 0 |
|3 | 2:12,16 | 2 | 2 |
|4 | 3:2,4,6 | 3 | 3 |
|5 | 2 | 2 | 0 |
|6 | 2:3,7|2:3| 4 | 3 |
+--------------+-------+-----------+
Before the ":", values ought to be added. The values post the ":" are to be counted. The | marks the shift in the record(can be ignored)
Some data points are as long as 2:3,4,5|3:4,6,3|4:3,7,8
Any help would be greatly appreciated
Scenario Explained:
Considering the 6th id for example. The 6 refers to a biz unit id. The 'Description' column describes the team within that given unit.
Now for the meaning of the values 2:3,7|2:3 are as follows:
1)Fist 2 followed by 3&7 = there are 2 folks in the team and one of them has been in another org for 3 years and for 7 years (perhaps its the second guys first company)
2)Second 2 followed by 3 = there are 2 folks again in a sub team, and 1 person has spent 3 years in another org.
Desired output:
sum_emp = total number of employees in that given biz unit.
org_changed = total number of organizations folks in that biz unit have changed.
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([[1,"3:2,3|2:1"],
[2,"2"],
[3,"2:12,16"],
[4,"3:2,4,6"],
[5,"2"],
[6,"2:3,7|2:3"]]),
["id","description"])
+---+-----------+
| id|description|
+---+-----------+
| 1| 3:2,3|2:1|
| 2| 2|
| 3| 2:12,16|
| 4| 3:2,4,6|
| 5| 2|
| 6| 2:3,7|2:3|
+---+-----------+
First we'll split the records and explode the resulting array so we only have one record per line:
import pyspark.sql.functions as psf
df = df.withColumn(
"record",
psf.explode(psf.split("description", '\|'))
)
+---+-----------+-------+
| id|description| record|
+---+-----------+-------+
| 1| 3:2,3|2:1| 3:2,3|
| 1| 3:2,3|2:1| 2:1|
| 2| 2| 2|
| 3| 2:12,16|2:12,16|
| 4| 3:2,4,6|3:2,4,6|
| 5| 2| 2|
| 6| 2:3,7|2:3| 2:3,7|
| 6| 2:3,7|2:3| 2:3|
+---+-----------+-------+
Now we'll split records into the number of players and a list of years:
df = df.withColumn(
"record",
psf.split("record", ':')
).withColumn(
"nb_players",
psf.col("record")[0]
).withColumn(
"years",
psf.split(psf.col("record")[1], ',')
)
+---+-----------+----------+----------+---------+
| id|description| record|nb_players| years|
+---+-----------+----------+----------+---------+
| 1| 3:2,3|2:1| [3, 2,3]| 3| [2, 3]|
| 1| 3:2,3|2:1| [2, 1]| 2| [1]|
| 2| 2| [2]| 2| null|
| 3| 2:12,16|[2, 12,16]| 2| [12, 16]|
| 4| 3:2,4,6|[3, 2,4,6]| 3|[2, 4, 6]|
| 5| 2| [2]| 2| null|
| 6| 2:3,7|2:3| [2, 3,7]| 2| [3, 7]|
| 6| 2:3,7|2:3| [2, 3]| 2| [3]|
+---+-----------+----------+----------+---------+
Finally, we want to sum for each id the number of players and the length of years:
df = df.withColumn(
"years_size",
psf.when(psf.size("years") > 0, psf.size("years")).otherwise(0)
).groupby("id").agg(
psf.first("description").alias("description"),
psf.sum("nb_players").alias("sum_emp"),
psf.sum("years_size").alias("org_changed")
).sort("id").show()
+---+-----------+-------+-----------+
| id|description|sum_emp|org_changed|
+---+-----------+-------+-----------+
| 1| 3:2,3|2:1| 5.0| 3|
| 2| 2| 2.0| 0|
| 3| 2:12,16| 2.0| 2|
| 4| 3:2,4,6| 3.0| 3|
| 5| 2| 2.0| 0|
| 6| 2:3,7|2:3| 4.0| 3|
+---+-----------+-------+-----------+

Resources