I have the following dataframe:
| Timestamp | info |
+-------------------+----------+
|2016-01-01 17:54:30| 8 |
|2016-02-01 12:16:18| 2 |
|2016-03-01 12:17:57| 1 |
|2016-04-01 10:05:21| 2 |
|2016-05-11 18:58:25| 7 |
|2016-06-11 11:18:29| 6 |
|2016-07-01 12:05:21| 3 |
|2016-08-11 11:58:25| 2 |
|2016-09-11 15:18:29| 9 |
I would like to create a new column named count which counts in a window(-2, 0) (current row and previous two) how many values are > 5 (in the first two rows where I cannot perform the operation I would put 0).
The resulting table should be:
| Timestamp | info | count |
+-------------------+----------+----------+
|2016-01-01 17:54:30| 8 | 0 |
|2016-02-01 12:16:18| 2 | 0 |
|2016-03-01 12:17:57| 1 | 1 |
|2016-04-01 10:05:21| 2 | 0 |
|2016-05-11 18:58:25| 7 | 1 |
|2016-06-11 11:18:29| 6 | 2 |
|2016-07-01 12:05:21| 3 | 2 |
|2016-08-11 11:58:25| 2 | 1 |
|2016-09-11 15:18:29| 9 | 1 |
I tried to do this but it didn't work:
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn("count", F.when((F.count("info").over(w) > 5), F.count("info").over(w) > 5).otherwise(0))
The following would work if you don't mind calculations performed for the first 2 rows.
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn('count', F.count(F.when(F.col('info') > 5, 1)).over(w))
df_input.show()
# +-------------------+----+-----+
# | Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30| 8| 1|
# |2016-02-01 12:16:18| 2| 1|
# |2016-03-01 12:17:57| 1| 1|
# |2016-04-01 10:05:21| 2| 0|
# |2016-05-11 18:58:25| 7| 1|
# |2016-06-11 11:18:29| 6| 2|
# |2016-07-01 12:05:21| 3| 2|
# |2016-08-11 11:58:25| 2| 1|
# |2016-09-11 15:18:29| 9| 1|
# +-------------------+----+-----+
If you need 2 first rows to be 0, without changing the window, you can use this when condition:
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn(
'count',
F.when(F.size(F.collect_list('info').over(w)) == 3, F.count(F.when(F.col('info') > 5, 1)).over(w))
.otherwise(0)
)
df_input.show()
# +-------------------+----+-----+
# | Timestamp|info|count|
# +-------------------+----+-----+
# |2016-01-01 17:54:30| 8| 0|
# |2016-02-01 12:16:18| 2| 0|
# |2016-03-01 12:17:57| 1| 1|
# |2016-04-01 10:05:21| 2| 0|
# |2016-05-11 18:58:25| 7| 1|
# |2016-06-11 11:18:29| 6| 2|
# |2016-07-01 12:05:21| 3| 2|
# |2016-08-11 11:58:25| 2| 1|
# |2016-09-11 15:18:29| 9| 1|
# +-------------------+----+-----+
I am currently using Pyspark to do a moving average calculation for the last 12 months for different company group. The data looks like this:
| CALENDAR_DATE| COMPANY | VALUE
| 2021-11-01 | a | 31
| 2021-10-01 | a | 31
| 2021-09-01 | a | 33
| 2021-08-01 | a | 21
| 2021-07-01 | a | 25
| 2021-06-01 | a | 28
| 2021-05-01 | a | 31
| 2021-04-01 | a | 31
| 2021-03-01 | a | 33
| 2021-04-01 | a | 31
| 2021-03-01 | a | 33
| 2021-04-01 | a | 10
| 2021-03-01 | a | 25
| 2021-04-01 | a | 30
| 2021-03-01 | a | 27
| 2021-02-01 | a | 18
| 2021-01-01 | a | 15
| 2021-11-01 | b | 31
| 2021-10-01 | b | 30
| 2021-09-01 | b | 31
| 2021-08-01 | b | 32
and I would like to get an extra column called rolling_average for each company a and b.
my code looks like this and it doesn't give me the right answer. I really don't know what is the problem.
from pyspark.sql.functions import *
from pyspark.sql.window import *
w = Window().partitionBy('COMPANY').orderBy('CALENDAR_DATE').rowsBetween(-11, 0)
df = df.withColumn('ROLLING_AVERAGE', round(avg('VALUE').over(w), 1))
You need to use Window rangeBetween instead of rowsBetween. But before convert the CALENDAR_DATE column into timestamp:
from pyspark.sql import Window
from pyspark.sql import functions as F
df = df.withColumn('calendar_timestamp', F.to_timestamp('CALENDAR_DATE').cast("long"))
# 2629800 is the number of seconds in one month
w = Window().partitionBy('COMPANY').orderBy('calendar_timestamp').rangeBetween(-11 * 2629800, 0)
df1 = df.withColumn(
'ROLLING_AVERAGE',
F.round(F.avg('VALUE').over(w), 1)
).drop('calendar_timestamp')
df1.show()
#+-------------+-------+-----+---------------+
#|CALENDAR_DATE|COMPANY|VALUE|ROLLING_AVERAGE|
#+-------------+-------+-----+---------------+
#| 2021-08-01| b| 32| 32.0|
#| 2021-09-01| b| 31| 31.5|
#| 2021-10-01| b| 30| 31.0|
#| 2021-11-01| b| 31| 31.0|
#| 2021-01-01| a| 15| 15.0|
#| 2021-02-01| a| 18| 16.5|
#| 2021-03-01| a| 33| 25.2|
#| 2021-03-01| a| 33| 25.2|
#| 2021-03-01| a| 25| 25.2|
#| 2021-03-01| a| 27| 25.2|
#| 2021-04-01| a| 31| 25.3|
#| 2021-04-01| a| 31| 25.3|
#| 2021-04-01| a| 10| 25.3|
#| 2021-04-01| a| 30| 25.3|
#| 2021-05-01| a| 31| 25.8|
#| 2021-06-01| a| 28| 26.0|
#| 2021-07-01| a| 25| 25.9|
#| 2021-08-01| a| 21| 25.6|
#| 2021-09-01| a| 33| 26.1|
#| 2021-10-01| a| 31| 26.4|
#+-------------+-------+-----+---------------+
I have a pyspark dataframe:
date | cust | amount | is_delinquent
---------------------------------------
1/1/20 | A | 5 | 0
13/1/20 | A | 1 | 0
15/1/20 | A | 3 | 1
19/1/20 | A | 4 | 0
20/1/20 | A | 4 | 1
27/1/20 | A | 2 | 0
1/2/20 | A | 2 | 0
5/2/20 | A | 1 | 0
1/1/20 | B | 7 | 0
1/1/20 | B | 5 | 0
Now I want to calculate the average of amount on a period windows of 30 days and filtering the column IS_DELINQUENT is equal to 0. It should skip when IS_DELINQUENT equal to 1 and replace as NaN.
My expected final dataframe is:
date | cust | amount | is_delinquent | avg_amount
----------------------------------------------------------
1/1/20 | A | 5 | 0 | null
13/1/20 | A | 1 | 0 | 5
15/1/20 | A | 3 | 1 | null
19/1/20 | A | 4 | 0 | 3
20/1/20 | A | 4 | 1 | null
27/1/20 | A | 2 | 0 | 3.333
1/2/20 | A | 2 | 0 | null
5/2/20 | A | 1 | 0 | 2
1/1/20 | B | 7 | 0 | null
9/1/20 | B | 5 | 0 | 7
without the filtering, my code would be like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
days = lambda i: i * 86400
w_pay_30x = Window.partitionBy("cust").orderBy(col("date").cast("timestamp").cast("long")).rangeBetween(-days(30), -days(1))
data.withColumn("avg_amount", F.avg("amount").over(w_pay_30x)
Any idea how I can add this filter?
You can use when to calculate and show the average only if is_delinquent is equal to 0. Also you may want to include the month in the partition by clause of the window.
from pyspark.sql import functions as F, Window
days = lambda i: i * 86400
w_pay_30x = (Window.partitionBy("cust", F.month(F.to_timestamp('date', 'd/M/yy')))
.orderBy(F.to_timestamp('date', 'd/M/yy').cast('long'))
.rangeBetween(-days(30), -days(1))
)
data2 = data.withColumn(
'avg_amount',
F.when(
F.col('is_delinquent') == 0,
F.avg(
F.when(
F.col('is_delinquent') == 0,
F.col('amount')
)
).over(w_pay_30x)
)
).orderBy('cust', F.to_timestamp('date', 'd/M/yy'))
data2.show()
+-------+----+------+-------------+------------------+
| date|cust|amount|is_delinquent| avg_amount|
+-------+----+------+-------------+------------------+
| 1/1/20| A| 5| 0| null|
|13/1/20| A| 1| 0| 5.0|
|15/1/20| A| 3| 1| null|
|19/1/20| A| 4| 0| 3.0|
|20/1/20| A| 4| 1| null|
|27/1/20| A| 2| 0|3.3333333333333335|
| 1/2/20| A| 2| 0| null|
| 5/2/20| A| 1| 0| 2.0|
| 1/1/20| B| 7| 0| null|
| 9/1/20| B| 5| 0| 7.0|
+-------+----+------+-------------+------------------+
Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+
I have a dataframe of the form:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
| ha42a | AB | 3 |
| ha42a | AB | 4 |
| ha42a | AB | 5 |
I want to filter out users that are seen on consecutive days, if they are not seen in at least a single nonconsecutive day. The resulting dataframe should be:
|user_id| action | day |
------------------------
| d25as | AB | 2 |
| d25as | AB | 3 |
| d25as | AB | 5 |
| m3562 | AB | 1 |
| m3562 | AB | 7 |
| m3562 | AB | 9 |
where the last user has been removed, since he appeared just on consecutive days.
Does anyone know how this can be done in spark?
Using spark-sql window functions and without any udfs. The df construction is done in scala but the sql part will be same in python. Check this out:
val df = Seq(("d25as","AB",2),("d25as","AB",3),("d25as","AB",5),("m3562","AB",1),("m3562","AB",7),("m3562","AB",9),("ha42a","AB",3),("ha42a","AB",4),("ha42a","AB",5)).toDF("user_id","action","day")
df.createOrReplaceTempView("qubix")
spark.sql(
""" with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
select user_id, action, day from t2 where size(diff2) > 1
""").show(false)
Results:
+-------+------+---+
|user_id|action|day|
+-------+------+---+
|d25as |AB |2 |
|d25as |AB |3 |
|d25as |AB |5 |
|m3562 |AB |1 |
|m3562 |AB |7 |
|m3562 |AB |9 |
+-------+------+---+
pyspark version
>>> from pyspark.sql.functions import *
>>> values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
... ('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
... ('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
>>> df = spark.createDataFrame(values,['user_id','action','day'])
>>> df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
>>> df.createOrReplaceTempView("qubix")
>>> spark.sql(
... """ with t1( select user_id, action, day, row_number() over(partition by user_id order by day)-day diff from qubix),
... t2( select user_id, action, day, collect_set(diff) over(partition by user_id) diff2 from t1)
... select user_id, action, day from t2 where size(diff2) > 1
... """).show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
+-------+------+---+
>>>
Read the comments in between. The code will be self explanatory then.
from pyspark.sql.functions import udf, collect_list, explode
#Creating the DataFrame
values = [('d25as','AB',2),('d25as','AB',3),('d25as','AB',5),
('m3562','AB',1),('m3562','AB',7),('m3562','AB',9),
('ha42a','AB',3),('ha42a','AB',4),('ha42a','AB',5)]
df = sqlContext.createDataFrame(values,['user_id','action','day'])
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| ha42a| AB| 3|
| ha42a| AB| 4|
| ha42a| AB| 5|
+-------+------+---+
# Grouping together the days in one list.
df = df.groupby(['user_id','action']).agg(collect_list('day'))
df.show()
+-------+------+-----------------+
|user_id|action|collect_list(day)|
+-------+------+-----------------+
| ha42a| AB| [3, 4, 5]|
| m3562| AB| [1, 7, 9]|
| d25as| AB| [2, 3, 5]|
+-------+------+-----------------+
# Creating a UDF to check if the days are consecutive or not. Only keep False ones.
check_consecutive = udf(lambda row: sorted(row) == list(range(min(row), max(row)+1)))
df = df.withColumn('consecutive',check_consecutive(col('collect_list(day)')))\
.where(col('consecutive')==False)
df.show()
+-------+------+-----------------+-----------+
|user_id|action|collect_list(day)|consecutive|
+-------+------+-----------------+-----------+
| m3562| AB| [1, 7, 9]| false|
| d25as| AB| [2, 3, 5]| false|
+-------+------+-----------------+-----------+
# Finally, exploding the DataFrame from above to get the result.
df = df.withColumn("day", explode(col('collect_list(day)')))\
.drop('consecutive','collect_list(day)')
df.show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| m3562| AB| 1|
| m3562| AB| 7|
| m3562| AB| 9|
| d25as| AB| 2|
| d25as| AB| 3|
| d25as| AB| 5|
+-------+------+---+