pyspark: rolling average using timeseries data - apache-spark

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week.
Here's an example:
%pyspark
import datetime
from pyspark.sql import functions as F
df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"])
df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp'))
w = df2.groupBy(F.window("timestampGMT", "7 days")).agg(F.avg("dollars").alias('avg'))
w.select(w.window.start.cast("string").alias("start"), w.window.end.cast("string").alias("end"), "avg").collect()
This results in two records:
| start | end | avg |
|---------------------|----------------------|-----|
|'2017-03-16 00:00:00'| '2017-03-23 00:00:00'| 21.0|
|---------------------|----------------------|-----|
|'2017-03-09 00:00:00'| '2017-03-16 00:00:00'| 15.0|
|---------------------|----------------------|-----|
The window function binned the time series data rather than performing a rolling average.
Is there a way to perform a rolling average where I'll get back a weekly average for each row with a time period ending at the timestampGMT of the row?
EDIT:
Zhang's answer below is close to what I want, but not exactly what I'd like to see.
Here's a better example to show what I'm trying to get at:
%pyspark
from pyspark.sql import functions as F
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', F.avg("dollars").over(Window.partitionBy(F.window("timestampGMT", "7 days"))))
This results in the following dataframe:
dollars timestampGMT rolling_average
25 2017-03-18 11:27:18.0 25
17 2017-03-10 15:27:18.0 15
13 2017-03-15 12:27:18.0 15
I'd like the average to be over the week proceeding the date in the timestampGMT column, which would result in this:
dollars timestampGMT rolling_average
17 2017-03-10 15:27:18.0 17
13 2017-03-15 12:27:18.0 15
25 2017-03-18 11:27:18.0 19
In the above results, the rolling_average for 2017-03-10 is 17, since there are no preceding records. The rolling_average for 2017-03-15 is 15 because it is averaging the 13 from 2017-03-15 and the 17 from 2017-03-10 which falls withing the preceding 7 day window. The rolling average for 2017-03-18 is 19 because it is averaging the 25 from 2017-03-18 and the 13 from 2017-03-10 which falls withing the preceding 7 day window, and it is not including the 17 from 2017-03-10 because that does not fall withing the preceding 7 day window.
Is there a way to do this rather than the binning window where the weekly windows don't overlap?

I figured out the correct way to calculate a moving/rolling average using this stackoverflow:
Spark Window Functions - rangeBetween dates
The basic idea is to convert your timestamp column to seconds, and then you can use the rangeBetween function in the pyspark.sql.Window class to include the correct rows in your window.
Here's the solved example:
%pyspark
from pyspark.sql import functions as F
from pyspark.sql.window import Window
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))
df = df.withColumn('rolling_average', F.avg("dollars").over(w))
This results in the exact column of rolling averages that I was looking for:
dollars timestampGMT rolling_average
17 2017-03-10 15:27:18.0 17.0
13 2017-03-15 12:27:18.0 15.0
25 2017-03-18 11:27:18.0 19.0

I will add a variation which I personally found very useful. I hope someone will find it useful as well:
If you want to groupby then within the respective groups calculate the moving average:
Example of the dataframe :
from pyspark.sql.window import Window
from pyspark.sql import functions as func
df = spark.createDataFrame([("tshilidzi", 17.00, "2018-03-10T15:27:18+00:00"),
("tshilidzi", 13.00, "2018-03-11T12:27:18+00:00"),
("tshilidzi", 25.00, "2018-03-12T11:27:18+00:00"),
("thabo", 20.00, "2018-03-13T15:27:18+00:00"),
("thabo", 56.00, "2018-03-14T12:27:18+00:00"),
("thabo", 99.00, "2018-03-15T11:27:18+00:00"),
("tshilidzi", 156.00, "2019-03-22T11:27:18+00:00"),
("thabo", 122.00, "2018-03-31T11:27:18+00:00"),
("tshilidzi", 7000.00, "2019-04-15T11:27:18+00:00"),
("ash", 9999.00, "2018-04-16T11:27:18+00:00")
],
["name", "dollars", "timestampGMT"])
# we need this timestampGMT as seconds for our Window time frame
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df.show(10000, False)
Output:
+---------+-------+---------------------+
|name |dollars|timestampGMT |
+---------+-------+---------------------+
|tshilidzi|17.0 |2018-03-10 17:27:18.0|
|tshilidzi|13.0 |2018-03-11 14:27:18.0|
|tshilidzi|25.0 |2018-03-12 13:27:18.0|
|thabo |20.0 |2018-03-13 17:27:18.0|
|thabo |56.0 |2018-03-14 14:27:18.0|
|thabo |99.0 |2018-03-15 13:27:18.0|
|tshilidzi|156.0 |2019-03-22 13:27:18.0|
|thabo |122.0 |2018-03-31 13:27:18.0|
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|
|ash |9999.0 |2018-04-16 13:27:18.0|
+---------+-------+---------------------+
To calculate the moving average based on the name and still maintain all rows:
#create window by casting timestamp to long (number of seconds)
w = (Window()
.partitionBy(col("name"))
.orderBy(F.col("timestampGMT").cast('long'))
.rangeBetween(-days(7), 0))
df2 = df.withColumn('rolling_average', F.avg("dollars").over(w))
df2.show(100, False)
Output:
+---------+-------+---------------------+------------------+
|name |dollars|timestampGMT |rolling_average |
+---------+-------+---------------------+------------------+
|ash |9999.0 |2018-04-16 13:27:18.0|9999.0 |
|tshilidzi|17.0 |2018-03-10 17:27:18.0|17.0 |
|tshilidzi|13.0 |2018-03-11 14:27:18.0|15.0 |
|tshilidzi|25.0 |2018-03-12 13:27:18.0|18.333333333333332|
|tshilidzi|156.0 |2019-03-22 13:27:18.0|156.0 |
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|7000.0 |
|thabo |20.0 |2018-03-13 17:27:18.0|20.0 |
|thabo |56.0 |2018-03-14 14:27:18.0|38.0 |
|thabo |99.0 |2018-03-15 13:27:18.0|58.333333333333336|
|thabo |122.0 |2018-03-31 13:27:18.0|122.0 |
+---------+-------+---------------------+------------------+

It's worth noting, that if you don't care about the exact dates - but care to have the average of the last 30 days available you can use the rowsBetween function as follows:
w = Window.orderBy('timestampGMT').rowsBetween(-7, 0)
df = eurPrices.withColumn('rolling_average', F.avg('dollars').over(w))
Since you order by the dates, it will take the last 7 occurrences.
You save all the casting.

Do you mean this :
df = spark.createDataFrame([(17, "2017-03-11T15:27:18+00:00"),
(13, "2017-03-11T12:27:18+00:00"),
(21, "2017-03-17T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', f.avg("dollars").over(Window.partitionBy(f.window("timestampGMT", "7 days"))))
Output:
+-------+-------------------+---------------+
|dollars|timestampGMT |rolling_average|
+-------+-------------------+---------------+
|21 |2017-03-17 19:27:18|21.0 |
|17 |2017-03-11 23:27:18|15.0 |
|13 |2017-03-11 20:27:18|15.0 |
+-------+-------------------+---------------+

Related

pandas computation on rolling 1 calendar month

I have a pandas DataFrame with date as the index and a column, 'spendings'. I intend to get the rolling max() of the 'spendings' column for the trailing 1 calendar month (not 30 days or 4 weeks).
I tried to capture a snippet with custom data for addressing the problem, below (borrowed from Pandas monthly rolling operation):
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings
20210325 15
20210405 20
20210415 10
20210425 40
20210505 3
20210515 2
20210525 2
20210527 1
"""
)
df = pd.read_csv(data,sep="\s+", parse_dates=True)
df.index = pd.to_datetime(df.date, format='%Y%m%d')
del(df['date'])
Now, to create a column 'max' to hold rolling last 1 calendar month's max() val, I use:
df['max'] = df.loc[(df.index - pd.tseries.offsets.DateOffset(months=1)):df.index, 'spendings'].max()
This raises an exception like:
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [DatetimeIndex(['2021-02-25', '2021-03-05', '2021-03-15', '2021-03-25',
'2021-04-05', '2021-04-15', '2021-04-25'],
dtype='datetime64[ns]', name='date', freq=None)] of type DatetimeIndex
However, if I manually access a random month window like below, it works without exception:
>>> df['2021-04-16':'2021-05-15']
spendings
date
2021-04-25 40
2021-05-05 3
2021-05-15 2
(I could have followed the method using list comprehension here: https://stackoverflow.com/a/47199274/235415, but I would like to use panda's vectorized method. I have many DataFrames and each is very large - using list comprehension is very slow here).
Q: How to get the vectorized method of performing rolling 1 calendar month's max()?
The expected o/p, ie primarily the 'max' column (holding the max value of 'spendings' for last 1 calendar month) will be something like this:
>>> df
spendings max
date
2021-03-25 15 15
2021-04-05 20 20
2021-04-15 10 20
2021-04-25 40 40
2021-05-05 3 40
2021-05-15 2 40
2021-05-25 2 40
2021-05-27 1 3
The answer will be
[df.loc[x- pd.tseries.offsets.DateOffset(months=1):x, 'spendings'].max() for x in df.index]
Out[53]: [15, 20, 20, 40, 40, 40, 40, 3]

Filter DataFrame to delete duplicate values in pyspark

I have the following dataframe
date | value | ID
--------------------------------------
2021-12-06 15:00:00 25 1
2021-12-06 15:15:00 35 1
2021-11-30 00:00:00 20 2
2021-11-25 00:00:00 10 2
I want to join this DF with another one like this:
idUser | Name | Gender
-------------------
1 John M
2 Anne F
My expected output is:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
What I need is: Get only the most recent value of the first dataframe and join only this value with my second dataframe. Although, my spark script is joining both values:
My code:
df = df1.select(
col("date"),
col("value"),
col("ID"),
).OrderBy(
col("ID").asc(),
col("date").desc(),
).groupBy(
col("ID"), col("date").cast(StringType()).substr(0,10).alias("date")
).agg (
max(col("value")).alias("value")
)
final_df = df2.join(
df,
(col("idUser") == col("ID")),
how="left"
)
When i perform this join (formating the columns is abstracted in this post) I have the following output:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
2 Anne F 10
I use substr to remove hours and minutes to filter only by date. But when I have the same ID in different days my output df has the 2 values instead of the most recently. How can I fix this?
Note: I'm using only pyspark functions to do this (I now want to use spark.sql(...)).
You can use window and row_number function in pysaprk
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("ID").orderBy("date").desc()
df1_latest_val = df1.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
The output of table df1_latest_val will look something like this
date | value | ID | row_number |
-----------------------------------------------------
2021-12-06 15:15:00 35 1 1
2021-11-30 00:00:00 20 2 1
Now you will have df with the latest val, which you can directly join with another table.

Pandas rolling slope on groupby objects

I would like to estimate a rolling slope on a grouped dataframe.
Let's say that I have the following df:
Date tags weight
22 2004-05-12 a 0.000081
23 2004-05-13 a 0.000073
24 2004-05-14 a 0.000085
25 2004-05-17 a 0.000089
26 2004-05-18 b 0.000034
27 2004-05-19 b 0.000048
......
1000 2004-05-20 b 0.000034
1001 2004-05-21 b 0.000037
1002 2004-05-24 c 0.000043
1003 2004-05-25 c 0.000038
1004 2004-05-26 c 0.000029
How could I calculate a rolling slope over 10 dates and for each group?
I tried:
from scipy.stats import linregress
df['rolling_slope'] = df.groupby('tags').rolling(window=10,
min_periods=2).apply(lambda v: linregress(v.Date, v.weight))
but it seems that I can't apply the function to a Series
Try:
df['rolling_slope'] = (df.groupby('tags')['weight']
.rolling(window=10, min_period=2)
.apply(lambda v: linregress(np.arange(len(v)), v).slope )
.reset_index(level=0, drop=True)
)
But this is rolling on number of rows only, not really looking back 10 days. There's also an option rolling('10D') but you would need to set date as index.

How to iterate over columns of "spark" dataframe?

I have the following Spark dataframe that is created dynamically
| name| number |
+--------+---------+
| Andy | (20,10,30)|
|Berta | (30,40,20)|
| Joe | (40,90,60)|
+-------+---------+
Now, I need to iterate each row and column in Spark to print the following output, How to do this?
Andy 20
Andy 10
Andy 30
Berta 30
Berta 40
Berta 20
Joe 40
Joe 90
Joe 60
Assuming the number column is of string Data Type, you can achieve the desired results by following below steps.
Original Data Frame:
val df = Seq(("Andy", "20,10,30"), ("Berta", "30,40,20"), ("Joe", "40,90,60"))
.toDF("name", "number")
Then Create an intermediate Data Frame having 3 number columns by splitting the number column with comma.
val Interim_Df = df.withColumn("n1", split(col("number"), ",").getItem(0))
.withColumn("n2", split(col("number"), ",").getItem(1))
.withColumn("n3", split(col("number"), ",").getItem(2))
.drop("number")
Then generate the final result data frame by doing union with oneIndexDfs.
val columnIndexes = Seq(1, 2, 3)
val onlyOneIndexDfs = columnIndexes.map(x =>
Interim_Df.select(
$"name",
col(s"n$x").alias("number")))
val resultDF = onlyOneIndexDfs.reduce(_ union _)
You need explode function.
Here samples of its usage.

Populating pandas column based on moving date range (efficiently)

I have 2 pandas dataframes, one of them contains dates with measurements, and the other contains dates with an event ID.
df1
from datetime import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np
today = dt.now()
ndays = 10
df1 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays))})
df1.Date = df1.Date.dt.date
Date measurement
2018-01-10 8
2018-01-11 2
2018-01-12 7
2018-01-13 3
2018-01-14 1
2018-01-15 1
2018-01-16 6
2018-01-17 9
2018-01-18 8
2018-01-19 4
df2
df2 = pd.DataFrame({'Date': ['2018-01-11', '2018-01-14', '2018-01-16', '2018-01-19'], 'letter': ['event_a', 'event_b', 'event_c', 'event_d']})
df2.Date = pd.to_datetime(df2.Date, format = '%Y-%m-%d')
df2.Date = df2.Date.dt.date
Date event_id
2018-01-11 event_a
2018-01-14 event_b
2018-01-16 event_c
2018-01-19 event_d
I give the dates in df1 an event_id from df2 only if it's between two event dates. The resulting dataframe would look something like:
df3
today = dt.now()
ndays = 10
df3 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays)), 'event_id': ['event_a', 'event_a', 'event_b', 'event_b', 'event_b', 'event_c', 'event_c', 'event_d', 'event_d', 'event_d']})
df3.Date = df3.Date.dt.date
Date event_id measurement
2018-01-10 event_a 4
2018-01-11 event_a 2
2018-01-12 event_b 1
2018-01-13 event_b 5
2018-01-14 event_b 5
2018-01-15 event_c 4
2018-01-16 event_c 6
2018-01-17 event_d 6
2018-01-18 event_d 9
2018-01-19 event_d 6
The code I use to achieve this is:
n = 1
while n <= len(list(df2.Date)) - 1 :
for date in list(df1.Date):
if date <= df2.iloc[n].Date and (date > df2.iloc[n-1].Date):
df1.loc[df1.Date == date, 'event_id'] = df2.iloc[n].event_id
n += 1
The dataset that I am working with is significantly larger than this (a few million rows) and this method runs far too long. Is there a more efficient way to accomplish this?
So there are quite a few things to improve performance.
The first question I have is: does it have to be a pandas frame to begin with? Meaning can't df1 and df2 just be lists of tuples or list of lists?
The thing is that pandas adds a significant overhead when accessing items but especially when setting values individually.
Pandas excels when it comes to vectorized operations but I don't see an efficient alternative right now (maybe someone comes up with such an answer, that would be ideal).
Now what I'd do is:
Convert your df1 and df2 to records -> e.g. d1 = df1.to_records() what you get is an array of tuples, basically with the same structure as the dataframe.
Now run your algorithm but instead of operating on pandas dataframes you operate on the arrays of tuples d1 and d2
Use a third list of tuples d3 where you store the newly created data (each tuple is a row)
Now if you want you can convert d3 back to a pandas dataframe:
df3 = pd.DataFrame.from_records(d3, myKwArgs**)
This will speed up your code significantly I'd assume by more than 100-1000%. It does increase memory usage though, so if you are low on memory try to avoid the pandas dataframes all-together or dereference unused pandas frames df1, df2 once you used them to create the records (and if you run into problems call gc manually).
EDIT: Here a version of your code using the procedure above:
d3 = []
n = 1
while n < range(len(d2)):
for i in range(len(d1)):
date = d1[i][0]
if date <= d2[n][0] and date > d2[n-1][0]:
d3.append( (date, d2[n][1], d1[i][1]) )
n += 1
You can try df.apply() method to achieve this. Refer pandas.DataFrame.apply. I think my code will works faster than yours.
My approach:
Merge two dataframes df1 and df2 and create new one df3 by
df3 = pd.merge(df1, df2, on='Date', how='outer')
Sort df3 by date to make easy to travserse.
df3['Date'] = pd.to_datetime(df3.Date)
df3.sort_values(by='Date')
Create set_event_date() method to apply for each rows in df3.
new_event_id = np.nan
def set_event_date(df3):
global new_event_id
if df3.event_id is not np.nan:
new_event_id = df3.event_id
return new_event_id
Apply set_event_method() to each rows in df3.
df3['new_event_id'] = df3.apply(set_event_date,axis=1)
Final Output will be:
Date Measurement New_event_id
0 2018-01-11 2 event_a
1 2018-01-12 1 event_a
2 2018-01-13 3 event_a
3 2018-01-14 6 event_b
4 2018-01-15 3 event_b
5 2018-01-16 5 event_c
6 2018-01-17 7 event_c
7 2018-01-18 9 event_c
8 2018-01-19 7 event_d
9 2018-01-20 4 event_d
Let me know once you tried my solution and it works faster than yours.
Thanks.

Resources