How to calculate 6 months moving average from daily data using pyspark - apache-spark

I am trying to calculate moving average of price for last six months in pyspark.
Currently my table has 6month lagged date.
id dates lagged_6month price
1 2017-06-02 2016-12-02 14.8
1 2017-08-09 2017-02-09 16.65
2 2017-08-16 2017-02-16 16
2 2018-05-14 2017-11-14 21.05
3 2017-09-01 2017-03-01 16.75
Desired Results
id dates avg6mprice
1 2017-06-02 20.6
1 2017-08-09 21.5
2 2017-08-16 16.25
2 2018-05-14 25.05
3 2017-09-01 17.75
Sample code
from pyspark.sql.functions import col
from pyspark.sql import functions as F
df = sqlContext.table("price_table")
w = Window.partitionBy([col('id')]).rangeBetween(col('dates'),col('lagged_6month'))
RangeBetween does not seem to accept columns as argument in the window function.

Related

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Resampling with a Multiindex and several columns

I have a pandas dataframe with the following structure:
ID date m_1 m_2
1 2016-01-03 10 3.4
2016-02-07 11 3.3
2016-02-07 10.4 2.8
2 2016-01-01 10.9 2.5
2016-02-04 12 2.3
2016-02-04 11 2.7
2016-02-04 12.1 2.1
Both ID and date are a MultiIndex. The data represent some measurements made by some sensors (in the example two sensors). Those sensors sometimes create several measurements per day (as shown in the example).
My questions are:
How can I resample this so I have one row per day per sensor, but one column with the mean, another with the max another with min, etc?
How can I "align" (maybe this is no the correct word) the two time series, so both begin and end at the same time (from 2016-01-01 to 2016-02-07) adding the missing days with NAs?
You can use groupby with DataFrameGroupBy.resample and aggregate by functions in dict first and then reindex by MultiIndex.from_product:
df = df.reset_index(level=0).groupby('ID').resample('D').agg({'m_1':'mean', 'm_2':'max'})
df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names = df.index.names))
#alternative for adding missing start and end datetimes
#df = df.unstack().stack(dropna=False)
print (df.head())
m_2 m_1
ID date
1 2016-01-01 NaN NaN
2016-01-02 NaN NaN
2016-01-03 3.4 10.0
2016-01-04 NaN NaN
2016-01-05 NaN NaN
For PeriodIndex in second level use set_levels with to_period:
df.index = df.index.set_levels(df.index.get_level_values('date').to_period('d'), level=1)
print (df.index.get_level_values('date'))
PeriodIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01',
'2016-02-02', '2016-02-03', '2016-02-04', '2016-02-05',
'2016-02-06', '2016-02-07', '2016-01-01', '2016-01-02',
'2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06',
'2016-01-07', '2016-01-08', '2016-01-09', '2016-01-10',
'2016-01-11', '2016-01-12', '2016-01-13', '2016-01-14',
'2016-01-15', '2016-01-16', '2016-01-17', '2016-01-18',
'2016-01-19', '2016-01-20', '2016-01-21', '2016-01-22',
'2016-01-23', '2016-01-24', '2016-01-25', '2016-01-26',
'2016-01-27', '2016-01-28', '2016-01-29', '2016-01-30',
'2016-01-31', '2016-02-01', '2016-02-02', '2016-02-03',
'2016-02-04', '2016-02-05', '2016-02-06', '2016-02-07'],
dtype='period[D]', name='date', freq='D')

Determining the number of unique entry's left after experiencing a specific item in pandas

I have a data frame with three columns timestamp, lecture_id, and userid
I am trying to write a loop that will count up the number of students who dropped (never seen again) after experiencing a specific lecture. The goal is to ultimately have a fourth column that shows the number of students remaining after exposure to a specific lecture.
I'm having trouble writing this in python, I tried a for loop which never finished (I have 13m rows).
import pandas as pd
import numpy as np
ids = list(np.random.randint(0,5,size=(100, 1)))
users = list(np.random.randint(0,10,size=(100, 1)))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
dft = pd.DataFrame(
{'lecture_id': ids,
'userid': users,
'timestamp': dates
})
I want to make a new data frame that shows for every user that experienced x lecture, how many never came back (dropped).
Not sure if this is what you want and also not sure if this can be done simpler but this could be a way to do it:
import pandas as pd
import numpy as np
np.random.seed(42)
ids = list(np.random.randint(0,5,size=(100, 1)[0]))
users = list(np.random.randint(0,10,size=(100, 1)[0]))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
df = pd.DataFrame({'lecture_id': ids, 'userid': users, 'timestamp': dates})
# Get the last date for each user
last_seen = df.timestamp.iloc[df.groupby('userid').timestamp.apply(lambda x: np.argmax(x))]
df['remaining'] = len(df.userid.unique())
tmp = np.zeros(len(df))
tmp[last_seen.index] = 1
df['remaining'] = (df['remaining']- tmp.cumsum()).astype(int)
df[-10:]
where the last 10 entries are:
lecture_id timestamp userid remaining
90 2 2013-01-04 18:00:00 9 6
91 0 2013-01-04 19:00:00 5 6
92 2 2013-01-04 20:00:00 6 6
93 2 2013-01-04 21:00:00 3 5
94 0 2013-01-04 22:00:00 6 4
95 2 2013-01-04 23:00:00 7 4
96 4 2013-01-05 00:00:00 0 3
97 1 2013-01-05 01:00:00 5 2
98 1 2013-01-05 02:00:00 7 1
99 0 2013-01-05 03:00:00 4 0

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources