Rolling window percentile rank over a multi-index Pandas DataFrame - python-3.x

I am creating a percentile rank over a rolling window of time and would like help refining my approach.
My DataFrame has a multi-index with the first level set to datetime and the second set to an identifier. Ultimately, I’d like the rolling window to evaluate the trailing n periods, including the current period, and produce the corresponding percentile ranks.
I referenced the posts shown below but found they were working with the data a bit differently than how I intend to. In those posts, the final functions group results by identifier and then by datetime, whereas I'm looking to use rolling panels of data in my function (dates and identifiers).
using rolling functions on multi-index dataframe in pandas
Panda rolling window percentile rank
This is an example of what I am after.
Create a sample DataFrame:
num_days = 5
np.random.seed(8675309)
stock_data = {
"AAPL": np.random.randint(1, max_value, size=num_days),
"MSFT": np.random.randint(1, max_value, size=num_days),
"WMT": np.random.randint(1, max_value, size=num_days),
"TSLA": np.random.randint(1, max_value, size=num_days)
}
dates = pd.date_range(
start="2013-01-03",
periods=num_days,
freq=BDay()
)
sample_df = pd.DataFrame(stock_data, index=dates)
sample_df = sample_df.stack().to_frame(name='data')
sample_df.index.names = ['date', 'ticker']
Which outputs:
date ticker
2013-01-03 AAPL 2
MSFT 93
TSLA 39
WMT 21
2013-01-04 AAPL 141
MSFT 43
TSLA 205
WMT 20
2013-01-07 AAPL 256
MSFT 93
TSLA 103
WMT 25
2013-01-08 AAPL 233
MSFT 60
TSLA 13
WMT 104
2013-01-09 AAPL 19
MSFT 120
TSLA 282
WMT 293
The code below breaks out the sample_df into 2 day increments and produces a rank vs. ranking over a rolling window of time. So it's close, but not what I'm after.
sample_df.reset_index(level=1, drop=True)[['data']] \
.apply(
lambda x: x.groupby(pd.Grouper(level=0, freq='2d')).rank()
)
I then tried what's shown below without much luck either.
from scipy.stats import rankdata
def rank(x):
return rankdata(x, method='ordinal')[-1]
sample_df.reset_index(level=1, drop=True) \
.rolling(window="2d", min_periods=1) \
.apply(
lambda x: rank(x)
)
I finally arrived at the output I'm looking for but the formula seems a bit contrived, so I'm hoping to identify a more elegant approach if one exists.
import numpy as np
import pandas as pd
from pandas.tseries.offsets import BDay
window_length = 1
target_column = "data"
def rank(df, target_column, ids, window_length):
percentile_ranking = []
list_of_ids = []
date_index = df.index.get_level_values(0).unique()
for date in date_index:
rolling_start_date = date - BDay(window_length)
first_date = date_index[0] + BDay(window_length)
trailing_values = df.loc[rolling_start_date:date, target_column]
# Only calc rolling percentile after the rolling window has lapsed
if date < first_date:
pass
else:
percentile_ranking.append(
df.loc[date, target_column].apply(
lambda x: stats.percentileofscore(trailing_values, x, kind="rank")
)
)
list_of_ids.append(df.loc[date, ids])
ranks, output_ids = pd.concat(percentile_ranking), pd.concat(list_of_ids)
df = pd.DataFrame(
ranks.values, index=[ranks.index, output_ids], columns=["percentile_rank"]
)
return df
ranks = rank(
sample_df.reset_index(level=1),
window_length=1,
ids='ticker',
target_column="data"
)
sample_df.join(ranks)
I get the feeling that my rank function is more than what's needed here. I appreciate any ideas/feedback to help in simplifying this code to arrive at the output below. Thank you!
data percentile_rank
date ticker
2013-01-03 AAPL 2 NaN
MSFT 93 NaN
TSLA 39 NaN
WMT 21 NaN
2013-01-04 AAPL 141 87.5
MSFT 43 62.5
TSLA 205 100.0
WMT 20 25.0
2013-01-07 AAPL 256 100.0
MSFT 93 50.0
TSLA 103 62.5
WMT 25 25.0
2013-01-08 AAPL 233 87.5
MSFT 60 37.5
TSLA 13 12.5
WMT 104 75.0
2013-01-09 AAPL 19 25.0
MSFT 120 62.5
TSLA 282 87.5
WMT 293 100.0

Edited: The original answer was taking 2d groups without the rolling effect, and just grouping the first two days that appeared. If you want rolling by every 2 days:
Dataframe pivoted to keep the dates as index and ticker as columns
pivoted = sample_df.reset_index().pivot('date','ticker','data')
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 2 93 39 21
2013-01-04 141 43 205 20
2013-01-07 256 93 103 25
2013-01-08 233 60 13 104
2013-01-09 19 120 282 293
Now we can apply a rolling function and consider all stocks in the same window within that rolling
from scipy.stats import rankdata
def pctile(s):
wdw = sample_df.loc[s.index,:].values.flatten() ##get all stock values in the period
ranked = rankdata(wdw) / len(wdw)*100 ## their percentile
return ranked[np.where(wdw == s[len(s)-1])][0] ## return this value's percentile
pivoted_pctile = pivoted.rolling('2D').apply(pctile, raw=False)
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 25.0 100.0 75.0 50.0
2013-01-04 87.5 62.5 100.0 25.0
2013-01-07 100.0 50.0 75.0 25.0
2013-01-08 87.5 37.5 12.5 75.0
2013-01-09 25.0 62.5 87.5 100.0
To get the original format back, we just melt the results:
pd.melt(pivoted_pctile.reset_index(),'date')\
.sort_values(['date', 'ticker']).reset_index()
Output
value
date ticker
2013-01-03 AAPL 25.0
MSFT 100.0
TSLA 75.0
WMT 50.0
2013-01-04 AAPL 87.5
MSFT 62.5
TSLA 100.0
WMT 25.0
2013-01-07 AAPL 100.0
MSFT 50.0
TSLA 75.0
WMT 25.0
2013-01-08 AAPL 87.5
MSFT 37.5
TSLA 12.5
WMT 75.0
2013-01-09 AAPL 25.0
MSFT 62.5
TSLA 87.5
WMT 100.0
If you prefer in one execution:
pd.melt(
sample_df\
.reset_index()\
.pivot('date','ticker','data')\
.rolling('2D').apply(pctile, raw=False)\
.reset_index(),'date')\
.sort_values(['date', 'ticker']).set_index(['date','ticker'])
Note that on day 7 this is different than what you displayed. This is actually rolling, so in day 7, because there is no day 6, the values are ranked only for that day, as the window of data is only 4 values and windows don't look forward. This differs from your result for that day.
Original
Is this something you might be looking for? I combined the groupby on the date (2 days) with transform so the number of observations is the same as the series provided. As you can see I kept the first observation of the window group.
df = sample_df.reset_index()
df['percentile_rank'] = df.groupby([pd.Grouper(key='date',freq='2D')]['data']\
.transform(lambda x: x.rank(ascending=True)/len(x)*100)
Output
Out[19]:
date ticker data percentile_rank
0 2013-01-03 AAPL 2 12.5
1 2013-01-03 MSFT 93 75.0
2 2013-01-03 WMT 39 50.0
3 2013-01-03 TSLA 21 37.5
4 2013-01-04 AAPL 141 87.5
5 2013-01-04 MSFT 43 62.5
6 2013-01-04 WMT 205 100.0
7 2013-01-04 TSLA 20 25.0
8 2013-01-07 AAPL 256 100.0
9 2013-01-07 MSFT 93 50.0
10 2013-01-07 WMT 103 62.5
11 2013-01-07 TSLA 25 25.0
12 2013-01-08 AAPL 233 87.5
13 2013-01-08 MSFT 60 37.5
14 2013-01-08 WMT 13 12.5
15 2013-01-08 TSLA 104 75.0
16 2013-01-09 AAPL 19 25.0
17 2013-01-09 MSFT 120 50.0
18 2013-01-09 WMT 282 75.0
19 2013-01-09 TSLA 293 100.0

Related

How to calculate percentile in a pd dataframe?

Trying to calculate the percentile of a value in a pd column but only for x number of values:
Column name = 'Signal'
50
20
30
20
102
22
1
2
43
49
38
173
371
312
Next to each value would like to calculate on a rolling basis the percentile of that value among X numbers preceding that one (inclusive).
Thanks!
You can use scipy.stats.rankdata() to calculate your data's rank and pd.rolling() to calculate the rank on a rolling basis.
import pandas as pd
import numpy as np
from scipy.stats import rankdata
df = pd.DataFrame({'Signal':[50,20,30,20,102,22,1,2,43,49,38,173,371,312]})
def last_value_rank(arr):
"""
Calculates the last value's rank.
Rankings start from 1.
Duplicates are considered (e.g. [1,3,2,2] ranks are [1,4,2.5,2.5], result = 2.5)
"""
return rankdata(arr)[-1]
def last_value_rank_normalized(arr):
return (last_value_rank(arr) - 1) * 100 / (len(arr) - 1)
print(df.join(df['Signal'].rolling(5).agg([last_value_rank, last_value_rank_normalized])))
output:
Signal last_value_rank last_value_rank_normalized
0 50 NaN NaN
1 20 NaN NaN
2 30 NaN NaN
3 20 NaN NaN
4 102 5.0 100.0
5 22 3.0 50.0
6 1 1.0 0.0
7 2 2.0 25.0
8 43 4.0 75.0
9 49 5.0 100.0
10 38 3.0 50.0
11 173 5.0 100.0
12 371 5.0 100.0
13 312 4.0 75.0

Interpolate above and below a range of values in a column - Pandas

I was looking for the way to extend the range values inside a Pandas column by interpolation, but I still don't know how to set the 'limits' of the interpolation, I mean, it's something like:
[Distance] [Radiation]
12 120
13 130
14 140
15 150
16 160
17 170
So, what I'm trying to get is the full range of column [Radiation] according to the complete secuence of column [Distance] by interpolation.
[Distance] [Radiation]
1 10
2 20
. .
. .
12 120
13 130
14 140
15 150
16 160
. .
. .
20 200
I was looking in the documentation of pandas and scipy methods but I think I couldn't find it yet.
Thanks for your insights.
One idea is use DataFrame.reindex for add all not existing values of distance and then use DataFrame.interpolate with barycentric method:
df = (df.set_index('Distance')
.reindex(range(1, 21))
.interpolate(method='barycentric', limit_direction='both')
.reset_index())
print (df)
Distance Radiation
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
10 11 110.0
11 12 120.0
12 13 130.0
13 14 140.0
14 15 150.0
15 16 160.0
16 17 170.0
17 18 180.0
18 19 190.0
19 20 200.0

In the output I am not getting the complete table from Excel

I just started using pandas, i wanted to import one Excel file with 31 rows and 11 columns, but in the output only some columns are displayed, the middle columns are represented by "....", and the first column 'EST' the starting few elements are displayed "00:00:00".
Code
import pandas as pd
df = pd.read_excel("C:\\Users\daryl\PycharmProjects\pandas\Book1.xlsx")
print(df)
Output
C:\Users\daryl\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/daryl/PycharmProjects/pandas/1. Introduction.py"
EST Temperature ... Events WindDirDegrees
0 2016-01-01 00:00:00 38 ... NaN 281
1 2016-02-01 00:00:00 36 ... NaN 275
2 2016-03-01 00:00:00 40 ... NaN 277
3 2016-04-01 00:00:00 25 ... NaN 345
4 2016-05-01 00:00:00 20 ... NaN 333
5 2016-06-01 00:00:00 33 ... NaN 259
6 2016-07-01 00:00:00 39 ... NaN 293
7 2016-08-01 00:00:00 39 ... NaN 79
8 2016-09-01 00:00:00 44 ... Rain 76
9 2016-10-01 00:00:00 50 ... Rain 109
10 2016-11-01 00:00:00 33 ... NaN 289
11 2016-12-01 00:00:00 35 ... NaN 235
12 1-13-2016 26 ... NaN 284
13 1-14-2016 30 ... NaN 266
14 1-15-2016 43 ... NaN 101
15 1-16-2016 47 ... Rain 340
16 1-17-2016 36 ... Fog-Snow 345
17 1-18-2016 25 ... Snow 293
18 1/19/2016 22 ... NaN 293
19 1-20-2016 32 ... NaN 302
20 1-21-2016 31 ... NaN 312
21 1-22-2016 26 ... Snow 34
22 1-23-2016 26 ... Fog-Snow 42
23 1-24-2016 28 ... Snow 327
24 1-25-2016 34 ... NaN 286
25 1-26-2016 43 ... NaN 244
26 1-27-2016 41 ... Rain 311
27 1-28-2016 37 ... NaN 234
28 1-29-2016 36 ... NaN 298
29 1-30-2016 34 ... NaN 257
30 1-31-2016 46 ... NaN 241
[31 rows x 11 columns]
Process finished with exit code 0
To answer your question about the display of only a few columns and "..." :
All of the columns have been properly ingested, but your screen / the console is not wide enough to output all of the columns at once in a "print" fashion. This is normal/expected behavior.
Pandas is not a spreadsheet visualization tool like Excel. Maybe someone can suggest a tool for visualizing dataframes in a spreadsheet format for Python, like in Excel. I think I've seen people visualizing spreadsheets in Spyder but I don't use that myself.
If you want to make sure all of the columns are there, try using list(df) or print(list(df)).
To answer your question about the EST format:
It looks like you have some data cleaning to do. This is typical work in data science. I am not sure how to best do this - I haven't worked much with dates/datetime yet. However, here is what I see:
The first few items have timestamps as well, likely formatted in HH:MM:SS
They are formatted YYYY-MM-DD
On index row 18, there are / instead of - in the date
The remaining rows are formatted M-DD-YYYY
There's an option on read_csv's documentation that may take care of those automatically. It's called "parse_dates". If you turn that option on like pd.read_csv('file location', parse_dates='EST'), that could turn on the date parser for the EST column and maybe solve your problem.
Hope this helps! This is my first answer to anyone who sees it feel free to edit and improve it.

Pandas Pivot table without aggregating

I have a dataframe df as:
Acct_Id Acct_Nm Srvc_Id Phone_Nm Phone_plan_value Srvc_Num
51 Roger 789 Pixel 30 1
51 Roger 800 iPhone 25 2
51 Roger 945 Galaxy 40 3
78 Anjay 100 Nokia 50 1
78 Anjay 120 Oppo 30 2
32 Rafa 456 HTC 35 1
I want to transform the dataframe so I can have 1 row per Acct_Id and Acct_Nm as:
Acct_Id Acct_Nm Srvc_Num_1 Srvc_Num_2 Srvc_Num_3
Srvc_Id Phone_Nm Phone_plan_value Srvc_Id Phone_Nm Phone_plan_value Srvc_Id Phone_Nm Phone_plan_value
51 Roger 789 Pixel 30 800 iPhone 25 945 Galaxy 40
78 Anjay 100 Nokia 50 120 Oppo 30
32 Rafa 456 HTC 35
I am not sure how to achieve the same in pandas.
More like a pivot problem , but need swaplevel and sort_index
df.set_index(['Acct_Id','Acct_Nm','Srvc_Num']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('Srvc_Num_')
Out[289]:
Srvc_Num Srvc_Num_1 \
Srvc_Num_Phone_Nm Srvc_Num_Phone_plan_value Srvc_Num_Srvc_Id
Acct_Id Acct_Nm
32 Rafa HTC 35.0 456.0
51 Roger Pixel 30.0 789.0
78 Anjay Nokia 50.0 100.0
Srvc_Num Srvc_Num_2 \
Srvc_Num_Phone_Nm Srvc_Num_Phone_plan_value Srvc_Num_Srvc_Id
Acct_Id Acct_Nm
32 Rafa None NaN NaN
51 Roger iPhone 25.0 800.0
78 Anjay Oppo 30.0 120.0
Srvc_Num Srvc_Num_3
Srvc_Num_Phone_Nm Srvc_Num_Phone_plan_value Srvc_Num_Srvc_Id
Acct_Id Acct_Nm
32 Rafa None NaN NaN
51 Roger Galaxy 40.0 945.0
78 Anjay None NaN NaN
And here is pivot_table
pd.pivot_table(df,index=['Acct_Id','Acct_Nm'],columns=['Srvc_Num'],values=['Phone_Nm','Phone_plan_value','Srvc_Id'],aggfunc='first')
How about:
df.set_index(['Acct_Id', 'Acct_Nm', 'Srvc_Num']).unstack().swaplevel(0, 1, axis = 1).sort_index(axis = 1)

why am I getting a too many indexers error?

cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')

Resources