Create some features based on the average growth rate of y for the month over the past few years - python-3.x

Assuming we have dataset df (which can be downloaded from this link), I want to create some features based on the average growth rate of y for the month of the past several years, for example: y_agr_last2, y_agr_last3, y_agr_last4, etc.
The formula is:
For example, for September 2022, y_agr_last2 = ((1 + 3.85/100)*(1 + 1.81/100))^(1/2) -1, y_agr_last3 = ((1 + 3.85/100)*(1 + 1.81/100)*(1 + 1.6/100))^(1/3) -1.
The code I use is as follows, which is relatively repetitive and trivial:
import math
df['y_shift12'] = df['y'].shift(12)
df['y_shift24'] = df['y'].shift(24)
df['y_shift36'] = df['y'].shift(36)
df['y_agr_last2'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100)), 1/2) -1
df['y_agr_last3'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100) * (1+df['y_shift36']/100)), 1/3) -1
df.drop(['y_shift12', 'y_shift24', 'y_shift36'], axis=1, inplace=True)
df
How can the desired result be achieved more concisely?
References:
Create some features based on the mean of y for the month over the past few years

Following is one way to generalise it:
import functools
import operator
num_yrs = 3
for n in range(1, num_yrs+1):
df[f"y_shift{n*12}"] = df["y"].shift(n*12)
df[f"y_agr_last{n}"] = pow(functools.reduce(operator.mul, [1+df[f"y_shift{i*12}"]/100 for i in range(1, n+1)], 1), 1/n) - 1
df = df.drop(["y_agr_last1"] + [f"y_shift{n*12}" for n in range(1, num_yrs+1)], axis=1)
Output:
date y x1 x2 y_agr_last2 y_agr_last3
0 2018/1/31 -13.80 1.943216 3.135839 NaN NaN
1 2018/2/28 -14.50 0.732108 0.375121 NaN NaN
...
22 2019/11/30 4.00 -0.273262 -0.021146 NaN NaN
23 2019/12/31 7.60 1.538851 1.903968 NaN NaN
24 2020/1/31 -11.34 2.858537 3.268478 -0.077615 NaN
25 2020/2/29 -34.20 -1.246915 -0.883807 -0.249940 NaN
26 2020/3/31 46.50 -4.213756 -4.670146 0.221816 NaN
...
33 2020/10/31 -1.00 1.967062 1.860070 -0.035569 NaN
34 2020/11/30 12.99 2.302166 2.092842 0.041998 NaN
35 2020/12/31 5.54 3.814303 5.611199 0.030017 NaN
36 2021/1/31 -6.41 4.205601 4.948924 -0.064546 -0.089701
37 2021/2/28 -22.38 4.185913 3.569100 -0.342000 -0.281975
38 2021/3/31 17.64 5.370519 3.130884 0.465000 0.298025
...
54 2022/7/31 0.80 -6.259455 -6.716896 0.057217 0.052793
55 2022/8/31 -5.30 1.302754 1.412277 0.015121 -0.000492
56 2022/9/30 NaN -2.876968 -3.785964 0.028249 0.024150

Related

Convert one dataframe's format and check if each row exits in another dataframe in Python

Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})

how to multiply values with group of data from pandas series without loop iteration

I have two pandas time series with different length and index, and a Boolean series. Series_1 is from the last data of each month with index last day of the month, series_2 is daily data with index daily, the Boolean series is True on the last day of each month, else as false.
I want to get data from series_1 (s1[0]) times data from series_2 (s2[1:n]) which is the daily data from one month, is there a way to do it without loop?
series_1 = 2010-06-30 1
2010-07-30 2
2010-08-31 5
2010-09-30 7
series_2 = 2010-07-01 2
2010-07-02 3
2010-07-03 5
2010-07-04 6
.....
2010-07-30 7
2010-08-01 6
2010-08-02 7
2010-08-03 5
.....
2010-08-31 6
Boolean = False
false
....
True
False
False
....
True
(with only the end of each month True)
want to get a series as a result that s = series_1[i] * series_2[j:j+n] (n data from same month)
How to make it?
Thanks in advance
Not sure if I got your question completely right but this should get you there:
series_1 = pd.Series({
'2010-07-30': 2,
'2010-08-31': 5
})
series_2 = pd.Series({
'2010-07-01': 2,
'2010-07-02': 3,
'2010-07-03': 5,
'2010-07-04': 6,
'2010-07-30': 7,
'2010-08-01': 6,
'2010-08-02': 7,
'2010-08-03': 5,
'2010-08-31': 6
})
Make the series Datetime aware and resample them to daily frequency:
series_1.index = pd.DatetimeIndex(series_1.index)
series_1 = series_1.resample('1D').asfreq()
series_2.index = pd.DatetimeIndex(series_2.index)
series_2 = series_2.resample('1D').asfreq()
Put them in a dataframe and perform basic multiplication:
df = pd.DataFrame()
df['1'] = series_1
df['2'] = series_2
df['product'] = df['1'] * df['2']
Result:
>>> df
1 2 product
2010-07-30 2.0 7.0 14.0
2010-07-31 NaN NaN NaN
2010-08-01 NaN 6.0 NaN
2010-08-02 NaN 7.0 NaN
2010-08-03 NaN 5.0 NaN
[...]
2010-08-27 NaN NaN NaN
2010-08-28 NaN NaN NaN
2010-08-29 NaN NaN NaN
2010-08-30 NaN NaN NaN
2010-08-31 5.0 6.0 30.0

Python: Extract dimension data from dataframe string column and create columns with values for each of them

Hej,
I have a source file with 2 columns: ID and all_dimensions. All dimensions is a string with different "key-value"-pairs which are not the same for each id.
I want to make the keys column headers and parse the respective value if existent in the right cell.
Example:
ID all_dimensions
12 Height:2 cm,Volume: 4cl,Weight:100g
34 Length: 10cm, Height: 5 cm
56 Depth: 80cm
78 Weight: 2 kg, Length: 7 cm
90 Diameter: 4 cm, Volume: 50 cl
Desired result:
ID Height Volume Weight Length Depth Diameter
12 2 cm 4cl 100g - - -
34 5 cm - - 10cm - -
56 - - - - 80cm -
78 - - 2 kg 7 cm - -
90 - 50 cl - - - 4 cm
I do have over a 100 dimensions so ideally I would like to write a for loop or something similar to not specify each column header (see code examples below)
I am using Python 3.7.3 and pandas 0.24.2.
What have I tried already:
1) I have tried to split the data in separate columns but wasn't sure how to proceed to have each value assigned into the right header:
df.set_index('ID',inplace=True)
newdf = df["all_dimensions"].str.split(",|:",expand = True)
2) Using the initial df, I used "str.extract" to create new columns (but then I would need to specify each header):
df['Volume']=df.all_dimensions.str.extract(r'Volume:([\w\s.]*)').fillna('')
3) To resolve the problem of 2) with each header, I created a list of all dimension attributes and thought to use the list with an for loop to extract the values:
columns_list=df.all_dimensions.str.extract(r'^([\D]*):',expand=True).drop_duplicates()
columns_list=columns_list[0].str.strip().values.tolist()
for dimension in columns_list:
df.dimension=df.all_dimensions.str.extract(r'dimension([\w\s.]*)').fillna('')
Here, JupyterNB gives me a UserWarning: "Pandas doesn't allow columns to be created via a new attribute name" and the df looks the same as before.
Option 1: I prefer splitting several time:
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# split second time for individual measurement
new_df = (new_series.str
.split(':', expand=True)
.reset_index()
)
# stripping off leading/trailing spaces
new_df[0] = new_df[0].str.strip()
new_df[1] = new_df[1].str.strip()
# unstack to get the desire table:
new_df.set_index(['ID', 0])[1].unstack()
Option 2: Use split(',|:') as what you tried:
# splitting
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',|:', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# concat along axis=1 to get dataframe with two columns
# new_df.columns = ('ID', 0, 1) where 0 is measurement name
new_df = (pd.concat((new_series[::2].str.strip(),
new_series[1::2]), axis=1)
.reset_index())
new_df.set_index(['ID', 0])[1].unstack()
Output:
Depth Diameter Height Length Volume Weight
ID
12 NaN NaN 2 cm NaN 4cl 100g
34 NaN NaN 5 cm 10cm NaN NaN
56 80cm NaN NaN NaN NaN NaN
78 NaN NaN NaN 7 cm NaN 2 kg
90 NaN 4 cm NaN NaN 50 cl NaN
This is a hard question , your string need to be split and your each items after split need to be convert to dict , then we can using DataFrame constructor rebuild those columns
d=[ [{y.split(':')[0]:y.split(':')[1]}for y in x.split(',')]for x in df.all_dimensions]
from collections import ChainMap
data = list(map(lambda x : dict(ChainMap(*x)),d))
s=pd.DataFrame(data)
df=pd.concat([df,s.groupby(s.columns.str.strip(),axis=1).first()],1)
df
Out[26]:
ID all_dimensions Depth ... Length Volume Weight
0 12 Height:2 cm,Volume: 4cl,Weight:100g NaN ... NaN 4cl 100g
1 34 Length: 10cm, Height: 5 cm NaN ... 10cm NaN NaN
2 56 Depth: 80cm 80cm ... NaN NaN NaN
3 78 Weight: 2 kg, Length: 7 cm NaN ... 7 cm NaN 2 kg
4 90 Diameter: 4 cm, Volume: 50 cl NaN ... NaN 50 cl NaN
[5 rows x 8 columns]
Check the columns
df['Height']
Out[28]:
0 2 cm
1 5 cm
2 NaN
3 NaN
4 NaN
Name: Height, dtype: object

DataFrame difference between rows based on multiple columns

I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()
Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()
Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources