Group By Quarterly Avg and Get Values That Were Used in Avg Calculation -pandas - python-3.x

I have a df like this,
time value
0 2019-07-30 124.00
1 2019-07-19 123.00
2 2019-08-28 191.46
3 2019-10-25 181.13
4 2019-11-01 24.23
5 2019-11-13 340.00
6 2020-01-01 36.12
7 2020-01-25 56.12
8 2020-01-30 121.00
9 2020-02-04 115.62
10 2020-02-06 63.62
I want to group by quarterly average and get the values that were used in average calculation. Something like below.
Year Quarter Values Avg
2019 Q3 124, 123, 191 146
2019 Q4 181.13, 24.23, 340 181.78
2020 Q1 36.12, 26.12, 121, 115.62, 63.62 72.96
How can I achieve my desired result?

Use GroupBy.agg with quarter periods created by Series.dt.quarter with join values converted to strings and mean in named aggregations:
df['time'] = pd.to_datetime(df['time'])
df1 = (df.assign(Year = df['time'].dt.year,
Q = 'Q' + df['time'].dt.quarter.astype(str),
vals = df['value'].astype(str))
.groupby(['Year','Q'])
.agg(Values=('vals', ', '.join), Avg = ('value','mean'))
.reset_index())
print (df1)
Year Q Values Avg
0 2019 Q3 124.0, 123.0, 191.46 146.153333
1 2019 Q4 181.13, 24.23, 340.0 181.786667
2 2020 Q1 36.12, 56.12, 121.0, 115.62, 63.62 78.496000
EDIT:
df['time'] = pd.to_datetime(df['time'])
df1 = (df.groupby(df['time'].dt.to_period('Q').rename('YearQ'))['value']
.agg([('Values', lambda x: ', '.join(x.astype(str))),('Avg','mean')])
.reset_index()
.assign(Year = lambda x: x['YearQ'].dt.year,
Q = lambda x: 'Q' + x['YearQ'].dt.quarter.astype(str))
.reindex(['Year','Q','Values','Avg'], axis=1))
print (df1)
Year Q Values Avg
0 2019 Q3 124.0, 123.0, 191.46 146.153333
1 2019 Q4 181.13, 24.23, 340.0 181.786667
2 2020 Q1 36.12, 56.12, 121.0, 115.62, 63.62 78.496000

Create a grouper, groupby and reshape the index to year and quarter:
grouper = pd.Grouper(key='time',freq='Q')
res = (df
.assign(temp = df.value.astype(str))
.groupby(grouper)
.agg(Values=('temp', ','.join),
Avg = ('value',np.mean)
)
)
res.index = [res.index.year, 'Q' + res.index.quarter.astype(str)]
res.index = res.index.set_names(['Year','Quarter'])
Values Avg
Year Quarter
2019 Q3 123.0,124.0,191.46 146.153333
Q4 181.13,24.23,340.0 181.786667
2020 Q1 36.12,56.12,121.0,115.62,63.62 78.496000

Related

How to sum by month in timestamp Data Frame?

i have dataframe like this :
trx_date
trx_amount
2013-02-11
35
2014-03-10
26
2011-02-9
10
2013-02-12
5
2013-01-11
21
how do i filter that into month and year? so that i can sum the trx_amount
example expected output :
trx_monthly
trx_sum
2013-02
40
2013-01
21
2014-02
35
You can convert values to month periods by Series.dt.to_period and then aggregate sum:
df['trx_date'] = pd.to_datetime(df['trx_date'])
df1 = (df.groupby(df['trx_date'].dt.to_period('m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df1)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert datetimes to strings in format YYYY-MM by Series.dt.strftime:
df2 = (df.groupby(df['trx_date'].dt.strftime('%Y-%m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert to month and years, then output is different - 3 columns:
df2 = (df.groupby([df['trx_date'].dt.year.rename('year'),
df['trx_date'].dt.month.rename('month')])['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
year month trx_sum
0 2011 2 10
1 2013 1 21
2 2013 2 40
3 2014 3 26
You can try this -
df['trx_month'] = df['trx_date'].dt.month
df_agg = df.groupby('trx_month')['trx_sum'].sum()

Pandas: How to ctrate DateTime index

There is Pandas Dataframe as:
year month count
0 2014 Jan 12
1 2014 Feb 10
2 2015 Jan 12
3 2015 Feb 10
How to create DateTime index from 'year' and 'month',so result would be :
count
2014.01.31 12
2014.02.28 10
2015.01.31 12
2015.02.28 10
Use to_datetime with DataFrame.pop for use and remove columns and add offsets.MonthEnd:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.offsets.MonthEnd()
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10
Or:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.to_timedelta(dates.dt.daysinmonth - 1, unit='d')
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Indicate whether datetime of row is in a daterange

I'm trying to get dummy variables for holidays in a dataset. I have a couple of dateranges (pd.daterange()) with holidays and a dataframe to which I would like to append a dummy to indicate whether the datetime of that row is in a certain daterange of the specified holidays.
Small example:
ChristmasBreak = list(pd.date_range('2014-12-20','2015-01-04').date)
dates = pd.date_range('2015-01-03', '2015-01-06, freq='H')
d = {'Date': dates, 'Number': np.rand(len(dates))}
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
for i, row in df.iterrows():
if i in ChristmasBreak:
df[i,'Christmas] = 1
The if loop is never entered, so matching the dates won't work. Is there any way to do this? Alternative methods to come to dummies for this case are welcome as well!
First dont use iterrows, because really slow.
Better is use dt.date with Series,isin, last convert boolean mask to integer - Trues are 1:
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
Or use between:
df['Christmas'] = df['Date'].between('2014-12-20', '2015-01-04').astype(int)
If want compare with DatetimeIndex:
df = pd.DataFrame(data=d)
df.set_index('Date', inplace=True)
df['Christmas'] = df.index.date.isin(ChristmasBreak).astype(int)
df['Christmas'] = ((df.index > '2014-12-20') & (df.index < '2015-01-04')).astype(int)
Sample:
ChristmasBreak = pd.date_range('2014-12-20','2015-01-04').date
dates = pd.date_range('2014-12-19 20:00', '2014-12-20 05:00', freq='H')
d = {'Date': dates, 'Number': np.random.randint(10, size=len(dates))}
df = pd.DataFrame(data=d)
df['Christmas'] = df['Date'].dt.date.isin(ChristmasBreak).astype(int)
print (df)
Date Number Christmas
0 2014-12-19 20:00:00 6 0
1 2014-12-19 21:00:00 7 0
2 2014-12-19 22:00:00 0 0
3 2014-12-19 23:00:00 9 0
4 2014-12-20 00:00:00 1 1
5 2014-12-20 01:00:00 3 1
6 2014-12-20 02:00:00 1 1
7 2014-12-20 03:00:00 8 1
8 2014-12-20 04:00:00 2 1
9 2014-12-20 05:00:00 1 1
This should do what you want:
df['Christmas'] = df.index.isin(ChristmasBreak).astype(int)

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources