To find sum and percentage from columns of two different dataframe and append result in third dataframe - python-3.x

I have made 2 identical looking dataframe which looks like below:
df1:
date id email Count
4/22/2019 1 abc#xyz.com 10
4/22/2019 1 def#xyz.com 4
4/23/2019 1 abc#xyz.com 5
4/23/2019 1 def#xyz.com 10
df2:
date id Email_ID Count
4/22/2019 1 fgh#xyz.com 5
4/22/2019 1 ijk#xyz.com 6
4/23/2019 1 fgh#xyz.com 7
4/23/2019 1 ijk#xyz.com 8
I want to make a dataframe3 which has sum and percentage of 'Count' column of each dataframe(df1 and df2) and calculate individual percentage[like df1_count%=(df1_count/df1_count+df2_count)*100] according to the date. Output df3 should be something like this below:
df3:
Count Count%
date df1_count df2_count df1_count% df2_count%
4/22/2019 14 11 56% 44%
4/23/2019 15 15 50% 50%
How can it be done by pandas? I am able to do it using 'for' loop but not able to do by pandas functionality, any leads will help
Output as per solution #jezrael
Count Count count% count%
df1_count df2_count df1_count% df2_count%
Date
4/22/2019 14 11 56% 44%
4/23/2019 15 15 50% 50%

Use concat with aggregation sum:
df = pd.concat([df1.groupby('date')['Count'].sum(),
df2.groupby('date')['Count'].sum()], axis=1, keys=('df1_count','df2_count'))
And then add new columns:
s = (df['df1_count'] + df['df2_count'])
df['df1_count%'] = df['df1_count'] / s * 100
df['df2_count%'] = df['df2_count'] / s * 100
df = df.reset_index()
print (df)
date df1_count df2_count df1_count% df2_count%
0 4/22/2019 14 11 56.0 44.0
1 4/23/2019 15 15 50.0 50.0
If need percentages to values first convert to strings with Series.round for truncate decimals:
s = (df['df1_count'] + df['df2_count'])
df['df1_count%'] = (df['df1_count'] / s * 100).round().astype(str) + '%'
df['df2_count%'] = (df['df2_count'] / s * 100).round().astype(str) + '%'
df = df.reset_index()
print (df)
date df1_count df2_count df1_count% df2_count%
0 4/22/2019 14 11 56.0% 44.0%
1 4/23/2019 15 15 50.0% 50.0%
EDIT:
df = pd.concat([df1.groupby('date')['Count'].sum(),
df2.groupby('date')['Count'].sum()], axis=1,
keys=('Count_df1_count','Count_df2_count'))
s = (df['Count_df1_count'] + df['Count_df2_count'])
df['Count%_df1_count%'] = (df['Count_df1_count'] / s * 100).round().astype(str) + '%'
df['Count%_df2_count%'] = (df['Count_df2_count'] / s * 100).round().astype(str) + '%'
df.columns = df.columns.str.split('_', expand=True, n=1)
print (df)
Count Count%
df1_count df2_count df1_count% df2_count%
date
4/22/2019 14 11 56.0% 44.0%
4/23/2019 15 15 50.0% 50.0%

Related

How to sum by month in timestamp Data Frame?

i have dataframe like this :
trx_date
trx_amount
2013-02-11
35
2014-03-10
26
2011-02-9
10
2013-02-12
5
2013-01-11
21
how do i filter that into month and year? so that i can sum the trx_amount
example expected output :
trx_monthly
trx_sum
2013-02
40
2013-01
21
2014-02
35
You can convert values to month periods by Series.dt.to_period and then aggregate sum:
df['trx_date'] = pd.to_datetime(df['trx_date'])
df1 = (df.groupby(df['trx_date'].dt.to_period('m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df1)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert datetimes to strings in format YYYY-MM by Series.dt.strftime:
df2 = (df.groupby(df['trx_date'].dt.strftime('%Y-%m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert to month and years, then output is different - 3 columns:
df2 = (df.groupby([df['trx_date'].dt.year.rename('year'),
df['trx_date'].dt.month.rename('month')])['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
year month trx_sum
0 2011 2 10
1 2013 1 21
2 2013 2 40
3 2014 3 26
You can try this -
df['trx_month'] = df['trx_date'].dt.month
df_agg = df.groupby('trx_month')['trx_sum'].sum()

Moving aggregate within a specified date range

Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?
Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0

pandas calculate scores for each group based on multiple functions

I have the following df,
group_id code amount date
1 100 20 2017-10-01
1 100 25 2017-10-02
1 100 40 2017-10-03
1 100 25 2017-10-03
2 101 5 2017-11-01
2 102 15 2017-10-15
2 103 20 2017-11-05
I like to groupby group_id and then compute scores to each group based on the following features:
if code values are all the same in a group, score 0 and 10 otherwise;
if amount sum is > 100, score 20 and 0 otherwise;
sort_values by date in descending order and sum the differences between the dates, if the sum < 5, score 30, otherwise 0.
so the result df looks like,
group_id code amount date score
1 100 20 2017-10-01 50
1 100 25 2017-10-02 50
1 100 40 2017-10-03 50
1 100 25 2017-10-03 50
2 101 5 2017-11-01 10
2 102 15 2017-10-15 10
2 103 20 2017-11-05 10
here are the functions that correspond to each feature above:
def amount_score(df, amount_col, thold=100):
if df[amount_col].sum() > thold:
return 20
else:
return 0
def col_uniq_score(df, col_name):
if df[col_name].nunique() == 1:
return 0
else:
return 10
def date_diff_score(df, col_name):
df.sort_values(by=[col_name], ascending=False, inplace=True)
if df[col_name].diff().dropna().sum() / np.timedelta64(1, 'D') < 5:
return score + 30
else:
return score
I am wondering how to apply these functions to each group and calculate the sum of all the functions to give a score.
You can try groupby.transform for same size of Series as original DataFrame with numpy.where for if-else for Series:
grouped = df.sort_values('date', ascending=False).groupby('group_id', sort=False)
a = np.where(grouped['code'].transform('nunique') == 1, 0, 10)
print (a)
[10 10 10 0 0 0 0]
b = np.where(grouped['amount'].transform('sum') > 100, 20, 0)
print (b)
[ 0 0 0 20 20 20 20]
c = np.where(grouped['date'].transform(lambda x:x.diff().dropna().sum()).dt.days < 5, 30, 0)
print (c)
[30 30 30 30 30 30 30]
df['score'] = a + b + c
print (df)
group_id code amount date score
0 1 100 20 2017-10-01 40
1 1 100 25 2017-10-02 40
2 1 100 40 2017-10-03 40
3 1 100 25 2017-10-03 50
4 2 101 5 2017-11-01 50
5 2 102 15 2017-10-15 50
6 2 103 20 2017-11-05 50

grouping by weekly days pandas

I have a dataframe,df containing
Index Date & Time eventName eventCount
0 2017-08-09 ABC 24
1 2017-08-09 CDE 140
2 2017-08-10 CDE 150
3 2017-08-11 DEF 200
4 2017-08-11 ABC 20
5 2017-08-16 CDE 10
6 2017-08-16 ABC 15
7 2017-08-17 CDE 10
8 2017-08-17 DEF 50
9 2017-08-18 DEF 80
...
I want to sum the eventCount for each weekly day occurrences and plot for the total events for each weekly day(from MON to SUN) i.e. for example:
Summation of the eventCount values of:
2017-08-09 and 2017-08-16(Mondays)=189
2017-08-10 and 2017-08-17(Tuesdays)=210
2017-08-16 and 2017-08-23(Wednesdays)=300
I have tried
dailyOccurenceSum=df['eventCount'].groupby(lambda x: x.weekday).sum()
and I get this error:AttributeError: 'int' object has no attribute 'weekday'
Starting with df -
df
Index Date & Time eventName eventCount
0 0 2017-08-09 ABC 24
1 1 2017-08-09 CDE 140
2 2 2017-08-10 CDE 150
3 3 2017-08-11 DEF 200
4 4 2017-08-11 ABC 20
5 5 2017-08-16 CDE 10
6 6 2017-08-16 ABC 15
7 7 2017-08-17 CDE 10
8 8 2017-08-17 DEF 50
9 9 2017-08-18 DEF 80
First, convert Date & Time to a datetime column -
df['Date & Time'] = pd.to_datetime(df['Date & Time'])
Next, call groupby + sum on the weekday name.
df = df.groupby(df['Date & Time'].dt.weekday_name)['eventCount'].sum()
df
Date & Time
Friday 300
Thursday 210
Wednesday 189
Name: eventCount, dtype: int64
If you want to sort by weekday, convert the index to categorical and call sort_index -
cat = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
df.index = pd.Categorical(df.index, categories=cat, ordered=True)
df = df.sort_index()
df
Wednesday 189
Thursday 210
Friday 300
Name: eventCount, dtype: int64

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources