Adding total rows to a Pandas DataFrame - python-3.x

I define a Pandas DataFrame containing several deposit/withdrawal rows for different owners. I want to add a total row for each owner to totalize the deposits/withdrawals aswell as the yield amounts generated by each capital amount.
Here's the result of the code below:
Here's my code:
import pandas as pd
OWNER = 'OWNER'
DEPWITHDR = 'DEP/WITHDR'
DATEFROM = 'DATE FROM'
DATETO = 'DATE TO'
CAPITAL = 'CAPITAL'
YIELD = 'YIELD AMT'
TOTAL = 'TOTAL'
df = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
DEPWITHDR: [10000, 20000, 4000, 20000, -8000],
CAPITAL: [10000, 30000, 4000, 24000, 16000],
DATEFROM: ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-03', '2021-01-04'],
DATETO: ['2021-01-01', '2021-01-05', '2021-01-02', '2021-01-03', '2021-01-05'],
YIELD: [100, 1200, 80, 240, 320]
})
print('SOURCE DATAFRAME\n')
print(df)
print()
newDf = pd.DataFrame(columns=[OWNER, DEPWITHDR, CAPITAL, DATEFROM, DATETO, YIELD])
currentOwner = df.loc[1, OWNER]
# using groupby function to compute the two columns totals
dfTotal = df.groupby([OWNER]).agg({DEPWITHDR:'sum', YIELD:'sum'}).reset_index()
totalIndex = 0
# deactivating SettingWithCopyWarning caueed by totalRow[OWNER] += ' total'
pd.set_option('mode.chained_assignment', None)
for index, row in df.iterrows():
if currentOwner == row[OWNER]:
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
else:
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
totalIndex += 1
newDf = newDf.append({OWNER: '',
DEPWITHDR: '',
CAPITAL: '',
DATEFROM: '',
DATETO: '',
YIELD: ''}, ignore_index=True)
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
currentOwner = row[OWNER]
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
print('TARGET DATAFRAME\n')
print(newDf.fillna(''))
My question is: what is a better, more Pandas friendly, way, to obtain the desired result ?

You can use groupby and concat:
df_total = pd.concat((
df,
df.replace({o: o + ' total' for o in df[OWNER].unique()}).groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}).reset_index())
).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO])
In detail:
df.replace({o: o + ' total' for o in df[OWNER].unique()}): replace each occurrence of the name of every owner with the name itself plus the string ' total' (e.g., 'JOE' -> 'JOE total'); so that the result of the groupby will have those values in the column OWNER.
groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}): get the sum of the column DEPWITHDR and YIELD per each owner.
pd.concat(...).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO]): concatenate the original DataFrame and that with the totals and then sort rows by column OWNER, than DATEFROM, than DATETO, so that the rows with the totals for each OWNER will be placed at the ends of the rows belonging to that owner (because they ends with ' total') and moreover the rows will be chronologically sorted by DATEFROM, DATETO.
Here df_total:
index OWNER DEP/WITHDR CAPITAL DATE FROM DATE TO YIELD AMT
0 0 JOE 10000 10000 2021-01-01 2021-01-01 100
1 1 JOE 20000 30000 2021-01-02 2021-01-05 1200
5 0 JOE total 30000 1300
2 2 ROB 4000 4000 2021-01-01 2021-01-02 80
3 3 ROB 20000 24000 2021-01-03 2021-01-03 240
4 4 ROB -8000 16000 2021-01-04 2021-01-05 320
6 1 ROB total 16000 640
IMHO, I'd create a different DataFrame from each owner, with only his/her data, and then a summary DataFrame with totals for each owner. But, maybe, in your use case, this is the best solution.

Related

Update dataframe cells according to match cells within another dataframe in pandas [duplicate]

I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values

Filter dataframe on multiple conditions within different columns

I have a sample of the dataframe as given below.
data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B'],
'Date':['2021-2-13', '2021-2-14', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-15'],
'Modified_Date':['3/19/2021 6:34:20 PM','3/20/2021 4:57:39 PM', '3/21/2021 4:57:40 PM', '3/22/2021 4:57:57 PM', '3/23/2021 4:57:41 PM',
'3/25/2021 11:44:15 PM','3/26/2021 2:16:09 PM', '3/20/2021 2:16:04 PM', '3/21/2021 4:57:40 PM'],
'Steps': [1000, 1200, 1500, 2000, 1400, 4000, 5000,1000, 3500]}
df1 = pd.DataFrame(data)
df1
This data have to be filtered in such a way that first for 'ID', and then for each 'Date', the latest entry of 'Modified_Date' row has to be selected.
EX: For ID=A, For Date='2021-04-14', The latest/last modified date = '3/22/2021 4:57:57 PM', This row has to be selected.
I have attached the snippet of the how the final dataframe has to look like.
I have been stuck on this for a while.
Try:
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
df_out = df1.groupby(["ID", "Date"], as_index=False).apply(
lambda x: x.loc[x["Modified_Date"].idxmax()]
)
print(df_out)
Prints:
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
1 A 2021-02-14 2021-03-22 16:57:57 2000
2 A 2021-02-15 2021-03-23 16:57:41 1400
3 B 2021-02-14 2021-03-26 14:16:09 5000
4 B 2021-02-15 2021-03-21 16:57:40 3500
Or: .sort_values + .groupby:
df_out = (
df1.sort_values(["ID", "Date", "Modified_Date"])
.groupby(["ID", "Date"], as_index=False)
.last()
)
The easiest/most straighforward is to sort by date and take the last per group:
(df1.sort_values(by='Modified_Date')
.groupby(['ID', 'Date'], as_index=False).last()
)
output:
ID Date Modified_Date Steps
0 A 2021-2-13 3/19/2021 6:34:20 PM 1000
1 A 2021-2-14 3/22/2021 4:57:57 PM 2000
2 A 2021-2-15 3/23/2021 4:57:41 PM 1400
3 B 2021-2-14 3/26/2021 2:16:09 PM 5000
4 B 2021-2-15 3/21/2021 4:57:40 PM 3500
You can also sort_values and drop_duplicates:
First convert the 2 series to dates (since they are strings in the example):
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
Then sort values on Modified_date and drop_duplicates keeping the last values:
out = df1.sort_values('Modified_Date').drop_duplicates(['ID','Date'],keep='last')\
.sort_index()
print(out)
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
3 A 2021-02-14 2021-03-22 16:57:57 2000
4 A 2021-02-15 2021-03-23 16:57:41 1400
6 B 2021-02-14 2021-03-26 14:16:09 5000
8 B 2021-02-15 2021-03-21 16:57:40 3500

Using Pandas groupby with total column and total row

The Dataframe used in my code list capital and yield amounts belonging to owners. The purpose is to group the values by owners and then to add a total column to the groupby dataframe and then add a global total row.
Here's the code:
import pandas as pd
OWNER = 'OWNER'
CAPITAL = 'CAPITAL'
YIELD = 'YIELD AMT'
TOTAL = 'TOTAL'
# defining the dataframe
df = pd.DataFrame({OWNER: 2 * ['Joe'] + 3 * ['Carla'] + ['Rob'],
CAPITAL: [10000, 5000, 20000, 3000, -4000, 2000],
YIELD: [1000, 500, 2000, 300, 400, 200]})
'''
OWNER CAPITAL YIELD AMT
0 Joe 10000 1000
1 Joe 5000 500
2 Carla 20000 2000
3 Carla 3000 300
4 Carla -4000 400
5 Rob 2000 200
'''
print(df)
print()
# grouping the rows by owner
dfg = df.groupby([OWNER]).sum().reset_index()
'''
OWNER CAPITAL YIELD AMT
0 Carla 19000 2700
1 Joe 15000 1500
2 Rob 2000 200
'''
print(dfg)
print()
# adding a TOTAL column
for index in range(0, len(dfg)):
dfg.loc[index, TOTAL] = dfg.loc[index, CAPITAL] + dfg.loc[index, YIELD]
'''
OWNER CAPITAL YIELD AMT TOTAL
0 Carla 19000 2700 21700.0
1 Joe 15000 1500 16500.0
2 Rob 2000 200 2200.0
'''
print(dfg)
print()
# resetting index to OWNER column
dfg = dfg.set_index(OWNER)
'''
CAPITAL YIELD AMT TOTAL
OWNER
Carla 19000 2700 21700.0
Joe 15000 1500 16500.0
Rob 2000 200 2200.0
'''
print(dfg)
print()
# finally, adding a TOTAL row
dfg.loc[TOTAL] = dfg.sum(numeric_only=True, axis=0)[[CAPITAL, YIELD, TOTAL]]
'''
CAPITAL YIELD AMT TOTAL
OWNER
Carla 19000.0 2700.0 21700.0
Joe 15000.0 1500.0 16500.0
Rob 2000.0 200.0 2200.0
TOTAL 36000.0 4400.0 40400.0
'''
print(dfg.fillna(''))
My question is: is there a more concise way of coding the total column or row computation using Pandas agg() or aggregate() and a lambda expression ?
df[TOTAL] = df[CAPITAL] + df[YIELD]
output = df.groupby(by=[OWNER]).sum()
is what you look for. output is the dataframe you need.

merging varying number of rows by multiple conditions in python

Problem: merging varying number of rows by multiple conditions
Here is a stylistic example of how the dataset looks like
"index" "connector" "type" "q_text" "a_text" "varx" ...
1 1111 1 aa NA xx
2 9999 2 NA tt NA
3 1111 2 NA uu NA
4 9999 1 bb NA yy
5 9999 1 cc NA zz
Goal: how the dataset should look like
"index" "connector" "type" "type.1" "q_text" "q_text.1" "a_text" "a_text.1 " "varx" "varx.1" ...
1 1111 1 2 aa NA NA uu xx NA
2 9999 1 2 bb NA NA tt yy NA
3 9999 1 2 cc NA NA tt zz NA
Logic: Column "type" has either value 1 or 2 while multiple rows have value 1 but only one row (with the same value in "connector") has value 2
If
same values in "connector"
then
merge
rows of "type"=2 with rows of "type"=1
but
(because multiple rows of "type"=1 have the same value in "connector")
duplicate
the corresponding rows of type=2
and
merge
all of the other rows that also have the same value in "connector" and are of "type"=1
My results: Not all are merged because multiple rows of "type"=1 are associated with UNIQUE rows of "type"=2
Most similar questions are answered using SQL, which i cannot use here.
df2 = df.copy()
df.index.astype(str)
df2.index.astype(str)
pd.merge(df,df2, how='left', on='connector',right_index=True, left_index=True)
df3 = pd.merge(df.set_index('connector'),df2.set_index('connector'), right_index=True, left_index=True).reset_index()
dfNew = df.merge(df2, how='left', left_on=['connector'], right_on = ['connector'])
Can i achieve my goal by goupby() ?
Solution by #victor__von__doom
if __name__ == '__main__':
df = df.groupby('connector', sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[['here', 'are', 'all', 'columns', 'except', 'for', 'the', 'connector', 'column']] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
First off, it is really messy to just keep concatenating new columns onto your original DataFrame when rows are merged, especially when the number of columns is very large. Furthermore, if you end up merging 3 rows for 1 connector value and 4 rows for another (for example), the only way to include all values is to make empty columns for some rows, which is never a good idea. Instead, I've made it so that the merged rows get combined into tuples, which can then be parsed efficiently while keeping the size of your DataFrame manageable:
import numpy as np
import pandas as pd
if __name__ == '__main__':
data = np.array([[1,2,3,4,5], [1111,9999,1111,9999,9999],
[1,2,2,1,1], ['aa', 'NA', 'NA', 'bb', 'cc'],
['NA', 'tt', 'uu', 'NA', 'NA'],
['xx', 'NA', 'NA', 'yy', 'zz']])
df = pd.DataFrame(data.T, columns = ["index", "connector",
"type", "q_text", "a_text", "varx"])
df = df.groupby("connector", sort=True).apply(lambda c: list(zip(*c.values[:,2:].tolist()))).reset_index(name='merged')
df[["type", "q_text", "a_text", "varx"]] = pd.DataFrame(df.merged.tolist())
df = df.drop(['merged'], axis=1)
The final DataFrame looks like:
connector type q_text a_text varx ...
0 1111 (1, 2) (aa, NA) (NA, uu) (xx, NA) ...
1 9999 (2, 1, 1) (NA, bb, cc) (tt, NA, NA) (NA, yy, zz) ...
Which is much more compact and readable.

roll off profile stacking data frames

I have a dataframe that looks like:
import pandas as pd
import datetime as dt
df= pd.DataFrame({'date':['2017-12-31','2017-12-31'],'type':['Asset','Liab'],'Amount':[100,-100],'Maturity Date':['2019-01-02','2018-01-01']})
df
I am trying to build a roll-off profile by checking if the 'Maturity Date' is greater than a 'date' in the future. I am trying to achieve something like:
#First Month
df1=df[df['Maturity Date']>'2018-01-31']
df1['date']='2018-01-31'
#Second Month
df2=df[df['Maturity Date']>'2018-02-28']
df2['date']='2018-02-28'
#third Month
df3=df[df['Maturity Date']>'2018-03-31']
df3['date']='2018-02-31'
#first quarter
qf1=df[df['Maturity Date']>'2018-06-30']
qf1['date']='2018-06-30'
#concatenate
df=pd.concat([df,df1,df2,df3,qf1])
df
I was wondering if there is a way to :
Allow an arbitrary long number of dates without repeating code
I think you need numpy.tile for repeat indices and assign to new column, last filter by boolean indexing and sorting by sort_values:
d = '2017-12-31'
df['Maturity Date'] = pd.to_datetime(df['Maturity Date'])
#generate first month and next quarters
c1 = pd.date_range(d, periods=4, freq='M')
c2 = pd.date_range(c1[-1], periods=2, freq='Q')
#join together
c = c1.union(c2[1:])
#repeat rows be indexing repeated index
df1 = df.loc[np.tile(df.index, len(c))].copy()
#assign column by datetimes
df1['date'] = np.repeat(c, len(df))
#filter by boolean indexing
df1 = df1[df1['Maturity Date'] > df1['date']]
print (df1)
Amount Maturity Date date type
0 100 2019-01-02 2017-12-31 Asset
1 -100 2018-01-01 2017-12-31 Liab
0 100 2019-01-02 2018-01-31 Asset
0 100 2019-01-02 2018-02-28 Asset
0 100 2019-01-02 2018-03-31 Asset
0 100 2019-01-02 2018-06-30 Asset
You could use a nifty tool in the Pandas arsenal called
pd.merge_asof. It
works similarly to pd.merge, except that it matches on "nearest" keys rather
than equal keys. Furthermore, you can tell pd.merge_asof to look for nearest
keys in only the backward or forward direction.
To make things interesting (and help check that things are working properly), let's add another row to df:
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
print(df)
# Amount Maturity Date date type
# 1 200 2018-03-15 2017-12-31 Asset
# 0 100 2019-01-02 2017-12-31 Asset
Now define some new dates:
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
# date
# 0 2018-01-31
# 1 2018-02-28
# 2 2018-03-31
# 3 2018-06-30
Now we can merge rows, matching nearest dates from result with Maturity Dates from df:
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
In this case we want to "match" dates with Maturity Dates which are greater
so we use direction='forward'.
Putting it all together:
import pandas as pd
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
result = pd.concat([df, result], axis=0)
result = result.sort_values(by=['Maturity Date', 'date'])
print(result)
yields
Amount Maturity Date date type
1 200 2018-03-15 2017-12-31 Asset
0 200 2018-03-15 2018-01-31 Asset
1 200 2018-03-15 2018-02-28 Asset
0 100 2019-01-02 2017-12-31 Asset
2 100 2019-01-02 2018-03-31 Asset
3 100 2019-01-02 2018-06-30 Asset

Resources