python 3.6 pandas conditionally filling missing values - python-3.x

If there is a dataframe:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan}
]
)
print(users[['id','date','balance_total','transaction_total']])
Dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 NaN NaN
3 1 01/04/2019 NaN NaN
4 1 01/05/2019 NaN -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 NaN NaN
8 2 01/05/2019 NaN -4.0
How can i do the following?
If both of the transaction_total and balance_total are NaN, just fill in the last date's balance_total (e.g. in row 3 where id=1, since the user1's transaction_total and balance_total are NaN, fill in 100 from 01/02/2019. The same will be row 4, fill in 100 from 01/03/2019.)
If the transaction_total is NOT NaN, but balance_total is NaN, do the math of the previous date's balance_total+ the current row's date's transaction_total.
In user 1, 01/05/2019 as example: the balance total will be=100+(-4), where 100 is 01/04/2019's balance total, and (-4) is 01/05/2019's transaction total.
Desired output:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is my code but it doesn't work. I think i couldn't figure out how to do "if logic in pandas when a row is null, do something".
for i, row in df.iterrows():
if(pd.isnull(row['transaction_total'] is True)):
if(pd.isnull(row['balance_total'] is True)):
df.loc[i,'transaction_total'] = df.loc[i-1,'transaction_total']
Could someone enlighten?

IIUC, first create a dummy series with ffill, and then use np.where:
s = df["balance_total"].ffill()
df["balance_total"] = np.where(df["balance_total"].isnull()&df["transaction_total"].notnull(),
s.add(df["transaction_total"]), s)
print (df)
id date transaction_total balance_total
0 1 01/01/2019 -1.0 102.0
1 1 01/02/2019 -2.0 100.0
2 1 01/03/2019 NaN 100.0
3 1 01/04/2019 NaN 100.0
4 1 01/05/2019 -4.0 96.0
5 2 01/01/2019 -2.0 200.0
6 2 01/02/2019 -2.0 100.0
7 2 01/04/2019 NaN 100.0
8 2 01/05/2019 -4.0 96.0

Related

Python Pandas DataFrame adding a fixed value from list to cloumn and Generating new Column output for each of this list values

This may be a different in requirements
i have data frame
A B C
1 2 4
2 4 6
8 10 12
1 3 5
and a dynamic list (the length may vary
list[1,2,3,4,5,6,7,8,9,10,11]
I wish to add C colum value with list value with each of this list and generate new dataframe column with added value how to do this?
A B C C_1 C_2 .......................... C_11
1 2 4 5 6 15
2 4 6 7 8 17
8 10 12 11 14
1 3 5 6 7 16
Thank you for your support
you can use a dict comprehension to create a simple dataframe.
dynamic_vals = [1,2,3,4,5,6,7,8,9,10,11]
df2 = pd.concat(
[df,pd.DataFrame({f'C_{val}' : [0] for val in dynamic_vals })]
,axis=1).fillna(0)
print(df2)
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 4 6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 8 10 12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 1 3 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
or you could use assign again with the suggestion of #piRSqaured
df2 = df.assign(**dict((f'C_{i}', np.nan) for i in dynamic_vals))
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 8 10 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1 3 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
or better and more simple solution suggested by #piRSquare
df.join(pd.DataFrame(np.nan, df.index, dynamic_vals).add_prefix('C_')
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 8 10 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1 3 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Edit :
Using df.join with a dictionary comprehension.
df.join(pd.DataFrame({f'C_{val}' : df['C'].values + val for val in dynamic_vals }))
A B C C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9 C_10 C_11
0 1 2 4 5 6 7 8 9 10 11 12 13 14 15
1 2 4 6 7 8 9 10 11 12 13 14 15 16 17
2 8 10 12 13 14 15 16 17 18 19 20 21 22 23
3 1 3 5 6 7 8 9 10 11 12 13 14 15 16

Transpose DF columns based on column values - Pandas

My df looks like this,
param per per_date per_num
0 XYZ 1.0 2018-10-01 11.0
1 XYZ 2.0 2017-08-01 15.25
2 XYZ 1.0 2019-10-01 11.25
3 XYZ 2.0 2019-08-01 15.71
4 XYZ 3.0 2020-10-01 11.50
5 XYZ NaN NaN NaN
6 MMG 1.0 2021-10-01 11.75
7 MMG 2.0 2014-01-01 14.00
8 MMG 3.0 2021-10-01 12.50
9 MMG 1.0 2014-01-01 15.00
10 LKG NaN NaN NaN
11 LKG NaN NaN NaN
I need my output like this,
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 per_date_3 per_num_3
0 XYZ 1 2018-10-01 11.0 2 2017-08-01 15.25 NaN NaN NaN
1 XYZ 1 2019-10-01 11.25 2 2019-08-01 15.71 3 2020-10-01 11.50
2 XYZ NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 MMG 1 2021-10-01 11.75 2 2014-01-01 14.00 3 2021-10-01 12.50
5 MMG 1 2014-01-01 15.00 NaN NaN NaN NaN NaN NaN
6 LKG NaN NaN NaN NaN NaN NaN NaN NaN NaN
If you see param column has values that are repeating and transposed column names are created from these values. Also, a new records gets created as soon as param values starts with 1. How can I achieve this?
Here main problem are NaNs in last LKG group - first replace missing values by counter created by cumcount and assign to new column per1:
s = df['per'].isna().groupby(df['param']).cumsum()
df = df.assign(per1=df['per'].fillna(s).astype(int))
print (df)
param per per_date per_num per1
0 XYZ 1.0 2018-10-01 11.00 1
1 XYZ 2.0 2017-08-01 15.25 2
2 XYZ 1.0 2019-10-01 11.25 1
3 XYZ 2.0 2019-08-01 15.71 2
4 XYZ 3.0 2020-10-01 11.50 3
5 XYZ NaN NaN NaN 1
6 MMG 1.0 2021-10-01 11.75 1
7 MMG 2.0 2014-01-01 14.00 2
8 MMG 3.0 2021-10-01 12.50 3
9 MMG 1.0 2014-01-01 15.00 1
10 LKG NaN NaN NaN 1
11 LKG NaN NaN NaN 2
Then create MultiIndex with groups with compare by 1 and cumulative sum and reshape by unstack:
g = df['per1'].eq(1).cumsum()
df = df.set_index(['param', 'per1',g]).unstack(1).sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1, drop=True).reset_index()
print (df)
param per_1 per_date_1 per_num_1 per_2 per_date_2 per_num_2 per_3 \
0 LKG NaN NaN NaN NaN NaN NaN NaN
1 MMG 1.0 2021-10-01 11.75 2.0 2014-01-01 14.00 3.0
2 MMG 1.0 2014-01-01 15.00 NaN NaN NaN NaN
3 XYZ 1.0 2018-10-01 11.00 2.0 2017-08-01 15.25 NaN
4 XYZ 1.0 2019-10-01 11.25 2.0 2019-08-01 15.71 3.0
5 XYZ NaN NaN NaN NaN NaN NaN NaN
per_date_3 per_num_3
0 NaN NaN
1 2021-10-01 12.5
2 NaN NaN
3 NaN NaN
4 2020-10-01 11.5
5 NaN NaN

Combine text from multiple rows in pandas

I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.
You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0
idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645

diagonally subtract different columns in python

I want to create another column in dataframe which consists value of difference. The difference is calculated by subtracting different rows of different columns for unique date values.
I tried looking for various stackoverflow links but didn't find the answer.
The difference should be the value after subtracting value of ATA of 2st row with ATD of 1st row and so on for unique date values. For ex, ATA of 1st january cannot be subtracted from ATD of 2nd january.
For example:-
The difference column's first values should be NAN.
Second values should be 50 Mins (17:13:00 - 16:23:00)
But ATD of 02-01-2019 should not be subtracted with ATA of 01-01-2019
You want to apply a shift grouped by Date and then subtract this with ATD
>>> df = pd.DataFrame({'ATA':range(0,365),'ATD':range(10,375),'Date':pd.date_range(start="2018-01-01",end="2018-12-31")})
>>> df['ATD'] = df['ATD']/6.0
>>> df = pd.concat([df,df,df,df])
>>> df['shifted_ATA'] = df.groupby('Date')['ATA'].transform('shift')
>>> df['result'] = df['ATD'] - df['shifted_ATA']
>>> df = df.sort_values(by='Date', ascending=[1])
>>> df.head(20)
ATA ATD Date shifted_ATA result
0 0 1.666667 2018-01-01 NaN NaN
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
1 1 1.833333 2018-01-02 NaN NaN
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 NaN NaN
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 2.0 0.000000
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 NaN NaN
3 3 2.166667 2018-01-04 3.0 -0.833333
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 NaN NaN

Calculating rolling sum in a pandas dataframe on the basis of 2 variable constraints

I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0

Resources