Pandas: filtering recurring payments - python-3.x

Im stuck with this problem for a while. I want to filter out regular monthly payments from table where the Beneficient and the Payer is the same and Amount is equal. I am filtering out salaries.
Date Beneficient Payer Amount
2014-09-10 X A 3000
2014-09-15 X A 4000
2014-10-10 X A 3000
2014-10-11 X A 5500
2014-11-10 X A 3000
2014-09-11 Y B 7000
2014-09-14 Y B 8500
2014-10-11 Y B 7000
2014-10-16 Y B 8900
2014-11-11 Y B 7000
2014-11-17 Y B 8200
the desirable result:
Date Beneficient Payer Amount
2014-09-10 X A 3000
2014-10-10 X A 3000
2014-11-10 X A 3000
2014-09-11 Y B 7000
2014-10-11 Y B 7000
2014-11-11 Y B 7000

Use duplicated by specifying columns for check dupes and keep=False for return all dupe rows for boolean mask and filter by boolean indexing:
df = df[df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False)]
print (df)
Date Beneficient Payer Amount
0 2014-09-10 X A 3000
2 2014-10-10 X A 3000
4 2014-11-10 X A 3000
5 2014-09-11 Y B 7000
7 2014-10-11 Y B 7000
9 2014-11-11 Y B 7000
Detail:
print (df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False))
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 True
8 False
9 True
10 False
dtype: bool
More general solution:
Idea is get differences between datetimes, first NaNs replace by 30 and compare.
And here is a bit problem - there is different count of days between months, the worst is February - possible difference less as 30, 31.
So in my opinion general solution with difference always +-1 days is not so easy.
df = df[df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False)]
df = df.sort_values(['Beneficient','Payer','Amount','Date'])
cols = [df['Beneficient'], df['Payer'], df['Amount']]
df = df[df['Date'].groupby(cols).diff().dt.days.fillna(30).isin([30,31])]
print (df)
Date Beneficient Payer Amount
0 2014-09-10 X A 3000
2 2014-10-10 X A 3000
4 2014-11-10 X A 3000
5 2014-09-11 Y B 7000
7 2014-10-11 Y B 7000
9 2014-11-11 Y B 7000

To filter out those results into their own dataframe, while also keeping the original records, you want to use duplicated():
sub_df = df[df.duplicated(subset=['Beneficient','Payer','Amount'], keep=False)]

Related

Divide values of rows based on condition which are of running count

Sample of the table for 1 id, exists multiple id in the original df.
id
legend
date
running_count
101
X
24-07-2021
3
101
Y
24-07-2021
5
101
X
25-07-2021
4
101
Y
25-07-2021
6
I want to create a new column where I have to perform division of the running_count on the basis of the id, legend and date - (X/Y) for the date 24-07-2021 for a particular id and so on.
How shall I perform the calculation?
If there is same order X, Y for each id is possible use:
df['new'] = df['running_count'].div(df.groupby(['id','date'])['running_count'].shift(-1))
print (df)
id legend date running_count new
0 101 X 24-07-2021 3 0.600000
1 101 Y 24-07-2021 5 NaN
2 101 X 25-07-2021 4 0.666667
3 101 Y 25-07-2021 6 NaN
If possible change ouput:
df1 = df.pivot(index=['id','date'], columns='legend', values='running_count')
df1['new'] = df1['X'].div(df1['Y'])
df1 = df1.reset_index()
print (df1)
legend id date X Y new
0 101 24-07-2021 3 5 0.600000
1 101 25-07-2021 4 6 0.666667

How can i find the highest and the lowest value between rows depending on a condition being met in another column in a pd.DataFrame?

I have the following DataFrame, that is generated every time that I run the script, the DataFrame looks like this:
df=
index time value status
0 2020-11-20 20:10:00 10 X
1 2020-11-20 20:20:00 11 X
2 2020-11-20 20:45:00 9 X
3 2020-11-20 20:45:00 5 Y
4 2020-11-20 21:00:00 4 X
5 2020-11-20 21:05:00 2 Y
6 2020-11-20 21:15:00 4 Y
7 2020-11-20 21:20:00 9 X
8 2020-11-20 21:25:00 5 X
The desired output would be :
index time value status
0 2020-11-20 20:20:00 11 X
1 2020-11-20 20:45:00 5 Y
2 2020-11-20 21:00:00 4 X
3 2020-11-20 21:05:00 2 Y
4 2020-11-20 21:20:00 9 X
So my goal here would be to create a new pd.DataFrame with the lowest values of Y and the highest values of X.
Thanks to everyone in advance for all the assistance and support.
You can do a groupby on consecutive values of your DataFrame where the status is the same, sort each grouped DataFrame by value, and keep either the first or last value of the sorted DataFrame depending on whether the grouped DataFrame has status equal to X or Y.
Note: I noticed the time column of your DataFrame has no impact on the answer, so I didn't include it when I recreated your DataFrame.
import pandas as pd
## the time column doesn't matter in your problem
df = pd.DataFrame({
'value':[10,11,9,5,4,2,4,9,5],
'status':['X']*3+['Y']+['X']+['Y']*2+['X']*2
})
df_new = pd.DataFrame(columns=df.columns)
## perform a groupby on consecutive values
for _, g in df.groupby([(df.status != df.status.shift()).cumsum()]):
g = g.sort_values(by='value')
## keep the highest value for X
if g.status.values[0] == 'X':
g = g.drop_duplicates(subset=['status'], keep='last')
## keep the lowest value for Y
elif g.status.values[0] == 'Y':
g = g.drop_duplicates(subset=['status'], keep='first')
else:
pass
df_new = pd.concat([df_new, g])
df_new = df_new.reset_index(drop=True)
Output:
>>> df_new
value status
0 11 X
1 5 Y
2 4 X
3 2 Y
4 9 X

Unable to understand DataFrame method "loc" logic , If we use incorrect names of labels

I am using the method loc for extracting the columns with the use of labels. I encountered an issue while using incorrect names of labels resulting in some output as follows. PLease help me to understand the logic behind the loc method in terms of labels use.
import pandas as pd
Dic={'empno':(101,102,103,104),'name':('a','b','c','d'),'salary':(3000,5000,8000,9000)}
df=pd.DataFrame(Dic)
print(df)
print()
print(df.loc[0:2,'empsfgsdzfsdfsdaf':'salary'])
print(df.loc[0:2,'empno':'salarysadfsa'])
print(df.loc[0:2,'name':'asdfsdafsdaf'])
print(df.loc[0:2,'sadfsadfsadf':'sasdfsdflasdfsdfsdry'])
print(df.loc[0:2,'':'nasdfsd'])
OUTPUT:
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
3 104 d 9000
name salary
0 a 3000
1 b 5000
2 c 8000
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
Empty DataFrame
Columns: []
Index: [0, 1, 2]
salary
0 3000
1 5000
2 8000
empno name
0 101 a
1 102 b
2 103 c
.loc[A : B, C : D] will select:
index (row) labels from (and including) A to (and including) B; and
column labels from (and including) C to (and including) D.
Let's look at the column label slice 'a' : 'salary'. Since a is before the first column label, we get empno, name, salary.
print(df.loc[0:2, 'a':'salary'])
empno name salary
0 101 a 3000
1 102 b 5000
2 103 c 8000
It works the same way at the upper end of the slice:
print(df.loc[0:2, 'name':'z'])
name salary
0 a 3000
1 b 5000
2 c 8000
Here is a list comprehension that shows how the second slice works:
# code
[col for col in df.columns if 'name' <= col <= 'z']
# result
['name', 'salary']
There is a good description for all most used subsetting methods here:
https://www.kdnuggets.com/2019/06/select-rows-columns-pandas.html

pandas create a flag when merging two dataframes

I have two df - df_a and df_b,
# df_a
number cur
1000 USD
2000 USD
3000 USD
# df_b
number amount deletion
1000 0.0 L
1000 10.0 X
1000 10.0 X
2000 20.0 X
2000 20.0 X
3000 0.0 L
3000 0.0 L
I want to left merge df_a with df_b,
df_a = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df_a.fillna(value={'amount':0}, inplace=True)
but also create a flag called deleted in the result df_a, that has three possible values - full, partial and none;
full - if all rows associated with a particular number value, have deletion = L;
partial - if some rows associated with a particular number value, have deletion = L;
none - no rows associated with a particular number value, have deletion = L;
Also when doing the merge, rows from df_b with deletion = L should not be considered; so the result looks like,
number amount deletion deleted cur
1000 10.0 X partial USD
1000 10.0 X partial USD
2000 20.0 X none USD
2000 20.0 X none USD
3000 0.0 NaN full USD
I am wondering how to achieve that.
Idea is compare deletion column and aggregate all and
any, create helper dictionary and last map for new column:
g = df_b['deletion'].eq('L').groupby(df_b['number'])
m1 = g.any()
m2 = g.all()
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
#join dictionries together
d = {**d1, **d2}
print (d)
{1000: 'partial', 3000: 'full'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d).fillna('none')
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full
For specify column none, if want create dictionary for it:
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
d3 = dict.fromkeys(m2.index[~m1], 'none')
d = {**d1, **d2, **d3}
print (d)
{1000: 'partial', 3000: 'full', 2000: 'none'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d)
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full

Pandas - create column with aggregate results

I have a dataset which has a row for each loan, and a borrower can have multiple loans. The 'Property' flag shows if there is any security behind the loan. I am trying to aggregate this flag on a borrower level, so for each borrower, if one of the Property flags is 'Y', I want to add an additional column where it is 'Y' for each of the borrowers.
The short example below shows what the end result should look like. Any help would be appreciated.
import pandas as pd
data = {'Borrower': [1,2,2,2,3,3,4,5,6,6],
'Loan' : [1,2,3,4,5,6,7,8,9,10],
'Property': ["Y","N","Y","Y","N","Y","N","Y","N","N"],
'Result': ['Y','Y','Y','Y','Y','Y','N','Y','N','N']}
df = pd.DataFrame.from_dict(data)
You can use Transform on Property after groupby Borrower. Because the ASCII code of 'Y' is bigger than 'N' so if there is any property which is 'Y' for a borrower, max(Property) will give 'Y'.
df['Result2'] = df.groupby('Borrower')['Property'].transform(max)
df
Out[202]:
Borrower Loan Property Result Result2
0 1 1 Y Y Y
1 2 2 N Y Y
2 2 3 Y Y Y
3 2 4 Y Y Y
4 3 5 N Y Y
5 3 6 Y Y Y
6 4 7 N N N
7 5 8 Y Y Y
8 6 9 N N N
9 6 10 N N N

Resources