Find recurring transactions every month for at least "x" months every year - python-3.x

df = pd.DataFrame.from_dict({"Card_No":[1234,1234,1234,4321,4321,4321],
"Merchant_Country":["USA", "USA", "USA", "USA","USA", "USA"],
"Merchant_Name":["BestBuy", "BestBuy", "BestBuy", "BestBuy","BestBuy", "BestBuy"],
"Date": ["2021-01-15", "2021-02-15", "2021-03-15", "2021-04-15", "2021-05-15", "2021-07-15"],
"TrxAmount": [99.99, 99.99, 99.99, 89.99, 89.99, 89.99]})
Find card numbers who had at least 3 recurring payments at the same merchant during all of 2020. So while 1234 qualifies as having had recurring payments, Card number 4321 does not fulfill the criteria.
A recurring transaction is at the same merchant, for the same amount, and in consecutive months.
I have been trying to break my head to find a solution but can't seem to solve this either in SQL or in Python.
One approach that might work is creating groups and then seeing how many transactions occurred in each group but I can't seem to crack the consecutive transaction piece of the problem.
Any help is appreciated. Also, the solution should scale to 3,4,5,..n recurring transactions.
groups = df.groupby(["Card_No", "Merchant_Name","Transaction Amount"])
for grp, data in groups:
lst_of_dates = data["date"].unique()

If you mean one payment per month for a merchant with the same amount on the same day of the month, then you can use lead() or lag(). So, to get the first of three "recurring" payments:
select t.*
from (select t.*,
lead(date, 1) over (partition by card, merchant, amount) as date_1,
lead(date, 2) over (partition by card, merchant, amount) as date_2
from t
) t
where date_2 = date + interval '2 month' and
date_1 = date + interval '1 month';

A pandas answer:
df = pd.DataFrame({'Card Number': {0: 1234, 1: 1234, 2: 1234, 3: 4321, 4: 4321, 5: 4321},
'Merchant Country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA', 5: 'USA'},
'Merchant Name': {0: 'Best Buy', 1: 'Best Buy', 2: 'Best Buy', 3: 'Best Buy', 4: 'Best Buy', 5: 'Best Buy'},
'Date': {0: '2020-02-15', 1: '2020-03-15', 2: '2020-04-15', 3: '2020-05-15', 4: '2020-06-15', 5: '2020-08-15'},
'Transaction Amount': {0: 99.99, 1: 99.99, 2: 99.99, 3: 99.99, 4: 99.99, 5: 99.99}})
df['Date'] = pd.to_datetime(df['Date'])
# Create monthly date for indexing
df['Month'] = df['Date']
# Set index as monthly & fill missing months
df = df.set_index('Month').resample('M').min()
# Fill card numbers for grouping
df['Card Number'] = df['Card Number'].ffill()
# Identify recurring transaction - same in month t and t+1, or month t and t-1
df['cond'] = df.groupby('Card Number')['Transaction Amount'].transform(
lambda x: (x == x.shift(1)) | (x == x.shift(-1)))
# Flag recurring transactions with total number of recurrences in sequence
df['Recurrences'] = df.groupby([df['Card Number'], df['cond'].eq(True)])['cond'].transform(
lambda x: x.cumsum().max())
Card Number Merchant Country Merchant Name Date Transaction Amount cond Recurrences
Month
2020-02-29 1234.0 USA Best Buy 2020-02-15 99.99 True 3
2020-03-31 1234.0 USA Best Buy 2020-03-15 99.99 True 3
2020-04-30 1234.0 USA Best Buy 2020-04-15 99.99 True 3
2020-05-31 4321.0 USA Best Buy 2020-05-15 99.99 True 2
2020-06-30 4321.0 USA Best Buy 2020-06-15 99.99 True 2
2020-07-31 4321.0 NaN NaN NaT NaN False 0
2020-08-31 4321.0 USA Best Buy 2020-08-15 99.99 False 0
From there, you can find recurrences of N months with df.loc[df['Recurrences'].eq(N).

Related

How to calculate YTD (Year to Date) value using Pandas Dataframe?

I want to calculate YTD using pandas dataframe in each month. Here I have used two measurements named sales and sales Rate. For measurement sales, YTD is calculated by taking the cumulative sum.Code is given below:
report_table['ytd_value'] = report_table.groupby(['financial_year', 'measurement', 'place', 'market', 'product'], sort=False)['value'].cumsum()
But, In the case of measurement sales rate YTD is calculated in different way.
YTD Calculation Explanation (sales rate) given below:
First month (April) YTD value of financial year = First month (April) value of financial year
From second month of financial year onwards YTD valueis calculated using formula.
Month May YTD value = ((APRIL YTD value(sales)* APRIL YTD value(sales rate)) + (APRIL value(sales)* APRIL value(sales rate)) / (APRIL value(sales) + APRIL value(sales rate)
Similarly for other months.Dataframe is given below as an image.
import pandas as pd
data = {'Month': ['April', 'May', 'April', 'June', 'April', 'May'],
'Year': [2022, 2022, 2022, 2022, 2022, 2022],
'Financial_Year': [2023, 2023, 2023, 2023, 2023, 2023],
'Measurement': ['sales', 'sales', 'sales', 'sales', 'sales rate', 'sales rate'],
'Place': ['Delhi', 'Delhi', 'Delhi', 'Delhi', 'Delhi', 'Delhi'],
'Market': ['Domestic', 'Domestic', 'Export', 'Domestic', 'Domestic', 'Domestic'],
'Product': ['Biscuit', 'Biscuit', 'Chocolate', 'Biscuit', 'Biscuit', 'Biscuit'],
'Value': ['10', '10', '20', '25', '10', '20']}
# Create DataFrame
df = pd.DataFrame(data)
df['Value'] = df['Value'].astype(float)
df['ytd_value'] = df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value'].cumsum()
It will calculate ytd_value for both sales and sales rate measurement.But I want to calculate ytd_value for sales rate in the above mentioned format.
I have tried below code, but it shows an error:
rslt_df = df[(df['Measurement'] == 'sales')]
df.loc[df['Measurement'] == "sales rate", 'ytd_value'] = (df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['ytd_value']*rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['ytd_value'] + df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value'] * rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value']) / (rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['ytd_value'] + rslt_df.groupby(['Financial_Year', 'Measurement', 'Place', 'Market', 'Product'], sort=False)['Value'])
Expected output:
Month Year Financial_Year ... Product Value ytd_value
0 April 2022 2023 ... Biscuit 10.0 10.0
1 May 2022 2023 ... Biscuit 10.0 20.0
2 April 2022 2023 ... Chocolate 20.0 20.0
3 June 2022 2023 ... Biscuit 25.0 45.0
4 April 2022 2023 ... Biscuit 10.0 10.0
5 May 2022 2023 ... Biscuit 20.0 10.0
Can anyone help me to solve this caclculation?
I recommend you change your dataframe around a bit:
Month Year Financial_Year Place Market Product Sales Sales Rate
0 April 2022 2023 Delhi Domestic Biscuit 10.0 10.0
1 May 2022 2023 Delhi Domestic Biscuit 10.0 20.0
2 June 2022 2023 Delhi Domestic Biscuit 25.0 0.0
You may be able to get here by aggregating the sales values across each month, but the point is that you have a single Sales value and Sales Rate value for each month.
Once you have this, you can set the YTD value for April, and then iterate through the following months to calculate their values.
I think there's an error in the formula you posted for YTD calculations, but using that as is, here's some sample code:
import pandas as pd
data = {'Month': ['April', 'May', 'June'],
'Year': [2022, 2022, 2022],
'Financial_Year': [2023, 2023, 2023],
'Place': ['Delhi', 'Delhi', 'Delhi'],
'Market': ['Domestic', 'Domestic', 'Domestic'],
'Product': ['Biscuit', 'Biscuit', 'Biscuit'],
'Sales': [10, 10, 25],
'Sales Rate': [10, 20, 0]}
# Create DataFrame
df = pd.DataFrame(data)
df['Sales'] = df['Sales'].astype(float)
df['Sales Rate'] = df['Sales Rate'].astype(float)
df['YTD'] = 0.0
df.at[0,'YTD'] = df.iloc[0]['Sales']
for rowidx in range(1, len(df)):
prevrow = df.iloc[rowidx - 1]
tmp = prevrow['Sales'] * prevrow['Sales Rate']
df.at[rowidx,'YTD'] = tmp + tmp/tmp
print(df)
This outputs, for example:
Month Year Financial_Year Place Market Product Sales Sales Rate YTD
0 April 2022 2023 Delhi Domestic Biscuit 10.0 10.0 10.0
1 May 2022 2023 Delhi Domestic Biscuit 10.0 20.0 101.0
2 June 2022 2023 Delhi Domestic Biscuit 25.0 0.0 201.0
You should be able to use this as an example to implement the correct function to calculate the YTD values.

Pandas filter a column based on another column

Given this dataframe
df = pd.DataFrame(\
{'name': {0: 'Peter', 1: 'Anna', 2: 'Anna', 3: 'Peter', 4: 'Simon'},
'Status': {0: 'Finished',
1: 'Registered',
2: 'Cancelled',
3: 'Finished',
4: 'Registered'},
'Modified': {0: '2019-03-11',
1: '2019-03-19',
2: '2019-05-22',
3: '2019-10-31',
4: '2019-04-05'}})
How can I compare and filter based on the Status? I want the Modified column to keep the date where the Status is either "Finished" or "Cancelled", and fill in blank where the condition is not met.
Wanted output:
name Status Modified
0 Peter Finished 2019-03-11
1 Anna Registered
2 Anna Cancelled 2019-05-22
3 Peter Finished 2019-10-31
4 Simon Registered
Check with where + isin
df.Modified.where(df.Status.isin(['Finished','Cancelled']),'',inplace=True)
df
Out[68]:
name Status Modified
0 Peter Finished 2019-03-11
1 Anna Registered
2 Anna Cancelled 2019-05-22
3 Peter Finished 2019-10-31
4 Simon Registered

How can I create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

The following dataframe shows trips made by employees of different companies:
source:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
output:
Name Company Depart Return Charges
0 Bob ABC 2020-01-01 2020-01-31 10.10
1 Joe ABC 2020-01-01 2020-02-15 20.25
2 Sue ABC 2020-01-06 2020-02-20 30.32
3 Jack HIJ 2020-01-01 2020-03-01 40.00
4 Henry HIJ 2020-05-01 2020-05-05 50.01
5 Frank DEF 2020-01-13 2020-01-15 60.32
6 Lee DEF 2020-01-12 2020-01-30 70.99
7 Jack DEF 2020-01-14 2020-02-02 80.87
How can I create another dataframe based on the following aspects:
The original dataframe is based on employee names/trips.
The generated dataframe will be based on companies grouped by overlapping dates.
The 'Name' column will not be included as it is no longer needed.
The 'Company' column will remain.
The 'Depart' date will be of the earliest date of any overlapping trip dates.
The 'Return' date will be of the latest date of any overlapping trip dates.
Any company trips that do not have overlapping dates will be its own entry/row.
The 'Charges' for each trip will be totaled for the new company entry.
Here is the desired output of the new dataframe:
Company Depart Return Charges
0 ABC 01/01/2020 02/20/2020 60.67
1 HIJ 01/01/2020 03/01/2020 40.00
2 HIJ 05/01/2020 05/05/2020 50.01
3 DEF 01/12/2020 02/02/2020 212.18
I've looked into the following as possible solutions:
Create a hierarchical index based on the company and date. As I worked through this, I realized that all this really does is create a hierarchical index but that's based on the specific columns. Also, this method won't aggregate the individual rows into summary rows.
df1 = df.set_index(['Company', not exactly sure how to say overlapping dates])
I also tried using timedelta but it resulted in True/False values in a separate column, and I'm not entirely sure how that would be used to combine into a single row based on overlapping date and company. Also, I don't think groupby('Company') works since there could be different trips and that non-overlapping that would require their own rows.
df['trips_overlap'] = (df.groupby('Company')
.apply(lambda x: (x['Return'].shift() - x['Depart']) > timedelta(0))
.reset_index(level=0, drop=True))

How can I merge rows if a column data is same and change a value of another specific column on merged column efficiently in pandas?

I am trying to merge rows if value of certain column are same. I have been using groupby first and replace the data the value of column based on specific condition. I was wondering if there is a better option to do what I am trying to do.
This is what I have been doing
data={'Name': {0: 'Sam', 1: 'Amy', 2: 'Cat', 3: 'Sam', 4: 'Kathy'},
'Subject1': {0: 'Math', 1: 'Science', 2: 'Art', 3: np.nan, 4: 'Science'},
'Subject2': {0: np.nan, 1: np.nan, 2: np.nan, 3: 'English', 4: np.nan},
'Result': {0: 'Pass', 1: 'Pass', 2: 'Fail', 3: 'TBD', 4: 'Pass'}}
df=pd.DataFrame(data)
df=df.groupby('Name').agg({
'Subject1': 'first',
'Subject2': 'first',
'Result': ', '.join}).reset_index()
df['Result']=df['Result'].apply(lambda x: 'RESULT_FAILED' if x=='Pass, TBD' else x )
Starting: df looks like:
Name Subject1 Subject2 Result
0 Sam Math NaN Pass
1 Amy Science NaN Pass
2 Cat Art NaN Fail
3 Sam NaN English TBD
4 Kathy Science NaN Pass
Final result I want is :
Name Subject1 Subject2 Result
0 Amy Science NaN Pass
1 Cat Art NaN Fail
2 Kathy Science NaN Pass
3 Sam Math English RESULT_FAILED
I believe this might not be a good solution if there are more than 100 columns. I will have to manually change the dictionary for aggregation.
I tried using :
df.groupby('Name')['Result'].agg(' '.join).reset_index() but I only get 2 columns.
Your sample indicates each unique name having single non-NaN SubjectX value. I.e. each SubjectX has only one single non-NaN value for duplicate Name. You may try this way
import numpy as np
df_final = (df.fillna('').groupby('Name', as_index=False).agg(''.join)
.replace({'':np.nan, 'PassTBD': 'RESULT_FAILED'}))
Out[16]:
Name Subject1 Subject2 Result
0 Amy Science NaN Pass
1 Cat Art NaN Fail
2 Kathy Science NaN Pass
3 Sam Math English RESULT_FAILED

Pandas Group-By and Calculate Ratio of Two Columns

I'm trying to use Pandas and groupby to calculate the ratio of two columns. In the example below I want to calculate the ratio of staff Status per Department (Number of Status in Department/Total Number of Employees per Department). For example the Sales department has a total of 3 Employees and the number of staff that have Employee Status is 2 which gives the ratio of 2/3, 66.67%. I managed to hack my way through to get this but there must be a more elegant and simple way to do this. How can I get the desired output below more efficiently?
Original DataFrame:
Department Name Status
0 Sales John Employee
1 Sales Steve Employee
2 Sales Sara Contractor
3 Finance Allen Contractor
4 Marketing Robert Employee
5 Marketing Lacy Contractor
Code:
mydict ={
'Name': ['John', 'Steve', 'Sara', 'Allen', 'Robert', 'Lacy'],
'Department': ['Sales', 'Sales', 'Sales', 'Finance', 'Marketing', 'Marketing'],
'Status': ['Employee', 'Employee', 'Contractor', 'Contractor', 'Employee', 'Contractor']
}
df = pd.DataFrame(mydict)
# Create column with total number of staff Status per Department
df['total_dept'] = df.groupby(['Department'])['Name'].transform('count')
print(df)
print('\n')
# Crate column with Status ratio per department
for k, v, in df.iterrows():
df.loc[k, 'Status_Ratio'] = (df.groupby(['Department', 'Status']).count().xs(v['Status'], level=1)['total_dept'][v['Department']]/v['total_dept']) *100
print(df)
print('\n')
# Final Groupby with Status Ratio. Size NOT needed
print(df.groupby(['Department', 'Status', 'Status_Ratio']).size())
Desired Output:
Department Status Status_Ratio
Finance Contractor 100.00
Marketing Contractor 50.00
Employee 50.00
Sales Contractor 33.33
Employee 66.67
Try (with the original df):
df.groupby("Department")["Status"].value_counts(normalize=True).mul(100)
Outputs:
Department Status
Finance Contractor 100.000000
Marketing Contractor 50.000000
Employee 50.000000
Sales Employee 66.666667
Contractor 33.333333
Name: Status, dtype: float64

Resources