Pandas filter a column based on another column - python-3.x

Given this dataframe
df = pd.DataFrame(\
{'name': {0: 'Peter', 1: 'Anna', 2: 'Anna', 3: 'Peter', 4: 'Simon'},
'Status': {0: 'Finished',
1: 'Registered',
2: 'Cancelled',
3: 'Finished',
4: 'Registered'},
'Modified': {0: '2019-03-11',
1: '2019-03-19',
2: '2019-05-22',
3: '2019-10-31',
4: '2019-04-05'}})
How can I compare and filter based on the Status? I want the Modified column to keep the date where the Status is either "Finished" or "Cancelled", and fill in blank where the condition is not met.
Wanted output:
name Status Modified
0 Peter Finished 2019-03-11
1 Anna Registered
2 Anna Cancelled 2019-05-22
3 Peter Finished 2019-10-31
4 Simon Registered

Check with where + isin
df.Modified.where(df.Status.isin(['Finished','Cancelled']),'',inplace=True)
df
Out[68]:
name Status Modified
0 Peter Finished 2019-03-11
1 Anna Registered
2 Anna Cancelled 2019-05-22
3 Peter Finished 2019-10-31
4 Simon Registered

Related

How to maintain column names while pivoting data, with no aggregation and not using artificial index?

I have a dataframe like this, which has about 40 unique in the key column and some of them are integer. I do not want anything aggregated if possible
:
hash ft timestamp id key value
1ab 1 2022-01-02 12:21:11 1 aaa 121289
1ab 1 2022-01-02 12:21:11 1 bbb 13ADF
1ab 1 2022-01-02 12:21:11 2 aaa 13ADH
1ab 1 2022-01-02 12:21:11 2 bbb 13ADH
2ab 2 2022-01-02 12:21:11 3 aaa 121382
2ab 2 2022-01-02 12:21:11 3 bbb 121381
2ab 2 2022-01-02 12:21:11 3 ccc 121389
I am trying to pivot the data only on 2 on columns key and value while keeping the remaining same columns and index. Example:
When I run below code, the column names take on grouped values with columnname being for the following columns id, ft, value. One of the actual column names, with parentheses : ('id', '1', '121289') and I am forced to select an index, which I dont want do.
Code:
df.pivot_table(index='hash',columns=['ft','key'])
I am not sure what I am doing wrong, that I cant use the value column for values. I get an empty dataframe:
df.pivot_table(index='hash',columns=['ft','key'], values='value')
A possible solution, using pandas.DataFrame.pivot:
(df.pivot(index=['hash', 'ft', 'id', 'timestamp'],
columns=['key'], values='value')
.reset_index().rename_axis(None, axis=1))
Output:
hash ft id timestamp aaa bbb ccc
0 1ab 1 1 2022-01-02 12:21:11 121289 13ADF NaN
1 1ab 1 2 2022-01-02 12:21:11 13ADH 13ADH NaN
2 2ab 2 3 2022-01-02 12:21:11 121382 121381 121389
Data:
df = pd.DataFrame.from_dict(
{'hash': {0: '1ab',
1: '1ab',
2: '1ab',
3: '1ab',
4: '2ab',
5: '2ab',
6: '2ab'},
'ft': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2},
'timestamp': {0: '2022-01-02 12:21:11',
1: '2022-01-02 12:21:11',
2: '2022-01-02 12:21:11',
3: '2022-01-02 12:21:11',
4: '2022-01-02 12:21:11',
5: '2022-01-02 12:21:11',
6: '2022-01-02 12:21:11'},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3, 6: 3},
'key': {0: 'aaa', 1: 'bbb', 2: 'aaa', 3: 'bbb', 4: 'aaa', 5: 'bbb', 6: 'ccc'},
'value': {0: '121289',
1: '13ADF',
2: '13ADH',
3: '13ADH',
4: '121382',
5: '121381',
6: '121389'}}
)
EDIT
To overcome an error reported below, in a comment, by the OP, the OP himself suggests the following solution:
(pd.pivot_table(df,index=['hash','ft', 'id'] ,
columns = ['key'] ,
values = "value", aggfunc='sum')
.reset_index().rename_axis(None, axis=1))

Find recurring transactions every month for at least "x" months every year

df = pd.DataFrame.from_dict({"Card_No":[1234,1234,1234,4321,4321,4321],
"Merchant_Country":["USA", "USA", "USA", "USA","USA", "USA"],
"Merchant_Name":["BestBuy", "BestBuy", "BestBuy", "BestBuy","BestBuy", "BestBuy"],
"Date": ["2021-01-15", "2021-02-15", "2021-03-15", "2021-04-15", "2021-05-15", "2021-07-15"],
"TrxAmount": [99.99, 99.99, 99.99, 89.99, 89.99, 89.99]})
Find card numbers who had at least 3 recurring payments at the same merchant during all of 2020. So while 1234 qualifies as having had recurring payments, Card number 4321 does not fulfill the criteria.
A recurring transaction is at the same merchant, for the same amount, and in consecutive months.
I have been trying to break my head to find a solution but can't seem to solve this either in SQL or in Python.
One approach that might work is creating groups and then seeing how many transactions occurred in each group but I can't seem to crack the consecutive transaction piece of the problem.
Any help is appreciated. Also, the solution should scale to 3,4,5,..n recurring transactions.
groups = df.groupby(["Card_No", "Merchant_Name","Transaction Amount"])
for grp, data in groups:
lst_of_dates = data["date"].unique()
If you mean one payment per month for a merchant with the same amount on the same day of the month, then you can use lead() or lag(). So, to get the first of three "recurring" payments:
select t.*
from (select t.*,
lead(date, 1) over (partition by card, merchant, amount) as date_1,
lead(date, 2) over (partition by card, merchant, amount) as date_2
from t
) t
where date_2 = date + interval '2 month' and
date_1 = date + interval '1 month';
A pandas answer:
df = pd.DataFrame({'Card Number': {0: 1234, 1: 1234, 2: 1234, 3: 4321, 4: 4321, 5: 4321},
'Merchant Country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA', 5: 'USA'},
'Merchant Name': {0: 'Best Buy', 1: 'Best Buy', 2: 'Best Buy', 3: 'Best Buy', 4: 'Best Buy', 5: 'Best Buy'},
'Date': {0: '2020-02-15', 1: '2020-03-15', 2: '2020-04-15', 3: '2020-05-15', 4: '2020-06-15', 5: '2020-08-15'},
'Transaction Amount': {0: 99.99, 1: 99.99, 2: 99.99, 3: 99.99, 4: 99.99, 5: 99.99}})
df['Date'] = pd.to_datetime(df['Date'])
# Create monthly date for indexing
df['Month'] = df['Date']
# Set index as monthly & fill missing months
df = df.set_index('Month').resample('M').min()
# Fill card numbers for grouping
df['Card Number'] = df['Card Number'].ffill()
# Identify recurring transaction - same in month t and t+1, or month t and t-1
df['cond'] = df.groupby('Card Number')['Transaction Amount'].transform(
lambda x: (x == x.shift(1)) | (x == x.shift(-1)))
# Flag recurring transactions with total number of recurrences in sequence
df['Recurrences'] = df.groupby([df['Card Number'], df['cond'].eq(True)])['cond'].transform(
lambda x: x.cumsum().max())
Card Number Merchant Country Merchant Name Date Transaction Amount cond Recurrences
Month
2020-02-29 1234.0 USA Best Buy 2020-02-15 99.99 True 3
2020-03-31 1234.0 USA Best Buy 2020-03-15 99.99 True 3
2020-04-30 1234.0 USA Best Buy 2020-04-15 99.99 True 3
2020-05-31 4321.0 USA Best Buy 2020-05-15 99.99 True 2
2020-06-30 4321.0 USA Best Buy 2020-06-15 99.99 True 2
2020-07-31 4321.0 NaN NaN NaT NaN False 0
2020-08-31 4321.0 USA Best Buy 2020-08-15 99.99 False 0
From there, you can find recurrences of N months with df.loc[df['Recurrences'].eq(N).

How can I merge rows if a column data is same and change a value of another specific column on merged column efficiently in pandas?

I am trying to merge rows if value of certain column are same. I have been using groupby first and replace the data the value of column based on specific condition. I was wondering if there is a better option to do what I am trying to do.
This is what I have been doing
data={'Name': {0: 'Sam', 1: 'Amy', 2: 'Cat', 3: 'Sam', 4: 'Kathy'},
'Subject1': {0: 'Math', 1: 'Science', 2: 'Art', 3: np.nan, 4: 'Science'},
'Subject2': {0: np.nan, 1: np.nan, 2: np.nan, 3: 'English', 4: np.nan},
'Result': {0: 'Pass', 1: 'Pass', 2: 'Fail', 3: 'TBD', 4: 'Pass'}}
df=pd.DataFrame(data)
df=df.groupby('Name').agg({
'Subject1': 'first',
'Subject2': 'first',
'Result': ', '.join}).reset_index()
df['Result']=df['Result'].apply(lambda x: 'RESULT_FAILED' if x=='Pass, TBD' else x )
Starting: df looks like:
Name Subject1 Subject2 Result
0 Sam Math NaN Pass
1 Amy Science NaN Pass
2 Cat Art NaN Fail
3 Sam NaN English TBD
4 Kathy Science NaN Pass
Final result I want is :
Name Subject1 Subject2 Result
0 Amy Science NaN Pass
1 Cat Art NaN Fail
2 Kathy Science NaN Pass
3 Sam Math English RESULT_FAILED
I believe this might not be a good solution if there are more than 100 columns. I will have to manually change the dictionary for aggregation.
I tried using :
df.groupby('Name')['Result'].agg(' '.join).reset_index() but I only get 2 columns.
Your sample indicates each unique name having single non-NaN SubjectX value. I.e. each SubjectX has only one single non-NaN value for duplicate Name. You may try this way
import numpy as np
df_final = (df.fillna('').groupby('Name', as_index=False).agg(''.join)
.replace({'':np.nan, 'PassTBD': 'RESULT_FAILED'}))
Out[16]:
Name Subject1 Subject2 Result
0 Amy Science NaN Pass
1 Cat Art NaN Fail
2 Kathy Science NaN Pass
3 Sam Math English RESULT_FAILED

What's the most Pythonic/efficient way to conditionally split lengthy rows?

I have a data frame with a column Results which can get pretty lengthy. Since the data frame is ending up in an Excel report, this is problematic as Excel will only make a row so tall before it just simply doesn't display all data. Instead, what I'd like to do is split rows with results of a certain length into multiple rows.
I wrote some code on a small scale data frame to split the results into chunks of 2. I haven't worked out how to put each chunk in a new row. Additionally, I'm not sure this will be the most efficient method when my data frame goes from 6 rows to 35k+. What is the most efficient/Pythonic way to achieve what I want?
Original data frame
Result Date
0 [SUCCESS] 10/10/2019
1 [SUCCESS] 10/09/2019
2 [FAILURE] 10/08/2019
3 [Pending, Pending, SUCCESS] 10/07/2019
4 [FAILURE] 10/06/2019
5 [Pending, SUCCESS] 10/05/2019
Goal output
Result Date
0 [SUCCESS] 10/10/2019
1 [SUCCESS] 10/09/2019
2 [FAILURE] 10/08/2019
3 [Pending, Pending] 10/07/2019
4 [SUCCESS] 10/07/2019
5 [FAILURE] 10/06/2019
6 [Pending, SUCCESS] 10/05/2019
Code
import pandas as pd
import numpy as np
data = {'Result': [['SUCCESS'], ['SUCCESS'], ['FAILURE'], ['Pending', 'Pending', 'SUCCESS'], ['FAILURE'], ['Pending', 'SUCCESS']], 'Date': ['10/10/2019', '10/09/2019', '10/08/2019', '10/07/2019', '10/06/2019', '10/05/2019']}
df = pd.DataFrame(data)
df['Length of Results'] = df['Result'].str.len()
def chunks(l, n):
for i in range(0, len(l), n):
yield l[i:i + n]
for i in range(len(df)):
if df['Length of Results'][i] > 2:
df['Result'][i] = list(chunks(df['Result'][i], 2))
else:
pass
df['Chunks'] = 1
for i in range(len(df)):
if df['Length of Results'][i] > 2:
df['Chunks'][i] = len(df['Result'][i])
else:
pass
df = df.loc[np.repeat(df.index.values, df.Chunks)]
df = df.reset_index(drop=True)
Code currently produces
Result Date Length of Results Chunks
0 [SUCCESS] 10/10/2019 1 1
1 [SUCCESS] 10/09/2019 1 1
2 [FAILURE] 10/08/2019 1 1
3 [[Pending, Pending], [SUCCESS]] 10/07/2019 3 2
4 [[Pending, Pending], [SUCCESS]] 10/07/2019 3 2
5 [FAILURE] 10/06/2019 1 1
6 [Pending, SUCCESS] 10/05/2019 2 1
df.to_dict()
{'Result': {0: ['SUCCESS'], 1: ['SUCCESS'], 2: ['FAILURE'], 3: [['Pending', 'Pending'], ['SUCCESS']], 4: [['Pending', 'Pending'], ['SUCCESS']], 5: ['FAILURE'], 6: ['Pending', 'SUCCESS']}, 'Date': {0: '10/10/2019', 1: '10/09/2019', 2: '10/08/2019', 3: '10/07/2019', 4: '10/07/2019', 5: '10/06/2019', 6: '10/05/2019'}, 'Length of Results': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 2}, 'Chunks': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 1, 6: 1}}
I would recommend putting each pending/SUCCESS/FAILURE on one row, using df.explode if you are on pandas 0.25+:
df = df.explode('Result')
df.groupby(['Date','Result']).count().reset_index(name='Length')
Output:
Date Result Length
0 10/05/2019 Pending 1
1 10/05/2019 SUCCESS 1
2 10/06/2019 FAILURE 1
3 10/07/2019 Pending 2
4 10/07/2019 SUCCESS 1
5 10/08/2019 FAILURE 1
6 10/09/2019 SUCCESS 1
7 10/10/2019 SUCCESS 1
You can do
s=df[['Date','Result']].explode('Result')
t=s.groupby(['Date','Result'])['Result'].transform('size')>1
s.groupby([s.Date,t]).Result.agg(list).reset_index(level=0).reset_index(drop=True)
Out[65]:
Date Result
0 10/05/2019 [Pending, SUCCESS]
1 10/06/2019 [FAILURE]
2 10/07/2019 [SUCCESS]
3 10/07/2019 [Pending, Pending]
4 10/08/2019 [FAILURE]
5 10/09/2019 [SUCCESS]
6 10/10/2019 [SUCCESS]

using duplicates values from one column to remove entire row in pandas dataframe

I have the data in the .csv file uploaded in the following link
Click here for the data
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage.
Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates() but it does not work.
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
Method 1
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
Method 2
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4

Resources