Filter dataframe on multiple conditions within different columns - python-3.x

I have a sample of the dataframe as given below.
data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B'],
'Date':['2021-2-13', '2021-2-14', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-15'],
'Modified_Date':['3/19/2021 6:34:20 PM','3/20/2021 4:57:39 PM', '3/21/2021 4:57:40 PM', '3/22/2021 4:57:57 PM', '3/23/2021 4:57:41 PM',
'3/25/2021 11:44:15 PM','3/26/2021 2:16:09 PM', '3/20/2021 2:16:04 PM', '3/21/2021 4:57:40 PM'],
'Steps': [1000, 1200, 1500, 2000, 1400, 4000, 5000,1000, 3500]}
df1 = pd.DataFrame(data)
df1
This data have to be filtered in such a way that first for 'ID', and then for each 'Date', the latest entry of 'Modified_Date' row has to be selected.
EX: For ID=A, For Date='2021-04-14', The latest/last modified date = '3/22/2021 4:57:57 PM', This row has to be selected.
I have attached the snippet of the how the final dataframe has to look like.
I have been stuck on this for a while.

Try:
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
df_out = df1.groupby(["ID", "Date"], as_index=False).apply(
lambda x: x.loc[x["Modified_Date"].idxmax()]
)
print(df_out)
Prints:
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
1 A 2021-02-14 2021-03-22 16:57:57 2000
2 A 2021-02-15 2021-03-23 16:57:41 1400
3 B 2021-02-14 2021-03-26 14:16:09 5000
4 B 2021-02-15 2021-03-21 16:57:40 3500
Or: .sort_values + .groupby:
df_out = (
df1.sort_values(["ID", "Date", "Modified_Date"])
.groupby(["ID", "Date"], as_index=False)
.last()
)

The easiest/most straighforward is to sort by date and take the last per group:
(df1.sort_values(by='Modified_Date')
.groupby(['ID', 'Date'], as_index=False).last()
)
output:
ID Date Modified_Date Steps
0 A 2021-2-13 3/19/2021 6:34:20 PM 1000
1 A 2021-2-14 3/22/2021 4:57:57 PM 2000
2 A 2021-2-15 3/23/2021 4:57:41 PM 1400
3 B 2021-2-14 3/26/2021 2:16:09 PM 5000
4 B 2021-2-15 3/21/2021 4:57:40 PM 3500

You can also sort_values and drop_duplicates:
First convert the 2 series to dates (since they are strings in the example):
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
Then sort values on Modified_date and drop_duplicates keeping the last values:
out = df1.sort_values('Modified_Date').drop_duplicates(['ID','Date'],keep='last')\
.sort_index()
print(out)
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
3 A 2021-02-14 2021-03-22 16:57:57 2000
4 A 2021-02-15 2021-03-23 16:57:41 1400
6 B 2021-02-14 2021-03-26 14:16:09 5000
8 B 2021-02-15 2021-03-21 16:57:40 3500

Related

Making few columns into one column if certain conditions are fulfilled

I have an exercise in which I need to turn few or several rows into one row if they have the same data in three columnes.
substances = pd.DataFrame({'id': ['id_1', 'id_1', 'id_1', 'id_2', 'id_3'],
'part': ['1', '1', '2', '2', '3'],
'sub': ['paracetamolum', 'paracetamolum', 'ibuprofenum', 'dienogestum', 'etynyloestradiol'],
'strength': ['150', '50', '50', '20', '30'],
'unit' : ['mg', 'mg', 'mg', 'mg', 'mcg'],
'other irrelevant columns for this task' : ['sth1' , 'sth2', 'sth3', 'sth4', 'sth5']
})
Now provided that id, part and substance is the same, I am supposed to make it into one row, so the end result is:
id
part
strength
substance
unit
id_1
1
'150 # 50'
paracetamolum
mg
id_1
2
50
ibuprofenum
mg
id_2
2
20
dienogestum
mg
id_3
3
30
etynyloestradiol
mcg
The issue I have is that I have problem joining these rows into one row to show possible strength like this '150 # 50' I have tried to something like this, but it is not going great:
substances = substances.groupby('id', 'part', 'sub', 'strength').id.apply(lambda x: str(substances['strength']) + ' # ' + str(next(substances['strength'])))
df = df.groupby(['id','part','sub','unit']).agg({'strength':' # '.join}).reset_index()
df = df[['id','part','strength', 'sub','unit']]
print(df)
output:
id part strength sub unit
0 id_1 1 150 # 50 paracetamolum mg
1 id_1 2 50 ibuprofenum mg
2 id_2 2 20 dienogestum mg
3 id_3 3 30 etynyloestradiol mcg

Filter rows of 1st Dataframe from the 2nd Dataframe having different starting dates

I have two dataframes from which a new dataframe has to be created.
The first one is given below.
data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B', 'C','C','C','C','C','C', 'D','D','D'],
'Date':['2021-2-13', '2021-2-14', '2021-2-15', '2021-2-16', '2021-2-17', '2021-2-16', '2021-2-17', '2021-2-18', '2021-2-19',
'2021-2-12', '2021-2-13', '2021-2-14', '2021-2-15', '2021-2-16','2021-2-17', '2021-2-14', '2021-2-15', '2021-2-16'],
'Steps': [1000, 1200, 1500, 2000, 1400, 4000,3400, 5000,1000, 3500,4000,5000,5300,2000,3500, 5000,5500,5200 ]}
df1 = pd.DataFrame(data)
df1
The image of this is also attached.
The 2nd dataframe contains the starting date of each participant as given and shown below.
data1 = {'ID':['A', 'B', 'C', 'D'],
'Date':['2021-2-15', '2021-2-17', '2021-2-16', '2021-2-15']}
df2 = pd.DataFrame(data1)
df2
The snippet of it is given below.
Now, the resulting dataframe have to be such that for each participant in the Dataframe1, the rows have to start from the dates given in the 2nd Dataframe. The rows prior to that starting date have to be deleted.
The final dataframe as in how it should look is given below.
Any help is greatly appreciated.
Thanks
You can use .merge + boolean-indexing:
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Date"] = pd.to_datetime(df2["Date"])
x = df1.merge(df2, on="ID", suffixes=("", "_y"))
print(x.loc[x.Date >= x.Date_y, df1.columns].reset_index(drop=True))
Prints:
ID Date Steps
0 A 2021-02-15 1500
1 A 2021-02-16 2000
2 A 2021-02-17 1400
3 B 2021-02-17 3400
4 B 2021-02-18 5000
5 B 2021-02-19 1000
6 C 2021-02-16 2000
7 C 2021-02-17 3500
8 D 2021-02-15 5500
9 D 2021-02-16 5200
Or: If some ID is missing in df2:
x = df1.merge(df2, on="ID", suffixes=("", "_y"), how="outer").fillna(pd.Timestamp(0))
print(x.loc[x.Date >= x.Date_y, df1.columns].reset_index(drop=True))
If the ID in df2 is unique, you could map df2 to df1, compare the dates, and use the boolean series to index df1 :
df1.loc[df1.Date >= df1.ID.map(df2.set_index('ID').squeeze())]
ID Date Steps
2 A 2021-02-15 1500
3 A 2021-02-16 2000
4 A 2021-02-17 1400
6 B 2021-02-17 3400
7 B 2021-02-18 5000
8 B 2021-02-19 1000
13 C 2021-02-16 2000
14 C 2021-02-17 3500
16 D 2021-02-15 5500
17 D 2021-02-16 5200

Adding total rows to a Pandas DataFrame

I define a Pandas DataFrame containing several deposit/withdrawal rows for different owners. I want to add a total row for each owner to totalize the deposits/withdrawals aswell as the yield amounts generated by each capital amount.
Here's the result of the code below:
Here's my code:
import pandas as pd
OWNER = 'OWNER'
DEPWITHDR = 'DEP/WITHDR'
DATEFROM = 'DATE FROM'
DATETO = 'DATE TO'
CAPITAL = 'CAPITAL'
YIELD = 'YIELD AMT'
TOTAL = 'TOTAL'
df = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
DEPWITHDR: [10000, 20000, 4000, 20000, -8000],
CAPITAL: [10000, 30000, 4000, 24000, 16000],
DATEFROM: ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-03', '2021-01-04'],
DATETO: ['2021-01-01', '2021-01-05', '2021-01-02', '2021-01-03', '2021-01-05'],
YIELD: [100, 1200, 80, 240, 320]
})
print('SOURCE DATAFRAME\n')
print(df)
print()
newDf = pd.DataFrame(columns=[OWNER, DEPWITHDR, CAPITAL, DATEFROM, DATETO, YIELD])
currentOwner = df.loc[1, OWNER]
# using groupby function to compute the two columns totals
dfTotal = df.groupby([OWNER]).agg({DEPWITHDR:'sum', YIELD:'sum'}).reset_index()
totalIndex = 0
# deactivating SettingWithCopyWarning caueed by totalRow[OWNER] += ' total'
pd.set_option('mode.chained_assignment', None)
for index, row in df.iterrows():
if currentOwner == row[OWNER]:
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
else:
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
totalIndex += 1
newDf = newDf.append({OWNER: '',
DEPWITHDR: '',
CAPITAL: '',
DATEFROM: '',
DATETO: '',
YIELD: ''}, ignore_index=True)
newDf = newDf.append({OWNER: row[OWNER],
DEPWITHDR: row[DEPWITHDR],
CAPITAL: row[CAPITAL],
DATEFROM: row[DATEFROM],
DATETO: row[DATETO],
YIELD: row[YIELD]}, ignore_index=True)
currentOwner = row[OWNER]
totalRow = dfTotal.loc[totalIndex]
totalRow[OWNER] += ' total'
newDf = newDf.append(totalRow, ignore_index=True)
print('TARGET DATAFRAME\n')
print(newDf.fillna(''))
My question is: what is a better, more Pandas friendly, way, to obtain the desired result ?
You can use groupby and concat:
df_total = pd.concat((
df,
df.replace({o: o + ' total' for o in df[OWNER].unique()}).groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}).reset_index())
).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO])
In detail:
df.replace({o: o + ' total' for o in df[OWNER].unique()}): replace each occurrence of the name of every owner with the name itself plus the string ' total' (e.g., 'JOE' -> 'JOE total'); so that the result of the groupby will have those values in the column OWNER.
groupby(OWNER).agg({DEPWITHDR: sum, YIELD: sum}): get the sum of the column DEPWITHDR and YIELD per each owner.
pd.concat(...).fillna('').reset_index().sort_values([OWNER, DATEFROM, DATETO]): concatenate the original DataFrame and that with the totals and then sort rows by column OWNER, than DATEFROM, than DATETO, so that the rows with the totals for each OWNER will be placed at the ends of the rows belonging to that owner (because they ends with ' total') and moreover the rows will be chronologically sorted by DATEFROM, DATETO.
Here df_total:
index OWNER DEP/WITHDR CAPITAL DATE FROM DATE TO YIELD AMT
0 0 JOE 10000 10000 2021-01-01 2021-01-01 100
1 1 JOE 20000 30000 2021-01-02 2021-01-05 1200
5 0 JOE total 30000 1300
2 2 ROB 4000 4000 2021-01-01 2021-01-02 80
3 3 ROB 20000 24000 2021-01-03 2021-01-03 240
4 4 ROB -8000 16000 2021-01-04 2021-01-05 320
6 1 ROB total 16000 640
IMHO, I'd create a different DataFrame from each owner, with only his/her data, and then a summary DataFrame with totals for each owner. But, maybe, in your use case, this is the best solution.

Pandas Dataframe: Reduce the value of a 'Days' by 1 if the corresponding 'Year' is a leap year

If 'Days' is greater than e.g 10 and corresponding 'Year' is a leap year, then reduce 'Days' by 1 only in that particular row. I tried some operations but couldn't do it. I am new in pandas. Appreciate any help.
sample data:
data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['69','2008']]
df=pd.DataFrame(data,columns=['Days','Year'])
I want 'Days' of row 5 to become 69 and everything else remains the same.
In [98]: import calendar
In [99]: data = [['1', '2005'], ['2', '2006'], ['3', '2008'],['50','2009'],['70','2008']] ;df=pd.DataFrame(data,column
...: s=['Days','Year'])
In [100]: df = df.astype(int)
In [102]: df["New_Days"] = df.apply(lambda x: x["Days"]-1 if (x["Days"] > 10 and calendar.isleap(x["Year"])) else x["D
...: ays"], axis=1)
In [103]: df
Out[103]:
Days Year New_Days
0 1 2005 1
1 2 2006 2
2 3 2008 3
3 50 2009 50
4 70 2008 69

In python pandas,How to use outer join using where condition?

Table 1
S.No BusNo Timings People
1 1234 3:05 pm 55
2 3456 3:30 pm 45
3 8945 3:45 pm 50
Table 2
BusNo Model
1234 Leyland
3456 Viking
Join this table using pandas for condition: busno model people count for people between 50 and 55 and group by model
Expected Output:
Table3
S.No BusNo Timings People Model
1 1234 3:05 pm 55 Leyland
3 8945 3:45 pm 50 Nan
You can do a simple merge on those two dataframes and do a simple condition check inside loc to get the desired output like shown below.
df = pd.DataFrame()
df['S.No'] = [1, 2, 3]
df['BusNo'] = [1234, 3456, 8945]
df['Timings'] = ['3:05 pm', '3:30 pm', '3:45 pm']
df['People'] = [55, 45, 50]
df_ = pd.DataFrame()
df_['BusNo'] = [1234, 8945]
df_['Model'] = ['Leyland', 'viking']
merged = pd.merge(df, df_, on='BusNo', how='outer')
merged.loc[(merged['People'] >= 50) & (merged['People'] <= 55), :]
I think you need pd.merge and pd.Series.between,
df=pd.merge(df1,df2,on='BusNo',how='outer')
df.loc[df['People'].between(50,55,inclusive=True),:]

Resources