In python pandas,How to use outer join using where condition? - python-3.x

Table 1
S.No BusNo Timings People
1 1234 3:05 pm 55
2 3456 3:30 pm 45
3 8945 3:45 pm 50
Table 2
BusNo Model
1234 Leyland
3456 Viking
Join this table using pandas for condition: busno model people count for people between 50 and 55 and group by model
Expected Output:
Table3
S.No BusNo Timings People Model
1 1234 3:05 pm 55 Leyland
3 8945 3:45 pm 50 Nan

You can do a simple merge on those two dataframes and do a simple condition check inside loc to get the desired output like shown below.
df = pd.DataFrame()
df['S.No'] = [1, 2, 3]
df['BusNo'] = [1234, 3456, 8945]
df['Timings'] = ['3:05 pm', '3:30 pm', '3:45 pm']
df['People'] = [55, 45, 50]
df_ = pd.DataFrame()
df_['BusNo'] = [1234, 8945]
df_['Model'] = ['Leyland', 'viking']
merged = pd.merge(df, df_, on='BusNo', how='outer')
merged.loc[(merged['People'] >= 50) & (merged['People'] <= 55), :]

I think you need pd.merge and pd.Series.between,
df=pd.merge(df1,df2,on='BusNo',how='outer')
df.loc[df['People'].between(50,55,inclusive=True),:]

Related

Filter dataframe on multiple conditions within different columns

I have a sample of the dataframe as given below.
data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B'],
'Date':['2021-2-13', '2021-2-14', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-14', '2021-2-14', '2021-2-15', '2021-2-15'],
'Modified_Date':['3/19/2021 6:34:20 PM','3/20/2021 4:57:39 PM', '3/21/2021 4:57:40 PM', '3/22/2021 4:57:57 PM', '3/23/2021 4:57:41 PM',
'3/25/2021 11:44:15 PM','3/26/2021 2:16:09 PM', '3/20/2021 2:16:04 PM', '3/21/2021 4:57:40 PM'],
'Steps': [1000, 1200, 1500, 2000, 1400, 4000, 5000,1000, 3500]}
df1 = pd.DataFrame(data)
df1
This data have to be filtered in such a way that first for 'ID', and then for each 'Date', the latest entry of 'Modified_Date' row has to be selected.
EX: For ID=A, For Date='2021-04-14', The latest/last modified date = '3/22/2021 4:57:57 PM', This row has to be selected.
I have attached the snippet of the how the final dataframe has to look like.
I have been stuck on this for a while.
Try:
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
df_out = df1.groupby(["ID", "Date"], as_index=False).apply(
lambda x: x.loc[x["Modified_Date"].idxmax()]
)
print(df_out)
Prints:
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
1 A 2021-02-14 2021-03-22 16:57:57 2000
2 A 2021-02-15 2021-03-23 16:57:41 1400
3 B 2021-02-14 2021-03-26 14:16:09 5000
4 B 2021-02-15 2021-03-21 16:57:40 3500
Or: .sort_values + .groupby:
df_out = (
df1.sort_values(["ID", "Date", "Modified_Date"])
.groupby(["ID", "Date"], as_index=False)
.last()
)
The easiest/most straighforward is to sort by date and take the last per group:
(df1.sort_values(by='Modified_Date')
.groupby(['ID', 'Date'], as_index=False).last()
)
output:
ID Date Modified_Date Steps
0 A 2021-2-13 3/19/2021 6:34:20 PM 1000
1 A 2021-2-14 3/22/2021 4:57:57 PM 2000
2 A 2021-2-15 3/23/2021 4:57:41 PM 1400
3 B 2021-2-14 3/26/2021 2:16:09 PM 5000
4 B 2021-2-15 3/21/2021 4:57:40 PM 3500
You can also sort_values and drop_duplicates:
First convert the 2 series to dates (since they are strings in the example):
df1["Date"] = pd.to_datetime(df1["Date"])
df1["Modified_Date"] = pd.to_datetime(df1["Modified_Date"])
Then sort values on Modified_date and drop_duplicates keeping the last values:
out = df1.sort_values('Modified_Date').drop_duplicates(['ID','Date'],keep='last')\
.sort_index()
print(out)
ID Date Modified_Date Steps
0 A 2021-02-13 2021-03-19 18:34:20 1000
3 A 2021-02-14 2021-03-22 16:57:57 2000
4 A 2021-02-15 2021-03-23 16:57:41 1400
6 B 2021-02-14 2021-03-26 14:16:09 5000
8 B 2021-02-15 2021-03-21 16:57:40 3500

Group By Quarterly Avg and Get Values That Were Used in Avg Calculation -pandas

I have a df like this,
time value
0 2019-07-30 124.00
1 2019-07-19 123.00
2 2019-08-28 191.46
3 2019-10-25 181.13
4 2019-11-01 24.23
5 2019-11-13 340.00
6 2020-01-01 36.12
7 2020-01-25 56.12
8 2020-01-30 121.00
9 2020-02-04 115.62
10 2020-02-06 63.62
I want to group by quarterly average and get the values that were used in average calculation. Something like below.
Year Quarter Values Avg
2019 Q3 124, 123, 191 146
2019 Q4 181.13, 24.23, 340 181.78
2020 Q1 36.12, 26.12, 121, 115.62, 63.62 72.96
How can I achieve my desired result?
Use GroupBy.agg with quarter periods created by Series.dt.quarter with join values converted to strings and mean in named aggregations:
df['time'] = pd.to_datetime(df['time'])
df1 = (df.assign(Year = df['time'].dt.year,
Q = 'Q' + df['time'].dt.quarter.astype(str),
vals = df['value'].astype(str))
.groupby(['Year','Q'])
.agg(Values=('vals', ', '.join), Avg = ('value','mean'))
.reset_index())
print (df1)
Year Q Values Avg
0 2019 Q3 124.0, 123.0, 191.46 146.153333
1 2019 Q4 181.13, 24.23, 340.0 181.786667
2 2020 Q1 36.12, 56.12, 121.0, 115.62, 63.62 78.496000
EDIT:
df['time'] = pd.to_datetime(df['time'])
df1 = (df.groupby(df['time'].dt.to_period('Q').rename('YearQ'))['value']
.agg([('Values', lambda x: ', '.join(x.astype(str))),('Avg','mean')])
.reset_index()
.assign(Year = lambda x: x['YearQ'].dt.year,
Q = lambda x: 'Q' + x['YearQ'].dt.quarter.astype(str))
.reindex(['Year','Q','Values','Avg'], axis=1))
print (df1)
Year Q Values Avg
0 2019 Q3 124.0, 123.0, 191.46 146.153333
1 2019 Q4 181.13, 24.23, 340.0 181.786667
2 2020 Q1 36.12, 56.12, 121.0, 115.62, 63.62 78.496000
Create a grouper, groupby and reshape the index to year and quarter:
grouper = pd.Grouper(key='time',freq='Q')
res = (df
.assign(temp = df.value.astype(str))
.groupby(grouper)
.agg(Values=('temp', ','.join),
Avg = ('value',np.mean)
)
)
res.index = [res.index.year, 'Q' + res.index.quarter.astype(str)]
res.index = res.index.set_names(['Year','Quarter'])
Values Avg
Year Quarter
2019 Q3 123.0,124.0,191.46 146.153333
Q4 181.13,24.23,340.0 181.786667
2020 Q1 36.12,56.12,121.0,115.62,63.62 78.496000

Apply a function to every row of a dataframe and store the data to a list/Dataframe in Python

I have the following simplified version of the code:
import pandas as pd
def myFunction(portf, Val):
mydata = {portf: [Val, Val * 2, Val * 3, Val * 4]}
df = pd.DataFrame(mydata, columns=[portf])
return df
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_output = myFunction(df_input['Portfolio'][0], df_input['Value'][0])
df_output1 = myFunction(df_input['Portfolio'][1], df_input['Value'][1])
df_output2 = myFunction(df_input['Portfolio'][2], df_input['Value'][2])
df_output3 = myFunction(df_input['Portfolio'][3], df_input['Value'][3])
What I would like is concatenate all the df_output in a single list or even better in a dataframe in an efficient way as the df_input dataframe will have 100+ columns.
I tried to apply the following:
df_input.apply(lambda row : myFunction(row['Portfolio'], row['Value']), axis = 1)
However all the results return to a single column.
Any idea how to achieve that?
Thanks
You can use pd.concat to store all results in a single dataframe:
pd.concat([myFunction(row['Portfolio'], row['Value'])
for _, row in df_input.iterrows()], axis=1)
First you build a list of pd.DataFrames with a list comprehension (you could also use a normal loop). Then you concat all DataFrames along axis=1.
Output:
Book1 Book2 Book1 Book2
0 10 5 6 11
1 20 10 12 22
2 30 15 18 33
3 40 20 24 44
You mentioned df_input has many more rows in the original dataframe. To account for this you neeed another loop (minimal example):
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_input['Value2'] = df_input['Value'] * 100
pd.concat([myFunction(row['Portfolio'], row[col])
for col in df_input.columns if col != 'Portfolio'
for (_, row) in df_input.iterrows()], axis=1)
Output:
Book1 Book2 Book1 Book2 Book1 Book2 Book1 Book2
0 10 5 6 11 1000 500 600 1100
1 20 10 12 22 2000 1000 1200 2200
2 30 15 18 33 3000 1500 1800 3300
3 40 20 24 44 4000 2000 2400 4400
You might want to rename the columns or aggregate the resulting dataframe in some other way. But for this I had to guess (and I try not to guess in the face of ambiguity).

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

How do we add dataframes with same id?

I'm a beginner in data science learning. Gone through the pandas topic and I found a task here, which I'm unable to understand what is wrong. Let me explain the problem.
I have three data frames:
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
Here, I need to add to all the medals into one column, country in another. When I added it was showing NAN. So, I filled the NAN with zero values, still I'm unable to get deserved output.
Code:
gold.set_index('Country', inplace = True)
silver.set_index('Country',inplace = True)
bronze.set_index('Country', inplace = True)
Total = silver.add(gold,fill_value = 0)
Total = bronze.add(silver,fill_value = 0)
Total = gold + silver + bronze
print(Total)
Actual Output:
Medals
Country
France NaN
Germany NaN
Russia NaN
UK NaN
USA 72.0
Expected:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0
Let me know what is wrong.
Just do concat with groupby sum
pd.concat([gold,silver,bronze]).groupby('Country').sum()
Out[1306]:
Medals
Country
France 53
Germany 20
Russia 25
UK 27
USA 72
Fixing your code
silver.add(gold,fill_value = 0).add(bronze,fill_value=0)
if we expect floating point:
pd.concat([gold,silver,bronze]).groupby('Country').sum().astype(float)
# For a video solution of the code, copy-paste the following link on your browser:
# https://youtu.be/p0cnApQDotA
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False)
# Print the sorted dataframe
print(total)

Resources