Groupby and calculate count and means based on multiple conditions in Pandas - python-3.x

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.

I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0

Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

Related

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

pandas: search column values from one df in another df column that contains lists

I need to search the values from the df1['numsearch'] column into the lists in df2['Numbers']. If the number is in those lists, then I want to add values from the df2['Score'] column to df1. See desired output below.
df1 = pd.DataFrame(
{'Day':['M','Tu','W','Th','Fr','Sa','Su'],
'numsearch':['1','20','14','99','19','6','101']
})
df2 = pd.DataFrame(
{'Letters':['a','b','c','d'],
'Numbers':[['1','2','3','4'],['5','6','7','8'],['10','20','30','40'],['11','12','13','14']],
'Score': ['1.1','2.2','3.3','4.4']})
desired output
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 4 4.4
3 Th 99 "No score"
4 Fr 19 "No score"
5 Sa 6 2.2
6 Su 101 "No score"
I have written a for loop that works with the test data.
scores = []
for s,ns in enumerate(ppr_data['SN']):
match = ''
for k,q in enumerate(jcr_data['All_ISSNs']):
if ns in q:
scores.append(jcr_data['Journal Impact Factor'][k])
match = 1
else:
continue
if match == "":
scores.append('No score')
match = ""
df1['Score'] = np.array(scores)
In my small test, but above code works, but when working with larger data files, it is creating duplicates. So this clearly isn't the best way to do this.
I'm sure there's a more pandas-proper line of code that ends in .fillna("No score") .
I tried to use a loc statement, but I get hung up on searching the values of one dataframe in a column that contains lists.
Can anyone shed some light?
df2=df2.explode('Numbers')#Explode df2 on Numbers
d=dict(zip(df2.Numbers, df2.Score))#dict Numbers and Scores
df1['Score']=df1.numsearch.map(d).fillna('No Score')#Map dict to df1 filling NaN with No Score
Can shorten it as follows:
df2=df2.explode('Numbers')#Explode df2 on Numbers
df1['Score']=df1.numsearch.map(dict(zip(df2.Numbers, df2.Score))).fillna('No Score')
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No Score
4 Fr 19 No Score
5 Sa 6 2.2
6 Su 101 No Score
You can try left join and fillna:
df1.merge(df2.explode('Numbers'),
left_on='numsearch',
right_on='Numbers', how='left')[['Day', 'numsearch', 'Score']].fillna("No score")
Output:
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No score
4 Fr 19 No score
5 Sa 6 2.2
6 Su 101 No score

Adding values to a new column in Pandas depending on values in an existing column

I have a pandas dataframe as follows:
Name Age City Country percentage
a Jack 34 Sydney Australia 0.23
b Riti 30 Delhi India 0.45
c Vikas 31 Mumbai India 0.55
d Neelu 32 Bangalore India 0.73
e John 16 New York US 0.91
f Mike 17 las vegas US 0.78
I am planning to add one more column called bucket whose definition depends on the percentage column as follows:
less than 0.25 = 1
between 0.25 and 0.5 = 2
between 0.5 and 0.75 = 3
greater than 0.75 = 4
I tried the inbuilt conditions and choices properties of pandas follows:
conditions = [(df_obj['percentage'] < .25),
(df_obj['percentage'] >=.25 & df_obj['percentage'] < .5),
(df_obj['percentage'] >=.5 & df_obj['percentage'] < .75),
(df_obj['percentage'] >= .75)]
choices = [1,2,3,4]
df_obj['bucket'] = np.select(conditions, choices)
However, this gives me a random error as follows in the line where I create the conditions:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
A quick fix to your code is that you need more parentheses, for example:
((df_obj['percentage'] >=.25) & (df_obj['percentage'] < .5) )
^ ^ ^ ^
However, I think it's cleaner with pd.cut:
pd.cut(df['percentage'], bins=[0,0.25, 0.5, 0.75, 1],
include_lowest=True, right=False,
labels=[1,2,3,4])
Or since your buckets are linear:
df['bucket'] = (df['percentage']//0.25).add(1).astype(int)
Output
Name Age City Country percentage bucket
a Jack 34 Sydney Australia 0.23 1
b Riti 30 Delhi India 0.45 2
c Vikas 31 Mumbai India 0.55 3
d Neelu 32 Bangalore India 0.73 3
e John 16 New York US 0.91 4
f Mike 17 las vegas US 0.78 4
I think the easiest/most readable way to do this is to use the apply function:
def percentage_to_bucket(percentage):
if percentage < .25:
return 1
elif percentage >= .25 and percentage < .5:
return 2
elif percentage >= .5 and percentage < .75:
return 3
else:
return 4
df["bucket"] = df["percentage"].apply(percentage_to_bucket)
Pandas apply will take each value of a given column and apply the passed function to this value, returning a pandas series with the results, which you can then assign to your new column.

How to compare values in a data frame in pandas [duplicate]

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

How to compare row by row in a dataframe

I have a data frame that has a name and the URL ID of the name. For example:
Abc 123
Abc.com 123
Def 345
Pqr 123
PQR.com 123
Here due to data extraction error, at times different names have same ID. I want to clean the table such that if the names are different and ID is same, then the record should remain the same. If the names are similar and ID is also same, the name should be changed to one. To be clear,
The expected output should be:
Abc.com 123
Abc.com 123
Def 354
PQR.com 123
PQR.com 123
That is, the last one was data entry error..and both were the same name(The first word of the string was the same). So they are changed to one name looking at ID.
But first and second records even though they had a similar ID to the last ones their names did not match and were completely different.
I am not able to understand how to achieve this.
Request some guidance here. Thanks in advance.
Note: The size of the dataset is almost 16 million of such records.
Idea is use fuzzy matching lib fuzzywuzzy for ratio of all combinations of Names by cross join by DataFrame.merge and removed rows with same names in both columns by DataFrame.query, also was added new column by lengths of data by Series.str.len:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
Name_x ID Name_y ratio len
1 Abc 123 BCD 0 3
2 BCD 123 Abc 0 3
6 Pqr 789 PQR.com 20 3
7 PQR.com 789 Pqr 20 7
Then filter rows by treshold and boolean indexing. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax with DataFrame.loc and then DataFrame.set_index for Series:
N = 15
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789 PQR.com
Name: Name_x, dtype: object
Last Series.map by ID and replace non matched values by original with Series.fillna:
df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
Name ID
0 Abc 123
1 BCD 123
2 Def 345
3 PQR.com 789
4 PQR.com 789
EDIT: If there is more valid strings per ID is is more complicated:
print (df)
Name ID
0 Air Ordnance 1578013421
1 Air-Ordnance.com 1578013421
2 Garreett 1578013421
3 Garrett 1578013421
First get fuzz.ratio like in solution before:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
2 Air Ordnance 1578013421 Garreett 30
3 Air Ordnance 1578013421 Garrett 32
4 Air-Ordnance.com 1578013421 Air Ordnance 79
6 Air-Ordnance.com 1578013421 Garreett 25
7 Air-Ordnance.com 1578013421 Garrett 26
8 Garreett 1578013421 Air Ordnance 30
9 Garreett 1578013421 Air-Ordnance.com 25
11 Garreett 1578013421 Garrett 93
12 Garrett 1578013421 Air Ordnance 32
13 Garrett 1578013421 Air-Ordnance.com 26
14 Garrett 1578013421 Garreett 93
Then filter by threshold:
N = 50
df2 = df1[df1['ratio'].gt(N)]
print (df2)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
4 Air-Ordnance.com 1578013421 Air Ordnance 79
11 Garreett 1578013421 Garrett 93
14 Garrett 1578013421 Garreett 93
But for more precision is necessary specify, what strings are valid in list L, filter by list:
L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
Name_x Name ID
4 Air-Ordnance.com Air Ordnance 1578013421
14 Garrett Garreett 1578013421
Last merge with left join to original and repalce missing values:
df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
Name ID
0 Air-Ordnance.com 1578013421
1 Air-Ordnance.com 1578013421
2 Garrett 1578013421
3 Garrett 1578013421

Resources