Change column value based on calculated mean of other column in python pandas - python-3.x

I am newbie to pandas. I have gone through many questions, but doesn't found an answer.
I have the following datasets.
Name || Price || Cuisine Category || City || Region || Cuisine Types || Rating Types || Rating
Pizza || 600 || Fast Food,Pizza || Ajmer || Ana Saga || Quick Bites || Good || 3.9
... ... ... ... ... ... ... ... ...
Chawla's || 300 || Beverages || Ajmer || Sagar Lake || Cafe || Average || 3.3
Masala || 0 || North,South Indian || Ajmer || Ram Ganj || Mess || None || NEW
I want to change the value of:
Rating where it is a NEW, on the basis of Average Rating of that particular Cuisine Types and then Rating Types based on the calculated Ratings
Price where it is a 0, on the basis of Average Price of that particular Region
My try for changing price:
Reading CSV File
data = pd.read_csv('/content/Ajmer.csv')
calculate Region wise mean of Price
gregion = round(data.groupby('Region')['Price'].mean())
Trying to replace 0 of price column
data['Price'] = data['Price'].replace(0, gregion[data['Region']])
But my price column is unchanged.
My try for changing Rating:
Reading CSV file
data2 = pd.read_csv('/content/Ajmer.csv')
Creating separate data frame, so that it wont affect mean value.
filtered_rating = data2[(data2['Rating'] == 'NEW') | (data2['Rating'] == '-') | (data2['Rating'] == 'Opening')]
Dropping from original data2
data2.drop(data2.loc[data['Rating']=='NEW'].index, inplace=True)
data2.drop(data2.loc[data['Rating']=='-'].index, inplace=True)
data2.drop(data2.loc[data['Rating']=='Opening'].index, inplace=True)
Calculating Cuisine Types wise mean of Rating
c = round(data2.groupby('Cuisine Types')['Rating'].mean(),1)
which gives me output as follows:
Cuisine Types
Bakery 3.4
Confectionery 3.4
Dessert Parlor 3.5
...
Quick Bites 3.4
Sweet Shop 3.4
Name: Rating, dtype: float64
Trying to replace values
filtered_rating['Rating'].replace('NEW', c[data2['Region']], inplace=True)
filtered_rating['Rating'].replace('-', c[data2['Region']], inplace=True)
filtered_rating['Rating'].replace('Opening', c[data2['Region']], inplace=True)
But my Rating column is unchanged.
Expected output
Mean Price of that particular Region of row where the price is zero in Price column
Mean Rating of that particular Cuisine Type of row where the Rating is NEW in Rating column
Can anyone help me out to do this?
Thanks in Advance!
I will be more than happy for your help

lets say you have a data like following.
data
name region price cuisine_type rating_type rating
0 pizza NY 500 fast food average 3.3
1 burger NY 350 fast food good 4.1
2 lobster LA 1500 seafood good 4.5
3 mussels LA 1000 seafood average 3.9
4 shawarma NY 300 mediterranean average 3.4
5 kabab LA 600 mediterranean good 4
6 pancake NY 250 breakfast average 3.7
7 waffle LA 450 breakfast good 4.2
8 fries NY 0 fast food None NEW
9 crab LA 0 seafood None Opening
10 tuna sandwich NY 0 seafood None NEW
11 onion rings LA 0 fast food None Opening
Now according to your question we need to replace the rating when it is NEW or Opening with mean rating of respective cuisine_type. And price when its 0 with mean price of respective region. And update rating type for None at the end.
#get a list of cuisine types
cuisine_type_list=data.cuisine_type.unique().tolist()
cuisine_type_list
['fast food', 'seafood', 'mediterranean', 'breakfast']
#get a list of regions
region_list=data.region.unique().tolist()
region_list
['NY', 'LA']
#replace the ratings
for i in cuisine_type_list:
data.loc[(data.cuisine_type==i) & (data.rating.isin(['NEW', 'Opening'])), 'rating']=round(data.loc[(data.cuisine_type==i) & (data.rating.isin(['NEW', 'Opening'])==False)].rating.mean(), 2)
#replace price when 0
for i in region_list:
data.loc[(data.region==i) & (data.price==0), 'price']=round(data.loc[(data.region==i) & (data.price!=0)].price.mean(), 2)
#function to assign rating type (assuming good for rating>=4)
def calculate_rating_type(row):
if row['rating'] >= 4:
return 'good'
else:
return 'average'
#update rating type
data.loc[data.rating_type.isnull(), 'rating_type']=data.loc[data.rating_type.isnull()].apply(lambda row: calculate_rating_type(row), axis=1)
this is the data after updating
data
name region price cuisine_type rating_type rating
0 pizza NY 500 fast food average 3.3
1 burger NY 350 fast food good 4.1
2 lobster LA 1500 seafood good 4.5
3 mussels LA 1000 seafood average 3.9
4 shawarma NY 300 mediterranean average 3.4
5 kabab LA 600 mediterranean good 4
6 pancake NY 250 breakfast average 3.7
7 waffle LA 450 breakfast good 4.2
8 fries NY 350 fast food average 3.7
9 crab LA 887.5 seafood good 4.2
10 tuna sandwich NY 350 seafood good 4.2
11 onion rings LA 887.5 fast food average 3.7

You can try the following code:
gregion = round(data.groupby('Region')['Price'].mean())
# convert your group by to DataFrame
gregion = pd.DataFrame(gregion)
gregion.reset_index(inplace=True)
# merge the datas and drop the new column that is created
data = data.merge(gregion, left_on='Region', right_on='Region', suffixes=('_x', ''))
data = data.drop(columns={'Price_x'})
filtered_rating = data[(data['Rating'] == 'NEW') | (data['Rating'] == '-') | (data['Rating'] == 'Opening')]
# you don't need to re-upload the file
data2 = data.copy()
data2.drop(data2.loc[data2['Rating']=='NEW'].index, inplace=True)
data2.drop(data2.loc[data2['Rating']=='-'].index, inplace=True)
data2.drop(data2.loc[data['Rating']=='Opening'].index, inplace=True)
# do the same with c
c = round(data2.groupby('Cuisine Types')['Rating'].mean(),1)
c = pd.DataFrame(c)
c.reset_index(inplace=True)
filtered_rating = filtered_rating.merge(c, left_on='Cuisine Types', right_on='Cuisine Types', how='left', suffixes=('_x', ''))
filtered_rating = filtered_rating.drop(columns={'Rating_x'})
Hope this helps.

Related

Adding values to a new column in Pandas depending on values in an existing column

I have a pandas dataframe as follows:
Name Age City Country percentage
a Jack 34 Sydney Australia 0.23
b Riti 30 Delhi India 0.45
c Vikas 31 Mumbai India 0.55
d Neelu 32 Bangalore India 0.73
e John 16 New York US 0.91
f Mike 17 las vegas US 0.78
I am planning to add one more column called bucket whose definition depends on the percentage column as follows:
less than 0.25 = 1
between 0.25 and 0.5 = 2
between 0.5 and 0.75 = 3
greater than 0.75 = 4
I tried the inbuilt conditions and choices properties of pandas follows:
conditions = [(df_obj['percentage'] < .25),
(df_obj['percentage'] >=.25 & df_obj['percentage'] < .5),
(df_obj['percentage'] >=.5 & df_obj['percentage'] < .75),
(df_obj['percentage'] >= .75)]
choices = [1,2,3,4]
df_obj['bucket'] = np.select(conditions, choices)
However, this gives me a random error as follows in the line where I create the conditions:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
A quick fix to your code is that you need more parentheses, for example:
((df_obj['percentage'] >=.25) & (df_obj['percentage'] < .5) )
^ ^ ^ ^
However, I think it's cleaner with pd.cut:
pd.cut(df['percentage'], bins=[0,0.25, 0.5, 0.75, 1],
include_lowest=True, right=False,
labels=[1,2,3,4])
Or since your buckets are linear:
df['bucket'] = (df['percentage']//0.25).add(1).astype(int)
Output
Name Age City Country percentage bucket
a Jack 34 Sydney Australia 0.23 1
b Riti 30 Delhi India 0.45 2
c Vikas 31 Mumbai India 0.55 3
d Neelu 32 Bangalore India 0.73 3
e John 16 New York US 0.91 4
f Mike 17 las vegas US 0.78 4
I think the easiest/most readable way to do this is to use the apply function:
def percentage_to_bucket(percentage):
if percentage < .25:
return 1
elif percentage >= .25 and percentage < .5:
return 2
elif percentage >= .5 and percentage < .75:
return 3
else:
return 4
df["bucket"] = df["percentage"].apply(percentage_to_bucket)
Pandas apply will take each value of a given column and apply the passed function to this value, returning a pandas series with the results, which you can then assign to your new column.

How to compare values in a data frame in pandas [duplicate]

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

How to fill in between rows gap comparing with other dataframe using pandas?

I want to compare df1 with df2 and fill only the blanks without overwriting other values. I have no idea how to achieve this without overwriting or creating an extra columns.
Can I do this by converting df2 into dictionary and mapping with df1?
df1 = pd.DataFrame({'players name':['ram', 'john', 'ismael', 'sam', 'karan'],
'hobbies':['jog','','photos','','studying'],
'sports':['cricket', 'basketball', 'chess', 'kabadi', 'volleyball']})
df1:
players name hobbies sports
0 ram jog cricket
1 john basketball
2 ismael photos chess
3 sam kabadi
4 karan studying volleyball
And, df,
df2 = pd.DataFrame({'players name':['jagan', 'mohan', 'john', 'sam', 'karan'],
'hobbies':['riding', 'tv', 'sliding', 'jumping', 'studying']})
df2:
players name hobbies
0 jagan riding
1 mohan tv
2 john sliding
3 sam jumping
4 karan studying
I want output like this:
Try this:
df1['hobbies'] = (df1['players name'].map(df2.set_index('players name')['hobbies'])
.fillna(df1['hobbies']))
df1
Output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball
if blank space is NaN value
df1 = pd.DataFrame({"players name":["ram","john","ismael","sam","karan"],
"hobbies":["jog",pd.np.NaN,"photos",pd.np.NaN,"studying"],
"sports":["cricket","basketball","chess","kabadi","volleyball"]})
then
dicts = df2.set_index("players name")['hobbies'].to_dict()
df1['hobbies'] = df1['hobbies'].fillna(df1['players name'].map(dicts))
output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

Complex pandas aggregation

I have a table as below :
User_ID Cricket Football Chess Video_ID Category Time
1 200 150 100 111 A Morning
1 200 150 100 222 B Morning
1 200 150 100 111 A Afternoon
1 200 150 100 333 A Morning
2 100 160 80 444 C Evening
2 100 160 80 222 C Evening
2 100 160 80 333 A Morning
2 100 160 80 333 A Morning
Above table is a transactional table, each entry represents the transaction of a user watching a video.
For Eg. “User_ID” - 1 has watched video’s 4 times.
What all video’s watched are given in “Video_ID” : 111,222,111,333
NOTE :
Video_ID - 111 was watched twice by this user.
Cricket, Football, Chess : The values are duplicate for each row. (I.e) No of times “User_ID” 1 played cricket , football, chess are 200,150,100. ( They are duplicate in other rows for that particular “User_ID”.
Category : Which Category that particular Video_ID belongs to.
Time : What time the Video_ID was watched.
I am trying to get the below information from the table :
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
1 Cricket Football A B Morning
2 Football Cricket C A Evening
NOTE : If the count of Category is same then any one can be kept as Top_1_Category.
Its bit complex though, can anyone help on this ?
First get top values per groups by User_ID and Video_ID with Series.value_counts and index[0]:
df1 = df.groupby(['User_ID','Video_ID']).agg(lambda x: x.value_counts().index[0])
Then get second top Category by GroupBy.nth:
s = df1.groupby(level=0)['Category'].nth(1)
Remove duplicates by User_ID with DataFrame.drop_duplicates:
df1 = df1.reset_index().drop_duplicates('User_ID').drop('Video_ID', axis=1)
cols = ['User_ID','Category','Time']
cols1 = df1.columns.difference(cols)
Get top2 games by this solution:
df2 = pd.DataFrame((cols1[np.argsort(-df1[cols1].values, axis=1)[:,:2]]),
columns=['Top_1_Game','Top_2_Game'],
index=df1['User_ID'])
Filter Category and Time with rename columns names:
df3 = (df1[cols].set_index('User_ID')
.rename(columns={'Category':'Top_1_Cat','Time':'Top_Time'}))
Join together by DataFrame.join and DataFrame.insert Top_2_Cat values:
df = df2.join(df3).reset_index()
df.insert(4, 'Top_2_Cat', s.values)
print (df)
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
0 1 Cricket Football A B Morning
1 2 Football Cricket C A Evening

Resources