how to categorize salary into high/med/low group in python? - python-3.x

I have an employee dataset having salary details. I like to add an additional column to display their salary group like high/med/low:
Data:
Empno Sal Deptno
1 800 20
2 1600 30
3 2975 20
4 1250 30
5 2850 30
6 2450 10
7 3000 20
Expected Output:
Empno Sal Deptno Sal_Group
1 800 20 low
2 1600 30 mid
3 2975 20 ...
4 1250 30 ...
5 2850 30 ...
6 2450 10 ...
7 3000 20 high

You can try this:
import pandas as pd
import numpy as np
df = pd.read_csv("file.csv")
bins = np.linspace(min(df['Sal']), max(df['Sal']),4)
groupNames = ["low", "med", "high"]
df['SalGroup'] = pd.cut(df['Sal'], bins, labels = groupNames, include_lowest = True)
print(df)

Related

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Creating an aggregate columns in pandas dataframe

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "B", "B"], 'var1':[2, 3, 1, 5],'a1_bal':[1,2,3,4], 'a1c_bal':[10,22,36,41], 'b1_bal':[1,2,33,4], 'b1c_bal':[11,22,3,4], 'm1_bal':[15,2,35,4]})
df
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal
0 A 2 1 10 1 11 15
1 A 3 2 22 2 22 2
2 B 1 3 36 33 3 35
3 B 5 4 41 4 4 4
I want to create new columns as below:
a1_final_bal = sum(a1_bal, a1c_bal)
b1_final_bal = sum(b1_bal, b1c_bal)
m1_final_bal = m1_bal (since we only have m1_bal field not m1c_bal, so it will renain as it is)
I don't want to hardcode this step because there might be more such columns as "c_bal", "m2_bal", "m2c_bal" etc..
My final data should look something like below
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal a1_final_bal b1_final_bal m1_final_bal
0 A 2 1 10 1 11 15 11 12 15
1 A 3 2 22 2 22 2 24 24 2
2 B 1 3 36 33 3 35 38 36 35
3 B 5 4 41 4 4 4 45 8 4
You could try something like this. I am not sure if its exactly what you are looking for, but I think it should work.
dfforgroup = df.set_index(['ORDER','var1']) #Creates MultiIndex
dfforgroup.columns = dfforgroup.columns.str[:2] #Takes first two letters of remaining columns
df2 = dfforgroup.groupby(dfforgroup.columns,axis=1).sum().reset_index().drop(columns =
['ORDER','var1']).add_suffix('_final_bal') #groups columns by their first two letters and sums the columns up
df = pd.concat([df,df2],axis=1) #concatenates new columns to original df

pandas calculate scores for each group based on multiple functions

I have the following df,
group_id code amount date
1 100 20 2017-10-01
1 100 25 2017-10-02
1 100 40 2017-10-03
1 100 25 2017-10-03
2 101 5 2017-11-01
2 102 15 2017-10-15
2 103 20 2017-11-05
I like to groupby group_id and then compute scores to each group based on the following features:
if code values are all the same in a group, score 0 and 10 otherwise;
if amount sum is > 100, score 20 and 0 otherwise;
sort_values by date in descending order and sum the differences between the dates, if the sum < 5, score 30, otherwise 0.
so the result df looks like,
group_id code amount date score
1 100 20 2017-10-01 50
1 100 25 2017-10-02 50
1 100 40 2017-10-03 50
1 100 25 2017-10-03 50
2 101 5 2017-11-01 10
2 102 15 2017-10-15 10
2 103 20 2017-11-05 10
here are the functions that correspond to each feature above:
def amount_score(df, amount_col, thold=100):
if df[amount_col].sum() > thold:
return 20
else:
return 0
def col_uniq_score(df, col_name):
if df[col_name].nunique() == 1:
return 0
else:
return 10
def date_diff_score(df, col_name):
df.sort_values(by=[col_name], ascending=False, inplace=True)
if df[col_name].diff().dropna().sum() / np.timedelta64(1, 'D') < 5:
return score + 30
else:
return score
I am wondering how to apply these functions to each group and calculate the sum of all the functions to give a score.
You can try groupby.transform for same size of Series as original DataFrame with numpy.where for if-else for Series:
grouped = df.sort_values('date', ascending=False).groupby('group_id', sort=False)
a = np.where(grouped['code'].transform('nunique') == 1, 0, 10)
print (a)
[10 10 10 0 0 0 0]
b = np.where(grouped['amount'].transform('sum') > 100, 20, 0)
print (b)
[ 0 0 0 20 20 20 20]
c = np.where(grouped['date'].transform(lambda x:x.diff().dropna().sum()).dt.days < 5, 30, 0)
print (c)
[30 30 30 30 30 30 30]
df['score'] = a + b + c
print (df)
group_id code amount date score
0 1 100 20 2017-10-01 40
1 1 100 25 2017-10-02 40
2 1 100 40 2017-10-03 40
3 1 100 25 2017-10-03 50
4 2 101 5 2017-11-01 50
5 2 102 15 2017-10-15 50
6 2 103 20 2017-11-05 50

grouping by weekly days pandas

I have a dataframe,df containing
Index Date & Time eventName eventCount
0 2017-08-09 ABC 24
1 2017-08-09 CDE 140
2 2017-08-10 CDE 150
3 2017-08-11 DEF 200
4 2017-08-11 ABC 20
5 2017-08-16 CDE 10
6 2017-08-16 ABC 15
7 2017-08-17 CDE 10
8 2017-08-17 DEF 50
9 2017-08-18 DEF 80
...
I want to sum the eventCount for each weekly day occurrences and plot for the total events for each weekly day(from MON to SUN) i.e. for example:
Summation of the eventCount values of:
2017-08-09 and 2017-08-16(Mondays)=189
2017-08-10 and 2017-08-17(Tuesdays)=210
2017-08-16 and 2017-08-23(Wednesdays)=300
I have tried
dailyOccurenceSum=df['eventCount'].groupby(lambda x: x.weekday).sum()
and I get this error:AttributeError: 'int' object has no attribute 'weekday'
Starting with df -
df
Index Date & Time eventName eventCount
0 0 2017-08-09 ABC 24
1 1 2017-08-09 CDE 140
2 2 2017-08-10 CDE 150
3 3 2017-08-11 DEF 200
4 4 2017-08-11 ABC 20
5 5 2017-08-16 CDE 10
6 6 2017-08-16 ABC 15
7 7 2017-08-17 CDE 10
8 8 2017-08-17 DEF 50
9 9 2017-08-18 DEF 80
First, convert Date & Time to a datetime column -
df['Date & Time'] = pd.to_datetime(df['Date & Time'])
Next, call groupby + sum on the weekday name.
df = df.groupby(df['Date & Time'].dt.weekday_name)['eventCount'].sum()
df
Date & Time
Friday 300
Thursday 210
Wednesday 189
Name: eventCount, dtype: int64
If you want to sort by weekday, convert the index to categorical and call sort_index -
cat = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
df.index = pd.Categorical(df.index, categories=cat, ordered=True)
df = df.sort_index()
df
Wednesday 189
Thursday 210
Friday 300
Name: eventCount, dtype: int64

Pandas assign value of one column based on another

Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':[10,20,30,40,50,60],
'B':[1,2,1,4,5,4]
})
df
A B
0 10 1
1 20 2
2 30 1
3 40 4
4 50 5
5 60 4
I would like a new column 'C' to have values be equal to those in 'A' where the corresponding values for 'B' are less than 3 else 0.
The desired result is as follows:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Thanks in advance!
Use np.where:
df['C'] = np.where(df['B'] < 3, df['A'], 0)
>>> df
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Here you can use pandas method where direct on the column:
In [3]:
df['C'] = df['A'].where(df['B'] < 3,0)
df
Out[3]:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Timings
In [4]:
%timeit df['A'].where(df['B'] < 3,0)
%timeit np.where(df['B'] < 3, df['A'], 0)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 407 µs per loop
np.where is faster here but pandas where is doing more checking and has more options so it depends on the use case here.

Resources