how to categorize salary into high/med/low group in python?

how to categorize salary into high/med/low group in python? - python-3.x

I have an employee dataset having salary details. I like to add an additional column to display their salary group like high/med/low:
Data:
Empno Sal Deptno
1 800 20
2 1600 30
3 2975 20
4 1250 30
5 2850 30
6 2450 10
7 3000 20
Expected Output:
Empno Sal Deptno Sal_Group
1 800 20 low
2 1600 30 mid
3 2975 20 ...
4 1250 30 ...
5 2850 30 ...
6 2450 10 ...
7 3000 20 high

You can try this:
import pandas as pd
import numpy as np
df = pd.read_csv("file.csv")
bins = np.linspace(min(df['Sal']), max(df['Sal']),4)
groupNames = ["low", "med", "high"]
df['SalGroup'] = pd.cut(df['Sal'], bins, labels = groupNames, include_lowest = True)
print(df)

Related

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie

Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38

I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Creating an aggregate columns in pandas dataframe

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "B", "B"], 'var1':[2, 3, 1, 5],'a1_bal':[1,2,3,4], 'a1c_bal':[10,22,36,41], 'b1_bal':[1,2,33,4], 'b1c_bal':[11,22,3,4], 'm1_bal':[15,2,35,4]})
df
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal
0 A 2 1 10 1 11 15
1 A 3 2 22 2 22 2
2 B 1 3 36 33 3 35
3 B 5 4 41 4 4 4
I want to create new columns as below:
a1_final_bal = sum(a1_bal, a1c_bal)
b1_final_bal = sum(b1_bal, b1c_bal)
m1_final_bal = m1_bal (since we only have m1_bal field not m1c_bal, so it will renain as it is)
I don't want to hardcode this step because there might be more such columns as "c_bal", "m2_bal", "m2c_bal" etc..
My final data should look something like below
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal a1_final_bal b1_final_bal m1_final_bal
0 A 2 1 10 1 11 15 11 12 15
1 A 3 2 22 2 22 2 24 24 2
2 B 1 3 36 33 3 35 38 36 35
3 B 5 4 41 4 4 4 45 8 4

You could try something like this. I am not sure if its exactly what you are looking for, but I think it should work.
dfforgroup = df.set_index(['ORDER','var1']) #Creates MultiIndex
dfforgroup.columns = dfforgroup.columns.str[:2] #Takes first two letters of remaining columns
df2 = dfforgroup.groupby(dfforgroup.columns,axis=1).sum().reset_index().drop(columns =
['ORDER','var1']).add_suffix('_final_bal') #groups columns by their first two letters and sums the columns up
df = pd.concat([df,df2],axis=1) #concatenates new columns to original df

pandas calculate scores for each group based on multiple functions

I have the following df,
group_id code amount date
1 100 20 2017-10-01
1 100 25 2017-10-02
1 100 40 2017-10-03
1 100 25 2017-10-03
2 101 5 2017-11-01
2 102 15 2017-10-15
2 103 20 2017-11-05
I like to groupby group_id and then compute scores to each group based on the following features:
if code values are all the same in a group, score 0 and 10 otherwise;
if amount sum is > 100, score 20 and 0 otherwise;
sort_values by date in descending order and sum the differences between the dates, if the sum < 5, score 30, otherwise 0.
so the result df looks like,
group_id code amount date score
1 100 20 2017-10-01 50
1 100 25 2017-10-02 50
1 100 40 2017-10-03 50
1 100 25 2017-10-03 50
2 101 5 2017-11-01 10
2 102 15 2017-10-15 10
2 103 20 2017-11-05 10
here are the functions that correspond to each feature above:
def amount_score(df, amount_col, thold=100):
if df[amount_col].sum() > thold:
return 20
else:
return 0
def col_uniq_score(df, col_name):
if df[col_name].nunique() == 1:
return 0
else:
return 10
def date_diff_score(df, col_name):
df.sort_values(by=[col_name], ascending=False, inplace=True)
if df[col_name].diff().dropna().sum() / np.timedelta64(1, 'D') < 5:
return score + 30
else:
return score
I am wondering how to apply these functions to each group and calculate the sum of all the functions to give a score.

You can try groupby.transform for same size of Series as original DataFrame with numpy.where for if-else for Series:
grouped = df.sort_values('date', ascending=False).groupby('group_id', sort=False)
a = np.where(grouped['code'].transform('nunique') == 1, 0, 10)
print (a)
[10 10 10 0 0 0 0]
b = np.where(grouped['amount'].transform('sum') > 100, 20, 0)
print (b)
[ 0 0 0 20 20 20 20]
c = np.where(grouped['date'].transform(lambda x:x.diff().dropna().sum()).dt.days < 5, 30, 0)
print (c)
[30 30 30 30 30 30 30]
df['score'] = a + b + c
print (df)
group_id code amount date score
0 1 100 20 2017-10-01 40
1 1 100 25 2017-10-02 40
2 1 100 40 2017-10-03 40
3 1 100 25 2017-10-03 50
4 2 101 5 2017-11-01 50
5 2 102 15 2017-10-15 50
6 2 103 20 2017-11-05 50

grouping by weekly days pandas

I have a dataframe,df containing
Index Date & Time eventName eventCount
0 2017-08-09 ABC 24
1 2017-08-09 CDE 140
2 2017-08-10 CDE 150
3 2017-08-11 DEF 200
4 2017-08-11 ABC 20
5 2017-08-16 CDE 10
6 2017-08-16 ABC 15
7 2017-08-17 CDE 10
8 2017-08-17 DEF 50
9 2017-08-18 DEF 80
...
I want to sum the eventCount for each weekly day occurrences and plot for the total events for each weekly day(from MON to SUN) i.e. for example:
Summation of the eventCount values of:
2017-08-09 and 2017-08-16(Mondays)=189
2017-08-10 and 2017-08-17(Tuesdays)=210
2017-08-16 and 2017-08-23(Wednesdays)=300
I have tried
dailyOccurenceSum=df['eventCount'].groupby(lambda x: x.weekday).sum()
and I get this error:AttributeError: 'int' object has no attribute 'weekday'

Starting with df -
df
Index Date & Time eventName eventCount
0 0 2017-08-09 ABC 24
1 1 2017-08-09 CDE 140
2 2 2017-08-10 CDE 150
3 3 2017-08-11 DEF 200
4 4 2017-08-11 ABC 20
5 5 2017-08-16 CDE 10
6 6 2017-08-16 ABC 15
7 7 2017-08-17 CDE 10
8 8 2017-08-17 DEF 50
9 9 2017-08-18 DEF 80
First, convert Date & Time to a datetime column -
df['Date & Time'] = pd.to_datetime(df['Date & Time'])
Next, call groupby + sum on the weekday name.
df = df.groupby(df['Date & Time'].dt.weekday_name)['eventCount'].sum()
df
Date & Time
Friday 300
Thursday 210
Wednesday 189
Name: eventCount, dtype: int64
If you want to sort by weekday, convert the index to categorical and call sort_index -
cat = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
df.index = pd.Categorical(df.index, categories=cat, ordered=True)
df = df.sort_index()
df
Wednesday 189
Thursday 210
Friday 300
Name: eventCount, dtype: int64

Pandas assign value of one column based on another

Given the following data frame:
import pandas as pd
df = pd.DataFrame(
{'A':[10,20,30,40,50,60],
'B':[1,2,1,4,5,4]
})
df
A B
0 10 1
1 20 2
2 30 1
3 40 4
4 50 5
5 60 4
I would like a new column 'C' to have values be equal to those in 'A' where the corresponding values for 'B' are less than 3 else 0.
The desired result is as follows:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Thanks in advance!

Use np.where:
df['C'] = np.where(df['B'] < 3, df['A'], 0)
>>> df
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0

Here you can use pandas method where direct on the column:
In [3]:
df['C'] = df['A'].where(df['B'] < 3,0)
df
Out[3]:
A B C
0 10 1 10
1 20 2 20
2 30 1 30
3 40 4 0
4 50 5 0
5 60 4 0
Timings
In [4]:
%timeit df['A'].where(df['B'] < 3,0)
%timeit np.where(df['B'] < 3, df['A'], 0)
1000 loops, best of 3: 1.4 ms per loop
1000 loops, best of 3: 407 µs per loop
np.where is faster here but pandas where is doing more checking and has more options so it depends on the use case here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to categorize salary into high/med/low group in python? - python-3.x

You can try this: import pandas as pd import numpy as np df = pd.read_csv("file.csv") bins = np.linspace(min(df['Sal']), max(df['Sal']),4) groupNames = ["low", "med", "high"] df['SalGroup'] = pd.cut(df['Sal'], bins, labels = groupNames, include_lowest = True) print(df)

Related

Calculate Percentage using Pandas DataFrame

Creating an aggregate columns in pandas dataframe

pandas calculate scores for each group based on multiple functions

grouping by weekly days pandas

Pandas assign value of one column based on another

Categories

Resources