Calculate Percentage using Pandas DataFrame - python-3.x

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie

Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38

I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Related

Inner merge in python with tables having duplicate values in key column

I am struggling to replicate sas(another programming language) inner merge in python .
The python inner merge is not matching with sas inner merge when duplicate key values are coming .
Below is an example :
zw = pd.DataFrame({"ID":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Shivansh','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary":[22,15,10,9,16,18,22],
"city":['noida','bangalore','hyderabad','noida','pune','gurugram','bangalore'],
"ant":[10,15,15,10,16,17,18]})
zw1 = pd.DataFrame({"ID-":[1,0,0,1,0,0,1],
"Name":['Shivansh','Shivansh','Swati','Amar','Arpit','Ranjeet','Priyanka'],
"job_profile_":['DataS','SWD','DataA','DataA','AndroidD','PythonD','fullstac'],
"salary_":[2,15,10,9,16,18,22],
"city_":['noida','kochi','hyderabad','noida','pune','gurugram','bangalore'],
"ant_":[1,15,15,10,16,17,18]})
zw and sw1 are the input tables . Both the tables need to be inner merged on the key column Name .The issue is both columns are having duplicate values in Name column .
Python is generating all possible combinations with the duplicate rows .
Below is the expected output :
I tried normal inner merge and tried dropping duplicate row with ID and Name columns , but still not getting the desired output .
df1=pd.merge(zw,zw1,on=['Name'],how='inner')
df1.drop_duplicates(['Name','ID'])
Use df.combine_first + df.sort_values combination:
df = zw.combine_first(zw1).sort_values('Name')
print(df)
ID ID- Name ant ant_ city city_ job_profile \
3 1 1 Amar 10 10 noida noida DataA
4 0 0 Arpit 16 16 pune pune AndroidD
6 1 1 Priyanka 18 18 bangalore bangalore fullstac
5 0 0 Ranjeet 17 17 gurugram gurugram PythonD
0 1 1 Shivansh 10 1 noida noida DataS
1 0 0 Shivansh 15 15 bangalore kochi SWD
2 0 0 Shivansh 15 15 hyderabad hyderabad DataA
job_profile_ salary salary_
3 DataA 9 9
4 AndroidD 16 16
6 fullstac 22 22
5 PythonD 18 18
0 DataS 22 2
1 SWD 15 15
2 DataA 10 10

how to categorize salary into high/med/low group in python?

I have an employee dataset having salary details. I like to add an additional column to display their salary group like high/med/low:
Data:
Empno Sal Deptno
1 800 20
2 1600 30
3 2975 20
4 1250 30
5 2850 30
6 2450 10
7 3000 20
Expected Output:
Empno Sal Deptno Sal_Group
1 800 20 low
2 1600 30 mid
3 2975 20 ...
4 1250 30 ...
5 2850 30 ...
6 2450 10 ...
7 3000 20 high
You can try this:
import pandas as pd
import numpy as np
df = pd.read_csv("file.csv")
bins = np.linspace(min(df['Sal']), max(df['Sal']),4)
groupNames = ["low", "med", "high"]
df['SalGroup'] = pd.cut(df['Sal'], bins, labels = groupNames, include_lowest = True)
print(df)

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Let's say I have students infos id, age and class as follows:
id age class
0 1 23 a
1 2 24 a
2 3 25 b
3 4 22 b
4 5 16 c
5 6 16 d
I want to groupby class and create a new column named major by randomly assign math, art, business, science into it, which means for same class, the major strings are same.
We may need to use apply(lambda x: random.choice..) to realize this, but I don't know how to do this. Thanks for your help.
Output expected:
id age major class
0 1 23 art a
1 2 24 art a
2 3 25 science b
3 4 22 science b
4 5 16 business c
5 6 16 math d
Use numpy.random.choice with number of values by length of DataFrame:
df['major'] = np.random.choice(['math', 'art', 'business', 'science'], size=len(df))
print (df)
id age major
0 1 23 business
1 2 24 art
2 3 25 science
3 4 22 math
4 5 16 science
5 6 16 business
EDIT: for same major values per groups use Series.map with dictionary:
c = df['class'].unique()
vals = np.random.choice(['math', 'art', 'business', 'science'], size=len(c))
df['major'] = df['class'].map(dict(zip(c, vals)))
print (df)
id age class major
0 1 23 a business
1 2 24 a business
2 3 25 b art
3 4 22 b art
4 5 16 c science
5 6 16 d math

How to count number of records in each group and add them to main dataset?

Given that i have a dataset as below:
import pandas as pd
import numpy as np
dt = {
"facility":["Ann Arbor","Ann Arbor","Detriot","Detriot","Detriot"],
"patient_ID":[4388,4388,9086,9086,9086],
"year":[2004,2007,2007,2008,2011],
"month":[8,9,9,6,2],
"Nr_Small":[0,0,5,12,10],
"Nr_Medium":[3,1,1,4,3],
"Nr_Large":[2,0,0,0,0]
}
dt = pd.DataFrame(dt)
dt.head()
i need to add a column which shows the number of records in each groups of paitents. Here is what i am doing:
dt["NumberOfVisits"] = dt.groupby(['patient_ID']).size()
or i tried this one:
but it adds a column of Nas into my dataset.However, my favorit output is as below
Use transform here:
df["NumberOfVisits"]=df.groupby(['patient_ID'])['patient_ID'].transform('size')
print(df)
facility patient_ID year month Nr_Small Nr_Medium Nr_Large \
0 Ann Arbor 4388 2004 8 0 3 2
1 Ann Arbor 4388 2007 9 0 1 0
2 Detriot 9086 2007 9 5 1 0
3 Detriot 9086 2008 6 12 4 0
4 Detriot 9086 2011 2 10 3 0
NumberOfVisits
0 2
1 2
2 3
3 3
4 3

grouping by weekly days pandas

I have a dataframe,df containing
Index Date & Time eventName eventCount
0 2017-08-09 ABC 24
1 2017-08-09 CDE 140
2 2017-08-10 CDE 150
3 2017-08-11 DEF 200
4 2017-08-11 ABC 20
5 2017-08-16 CDE 10
6 2017-08-16 ABC 15
7 2017-08-17 CDE 10
8 2017-08-17 DEF 50
9 2017-08-18 DEF 80
...
I want to sum the eventCount for each weekly day occurrences and plot for the total events for each weekly day(from MON to SUN) i.e. for example:
Summation of the eventCount values of:
2017-08-09 and 2017-08-16(Mondays)=189
2017-08-10 and 2017-08-17(Tuesdays)=210
2017-08-16 and 2017-08-23(Wednesdays)=300
I have tried
dailyOccurenceSum=df['eventCount'].groupby(lambda x: x.weekday).sum()
and I get this error:AttributeError: 'int' object has no attribute 'weekday'
Starting with df -
df
Index Date & Time eventName eventCount
0 0 2017-08-09 ABC 24
1 1 2017-08-09 CDE 140
2 2 2017-08-10 CDE 150
3 3 2017-08-11 DEF 200
4 4 2017-08-11 ABC 20
5 5 2017-08-16 CDE 10
6 6 2017-08-16 ABC 15
7 7 2017-08-17 CDE 10
8 8 2017-08-17 DEF 50
9 9 2017-08-18 DEF 80
First, convert Date & Time to a datetime column -
df['Date & Time'] = pd.to_datetime(df['Date & Time'])
Next, call groupby + sum on the weekday name.
df = df.groupby(df['Date & Time'].dt.weekday_name)['eventCount'].sum()
df
Date & Time
Friday 300
Thursday 210
Wednesday 189
Name: eventCount, dtype: int64
If you want to sort by weekday, convert the index to categorical and call sort_index -
cat = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
df.index = pd.Categorical(df.index, categories=cat, ordered=True)
df = df.sort_index()
df
Wednesday 189
Thursday 210
Friday 300
Name: eventCount, dtype: int64

Resources