Pandas Group-By and Calculate Ratio of Two Columns - python-3.x

I'm trying to use Pandas and groupby to calculate the ratio of two columns. In the example below I want to calculate the ratio of staff Status per Department (Number of Status in Department/Total Number of Employees per Department). For example the Sales department has a total of 3 Employees and the number of staff that have Employee Status is 2 which gives the ratio of 2/3, 66.67%. I managed to hack my way through to get this but there must be a more elegant and simple way to do this. How can I get the desired output below more efficiently?
Original DataFrame:
Department Name Status
0 Sales John Employee
1 Sales Steve Employee
2 Sales Sara Contractor
3 Finance Allen Contractor
4 Marketing Robert Employee
5 Marketing Lacy Contractor
Code:
mydict ={
'Name': ['John', 'Steve', 'Sara', 'Allen', 'Robert', 'Lacy'],
'Department': ['Sales', 'Sales', 'Sales', 'Finance', 'Marketing', 'Marketing'],
'Status': ['Employee', 'Employee', 'Contractor', 'Contractor', 'Employee', 'Contractor']
}
df = pd.DataFrame(mydict)
# Create column with total number of staff Status per Department
df['total_dept'] = df.groupby(['Department'])['Name'].transform('count')
print(df)
print('\n')
# Crate column with Status ratio per department
for k, v, in df.iterrows():
df.loc[k, 'Status_Ratio'] = (df.groupby(['Department', 'Status']).count().xs(v['Status'], level=1)['total_dept'][v['Department']]/v['total_dept']) *100
print(df)
print('\n')
# Final Groupby with Status Ratio. Size NOT needed
print(df.groupby(['Department', 'Status', 'Status_Ratio']).size())
Desired Output:
Department Status Status_Ratio
Finance Contractor 100.00
Marketing Contractor 50.00
Employee 50.00
Sales Contractor 33.33
Employee 66.67

Try (with the original df):
df.groupby("Department")["Status"].value_counts(normalize=True).mul(100)
Outputs:
Department Status
Finance Contractor 100.000000
Marketing Contractor 50.000000
Employee 50.000000
Sales Employee 66.666667
Contractor 33.333333
Name: Status, dtype: float64

Related

Pandas - I have a dataset where the clmns r country, company and total employees. I need a dataframe for total employees in each company by country

There are totally 8 companies and around 30 - 40 countries. I need to get a dataframe where i can know how many total number of employees in each company by country.
Sounds like you want to use Panda's groupby feature. I'm not sure what type of data you have and what result you want, so here are some toy examples:
df = pd.DataFrame({'company': ["A", "A", "B"], 'country': ["USA", "USA", "USA"], 'employees': [10, 20, 50]})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].sum()
print(dfg)
# company country employees
# 0 A USA 30
# 1 B USA 50
df = pd.DataFrame({'company': ["A", "A", "A"], 'country': ["USA", "USA", "Japan"], 'employees': ['Art', 'Bob', 'Chris']})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].count()
print(dfg)
# company country employees
# 0 A Japan 1
# 1 A USA 2

Python - Transpose/Pivot a column based based on a different column

I searched it and indeed I found a lot of similar questions but none of those seemed to answer my case.
I have a pd Dataframe which is a joined table consist of products and the countries in which they are sold.
It's 3000 rows and 50 columns in size.
I'm uploading a photo (only part of the df) of the current situation I'm in now and the expected result I want to achieve.
I want to transpose the 'Country name' column into rows grouped by the 'Product code name. Please note that the new country columns are not limited to a certain amount of countries (some products has 3, some 40).
Thank you!
Use .cumcount() to count the number of countries that a product has.
Then use .pivot() to get your dataframe in the right shape:
df = pd.DataFrame({
'Country': ['NL', 'Poland', 'Spain', 'Sweden', 'China', 'Egypt'],
'Product Code': ['123', '123', '115', '115', '117', '118'],
'Product Name': ['X', 'X', 'Y', 'Y', 'Z', 'W'],
})
df['cumcount'] = df.groupby(['Product Code', 'Product Name'])['Country'].cumcount() + 1
df_pivot = df.pivot(
index=['Product Code', 'Product Name'],
columns='cumcount',
values='Country',
).add_prefix('country_')
Resulting dataframe:
cumcount country_1 country_2
ProductCode Product Name
115 Y Spain Sweden
117 Z China NaN
118 W Egypt NaN
123 X NL Poland
Try this:
df_out = df.set_index(['Product code',
'Product name',
df.groupby('Product code').cumcount() + 1]).unstack()
df_out.columns = [f'Country_{j}' for _, j in df_out.columns]
df_out.reset_index()
Output:
Product code Product name Country_1 Country_2 Country_3
0 AAA115 Y Sweden China NaN
1 AAA117 Z Egypt Greece NaN
2 AAA118 W France Italy NaN
3 AAA123 X Netherlands Poland Spain
Details:
Reshape dataframe with set_index and unstack, using cumcount to create country columns. Then flatten multiindex header using list comprehension.

How can I create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

The following dataframe shows trips made by employees of different companies:
source:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
output:
Name Company Depart Return Charges
0 Bob ABC 2020-01-01 2020-01-31 10.10
1 Joe ABC 2020-01-01 2020-02-15 20.25
2 Sue ABC 2020-01-06 2020-02-20 30.32
3 Jack HIJ 2020-01-01 2020-03-01 40.00
4 Henry HIJ 2020-05-01 2020-05-05 50.01
5 Frank DEF 2020-01-13 2020-01-15 60.32
6 Lee DEF 2020-01-12 2020-01-30 70.99
7 Jack DEF 2020-01-14 2020-02-02 80.87
How can I create another dataframe based on the following aspects:
The original dataframe is based on employee names/trips.
The generated dataframe will be based on companies grouped by overlapping dates.
The 'Name' column will not be included as it is no longer needed.
The 'Company' column will remain.
The 'Depart' date will be of the earliest date of any overlapping trip dates.
The 'Return' date will be of the latest date of any overlapping trip dates.
Any company trips that do not have overlapping dates will be its own entry/row.
The 'Charges' for each trip will be totaled for the new company entry.
Here is the desired output of the new dataframe:
Company Depart Return Charges
0 ABC 01/01/2020 02/20/2020 60.67
1 HIJ 01/01/2020 03/01/2020 40.00
2 HIJ 05/01/2020 05/05/2020 50.01
3 DEF 01/12/2020 02/02/2020 212.18
I've looked into the following as possible solutions:
Create a hierarchical index based on the company and date. As I worked through this, I realized that all this really does is create a hierarchical index but that's based on the specific columns. Also, this method won't aggregate the individual rows into summary rows.
df1 = df.set_index(['Company', not exactly sure how to say overlapping dates])
I also tried using timedelta but it resulted in True/False values in a separate column, and I'm not entirely sure how that would be used to combine into a single row based on overlapping date and company. Also, I don't think groupby('Company') works since there could be different trips and that non-overlapping that would require their own rows.
df['trips_overlap'] = (df.groupby('Company')
.apply(lambda x: (x['Return'].shift() - x['Depart']) > timedelta(0))
.reset_index(level=0, drop=True))

Use my custom row order with pandas .describe() function

Assuming I have the following test DataFrame df:
Car Sold make profit
Honda 100 Accord 10
Honda 20 Fit 5
Toyota 300 Corolla 20
Hyundai 150 Elantra 20
BMW 20 Z-class 100
Toyota 45 Lexus 7
BMW 50 X-class 30
JEEP 150 cherokee 2
Honda 20 CRV 5
Toyota 30 Yaris 3
I need a summary statistic table for number of cars sold, by type of car.
I can do that this way:
df.groupby('Car')['Sold'].describe()
this gives me something like the following:
Car count mean std min 25th 50th 75th max
BMW 2
Honda 3
Hyundai 1
JEEP 1
Toyota 3
The 'Car' column values are listed in the summary statistic table in alphabetically ascending order. I am looking for a way to sort it in my own pre-specified way. I want the summary statistic table to be listed as "Toyota, Hyundai, JEEP, BMW, Honda"
df.groupby('Car')['Sold'].describe().loc[["Toyota", "Hyundai", "JEEP", "BMW", "Honda"]]
helps me put it in order, but I am not able to do it for multi-level indexing. For instance, if I want the summary statistics table by 'Car', and further by the make, .loc does not give me the desired solution.

How to fill in between rows gap comparing with other dataframe using pandas?

I want to compare df1 with df2 and fill only the blanks without overwriting other values. I have no idea how to achieve this without overwriting or creating an extra columns.
Can I do this by converting df2 into dictionary and mapping with df1?
df1 = pd.DataFrame({'players name':['ram', 'john', 'ismael', 'sam', 'karan'],
'hobbies':['jog','','photos','','studying'],
'sports':['cricket', 'basketball', 'chess', 'kabadi', 'volleyball']})
df1:
players name hobbies sports
0 ram jog cricket
1 john basketball
2 ismael photos chess
3 sam kabadi
4 karan studying volleyball
And, df,
df2 = pd.DataFrame({'players name':['jagan', 'mohan', 'john', 'sam', 'karan'],
'hobbies':['riding', 'tv', 'sliding', 'jumping', 'studying']})
df2:
players name hobbies
0 jagan riding
1 mohan tv
2 john sliding
3 sam jumping
4 karan studying
I want output like this:
Try this:
df1['hobbies'] = (df1['players name'].map(df2.set_index('players name')['hobbies'])
.fillna(df1['hobbies']))
df1
Output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball
if blank space is NaN value
df1 = pd.DataFrame({"players name":["ram","john","ismael","sam","karan"],
"hobbies":["jog",pd.np.NaN,"photos",pd.np.NaN,"studying"],
"sports":["cricket","basketball","chess","kabadi","volleyball"]})
then
dicts = df2.set_index("players name")['hobbies'].to_dict()
df1['hobbies'] = df1['hobbies'].fillna(df1['players name'].map(dicts))
output:
players name hobbies sports
0 ram jog cricket
1 john sliding basketball
2 ismael photos chess
3 sam jumping kabadi
4 karan studying volleyball

Resources