Pandas - I have a dataset where the clmns r country, company and total employees. I need a dataframe for total employees in each company by country - python-3.x

There are totally 8 companies and around 30 - 40 countries. I need to get a dataframe where i can know how many total number of employees in each company by country.

Sounds like you want to use Panda's groupby feature. I'm not sure what type of data you have and what result you want, so here are some toy examples:
df = pd.DataFrame({'company': ["A", "A", "B"], 'country': ["USA", "USA", "USA"], 'employees': [10, 20, 50]})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].sum()
print(dfg)
# company country employees
# 0 A USA 30
# 1 B USA 50
df = pd.DataFrame({'company': ["A", "A", "A"], 'country': ["USA", "USA", "Japan"], 'employees': ['Art', 'Bob', 'Chris']})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].count()
print(dfg)
# company country employees
# 0 A Japan 1
# 1 A USA 2

Related

Python - Transpose/Pivot a column based based on a different column

I searched it and indeed I found a lot of similar questions but none of those seemed to answer my case.
I have a pd Dataframe which is a joined table consist of products and the countries in which they are sold.
It's 3000 rows and 50 columns in size.
I'm uploading a photo (only part of the df) of the current situation I'm in now and the expected result I want to achieve.
I want to transpose the 'Country name' column into rows grouped by the 'Product code name. Please note that the new country columns are not limited to a certain amount of countries (some products has 3, some 40).
Thank you!
Use .cumcount() to count the number of countries that a product has.
Then use .pivot() to get your dataframe in the right shape:
df = pd.DataFrame({
'Country': ['NL', 'Poland', 'Spain', 'Sweden', 'China', 'Egypt'],
'Product Code': ['123', '123', '115', '115', '117', '118'],
'Product Name': ['X', 'X', 'Y', 'Y', 'Z', 'W'],
})
df['cumcount'] = df.groupby(['Product Code', 'Product Name'])['Country'].cumcount() + 1
df_pivot = df.pivot(
index=['Product Code', 'Product Name'],
columns='cumcount',
values='Country',
).add_prefix('country_')
Resulting dataframe:
cumcount country_1 country_2
ProductCode Product Name
115 Y Spain Sweden
117 Z China NaN
118 W Egypt NaN
123 X NL Poland
Try this:
df_out = df.set_index(['Product code',
'Product name',
df.groupby('Product code').cumcount() + 1]).unstack()
df_out.columns = [f'Country_{j}' for _, j in df_out.columns]
df_out.reset_index()
Output:
Product code Product name Country_1 Country_2 Country_3
0 AAA115 Y Sweden China NaN
1 AAA117 Z Egypt Greece NaN
2 AAA118 W France Italy NaN
3 AAA123 X Netherlands Poland Spain
Details:
Reshape dataframe with set_index and unstack, using cumcount to create country columns. Then flatten multiindex header using list comprehension.

How can I create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

The following dataframe shows trips made by employees of different companies:
source:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
output:
Name Company Depart Return Charges
0 Bob ABC 2020-01-01 2020-01-31 10.10
1 Joe ABC 2020-01-01 2020-02-15 20.25
2 Sue ABC 2020-01-06 2020-02-20 30.32
3 Jack HIJ 2020-01-01 2020-03-01 40.00
4 Henry HIJ 2020-05-01 2020-05-05 50.01
5 Frank DEF 2020-01-13 2020-01-15 60.32
6 Lee DEF 2020-01-12 2020-01-30 70.99
7 Jack DEF 2020-01-14 2020-02-02 80.87
How can I create another dataframe based on the following aspects:
The original dataframe is based on employee names/trips.
The generated dataframe will be based on companies grouped by overlapping dates.
The 'Name' column will not be included as it is no longer needed.
The 'Company' column will remain.
The 'Depart' date will be of the earliest date of any overlapping trip dates.
The 'Return' date will be of the latest date of any overlapping trip dates.
Any company trips that do not have overlapping dates will be its own entry/row.
The 'Charges' for each trip will be totaled for the new company entry.
Here is the desired output of the new dataframe:
Company Depart Return Charges
0 ABC 01/01/2020 02/20/2020 60.67
1 HIJ 01/01/2020 03/01/2020 40.00
2 HIJ 05/01/2020 05/05/2020 50.01
3 DEF 01/12/2020 02/02/2020 212.18
I've looked into the following as possible solutions:
Create a hierarchical index based on the company and date. As I worked through this, I realized that all this really does is create a hierarchical index but that's based on the specific columns. Also, this method won't aggregate the individual rows into summary rows.
df1 = df.set_index(['Company', not exactly sure how to say overlapping dates])
I also tried using timedelta but it resulted in True/False values in a separate column, and I'm not entirely sure how that would be used to combine into a single row based on overlapping date and company. Also, I don't think groupby('Company') works since there could be different trips and that non-overlapping that would require their own rows.
df['trips_overlap'] = (df.groupby('Company')
.apply(lambda x: (x['Return'].shift() - x['Depart']) > timedelta(0))
.reset_index(level=0, drop=True))

Pandas Group-By and Calculate Ratio of Two Columns

I'm trying to use Pandas and groupby to calculate the ratio of two columns. In the example below I want to calculate the ratio of staff Status per Department (Number of Status in Department/Total Number of Employees per Department). For example the Sales department has a total of 3 Employees and the number of staff that have Employee Status is 2 which gives the ratio of 2/3, 66.67%. I managed to hack my way through to get this but there must be a more elegant and simple way to do this. How can I get the desired output below more efficiently?
Original DataFrame:
Department Name Status
0 Sales John Employee
1 Sales Steve Employee
2 Sales Sara Contractor
3 Finance Allen Contractor
4 Marketing Robert Employee
5 Marketing Lacy Contractor
Code:
mydict ={
'Name': ['John', 'Steve', 'Sara', 'Allen', 'Robert', 'Lacy'],
'Department': ['Sales', 'Sales', 'Sales', 'Finance', 'Marketing', 'Marketing'],
'Status': ['Employee', 'Employee', 'Contractor', 'Contractor', 'Employee', 'Contractor']
}
df = pd.DataFrame(mydict)
# Create column with total number of staff Status per Department
df['total_dept'] = df.groupby(['Department'])['Name'].transform('count')
print(df)
print('\n')
# Crate column with Status ratio per department
for k, v, in df.iterrows():
df.loc[k, 'Status_Ratio'] = (df.groupby(['Department', 'Status']).count().xs(v['Status'], level=1)['total_dept'][v['Department']]/v['total_dept']) *100
print(df)
print('\n')
# Final Groupby with Status Ratio. Size NOT needed
print(df.groupby(['Department', 'Status', 'Status_Ratio']).size())
Desired Output:
Department Status Status_Ratio
Finance Contractor 100.00
Marketing Contractor 50.00
Employee 50.00
Sales Contractor 33.33
Employee 66.67
Try (with the original df):
df.groupby("Department")["Status"].value_counts(normalize=True).mul(100)
Outputs:
Department Status
Finance Contractor 100.000000
Marketing Contractor 50.000000
Employee 50.000000
Sales Employee 66.666667
Contractor 33.333333
Name: Status, dtype: float64

Handling duplicate data with pandas

Hello everyone, I'm having some issues with using pandas python library. Basically I'm reading csv
file with pandas and want to remove duplicates. I've tried everything and problem is still there.
import sqlite3
import pandas as pd
import numpy
connection = sqlite3.connect("test.db")
## pandas dataframe
dataframe = pd.read_csv('Countries.csv')
##dataframe.head(3)
countries = dataframe.loc[:, ['Retailer country', 'Continent']]
countries.head(6)
Output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 Canada North America
5 United States North America
6 France Europe
I want to be able to drop duplicate values based on columns from
a dataframe above so I would have smth like this unique values from each country, and continent
so that desired output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 France Europe
I have tried some methods mentioned there: Using pandas for duplicate values and looked around the net and realized I could use df.drop_duplicates() function, but when I use the code below and df.head(3) function it displays only one row. What can I do to get those unique rows and finally loop through them ?
countries.head(4)
country = countries['Retailer country']
continent = countries['Continent']
df = pd.DataFrame({'a':[country], 'b':[continent]})
df.head(3)
It seems like a simple group-by could solve your problem.
import pandas as pd
na = 'North America'
a = 'Asia'
e = 'Europe'
df = pd.DataFrame({'Retailer': [0, 1, 2, 3, 4, 5, 6],
'country': ['Unitied States', 'Canada', 'Japan', 'Italy', 'Canada', 'Unitied States', 'France'],
'continent': [na, na, a, e, na, na, e]})
df.groupby(['country', 'continent']).agg('count').reset_index()
The Retailer column is now showing a count of the number of times that country, continent combination occurs. You could remove this by `df = df[['country', 'continent']].

group by to create multiple files

I have written a code using pandas groupby and its is working.
my question is how can I save each group in a excel sheet.
For example is you have group of fruits [ 'apple', 'grapes',.....'mango
']
I want to save apple in an excel and gapes in a different excel
import pandas as pd
df = pd.read_excel('C://Desktop/test/file.xlsx')
g = df.groupby('fruits')
for fruits, fruits_g in g:
print(fruits)
print(fruits_g)
Mango
name id purchase fruits
1 john 877 2 Mango
apple
name id purchase fruits
0 ram 654 5 apple
3 Sam 546 5 apple
BlueB
name id purchase fruits
7 david 767 9 black
grapes
name id purchase fruits
2 Dan 454 1 grapes
4 sys 890 7 grapes
mango
name id purchase fruits
5 baka 786 6 mango
strawB
name id purchase fruits
6 silver 887 9 straw
How Can i Create an excel for each group of fruit?
This can be accomplished using pandas.DataFrame.to_excel:
import pandas as pd
df = pd.DataFrame({
"Fruit": ["apple", "orange", "banana", "apple", "orange"],
"Name": ["John", "Sam", "David", "Rebeca", "Sydney"],
"ID": [877, 546, 767, 887, 890],
"Purchase": [1, 2, 5, 6, 4]
})
grouped = df.groupby("Fruit")
# run this to generate separate Excel files
for fruit, group in grouped:
group.to_excel(excel_writer=f"{fruit}.xlsx", sheet_name=fruit, index=False)
# run this to generate a single Excel file with separate sheets
with pd.ExcelWriter("fruits.xlsx") as writer:
for fruit, group in grouped:
group.to_excel(excel_writer=writer, sheet_name=fruit, index=False)

Resources