group by to create multiple files - pandas-groupby

I have written a code using pandas groupby and its is working.
my question is how can I save each group in a excel sheet.
For example is you have group of fruits [ 'apple', 'grapes',.....'mango
']
I want to save apple in an excel and gapes in a different excel
import pandas as pd
df = pd.read_excel('C://Desktop/test/file.xlsx')
g = df.groupby('fruits')
for fruits, fruits_g in g:
print(fruits)
print(fruits_g)
Mango
name id purchase fruits
1 john 877 2 Mango
apple
name id purchase fruits
0 ram 654 5 apple
3 Sam 546 5 apple
BlueB
name id purchase fruits
7 david 767 9 black
grapes
name id purchase fruits
2 Dan 454 1 grapes
4 sys 890 7 grapes
mango
name id purchase fruits
5 baka 786 6 mango
strawB
name id purchase fruits
6 silver 887 9 straw
How Can i Create an excel for each group of fruit?

This can be accomplished using pandas.DataFrame.to_excel:
import pandas as pd
df = pd.DataFrame({
"Fruit": ["apple", "orange", "banana", "apple", "orange"],
"Name": ["John", "Sam", "David", "Rebeca", "Sydney"],
"ID": [877, 546, 767, 887, 890],
"Purchase": [1, 2, 5, 6, 4]
})
grouped = df.groupby("Fruit")
# run this to generate separate Excel files
for fruit, group in grouped:
group.to_excel(excel_writer=f"{fruit}.xlsx", sheet_name=fruit, index=False)
# run this to generate a single Excel file with separate sheets
with pd.ExcelWriter("fruits.xlsx") as writer:
for fruit, group in grouped:
group.to_excel(excel_writer=writer, sheet_name=fruit, index=False)

Related

Pandas - I have a dataset where the clmns r country, company and total employees. I need a dataframe for total employees in each company by country

There are totally 8 companies and around 30 - 40 countries. I need to get a dataframe where i can know how many total number of employees in each company by country.
Sounds like you want to use Panda's groupby feature. I'm not sure what type of data you have and what result you want, so here are some toy examples:
df = pd.DataFrame({'company': ["A", "A", "B"], 'country': ["USA", "USA", "USA"], 'employees': [10, 20, 50]})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].sum()
print(dfg)
# company country employees
# 0 A USA 30
# 1 B USA 50
df = pd.DataFrame({'company': ["A", "A", "A"], 'country': ["USA", "USA", "Japan"], 'employees': ['Art', 'Bob', 'Chris']})
dfg = df.groupby(['company', 'country'], as_index=False)['employees'].count()
print(dfg)
# company country employees
# 0 A Japan 1
# 1 A USA 2

list of visited interval

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])
Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

How can I create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe?

The following dataframe shows trips made by employees of different companies:
source:
import pandas as pd
emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'],
'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'],
'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'],
'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'],
'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87]
}
df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges'])
# Convert to date format
df['Return']= pd.to_datetime(df['Return'])
df['Depart']= pd.to_datetime(df['Depart'])
output:
Name Company Depart Return Charges
0 Bob ABC 2020-01-01 2020-01-31 10.10
1 Joe ABC 2020-01-01 2020-02-15 20.25
2 Sue ABC 2020-01-06 2020-02-20 30.32
3 Jack HIJ 2020-01-01 2020-03-01 40.00
4 Henry HIJ 2020-05-01 2020-05-05 50.01
5 Frank DEF 2020-01-13 2020-01-15 60.32
6 Lee DEF 2020-01-12 2020-01-30 70.99
7 Jack DEF 2020-01-14 2020-02-02 80.87
How can I create another dataframe based on the following aspects:
The original dataframe is based on employee names/trips.
The generated dataframe will be based on companies grouped by overlapping dates.
The 'Name' column will not be included as it is no longer needed.
The 'Company' column will remain.
The 'Depart' date will be of the earliest date of any overlapping trip dates.
The 'Return' date will be of the latest date of any overlapping trip dates.
Any company trips that do not have overlapping dates will be its own entry/row.
The 'Charges' for each trip will be totaled for the new company entry.
Here is the desired output of the new dataframe:
Company Depart Return Charges
0 ABC 01/01/2020 02/20/2020 60.67
1 HIJ 01/01/2020 03/01/2020 40.00
2 HIJ 05/01/2020 05/05/2020 50.01
3 DEF 01/12/2020 02/02/2020 212.18
I've looked into the following as possible solutions:
Create a hierarchical index based on the company and date. As I worked through this, I realized that all this really does is create a hierarchical index but that's based on the specific columns. Also, this method won't aggregate the individual rows into summary rows.
df1 = df.set_index(['Company', not exactly sure how to say overlapping dates])
I also tried using timedelta but it resulted in True/False values in a separate column, and I'm not entirely sure how that would be used to combine into a single row based on overlapping date and company. Also, I don't think groupby('Company') works since there could be different trips and that non-overlapping that would require their own rows.
df['trips_overlap'] = (df.groupby('Company')
.apply(lambda x: (x['Return'].shift() - x['Depart']) > timedelta(0))
.reset_index(level=0, drop=True))

pandas data frame effeciently remove duplicates and keep records largest int value

I have a data frame with two columns NAME, and VALUE, where NAME contains duplicates and VALUE contains INTs. I would like to efficiently drop duplicates records of column NAME while keeping the record with the largest VALUE. I figured out how to do it will two steps, sort and drop duplicates, but I am new to pandas and am curious if there is a more efficient way to achieve this with the query function?
import pandas
import io
import json
input = """
KEY VALUE
apple 0
apple 1
apple 2
bannana 0
bannana 1
bannana 2
pear 0
pear 1
pear 2
pear 3
orange 0
orange 1
orange 2
orange 3
orange 4
"""
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df[['KEY','VALUE']].sort_values(by=['VALUE']).drop_duplicates(subset='KEY', keep='last')
dicty = dict(zip(df['KEY'], df['VALUE']))
print(json.dumps(dicty, indent=4))
running this yields the expected output:
{
"apple": 2,
"bannana": 2,
"pear": 3,
"orange": 4
}
Is there a more efficient way to achieve this transformation with pandas?
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df.groupby('KEY')['VALUE'].max()
If your input needs to be a dictionary, just add to_dict() :
df.groupby('KEY')['VALUE'].max().to_dict()
Also you can try:
[*df.groupby('KEY',sort=False).last().to_dict().values()][0]
{'apple': 2, 'bannana': 2, 'pear': 3, 'orange': 4}

Resources