Pandas Top n % of grouped sum - python-3.x

I work for a company and am trying to calculate witch products produced the top 80% of Gross Revenue in different years.
Here is a short example of my data:
Part_no Revision Gross_Revenue Year
1 a 1 2014
2 a 2 2014
3 c 2 2014
4 c 2 2014
5 d 2 2014
I've been looking through various answers and here's the best code I can come up with but it is not working:
df1 = df[['Year', 'Part_No', 'Revision', 'Gross_Revenue']]
df1 = df1.groupby(['Year', 'Part_No','Revision']).agg({'Gross_Revenue':'sum'})
# print(df1.head())
a = 0.8
df2 = (df1.sort_values('Gross_Revenue', ascending = False)
.groupby(['Year', 'Part_No', 'Revision'], group_keys = False)
.apply(lambda x: x.head(int(len(x) * a )))
.reset_index(drop = True))
print(df2)
I'm trying to have the code return, for each year, all the top products that brought in 80% of our company's revenue.
I suspect it's the old 80/20 rule.
Thank you for your help,
Me

You can using cumsum
df[df.groupby('Year').Gross_Revenue.cumsum().div(df.groupby('Year').Gross_Revenue.transform('sum'),axis=0)<0.8]
Out[589]:
Part_no Revision Gross_Revenue Year
1 2 a 2 2014
2 3 c 2 2014
3 4 c 2 2014

Related

pandas: search column values from one df in another df column that contains lists

I need to search the values from the df1['numsearch'] column into the lists in df2['Numbers']. If the number is in those lists, then I want to add values from the df2['Score'] column to df1. See desired output below.
df1 = pd.DataFrame(
{'Day':['M','Tu','W','Th','Fr','Sa','Su'],
'numsearch':['1','20','14','99','19','6','101']
})
df2 = pd.DataFrame(
{'Letters':['a','b','c','d'],
'Numbers':[['1','2','3','4'],['5','6','7','8'],['10','20','30','40'],['11','12','13','14']],
'Score': ['1.1','2.2','3.3','4.4']})
desired output
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 4 4.4
3 Th 99 "No score"
4 Fr 19 "No score"
5 Sa 6 2.2
6 Su 101 "No score"
I have written a for loop that works with the test data.
scores = []
for s,ns in enumerate(ppr_data['SN']):
match = ''
for k,q in enumerate(jcr_data['All_ISSNs']):
if ns in q:
scores.append(jcr_data['Journal Impact Factor'][k])
match = 1
else:
continue
if match == "":
scores.append('No score')
match = ""
df1['Score'] = np.array(scores)
In my small test, but above code works, but when working with larger data files, it is creating duplicates. So this clearly isn't the best way to do this.
I'm sure there's a more pandas-proper line of code that ends in .fillna("No score") .
I tried to use a loc statement, but I get hung up on searching the values of one dataframe in a column that contains lists.
Can anyone shed some light?
df2=df2.explode('Numbers')#Explode df2 on Numbers
d=dict(zip(df2.Numbers, df2.Score))#dict Numbers and Scores
df1['Score']=df1.numsearch.map(d).fillna('No Score')#Map dict to df1 filling NaN with No Score
Can shorten it as follows:
df2=df2.explode('Numbers')#Explode df2 on Numbers
df1['Score']=df1.numsearch.map(dict(zip(df2.Numbers, df2.Score))).fillna('No Score')
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No Score
4 Fr 19 No Score
5 Sa 6 2.2
6 Su 101 No Score
You can try left join and fillna:
df1.merge(df2.explode('Numbers'),
left_on='numsearch',
right_on='Numbers', how='left')[['Day', 'numsearch', 'Score']].fillna("No score")
Output:
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No score
4 Fr 19 No score
5 Sa 6 2.2
6 Su 101 No score

Read excel and reformat the multi-index headers in Pandas

Given a excel file with format as follows:
Reading with pd.read_clipboard, I get:
year 2018 Unnamed: 2 2019 Unnamed: 4
0 city quantity price quantity price
1 bj 10 2 4 7
2 sh 6 8 3 4
Just wondering if it's possible to convert to the following format with Pandas:
year city quantity price
0 2018 bj 10 2
1 2019 bj 4 7
2 2018 sh 6 8
3 2019 sh 3 4
I think here is best convert excel file to DataFrame with MultiIndex in columns and first column as index:
df = pd.read_excel(file, header=[0,1], index_col=[0])
print (df)
year 2018 2019
city quantity price quantity price
bj 10 2 4 7
sh 6 8 3 4
print (df.columns)
MultiIndex([('2018', 'quantity'),
('2018', 'price'),
('2019', 'quantity'),
('2019', 'price')],
names=['year', 'city'])
Then reshape by DataFrame.stack, change order of levels by DataFrame.swaplevel, set index and columns names by DataFrame.rename_axis and last convert index to columns, and if encessary convert year to integers:
df1 = (df.stack(0)
.swaplevel(0,1)
.rename_axis(index=['year','city'], columns=None)
.reset_index()
.assign(year=lambda x: x['year'].astype(int)))
print (df1)
year city price quantity
0 2018 bj 2 10
1 2019 bj 7 4
2 2018 sh 8 6
3 2019 sh 4 3

pandas remove records conditionally based on records count of groups

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
and mapping dataframe
mapping = pd.DataFrame({'Product':['A','C'],'Product1':['B','D']}, columns = ['Product','Product1'])
and i wanted to compare products as per mapping. product A data should match with product B data.. the logic is product A number of records is 4 so product B records also should be 4 and those 4 records should be from the week number before and after form last week number of product A and including the last week number. so before 1 week of week number 4 i.e. 3rd week and after 2 weeks of week number 4 i.e 5,6 and week 4 data.
similarly product C number of records is 3 so product D records also should be 3 and those records before and after last week number of product C. so product c last week number 3 so product D records will be week number 2,3,4.
wanted data frame will be like below i wanted to remove those yellow records
Define the following function selecting rows from df, for products from
the current row in mapping:
def selRows(row, df):
rows_1 = df[df.Product == row.Product]
nr_1 = rows_1.index.size
lastWk_1 = rows_1.Week.iat[-1]
rows_2 = df[df.Product.eq(row.Product1) & df.Week.ge(lastWk_1 - 1)].iloc[:nr_1]
return pd.concat([rows_1, rows_2])
Then call it the following way:
result = pd.concat([ selRows(row, grp)
for _, grp in df2.groupby(['Country'])
for _, row in mapping.iterrows() ])
The list comprehension above creates a list on DataFrames - results of
calls of selRows on:
each group of rows from df2, for consecutive countries (the outer loop),
each row from mapping (the inner loop).
Then concat concatenates all of them into a single DataFrame.
Solution first create mapped column by mapping DataFrame and create dictionaries for mapping for length and last (maximal) value by groups by Country and Product:
df2['mapp'] = df2['Product'].map(mapping.set_index('Product1')['Product'])
df1 = df2.groupby(['Country','Product'])['Week'].agg(['max','size'])
#subtracted 1 for last previous value
dprev = df1['max'].sub(1).to_dict()
dlen = df1['size'].to_dict()
print(dlen)
{('UK', 'A'): 4, ('UK', 'B'): 8, ('UK', 'C'): 3, ('UK', 'D'): 6}
Then Series.map values of dict and filter out less values, then filter by second dictionary by lengths with DataFrame.head:
df3 = (df2[df2[['Country','mapp']].apply(tuple, 1).map(dprev) <= df2['Week']]
.groupby(['Country','mapp'])
.apply(lambda x: x.head(dlen.get(x.name))))
print(df3)
Country Product Week val mapp
Country mapp
UK A 6 UK B 3 7 A
7 UK B 4 8 A
8 UK B 5 9 A
9 UK B 6 10 A
C 16 UK D 2 6 C
17 UK D 3 7 C
18 UK D 4 8 C
Then filter original rows unmatched mapping['Product1'], add new df3 and sorting:
df = (df2[~df2['Product'].isin(mapping['Product1'])]
.append(df3, ignore_index=True)
.sort_values(['Country','Product'])
.drop('mapp', axis=1))
print(df)
Country Product Week val
0 UK A 1 5
1 UK A 2 4
2 UK A 3 3
3 UK A 4 1
7 UK B 3 7
8 UK B 4 8
9 UK B 5 9
10 UK B 6 10
4 UK C 1 5
5 UK C 2 5
6 UK C 3 5
11 UK D 2 6
12 UK D 3 7
13 UK D 4 8

Alternative to looping? Vectorisation, cython?

I have a pandas dataframe something like the below:
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023 etc
ref1 100 2020 5 10 0 0 0 0 0
ref2 20 2028 2 5 0 0 0 0 0
ref3 30 2021 7 16 0 0 0 0 0
ref4 40 2025 9 18 0 0 0 0 0
ref5 10 2022 4 30 0 0 0 0 0
The 'Total' column shows how many of a product needs to be delivered.
'First_yr_Del' tells you how many will be delivered in the first year. After this the delivery rate reverts to 'Del_rate' - a flat rate that can be applied each year until all products are delivered.
The 'Year to Use' column tells you the first year column to begin delivery from.
EXAMPLE: Ref1 has 100 to deliver. It will start delivering in 2020 and will deliver 5 in the first year, and 10 each year after that until all 100 are accounted for.
Any ideas how to go about this?
I thought i might use something like the below to reference which columns to use in turn, but i'm not even sure if that's helpful or not as it will depend on the solution (in the proper version, base_date.year is defined as the first column in the table - 2019):
start_index_for_slice = df.columns.get_loc(base_date.year)
end_index_for_slice = start_index_for_slice+no_yrs_to_project
df.columns[start_index_for_slice:end_index_for_slice]
I'm pretty new to python and aren't sure if i'm getting ahead of myself a bit...
The way i would think to go about it would be to use a for loop, or something using iterrows, but other posts seem to say this is a bad idea and i should be using vectorisation, cython or lambdas. Of those 3 i've only managed a very simple lambda so far. The others are a bit of a mystery to me since the solution seems to suggest doing one action after another until complete.
Any and all help appreciated!
Thanks
EDIT: Example expected output below (I edited some of the dates so you can better see the logic):
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023etc
ref1 100 2020 5 10 0 5 10 10 10
ref2 20 2021 2 5 0 0 2 5 5
ref3 30 2021 7 16 0 0 7 16 7
ref4 40 2019 9 18 9 18 13 0 0
ref5 10 2020 4 30 0 4 6 0 0
Here's another option, which separates the calculation of the rates/years matrix and appends it to the input df later on. Still does looping in the script itself (not "externalized" to some numpy / pandas function). Should be fine for 5k rows I'd guesstimate.
import pandas as pd
import numpy as np
# def gen_df1():
# create the inital df without years/rates
df = pd.DataFrame({'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2021, 2021, 2019, 2020],
'First_Year_Del': [5, 2, 7, 9, 10],
'Del_rate': [10, 5, 16, 18, 30]})
# get number of rates + remainder
n, r = np.divmod((df['Total']-df['First_Year_Del']), df['Del_rate'])
# get the year of the last rate considering all rows
max_year = np.max(n + r.astype(np.bool) + df['Yr_to_Use'])
# get the offsets for the start of delivery, year zero is 2019
offset = df['Yr_to_Use'] - 2019
# subtracting the year zero lets you use this as an index...
# get a year index; this determines the the columns that will be created
yrs = np.arange(2019, max_year+1)
# prepare a n*m array to hold the rates for all years, initalize with all zero
out = np.zeros((df['Total'].shape[0], yrs.shape[0]))
# n: number of rows of the df, m: number of years where rates will have to be payed
# calculate the rates for each year and insert them into the output array
for i in range(df['Total'].shape[0]):
# concatenate: year of the first rate, all yearly rates, a final rate if there was a remainder
if r[i]: # if rest is not zero, append it as well
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]], [r[i]]])
else: # rest is zero, skip it
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]]])
# insert the rates at the apropriate location of the output array:
out[i, offset[i]:offset[i]+rates.shape[0]] = rates
# add the years/rates matrix to the original df
df = pd.concat([df, pd.DataFrame(out, columns=yrs.astype(str))], axis=1, sort=False)
You can accomplish this using two user-defined function and apply method
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'id': ['ref1','ref2','ref3','ref4','ref5'],
'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2028, 2021, 2025, 2022],
'First_Year_Del': [5,2,7,9,4],
'Del_rate':[10,5,16,18,30]})
def f(r):
'''
Computes values per year and respective year
'''
n = (r['Total'] - r['First_Year_Del'])//r['Del_rate']
leftover = (r['Total'] - r['First_Year_Del'])%r['Del_rate']
r['values'] = [r['First_Year_Del']] + [r['Del_rate'] for _ in range(n)] + [leftover]
r['years'] = np.arange(r['Yr_to_Use'], r['Yr_to_Use'] + len(r['values']))
return r
df = df.apply(f, axis=1)
def get_year_range(r):
'''
Computes min and max year for each row
'''
r['y_min'] = min(r['years'])
r['y_max'] = max(r['years'])
return r
df = df.apply(get_year_range, axis=1)
y_min = df['y_min'].min()
y_max = df['y_max'].max()
#Initialize each year value to zero
for year in range(y_min, y_max+1):
df[year] = 0
def expand(r):
'''
Update value for each year
'''
for v, y in zip(r['values'], r['years']):
r[y] = v
return r
# Apply and drop temporary columns
df = df.apply(expand, axis=1).drop(['values', 'years', 'y_min', 'y_max'], axis=1)

binning with months column

i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
i have data frame as below
casenumber count CREATEDDATE
3820516 1 jan
3820547 1 jan
3820554 2 feb
3820562 1 feb
3820584 1 march
4226616 1 april
4226618 2 may
4226621 2 may
4226655 1 june
4226663 1 june
Here i used below code but i didnot match my requirement.i have data frame which contains fields casenumber , count and credated date .here created date is months which are in numerical i want to make dataframe as arrenge the ranges to the count acoording to createddate column
import pandas as pd
import numpy as np
df = pd.read_excel(r"")
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = df.groupby(pd.cut(df['CREATEDDATE'],bins,labels=names))['casenumber'].size().reset_index(name='No_of_times_statuschanged')
CREATEDDATE No_of_times_statuschanged
0 0-1 2092
1 1-4 9062
2 4-8 12578
3 8-15 3858
4 15+ 0
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
CREATEDDATE jan feb march april may june
0-1 1 2 3 4 5 6
1-4 3 0 6 7 8 9
4-8 4 6 3 0 9 2
8-15 0 3 4 5 8 9
I got the above data as out put but my expected should be range for month on month based on the cases per month .
expected output should be like
Use crosstab with change CREATEDDATE to count for pd.cut and change order of column by subset by list of columns names:
#add another months if necessary
months = ["jan", "feb", "march", "april", "may", "june"]
bins = [0, 1 ,4,8,15, np.inf]
names = ['0-1','1-4','4-8','8-15','15+']
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names), df['CREATEDDATE'])[months]
print (df1)
CREATEDDATE jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0
Another idea is use ordered categoricals:
df1 = pd.crosstab(pd.cut(df['count'],bins,labels=names),
pd.Categorical(df['CREATEDDATE'], ordered=True, categories=months))
print (df1)
col_0 jan feb march april may june
count
0-1 2 1 1 1 0 2
1-4 0 1 0 0 2 0

Resources