Optimizing pandas iteration - python-3.x

Customer Year Customer Lost/Retained
A 2009 Retained
A 2010 Retained
A 2011 Lost
B 2008 Lost
C 2008 Retained
C 2009 lost
I have used itterrows() for creating Customer Lost/Retained column based on the above logic.
If a customer is duplicated for the consecutive year, he is retained else lost.
for i, row in df.iterrows():
if (df[df['Year'] == row['Year']+1]['Customer']).str.contains(df['Customer'].iloc[i]).any():
df['Customer Lost/Retained'].iloc[i] = 'Retained'
else:
df['Customer Lost/Retained'].iloc[i] = 'Lost'
Can this code be optimized further?

# groupby customer
g = df.groupby('Customer')['Year']
# create a mask of conditions by using shift
mask = (g.shift(0) == g.shift(-1)-1)
# use npy.wehre to create a list of results based on the mask
df['Retained/lost'] = np.where(mask, 'Retained', 'Lost')
Customer Year Retained/lost
0 A 2009 Retained
1 A 2010 Retained
2 A 2011 Lost
3 B 2008 Lost
4 C 2008 Retained
5 C 2009 Lost

You could do this as a merge with itself but modifying the year:
In [83]: df['retained'] = pd.notnull(df.merge(
...: df,
...: how="left",
...: left_on=["Customer", "Year"],
...: right_on=["Customer", df["Year"].sub(1)],
...: suffixes=['', "_match"]
...: )["Year_match"]).map({True: 'Retained', False: 'Lost'})
In [84]: df
Out[84]:
Customer Year Customer Lost/Retained retained
0 A 2009 Retained Retained
1 A 2010 Retained Retained
2 A 2011 Lost Lost
3 B 2008 Lost Lost
4 C 2008 Retained Retained
5 C 2009 lost Lost

We add a column which says 'Retained':
df['Customer Lost/Retained'] = 'Retained'
Except for the indices with the highest year per customer, they get value 'Lost':
mask = df.groupby('Customer')['Year'].idxmax()
df.loc[mask, 'Customer Lost/Retained'] = 'Lost'
Customer Year Customer Lost/Retained
0 A 2009 Retained
1 A 2010 Retained
2 A 2011 Lost
3 B 2008 Lost
4 C 2008 Retained
5 C 2009 Lost
Or, alternatively, insert the 'Lost' first and then .fillna():
df.loc[df.groupby('Customer')['Year'].idxmax(), 'Customer Lost/Retained'] = 'Lost'
df['Customer Lost/Retained'] = df['Customer Lost/Retained'].fillna('Retained')

Related

Merging two pandas dataframes with date variable

I want to merger two pandas dataframes based on common date variable. Below is my code
import pandas as pd
data = pd.DataFrame({'date' : pd.to_datetime(['2010-12-31', '2012-12-31']), 'val' : [1,2]})
datarange = pd.DataFrame(pd.period_range('2009-12-31', '2012-12-31', freq='A'), columns = ['date'])
pd.merge(datarange, data, how = 'left', on = 'date')
With this I get below result
date val
0 2009 NaN
1 2010 NaN
2 2011 NaN
3 2012 NaN
Could you please help how can I correctly merge these two dataframes?
Use right_on for same anual periods like in datarange['date'] column:
df = pd.merge(datarange,
data,
how = 'left',
left_on = 'date',
right_on=data['date'].dt.to_period('A'))
print (df)
date date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0
Or create helper column:
df = pd.merge(datarange,
data.assign(datetimes=data['date'], date=data['date'].dt.to_period('A')),
how = 'left',
on = 'date')
print (df)
date val datetimes
0 2009 NaN NaT
1 2010 1.0 2010-12-31
2 2011 NaN NaT
3 2012 2.0 2012-12-31
You need to merge on a common type.
For example you can set the year as merging key on each side:
pd.merge(datarange, data, how='left',
left_on=datarange['date'].dt.year,
right_on=data['date'].dt.year
)
output:
key_0 date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0

pandas remove records conditionally based on records count of groups

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
and mapping dataframe
mapping = pd.DataFrame({'Product':['A','C'],'Product1':['B','D']}, columns = ['Product','Product1'])
and i wanted to compare products as per mapping. product A data should match with product B data.. the logic is product A number of records is 4 so product B records also should be 4 and those 4 records should be from the week number before and after form last week number of product A and including the last week number. so before 1 week of week number 4 i.e. 3rd week and after 2 weeks of week number 4 i.e 5,6 and week 4 data.
similarly product C number of records is 3 so product D records also should be 3 and those records before and after last week number of product C. so product c last week number 3 so product D records will be week number 2,3,4.
wanted data frame will be like below i wanted to remove those yellow records
Define the following function selecting rows from df, for products from
the current row in mapping:
def selRows(row, df):
rows_1 = df[df.Product == row.Product]
nr_1 = rows_1.index.size
lastWk_1 = rows_1.Week.iat[-1]
rows_2 = df[df.Product.eq(row.Product1) & df.Week.ge(lastWk_1 - 1)].iloc[:nr_1]
return pd.concat([rows_1, rows_2])
Then call it the following way:
result = pd.concat([ selRows(row, grp)
for _, grp in df2.groupby(['Country'])
for _, row in mapping.iterrows() ])
The list comprehension above creates a list on DataFrames - results of
calls of selRows on:
each group of rows from df2, for consecutive countries (the outer loop),
each row from mapping (the inner loop).
Then concat concatenates all of them into a single DataFrame.
Solution first create mapped column by mapping DataFrame and create dictionaries for mapping for length and last (maximal) value by groups by Country and Product:
df2['mapp'] = df2['Product'].map(mapping.set_index('Product1')['Product'])
df1 = df2.groupby(['Country','Product'])['Week'].agg(['max','size'])
#subtracted 1 for last previous value
dprev = df1['max'].sub(1).to_dict()
dlen = df1['size'].to_dict()
print(dlen)
{('UK', 'A'): 4, ('UK', 'B'): 8, ('UK', 'C'): 3, ('UK', 'D'): 6}
Then Series.map values of dict and filter out less values, then filter by second dictionary by lengths with DataFrame.head:
df3 = (df2[df2[['Country','mapp']].apply(tuple, 1).map(dprev) <= df2['Week']]
.groupby(['Country','mapp'])
.apply(lambda x: x.head(dlen.get(x.name))))
print(df3)
Country Product Week val mapp
Country mapp
UK A 6 UK B 3 7 A
7 UK B 4 8 A
8 UK B 5 9 A
9 UK B 6 10 A
C 16 UK D 2 6 C
17 UK D 3 7 C
18 UK D 4 8 C
Then filter original rows unmatched mapping['Product1'], add new df3 and sorting:
df = (df2[~df2['Product'].isin(mapping['Product1'])]
.append(df3, ignore_index=True)
.sort_values(['Country','Product'])
.drop('mapp', axis=1))
print(df)
Country Product Week val
0 UK A 1 5
1 UK A 2 4
2 UK A 3 3
3 UK A 4 1
7 UK B 3 7
8 UK B 4 8
9 UK B 5 9
10 UK B 6 10
4 UK C 1 5
5 UK C 2 5
6 UK C 3 5
11 UK D 2 6
12 UK D 3 7
13 UK D 4 8

Pandas Top n % of grouped sum

I work for a company and am trying to calculate witch products produced the top 80% of Gross Revenue in different years.
Here is a short example of my data:
Part_no Revision Gross_Revenue Year
1 a 1 2014
2 a 2 2014
3 c 2 2014
4 c 2 2014
5 d 2 2014
I've been looking through various answers and here's the best code I can come up with but it is not working:
df1 = df[['Year', 'Part_No', 'Revision', 'Gross_Revenue']]
df1 = df1.groupby(['Year', 'Part_No','Revision']).agg({'Gross_Revenue':'sum'})
# print(df1.head())
a = 0.8
df2 = (df1.sort_values('Gross_Revenue', ascending = False)
.groupby(['Year', 'Part_No', 'Revision'], group_keys = False)
.apply(lambda x: x.head(int(len(x) * a )))
.reset_index(drop = True))
print(df2)
I'm trying to have the code return, for each year, all the top products that brought in 80% of our company's revenue.
I suspect it's the old 80/20 rule.
Thank you for your help,
Me
You can using cumsum
df[df.groupby('Year').Gross_Revenue.cumsum().div(df.groupby('Year').Gross_Revenue.transform('sum'),axis=0)<0.8]
Out[589]:
Part_no Revision Gross_Revenue Year
1 2 a 2 2014
2 3 c 2 2014
3 4 c 2 2014

Making a list from a pandas column containing multiple values

Let's use this as an example data set:
Year Breeds
0 2009 Collie
1 2010 Shepherd
2 2011 Collie, Shepherd
3 2012 Shepherd, Retriever
4 2013 Shepherd
5 2014 Shepherd, Bulldog
6 2015 Collie, Retriever
7 2016 Retriever, Bulldog
I want to create a list dogs in which dogs contains the unique dog breeds Collie, Shepherd, Retriever, Bulldog. I know it is as simple as calling .unique() on the appropriate column, but I am running into the issue of having more than one value in the Breeds column. Any ideas to circumvent that?
Thanks!
EDIT:
If need extract all possible values use split:
df['new'] = df['Breeds'].str.split(', ')
For unique values convert to sets:
df['new'] = df['Breeds'].str.split(', ').apply(lambda x: list(set(x)))
Or use list comprehension:
df['new'] = [list(set(x.split(', '))) for x in df['Breeds']]
Use findall for extract by list and regex - | for OR if want extract only some values:
L = ["Collie", "Shepherd", "Retriever", "Bulldog"]
df['new'] = df['Breeds'].str.findall('|'.join(L))
If possible duplicates:
df['new'] = df['Breeds'].str.findall('|'.join(L)).apply(lambda x: list(set(x)))
print (df)
Year Breeds new
0 2009 Collie [Collie]
1 2010 Shepherd [Shepherd]
2 2011 Collie, Shepherd [Collie, Shepherd]
3 2012 Shepherd, Retriever [Shepherd, Retriever]
4 2013 Shepherd [Shepherd]
5 2014 Shepherd, Bulldog [Shepherd, Bulldog]
6 2015 Collie, Retriever [Collie, Retriever]
7 2016 Retriever, Bulldog [Retriever, Bulldog]

assigning a column to be index and dropping it

As the title suggests I am reassigning a column as index but I want the column to appear only as index.
df.set_index(df['col_name'], drop = True, inplace = True)
The documenatation as I understand it says that the above will reassign the column to the df and drop the initial column. But when I print out the df the column is now duplicated (as index and still as column). Can anyone point out where I am missing something ?
You need only set column name col_name and parameter inplace in set_index. If use not column name, but column like df['a'], set_index doesn't drop column, only copy it:
print df
col_name a b
0 1.255 2003 1
1 3.090 2003 2
2 3.155 2003 3
3 3.115 2004 1
4 3.010 2004 2
5 2.985 2004 3
df.set_index('col_name', inplace = True)
print df
a b
col_name
1.255 2003 1
3.090 2003 2
3.155 2003 3
3.115 2004 1
3.010 2004 2
2.985 2004 3
df.set_index(df['a'], inplace = True)
print df
a b
a
2003 2003 1
2003 2003 2
2003 2003 3
2004 2004 1
2004 2004 2
2004 2004 3
This works for me in Python 2.x, should work for you too (3.x)!
pandas.__version__
u'0.17.0'
df.set_index('col_name', inplace = True)

Resources