assigning a column to be index and dropping it

assigning a column to be index and dropping it - python-3.x

As the title suggests I am reassigning a column as index but I want the column to appear only as index.
df.set_index(df['col_name'], drop = True, inplace = True)
The documenatation as I understand it says that the above will reassign the column to the df and drop the initial column. But when I print out the df the column is now duplicated (as index and still as column). Can anyone point out where I am missing something ?

You need only set column name col_name and parameter inplace in set_index. If use not column name, but column like df['a'], set_index doesn't drop column, only copy it:
print df
col_name a b
0 1.255 2003 1
1 3.090 2003 2
2 3.155 2003 3
3 3.115 2004 1
4 3.010 2004 2
5 2.985 2004 3
df.set_index('col_name', inplace = True)
print df
a b
col_name
1.255 2003 1
3.090 2003 2
3.155 2003 3
3.115 2004 1
3.010 2004 2
2.985 2004 3
df.set_index(df['a'], inplace = True)
print df
a b
a
2003 2003 1
2003 2003 2
2003 2003 3
2004 2004 1
2004 2004 2
2004 2004 3

This works for me in Python 2.x, should work for you too (3.x)!
pandas.__version__
u'0.17.0'
df.set_index('col_name', inplace = True)

Related

Merging two pandas dataframes with date variable

I want to merger two pandas dataframes based on common date variable. Below is my code
import pandas as pd
data = pd.DataFrame({'date' : pd.to_datetime(['2010-12-31', '2012-12-31']), 'val' : [1,2]})
datarange = pd.DataFrame(pd.period_range('2009-12-31', '2012-12-31', freq='A'), columns = ['date'])
pd.merge(datarange, data, how = 'left', on = 'date')
With this I get below result
date val
0 2009 NaN
1 2010 NaN
2 2011 NaN
3 2012 NaN
Could you please help how can I correctly merge these two dataframes?

Use right_on for same anual periods like in datarange['date'] column:
df = pd.merge(datarange,
data,
how = 'left',
left_on = 'date',
right_on=data['date'].dt.to_period('A'))
print (df)
date date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0
Or create helper column:
df = pd.merge(datarange,
data.assign(datetimes=data['date'], date=data['date'].dt.to_period('A')),
how = 'left',
on = 'date')
print (df)
date val datetimes
0 2009 NaN NaT
1 2010 1.0 2010-12-31
2 2011 NaN NaT
3 2012 2.0 2012-12-31

You need to merge on a common type.
For example you can set the year as merging key on each side:
pd.merge(datarange, data, how='left',
left_on=datarange['date'].dt.year,
right_on=data['date'].dt.year
)
output:
key_0 date_x date_y val
0 2009 2009 NaT NaN
1 2010 2010 2010-12-31 1.0
2 2011 2011 NaT NaN
3 2012 2012 2012-12-31 2.0

Optimizing pandas iteration

Customer Year Customer Lost/Retained
A 2009 Retained
A 2010 Retained
A 2011 Lost
B 2008 Lost
C 2008 Retained
C 2009 lost
I have used itterrows() for creating Customer Lost/Retained column based on the above logic.
If a customer is duplicated for the consecutive year, he is retained else lost.
for i, row in df.iterrows():
if (df[df['Year'] == row['Year']+1]['Customer']).str.contains(df['Customer'].iloc[i]).any():
df['Customer Lost/Retained'].iloc[i] = 'Retained'
else:
df['Customer Lost/Retained'].iloc[i] = 'Lost'
Can this code be optimized further?

# groupby customer
g = df.groupby('Customer')['Year']
# create a mask of conditions by using shift
mask = (g.shift(0) == g.shift(-1)-1)
# use npy.wehre to create a list of results based on the mask
df['Retained/lost'] = np.where(mask, 'Retained', 'Lost')
Customer Year Retained/lost
0 A 2009 Retained
1 A 2010 Retained
2 A 2011 Lost
3 B 2008 Lost
4 C 2008 Retained
5 C 2009 Lost

You could do this as a merge with itself but modifying the year:
In [83]: df['retained'] = pd.notnull(df.merge(
...: df,
...: how="left",
...: left_on=["Customer", "Year"],
...: right_on=["Customer", df["Year"].sub(1)],
...: suffixes=['', "_match"]
...: )["Year_match"]).map({True: 'Retained', False: 'Lost'})
In [84]: df
Out[84]:
Customer Year Customer Lost/Retained retained
0 A 2009 Retained Retained
1 A 2010 Retained Retained
2 A 2011 Lost Lost
3 B 2008 Lost Lost
4 C 2008 Retained Retained
5 C 2009 lost Lost

We add a column which says 'Retained':
df['Customer Lost/Retained'] = 'Retained'
Except for the indices with the highest year per customer, they get value 'Lost':
mask = df.groupby('Customer')['Year'].idxmax()
df.loc[mask, 'Customer Lost/Retained'] = 'Lost'
Customer Year Customer Lost/Retained
0 A 2009 Retained
1 A 2010 Retained
2 A 2011 Lost
3 B 2008 Lost
4 C 2008 Retained
5 C 2009 Lost
Or, alternatively, insert the 'Lost' first and then .fillna():
df.loc[df.groupby('Customer')['Year'].idxmax(), 'Customer Lost/Retained'] = 'Lost'
df['Customer Lost/Retained'] = df['Customer Lost/Retained'].fillna('Retained')

copy columns name into first row of data in pandas

I have a pandas dataframe like following when I write this dataframe into google sheets I found out header is missing. My questions is how to make it work? or copy columns name into first row of data and other data does not change?
import pandas as pd
year = [2005, 2006, 2007]
A = [4, 5, 7]
B = [3, 3, 9]
C = [1, 7, 6]
df_old = pd.DataFrame({'year' : year, 'A' : A, 'B' : B, 'C' : C}, columns=['year', 'A', 'B', 'c'])
Out[25]:
A B C year
0 4 3 1 2005
1 5 3 7 2006
2 7 9 6 2007
#want output
Out[25]:
A B C year
0 A B C year
1 4 3 1 2005
2 5 3 7 2006
3 7 9 6 2007

You can also check the following answer: https://stackoverflow.com/a/24284680/11127365
For your case, the first line will be df_old.loc[-1] = df_old.columns

IIUC
df=pd.DataFrame(df_old.columns.values[None,:],columns=df_old.columns).\
append(df_old).\
reset_index(drop=True)
df
year A B c
0 year A B c
1 2005 4 3 NaN
2 2006 5 3 NaN
3 2007 7 9 NaN

Pandas Top n % of grouped sum

I work for a company and am trying to calculate witch products produced the top 80% of Gross Revenue in different years.
Here is a short example of my data:
Part_no Revision Gross_Revenue Year
1 a 1 2014
2 a 2 2014
3 c 2 2014
4 c 2 2014
5 d 2 2014
I've been looking through various answers and here's the best code I can come up with but it is not working:
df1 = df[['Year', 'Part_No', 'Revision', 'Gross_Revenue']]
df1 = df1.groupby(['Year', 'Part_No','Revision']).agg({'Gross_Revenue':'sum'})
# print(df1.head())
a = 0.8
df2 = (df1.sort_values('Gross_Revenue', ascending = False)
.groupby(['Year', 'Part_No', 'Revision'], group_keys = False)
.apply(lambda x: x.head(int(len(x) * a )))
.reset_index(drop = True))
print(df2)
I'm trying to have the code return, for each year, all the top products that brought in 80% of our company's revenue.
I suspect it's the old 80/20 rule.
Thank you for your help,
Me

You can using cumsum
df[df.groupby('Year').Gross_Revenue.cumsum().div(df.groupby('Year').Gross_Revenue.transform('sum'),axis=0)<0.8]
Out[589]:
Part_no Revision Gross_Revenue Year
1 2 a 2 2014
2 3 c 2 2014
3 4 c 2 2014

Making a list from a pandas column containing multiple values

Let's use this as an example data set:
Year Breeds
0 2009 Collie
1 2010 Shepherd
2 2011 Collie, Shepherd
3 2012 Shepherd, Retriever
4 2013 Shepherd
5 2014 Shepherd, Bulldog
6 2015 Collie, Retriever
7 2016 Retriever, Bulldog
I want to create a list dogs in which dogs contains the unique dog breeds Collie, Shepherd, Retriever, Bulldog. I know it is as simple as calling .unique() on the appropriate column, but I am running into the issue of having more than one value in the Breeds column. Any ideas to circumvent that?
Thanks!

EDIT:
If need extract all possible values use split:
df['new'] = df['Breeds'].str.split(', ')
For unique values convert to sets:
df['new'] = df['Breeds'].str.split(', ').apply(lambda x: list(set(x)))
Or use list comprehension:
df['new'] = [list(set(x.split(', '))) for x in df['Breeds']]
Use findall for extract by list and regex - | for OR if want extract only some values:
L = ["Collie", "Shepherd", "Retriever", "Bulldog"]
df['new'] = df['Breeds'].str.findall('|'.join(L))
If possible duplicates:
df['new'] = df['Breeds'].str.findall('|'.join(L)).apply(lambda x: list(set(x)))
print (df)
Year Breeds new
0 2009 Collie [Collie]
1 2010 Shepherd [Shepherd]
2 2011 Collie, Shepherd [Collie, Shepherd]
3 2012 Shepherd, Retriever [Shepherd, Retriever]
4 2013 Shepherd [Shepherd]
5 2014 Shepherd, Bulldog [Shepherd, Bulldog]
6 2015 Collie, Retriever [Collie, Retriever]
7 2016 Retriever, Bulldog [Retriever, Bulldog]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

assigning a column to be index and dropping it - python-3.x

This works for me in Python 2.x, should work for you too (3.x)! pandas.version u'0.17.0' df.set_index('col_name', inplace = True)

Related

Merging two pandas dataframes with date variable

Optimizing pandas iteration

copy columns name into first row of data in pandas

Pandas Top n % of grouped sum

Making a list from a pandas column containing multiple values

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

assigning a column to be index and dropping it - python-3.x

This works for me in Python 2.x, should work for you too (3.x)! pandas.__version__ u'0.17.0' df.set_index('col_name', inplace = True)

Related

Merging two pandas dataframes with date variable

Optimizing pandas iteration

copy columns name into first row of data in pandas

Pandas Top n % of grouped sum

Making a list from a pandas column containing multiple values

Categories

Resources

This works for me in Python 2.x, should work for you too (3.x)! pandas.version u'0.17.0' df.set_index('col_name', inplace = True)