Break dataframe header into multiheader - python-3.x

Names
ABCBaseCIP00
ABCBaseCIP01
ABCBaseCIP02
ABC1CIP00
ABC1CIP01
ABC1CIP02
ABC2CIP00
ABC2CIP01
ABC2CIP02
X
1
2
3
4
5
6
7
8
9
Y
1
2
3
4
5
6
7
8
9
Z
1
2
3
4
5
6
7
8
9
I have above dataframe, I am looking to break column headers by name(ABCBase|ABC1|ABC2) and code(CIP00|CIP01|CIP02|CIP00|CIP01|CIP02|CIP00|CIP01|CIP02) to get below table as output.
Can anyone suggest how can that be done in pandas? This is dynamic data so do not want to hardcode anything.
ABCBase
ABCBase
ABCBase
ABC1
ABC1
ABC1
ABC2
ABC2
ABC2
Names
CIP00
CIP01
CIP02
CIP00
CIP01
CIP02
CIP00
CIP01
CIP02
X
1
2
3
4
5
6
7
8
9
Y
1
2
3
4
5
6
7
8
9
Z
1
2
3
4
5
6
7
8
9

Here's a way using string manipulation and pd.MultiIndex with from_arrays:
df = df.set_index('Names')
cols = df.columns.str.extract('(ABC(?:Base|\d))(.*)')
df.columns = pd.MultiIndex.from_arrays([cols[0], cols[1]], names=[None, None])
df
Output:
ABCBase ABC1 ABC2
CIP00 CIP01 CIP02 CIP00 CIP01 CIP02 CIP00 CIP01 CIP02
Names
X 1 2 3 4 5 6 7 8 9
Y 1 2 3 4 5 6 7 8 9
Z 1 2 3 4 5 6 7 8 9
Or,
df.columns = pd.MultiIndex\
.from_arrays(zip(*df.columns.str.extract('(ABC(?:Base|\d))(.*)')\
.to_numpy()))

import pandas as pd
data = { 'names' : ['x','y','z'],
'ABCBaseCIP00' : [1,1,1],
'ABCBaseCIP01' : [2,2,2],
'ABCBaseCIP02' : [3,3,3],
'ABC1CIP00' : [4,4,4],
'ABC1CIP01' : [5,5,5]}
df = pd.DataFrame(data)
gives
names ABCBaseCIP00 ABCBaseCIP01 ABCBaseCIP02 ABC1CIP00 ABC1CIP01
0 x 1 2 3 4 5
1 y 1 2 3 4 5
2 z 1 2 3 4 5
Now do the work
df1 = df.T
df1.reset_index(inplace=True)
df1['name']=df1['index'].str[-5:]
df1['subname']=df1['index'].str[0:-5]
df1 = df1.drop('index',axis=1)
df1 = df1.T
which gives
0 1 2 3 4 5
0 x 1 2 3 4 5
1 y 1 2 3 4 5
2 z 1 2 3 4 5
name names CIP00 CIP01 CIP02 CIP00 CIP01
subname ABCBase ABCBase ABCBase ABC1 ABC1 ABC1
Which is not quite what you want but is it close enough?

a one-line solution to this problem:
df.columns = df.columns.str.split('(CIP.+)', expand=True).droplevel(2)
full example:
from pandas import DataFrame, Index
df = DataFrame(
{ 'ABCBaseCIP00': [1,1,1],
'ABCBaseCIP01': [2,2,2],
'ABCBaseCIP02': [3,3,3],
'ABC1CIP00': [4,4,4],
'ABC1CIP01': [5,5,5] },
index=Index(list('XYZ'), name='Names')
)
df.columns = df.columns.str.split('(CIP.+)', expand=True).droplevel(2)
# df outputs:
ABCBase ABC1
CIP00 CIP01 CIP02 CIP00 CIP01
Names
X 1 2 3 4 5
Y 1 2 3 4 5
Z 1 2 3 4 5
how it works:
the regex CIP.+ matches the from start of level-2. The brackets () create a capture group so it is returned by .str.split
splitting and & expanding an index creates a multi-index
the resulting multi index has an extra level, which is dropped with .droplevel(2)

Related

Removing Suffix From Dataframe Column Names - Python

I am trying to remove a suffix from all columns in a dataframe, however I am getting error messages. Any suggestions would be appreciated.
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df.add_suffix('_x')
def strip_right(df.columns, _x):
if not text.endswith("_x"):
return text
# else
return text[:len(df.columns)-len("_x")]
Error:
def strip_right(tmp, "_x"):
^
SyntaxError: invalid syntax
I've also tried removing the quotations.
def strip_right(df.columns, _x):
if not text.endswith(_x):
return text
# else
return text[:len(df.columns)-len(_x)]
Error:
def strip_right(df.columns, _x):
^
SyntaxError: invalid syntax
Here is a more concrete example:.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
print ("With Suffix")
print(df.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df)
print ("\n\nWithout Suffix")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8
Without Suffix
A B C D
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8
I found a bug in the implementation of the accepted answer. The docs for pandas.Series.str.rstrip() reference str.rstrip(), which states:
"The chars argument is not a suffix; rather, all combinations of its values are stripped."
Instead I had to use pandas.Series.str.replace to remove the actual suffix from my column names. See the modified example below.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
df['Ex_'] = np.random.randint(0,10,size=(10, 1))
df1 = pd.DataFrame(df, copy=True)
print ("With Suffix")
print(df1.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df1)
print ("\n\nAfter .rstrip()")
print(df1.head())
def replace_right(df, suffix='_x'):
df.columns = df.columns.str.replace(suffix+'$', '', regex=True)
print ("\n\nWith Suffix")
print(df.head())
replace_right(df)
print ("\n\nAfter .replace()")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .rstrip()
A B C D E
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .replace()
A B C D Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8

Pandas: How to extra only latest date in pivot table dataframe

How do I create a new dataframe which only include as index the latest date of the column 'txn_date' for each 'day' based on the pivot table in the picture?
Thank you
d1 = pd.to_datetime(['2016-06-25'] *2 + ['2016-06-28']*4)
df = pd.DataFrame({'txn_date':pd.date_range('2012-03-05 10:20:03', periods=6),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'day':d1}).set_index(['day','txn_date'])
print (df)
B C D E
day txn_date
2016-06-25 2012-03-05 10:20:03 4 7 1 5
2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-07 10:20:03 4 9 5 6
2012-03-08 10:20:03 5 4 7 9
2012-03-09 10:20:03 5 2 1 2
2012-03-10 10:20:03 4 3 0 4
1.
I think you need first sort_index if necessary first, then groupby by level day and aggregate last:
df1 = df.sort_index().reset_index(level=1).groupby(level='day').last()
print (df1)
txn_date B C D E
day
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4
2.
Filter by boolean indexing with duplicated:
#if necessary
df = df.sort_index()
df2 = df[~df.index.get_level_values('day').duplicated(keep='last')]
print(df2)
B C D E
day txn_date
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4

Repeating elements in a dataframe

Hi all I have the following dataframe:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
And I am trying to only repeat the last two rows of the data so that it looks like this:
A | B | C
1 2 3
2 3 4
3 4 5
3 4 5
4 5 6
4 5 6
I have tried using append, concat and repeat to no avail.
repeated = lambda x:x.repeat(2)
df.append(df[-2:].apply(repeated),ignore_index=True)
This returns the following dataframe, which is incorrect:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
3 4 5
3 4 5
4 5 6
4 5 6
You can use numpy.repeat for repeating index and then create df1 by loc, last append to original, but before filter out last 2 rows by iloc:
df1 = df.loc[np.repeat(df.index[-2:].values, 2)]
print (df1)
A B C
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
print (df.iloc[:-2])
A B C
0 1 2 3
1 2 3 4
df = df.iloc[:-2].append(df1,ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
If want use your code add iloc for filtering only last 2 rows:
repeated = lambda x:x.repeat(2)
df = df.iloc[:-2].append(df.iloc[-2:].apply(repeated),ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
Use pd.concat and index slicing with .iloc:
pd.concat([df,df.iloc[-2:]]).sort_values(by='A')
Output:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
I'm partial to manipulating the index into the pattern we are aiming for then asking the dataframe to take the new form.
Option 1
Use pd.DataFrame.reindex
df.reindex(df.index[:-2].append(df.index[-2:].repeat(2)))
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Same thing in multiple lines
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.reindex(idx)
Could also use loc
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.loc[idx]
Option 2
Reconstruct from values. Only do this is all dtypes are the same.
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
pd.DataFrame(df.values[idx], df.index[idx])
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Option 3
Can also use np.array in iloc
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
df.iloc[idx]
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6

Pandas use variable for column names part 2

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
How can one assign column names to variables for use in referring to said column names?
For example, if I do this:
cols=['A','B']
cols2=['C','D']
I then want to do something like this:
df[cols,'F',cols2]
But the result is this:
TypeError: unhashable type: 'list'
I think you need add column F to list:
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Or:
print df[cols + ['F'] +cols2]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Need give a list with columns for reference.
In [48]: df[cols+['F']+cols2]
Out[48]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
and, consider using df.loc[:, cols+['F']+cols2], df.ix[:, cols+['F']+cols2] for slicing.
Python 3 solution:
In [154]: df[[*cols,'F',*cols2]]
Out[154]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources