Removing Suffix From Dataframe Column Names - Python - python-3.x

I am trying to remove a suffix from all columns in a dataframe, however I am getting error messages. Any suggestions would be appreciated.
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df.add_suffix('_x')
def strip_right(df.columns, _x):
if not text.endswith("_x"):
return text
# else
return text[:len(df.columns)-len("_x")]
Error:
def strip_right(tmp, "_x"):
^
SyntaxError: invalid syntax
I've also tried removing the quotations.
def strip_right(df.columns, _x):
if not text.endswith(_x):
return text
# else
return text[:len(df.columns)-len(_x)]
Error:
def strip_right(df.columns, _x):
^
SyntaxError: invalid syntax

Here is a more concrete example:.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
print ("With Suffix")
print(df.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df)
print ("\n\nWithout Suffix")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8
Without Suffix
A B C D
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8

I found a bug in the implementation of the accepted answer. The docs for pandas.Series.str.rstrip() reference str.rstrip(), which states:
"The chars argument is not a suffix; rather, all combinations of its values are stripped."
Instead I had to use pandas.Series.str.replace to remove the actual suffix from my column names. See the modified example below.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
df['Ex_'] = np.random.randint(0,10,size=(10, 1))
df1 = pd.DataFrame(df, copy=True)
print ("With Suffix")
print(df1.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df1)
print ("\n\nAfter .rstrip()")
print(df1.head())
def replace_right(df, suffix='_x'):
df.columns = df.columns.str.replace(suffix+'$', '', regex=True)
print ("\n\nWith Suffix")
print(df.head())
replace_right(df)
print ("\n\nAfter .replace()")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .rstrip()
A B C D E
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .replace()
A B C D Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8

Related

Create new column with a list of max frequency values for each row of a pandas dataframe

Given this Dataframe:
df2 = pd.DataFrame([[3,3,3,3,3,3,5,5,5,5],[2,2,2,2,8,8,8,8,6,6]], columns=list('ABCDEFGHIJ'))
A B C D E F G H I J
0 3 3 3 3 3 3 5 5 5 5
1 2 2 2 2 8 8 8 8 6 6
I created 2 news columns which give for each row the max_freq and the max_freq_value:
df2["max_freq_val"] = df2.apply(lambda x: x.mode().agg(list), axis=1)
df2["max_freq"] = df2.loc[:, df2.columns != "max_freq_val"].apply(lambda x: x.value_counts().max(), axis=1)
A B C D E F G H I J max_freq_val max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
EDIT: I've edited my code inspired by the answer given by #rhug123.
Thanks to all of you for your answers.
Try this, it uses mode()
df2.assign(max_freq=pd.Series(df2.mode(axis=1).stack().groupby(level=0).agg(list)),
max_freq_value = df2.eq(df2.mode(axis=1)[0].squeeze(),axis=0).sum(axis=1))
or
df2.assign(freq = df2.eq((s := df2.mode(axis=1).stack().groupby(level=0).agg(list)).str[0],axis=0).sum(axis=1),val = s)
We can try stack then adjust the freq with agg put the multiple into the list
s = df2.stack().groupby(level=0).value_counts()
s = s[s.eq(s.max(level=0),level=0)].reset_index(level=1).groupby(level=0).agg(val= ('level_1',list),fre=(0,'first'))
df2 = df2.join(s)
df2
Out[156]:
A B C D E F G H I J val fre
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Perhaps you could use this function:
def give_back_maximums(a = [2,2,2,2,8,8,8,8,6,6]):
values, counts = np.unique(a, return_counts=True)
return values[counts >= counts.max()].tolist()
The order of the below could affect the result
df2["max_freq_value"] = df2.apply(lambda x: give_back_maximums(x), axis=1)
df2["max_freq"] = df2.apply(lambda x: x.value_counts().max(), axis=1)
print(df2)
A B C D E F G H I J max_freq_value max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Hope it helps : )

Break dataframe header into multiheader

Names
ABCBaseCIP00
ABCBaseCIP01
ABCBaseCIP02
ABC1CIP00
ABC1CIP01
ABC1CIP02
ABC2CIP00
ABC2CIP01
ABC2CIP02
X
1
2
3
4
5
6
7
8
9
Y
1
2
3
4
5
6
7
8
9
Z
1
2
3
4
5
6
7
8
9
I have above dataframe, I am looking to break column headers by name(ABCBase|ABC1|ABC2) and code(CIP00|CIP01|CIP02|CIP00|CIP01|CIP02|CIP00|CIP01|CIP02) to get below table as output.
Can anyone suggest how can that be done in pandas? This is dynamic data so do not want to hardcode anything.
ABCBase
ABCBase
ABCBase
ABC1
ABC1
ABC1
ABC2
ABC2
ABC2
Names
CIP00
CIP01
CIP02
CIP00
CIP01
CIP02
CIP00
CIP01
CIP02
X
1
2
3
4
5
6
7
8
9
Y
1
2
3
4
5
6
7
8
9
Z
1
2
3
4
5
6
7
8
9
Here's a way using string manipulation and pd.MultiIndex with from_arrays:
df = df.set_index('Names')
cols = df.columns.str.extract('(ABC(?:Base|\d))(.*)')
df.columns = pd.MultiIndex.from_arrays([cols[0], cols[1]], names=[None, None])
df
Output:
ABCBase ABC1 ABC2
CIP00 CIP01 CIP02 CIP00 CIP01 CIP02 CIP00 CIP01 CIP02
Names
X 1 2 3 4 5 6 7 8 9
Y 1 2 3 4 5 6 7 8 9
Z 1 2 3 4 5 6 7 8 9
Or,
df.columns = pd.MultiIndex\
.from_arrays(zip(*df.columns.str.extract('(ABC(?:Base|\d))(.*)')\
.to_numpy()))
import pandas as pd
data = { 'names' : ['x','y','z'],
'ABCBaseCIP00' : [1,1,1],
'ABCBaseCIP01' : [2,2,2],
'ABCBaseCIP02' : [3,3,3],
'ABC1CIP00' : [4,4,4],
'ABC1CIP01' : [5,5,5]}
df = pd.DataFrame(data)
gives
names ABCBaseCIP00 ABCBaseCIP01 ABCBaseCIP02 ABC1CIP00 ABC1CIP01
0 x 1 2 3 4 5
1 y 1 2 3 4 5
2 z 1 2 3 4 5
Now do the work
df1 = df.T
df1.reset_index(inplace=True)
df1['name']=df1['index'].str[-5:]
df1['subname']=df1['index'].str[0:-5]
df1 = df1.drop('index',axis=1)
df1 = df1.T
which gives
0 1 2 3 4 5
0 x 1 2 3 4 5
1 y 1 2 3 4 5
2 z 1 2 3 4 5
name names CIP00 CIP01 CIP02 CIP00 CIP01
subname ABCBase ABCBase ABCBase ABC1 ABC1 ABC1
Which is not quite what you want but is it close enough?
a one-line solution to this problem:
df.columns = df.columns.str.split('(CIP.+)', expand=True).droplevel(2)
full example:
from pandas import DataFrame, Index
df = DataFrame(
{ 'ABCBaseCIP00': [1,1,1],
'ABCBaseCIP01': [2,2,2],
'ABCBaseCIP02': [3,3,3],
'ABC1CIP00': [4,4,4],
'ABC1CIP01': [5,5,5] },
index=Index(list('XYZ'), name='Names')
)
df.columns = df.columns.str.split('(CIP.+)', expand=True).droplevel(2)
# df outputs:
ABCBase ABC1
CIP00 CIP01 CIP02 CIP00 CIP01
Names
X 1 2 3 4 5
Y 1 2 3 4 5
Z 1 2 3 4 5
how it works:
the regex CIP.+ matches the from start of level-2. The brackets () create a capture group so it is returned by .str.split
splitting and & expanding an index creates a multi-index
the resulting multi index has an extra level, which is dropped with .droplevel(2)

Slicing a pandas dataframe

import pandas as pd
x = pd.DataFrame([[1,2,3],[4,5,6]])
x[::2]
what does the above command mean and how does it function?
Better is more data, it return even rows only by slicing:
x = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[0,1,2]])
print (x)
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 0 1 2
print (x[::2])
0 1 2
0 1 2 3
2 7 8 9

Pandas use variable for column names part 2

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
How can one assign column names to variables for use in referring to said column names?
For example, if I do this:
cols=['A','B']
cols2=['C','D']
I then want to do something like this:
df[cols,'F',cols2]
But the result is this:
TypeError: unhashable type: 'list'
I think you need add column F to list:
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Or:
print df[cols + ['F'] +cols2]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Need give a list with columns for reference.
In [48]: df[cols+['F']+cols2]
Out[48]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
and, consider using df.loc[:, cols+['F']+cols2], df.ix[:, cols+['F']+cols2] for slicing.
Python 3 solution:
In [154]: df[[*cols,'F',*cols2]]
Out[154]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Pandas use variable for column names [duplicate]

This question already has answers here:
Pandas Passing Variable Names into Column Name
(3 answers)
Closed 2 years ago.
Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
How can I access columns via a variable?
I tried this:
cols='A','B'
df[cols]
...which resulted in this:
KeyError: ('A', 'B')
Bonus Question:
What if my data frame were like this?:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
and I wanted to do this?:
cols=['A','B']
cols2=['C','D']
df[cols,'F',cols2]
Thanks in advance!
You can try subset by list of column names:
cols=['A','B']
print df[cols]
A B
0 1 4
1 2 5
2 3 6
It is same as:
print df[['A','B']]
A B
0 1 4
1 2 5
2 3 6
Bonus answer:
cols=['A','B']
cols2=['C','D']
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Resources