How to remove header from pandas Styler? - python-3.x

I want to remove the header from the pandas Styler so that I can render it.
What I have tried:
def highlight(x):
c1 = 'background-color: #f5f5dc'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
df1.loc[['A'], :] = c1
return df1
temp = {'col1': ['abc', 'def'], 'col2': [1.0, 2.0]}
df = pd.DataFrame(temp)
df.index = ['A', 'B']
print(df)
df.style.apply(highlight, axis=None).hide_index()
Output
col1 col2
-------------
abc 1.000000
def 2.000000
But I want to remove col1 and col2 as it comes in my post rendering page which I don't need.
Is there any way I can do it?

In pandas 1.4.0 both hide_index and hide_columns were deprecated (GH43758) in favour of hide with axis=...
With labels:
df.style.apply(highlight, axis=None).hide(axis='index').hide(axis='columns')
With axis numbers:
df.style.apply(highlight, axis=None).hide(axis=0).hide(axis=1)
In versions prior to 1.4.0 the hide_columns function can be used to hide the columns. Similar to how the hide_index function does for the index:
df.style.apply(highlight, axis=None).hide_index().hide_columns()

Related

Python Pandas groupby with agg() nth() and/or iloc()

Given this DF:
df = pd.DataFrame({'Col1':['A','A','A','B','B','B','B']
, 'Col2':['i', 'j', 'k', 'l', 'm', 'n', 'o']
, 'Col3':['Apple', 'Peach', 'Apricot', 'Dog', 'Cat', 'Mouse', 'Horse']
,})
df
And then using this code:
df1 = df.groupby('Col1').agg({'Col2':'count', 'Col3': lambda x: x.iloc[2]})
df1
I got this result:
What I would like now:
Being able to make the lambda function 'Col3': lambda x: x.iloc[0] to print('Not enough data') when dealing with error for example if I change "x.iloc[0]" to "x.iloc[3]" which raised an error because there's not enough data in "Col1['A'] compared to "Col1['B']".
!! Don't want to use 'last' because this is a simplified and shortened DF for purpose !!
You can use nth that will give you a NaN if the value is missing. Unfortunately, nth is not handled by agg so you need to compute it separately and join:
g = df.groupby('Col1')
df1 = g.agg({'Col2':'count'}).join(g['Col3'].nth(3))
output:
Col2 Col3
Col1
A 3 NaN
B 4 Horse
You can try with a slice object which will return empty Series if none value.
df1 = df.groupby('Col1').agg({'Col2':'count',
'Col3': lambda x: x.iloc[3:4] if len(x.iloc[3:4]) else pd.NA})
print(df1)
Col2 Col3
Col1
A 3 <NA>
B 4 Horse
You can save typing with named expression if your Python version is greater than 3.8
df1 = df.groupby('Col1').agg({'Col2':'count',
'Col3': lambda x: v if len(v := x.iloc[3:4]) else pd.NA})

Python using apply function to skip Nan

I am trying to preprocess a dataset to use for XGBoost by mapping the classes in each column to numerical values. A working example looks like this:
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df1 = pd.DataFrame(data = {'col1': ['A', 'B','C','B','A'], 'col2': ['Z', 'X','Z','Z','Y'], 'col3':['I','J','I','J','J']})
d = defaultdict(LabelEncoder)
encodedDF = df1.apply(lambda x: d[x.name].fit_transform(x))
inv = encodedDF.apply(lambda x: d[x.name].inverse_transform(x))
Where encodedDF gives the output:
col1 col2 col3
0 2 0
1 0 1
2 2 0
1 2 1
0 1 1
And inv just reverts it back to the original dataframe. My issue is when null values get introduced:
df2 = pd.DataFrame(data = {'col1': ['A', 'B',None,'B','A'], 'col2': ['Z', 'X','Z',None,'Y'], 'col3':['I','J','I','J','J']})
encodedDF = df2.apply(lambda x: d[x.name].fit_transform(x))
Running the above will throw the error:
"TypeError: ('argument must be a string or number', 'occurred at index col1')"
Basically, I want to apply the encoding, but skip over the individual cell values that are null to get an output like this:
col1 col2 col3
0 2 0
1 0 1
NaN 2 0
1 NaN 1
0 1 1
I can't use dropna() before applying the encoding because then I lose data that I will be trying to impute down the line with XGBoost. I can't use conditionals to skip x if null, (e.g. using x.notnull() in the lambda function) because fit_transform(x) uses a Pandas.Series object as the argument, and none of the logical operators that I could use in the conditional appear to do what I'm trying to do. I'm not sure what else to try in order to get this to work. I hope what I'm trying to do makes sense. Let me know if I need to clarify.
I think I figured out a workaround. I probably should have been using sklearn's OneHotEncoder class from the beginning instead of the LabelEncoder/defaultdict combo. I'm brand new to all this. I replaced NaNs with dummy values, and then dropped those dummy values once I encoded the dataframe.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame(data = {'col1': ['A', 'B','C',None,'A'], 'col2': ['Z', 'X',None,'Z','Y'], 'col3':['I','J',None,'J','J'], 'col4':[45,67,None,32,94]})
replaceVals = {'col1':'missing','col2':'missing','col3':'missing','col4':-1}
df = df.fillna(value = replaceVals,axis=0)
drop = [['missing'],['missing'],['missing'],[-1]]
enc = OneHotEncoder(drop=drop)
encodeDF = enc.fit_transform(df)

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

Equivalent of R's "attach"ing a dataframe in Python?

I saw a similar question on OS but that one is different as it relates to functions and not dateframes.
Imagine we have a dataframe df with a column x. In R, if you "attach" df, then you can directly use x for example in print(x), without having to reference df as in print(df['x']). Is there any equivalent in Python?
First, the caveat that you should not do this. That said, you can set global variables via a loop across the columns:
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
for col in df.columns:
globals()[col] = df[col]
>>> a
0 1
1 2
3 3
If you wanted it to be something you use regularly, perhaps you write a function (again, I strongly discourage this):
def attach(df):
for col in df.columns:
globals()[col] = df[col]
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
attach(df)
This sounds like aliasing
x = df.x #alias x as df.x
df.x = 10 #change df.x
print(x) #prints out series df.x, which is now all 10s

Multi-index pandas dataframes: find an index related to the number of unique values a column has

# import Pandas library
import pandas as pd
idx = pd.MultiIndex.from_product([['A001', 'B001','C001'],
['0', '1', '2']],
names=['ID', 'Entries'])
col = ['A', 'B']
df = pd.DataFrame('-', idx, col)
df.loc['A001', 'A'] = [10,10,10]
df.loc['A001', 'B'] = [90,84,70]
df.loc['B001', 'A'] = [10,20,10]
df.loc['B001', 'B'] = [70,86,67]
df.loc['C001', 'A'] = [20,20,20]
df.loc['C001', 'B'] = [98,81,72]
#df is a dataframe
df
Following is the problem: How to return the ID which has more than one unique values for column 'A'? In the above dataset, ideally it should return B001.
I would appreciate if anyone could help me out with performing operations in multi-index pandas dataframes.
Use GroupBy.transform with nunique and filter by boolean indexing and for values of first levl of MultiIndex add get_level_values with unique:
a = df[df.groupby(level=0)['A'].transform('nunique') > 1].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')
Or use duplicated, but first need columns from MultiIndex by reset_index:
m = df.reset_index().duplicated(subset=['ID','A'], keep=False).values
a = df[~m].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')

Resources