Python Pandas groupby with agg() nth() and/or iloc() - python-3.x

Given this DF:
df = pd.DataFrame({'Col1':['A','A','A','B','B','B','B']
, 'Col2':['i', 'j', 'k', 'l', 'm', 'n', 'o']
, 'Col3':['Apple', 'Peach', 'Apricot', 'Dog', 'Cat', 'Mouse', 'Horse']
,})
df
And then using this code:
df1 = df.groupby('Col1').agg({'Col2':'count', 'Col3': lambda x: x.iloc[2]})
df1
I got this result:
What I would like now:
Being able to make the lambda function 'Col3': lambda x: x.iloc[0] to print('Not enough data') when dealing with error for example if I change "x.iloc[0]" to "x.iloc[3]" which raised an error because there's not enough data in "Col1['A'] compared to "Col1['B']".
!! Don't want to use 'last' because this is a simplified and shortened DF for purpose !!

You can use nth that will give you a NaN if the value is missing. Unfortunately, nth is not handled by agg so you need to compute it separately and join:
g = df.groupby('Col1')
df1 = g.agg({'Col2':'count'}).join(g['Col3'].nth(3))
output:
Col2 Col3
Col1
A 3 NaN
B 4 Horse

You can try with a slice object which will return empty Series if none value.
df1 = df.groupby('Col1').agg({'Col2':'count',
'Col3': lambda x: x.iloc[3:4] if len(x.iloc[3:4]) else pd.NA})
print(df1)
Col2 Col3
Col1
A 3 <NA>
B 4 Horse
You can save typing with named expression if your Python version is greater than 3.8
df1 = df.groupby('Col1').agg({'Col2':'count',
'Col3': lambda x: v if len(v := x.iloc[3:4]) else pd.NA})

Related

How to remove header from pandas Styler?

I want to remove the header from the pandas Styler so that I can render it.
What I have tried:
def highlight(x):
c1 = 'background-color: #f5f5dc'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
df1.loc[['A'], :] = c1
return df1
temp = {'col1': ['abc', 'def'], 'col2': [1.0, 2.0]}
df = pd.DataFrame(temp)
df.index = ['A', 'B']
print(df)
df.style.apply(highlight, axis=None).hide_index()
Output
col1 col2
-------------
abc 1.000000
def 2.000000
But I want to remove col1 and col2 as it comes in my post rendering page which I don't need.
Is there any way I can do it?
In pandas 1.4.0 both hide_index and hide_columns were deprecated (GH43758) in favour of hide with axis=...
With labels:
df.style.apply(highlight, axis=None).hide(axis='index').hide(axis='columns')
With axis numbers:
df.style.apply(highlight, axis=None).hide(axis=0).hide(axis=1)
In versions prior to 1.4.0 the hide_columns function can be used to hide the columns. Similar to how the hide_index function does for the index:
df.style.apply(highlight, axis=None).hide_index().hide_columns()

Python using apply function to skip Nan

I am trying to preprocess a dataset to use for XGBoost by mapping the classes in each column to numerical values. A working example looks like this:
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df1 = pd.DataFrame(data = {'col1': ['A', 'B','C','B','A'], 'col2': ['Z', 'X','Z','Z','Y'], 'col3':['I','J','I','J','J']})
d = defaultdict(LabelEncoder)
encodedDF = df1.apply(lambda x: d[x.name].fit_transform(x))
inv = encodedDF.apply(lambda x: d[x.name].inverse_transform(x))
Where encodedDF gives the output:
col1 col2 col3
0 2 0
1 0 1
2 2 0
1 2 1
0 1 1
And inv just reverts it back to the original dataframe. My issue is when null values get introduced:
df2 = pd.DataFrame(data = {'col1': ['A', 'B',None,'B','A'], 'col2': ['Z', 'X','Z',None,'Y'], 'col3':['I','J','I','J','J']})
encodedDF = df2.apply(lambda x: d[x.name].fit_transform(x))
Running the above will throw the error:
"TypeError: ('argument must be a string or number', 'occurred at index col1')"
Basically, I want to apply the encoding, but skip over the individual cell values that are null to get an output like this:
col1 col2 col3
0 2 0
1 0 1
NaN 2 0
1 NaN 1
0 1 1
I can't use dropna() before applying the encoding because then I lose data that I will be trying to impute down the line with XGBoost. I can't use conditionals to skip x if null, (e.g. using x.notnull() in the lambda function) because fit_transform(x) uses a Pandas.Series object as the argument, and none of the logical operators that I could use in the conditional appear to do what I'm trying to do. I'm not sure what else to try in order to get this to work. I hope what I'm trying to do makes sense. Let me know if I need to clarify.
I think I figured out a workaround. I probably should have been using sklearn's OneHotEncoder class from the beginning instead of the LabelEncoder/defaultdict combo. I'm brand new to all this. I replaced NaNs with dummy values, and then dropped those dummy values once I encoded the dataframe.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame(data = {'col1': ['A', 'B','C',None,'A'], 'col2': ['Z', 'X',None,'Z','Y'], 'col3':['I','J',None,'J','J'], 'col4':[45,67,None,32,94]})
replaceVals = {'col1':'missing','col2':'missing','col3':'missing','col4':-1}
df = df.fillna(value = replaceVals,axis=0)
drop = [['missing'],['missing'],['missing'],[-1]]
enc = OneHotEncoder(drop=drop)
encodeDF = enc.fit_transform(df)

None of [Index(['a', 'c'], dtype='object')] are in the [columns] error

I have a main dataframe and want to create a sub-dataframe with specific columns:
df_main=
[a,b,c,d]
[1,3,6,0]
When I want to pick specific columns and create a new one, it throws me this ugly error:
df_new=
df.loc[:, ['a','c']]
df_new.head()
Out:None of [Index(['a', 'c'], dtype='object')] are in the [columns]
What is the issue here?
If I right understand:
import pandas as pd
df = pd.DataFrame({'a':[1], 'b':[3], 'c':[6], 'd':[0]})
df_new = df.loc[:, ['a','c']]
df_new:
a c
0 1 6

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

Pandas, concatenating values of columns.

I have found answers to this question on here before, but none of them seem to work for me. Right now I have a data frame with a list of clients and their address. However, each address is separated into many columns and i'm trying to put them all under one.
The code I have so far read as so:
data1_df['Address'] = data1_df['Address 1'].map(str) + ", " + data1_df['Address 2'].map(str) + ", " + data1_df['Address 3'].map(str) + ", " + data1_df['city'].map(str) + ", " + data1_df['city'].map(str) + ", " + data1_df['Province/State'].map(str) + ", " + data1_df['Country'].map(str) + ", " + data1_df['Postal Code'].map(str)
However, the error I get is:
TypeError: Unary plus expects numeric dtype, not object
I'm not sure why it's not accepting the strings as they are and using the + operator. Shouldn't the plus accommodate objects?
Hopefully you'll find this example helpful:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,3],
'B': list('ABC'),
'C': [4,5,np.nan],
'D': ['One', np.nan, 'Three']})
addColumns = ['B', 'C', 'D']
df['Address'] = df[addColumns].astype(str).apply(lambda x: ', '.join([i for i in x if i != 'nan']), axis=1)
df
# A B C D Address
#0 1 A 4.0 One A, 4.0, One
#1 2 B 5.0 NaN B, 5.0
#2 3 C NaN Three C, Three
The above will work as str representation of NaN is nan.
Or you can make it with filling NaN with empty strings:
df['Address'] = df[addColumns].fillna('').astype(str).apply(lambda x: ', '.join([i for i in x if i]), axis=1)
In the case of columns with NaN values that you need to add together, here's some logic:
def add_cols_w_nan(df, col_list, space_char, new_col_name):
""" Add together multiple columns where some of the columns
may contain NaN, with the appropriate amount of spacing between columns.
Examples:
'Mr.' + NaN + 'Smith' becomes 'Mr. Smith'
'Mrs.' + 'J.' + 'Smith' becomes 'Mrs. J. Smith'
NaN + 'J.' + 'Smith' becomes 'J. Smith'
Args:
df: pd.DataFrame
DataFrame for which strings are added together.
col_list: ORDERED list of column names, eg. ['first_name',
'middle_name', 'last_name']. The columns will be added in order.
space_char: str
Character to insert between concatenation of columns.
new_col_name: str
Name of the new column after adding together strings.
Returns: pd.DataFrame with a string addition column
"""
df2 = df[col_list].copy()
# Convert to strings, leave nulls alone
df2 = df2.where(df2.isnull(), df2.astype('str'))
# Add space character, NaN remains NaN, which is important
df2.loc[:, col_list[1:]] = space_char + df2.loc[:, col_list[1:]]
# Fix rows where leading columns are null
to_fix = df2.notnull().idxmax(1)
for col in col_list[1:]:
m = to_fix == col
df2.loc[m, col] = df2.loc[m, col].str.replace(space_char, '')
# So that summation works
df2[col_list] = df2[col_list].replace(np.NaN, '')
# Add together all columns
df[new_col_name] = df2[col_list].sum(axis=1)
# If all are missing replace with missing
df[new_col_name] = df[new_col_name].replace('', np.NaN)
del df2
return df
Sample Data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Address 1': ['AAA', 'ABC', np.NaN, np.NaN, np.NaN],
'Address 2': ['foo', 'bar', 'baz', None, np.NaN],
'Address 3': [np.NaN, np.NaN, 17, np.NaN, np.NaN],
'city': [np.NaN, 'here', 'there', 'anywhere', np.NaN],
'state': ['NY', 'TX', 'WA', 'MI', np.NaN]})
# Address 1 Address 2 Address 3 city state
#0 AAA foo NaN NaN NY
#1 ABC bar NaN here TX
#2 NaN baz 17.0 there WA
#3 NaN None NaN anywhere MI
#4 NaN NaN NaN NaN NaN
df = add_cols_w_nan(
df,
col_list = ['Address 1', 'Address 2', 'Address 3', 'city', 'state'],
space_char = ', ',
new_col_name = 'full_address')
df.full_address.tolist()
#['AAA, foo, NY',
# 'ABC, bar, here, TX',
# 'baz, 17.0, there, WA',
# 'anywhere, MI',
# nan]

Resources