How to display cross-tabs of all interactions? - python-3.x

I have a dataset which (in simplified form) looks like this:
import pandas as pd
df = pd.DataFrame({"target":[20,30,40], "x1":[1,0,1], "x2":[0,1,1], "x3":[0,0,1]}
And I want to find the average value of target for all possible two-variable (x_i, x_j)interactions. So the output should look like this:
How would I go about doing this in Pandas?

You can use pivot_table and for add not exist values reindex by MultiIndex created by from_product:
df = df.pivot_table(index='x1',columns=['x2','x3'], values='target')
mux = pd.MultiIndex.from_product(df.columns.levels, names=df.columns.names)
df = df.reindex(columns=mux)
print (df)
x2 0 1
x3 0 1 0 1
x1
0 NaN NaN 30.0 NaN
1 20.0 NaN NaN 40.0
If want replace NaNs to 0:
df = df.pivot_table(index='x1',columns=['x2','x3'], values='target', fill_value=0)
mux = pd.MultiIndex.from_product(df.columns.levels, names=df.columns.names)
df = df.reindex(columns=mux, fill_value=0)
print (df)
x2 0 1
x3 0 1 0 1
x1
0 0 0 30 0
1 20 0 0 40

Related

Delete row from dataframe having "None" value in all the columns - Python

I need to delete the row completely in a dataframe having "None" value in all the columns. I am using the following code -
df.dropna(axis=0,how='all',thresh=None,subset=None,inplace=True)
This does not bring any difference to the dataframe. The rows with "None" value are still there.
How to achieve this?
There Nones should be strings, so use replace first:
df = df.replace('None', np.nan).dropna(how='all')
df = pd.DataFrame({
'a':['None','a', 'None'],
'b':['None','g', 'None'],
'c':['None','v', 'b'],
})
print (df)
a b c
0 None None None
1 a g v
2 None None b
df1 = df.replace('None', np.nan).dropna(how='all')
print (df1)
a b c
1 a g v
2 NaN NaN b
Or test values None with not equal and DataFrame.any:
df1 = df[df.ne('None').any(axis=1)]
print (df1)
a b c
1 a g v
2 None None b
You should be dropping in the axis 1. Use the how keyword to drop columns with any or all NaN values. Check the docs
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3], 'b':[-1, 0, np.nan], 'c':[np.nan, np.nan, np.nan]})
df
a b c
0 1 -1.0 NaN
1 2 0.0 NaN
2 3 NaN 5.0
df.dropna(axis=1, how='any')
a
0 1
1 2
2 3
df.dropna(axis=1, how='all')
a b
0 1 -1.0
1 2 0.0
2 3 NaN

counting NaNs in 'for' loop in python

I am trying to iterate over the rows in df and count consecutive rows when a certain value is NaN or 0 and start the count over if the value will change from NaN or 0. I would like to get something like this:
Value Period
0 1
0 2
0 3
NaN 4
21 NaN
4 NaN
0 1
0 2
NaN 3
I wrote the function which takes a dataframe as an argument and returns it with an additional column which denotes the count:
def calc_period(df):
period_x = []
sum_x = 0
for i in range(1,df.shape[0]):
if df.iloc[i,0] == np.nan or df.iloc[i,0] == 0:
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
period_x.append(sum_x)
df['period_x'] = period_x
return df
The function works well when the value is 0. But when the value is NaN the count is also NaN and I get the following result:
Value Period
0 1
0 2
0 3
NaN NaN
NaN NaN
Here is a revised version of your code:
import pandas as pd
import numpy as np
import math
def is_nan_or_zero(val):
return math.isnan(val) or val == 0
def calc_period(df):
is_first_nan_or_zero = is_nan_or_zero(df.iloc[0, 0])
period_x = [1 if is_first_nan_or_zero else np.nan]
sum_x = 1 if is_first_nan_or_zero else 0
for i in range(1,df.shape[0]):
val = df.iloc[i,0]
if is_nan_or_zero(val):
sum_x += 1
period_x.append(sum_x)
else:
period_x.append(None)
sum_x = 0
df['period_x'] = period_x
return df
There were 2 fixes:
Replacing df.iloc[i,0] == np.nan with math.isnan(val)
Remove period_x.append(sum_x) at the end, and add the first period value instead (since we start iterating from the second value)

Splitting Column Lists in Pandas DataFrame

I'm looking for an good way to solve the following problem. My current fix is not particularly clean, and I'm hoping to learn from your insight.
Suppose I have a Panda DataFrame, whose entries look like this:
>>> df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
>>> df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
>>> df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
>>> df['IsGlass']=[1,0,1]
>>> df
Color Texture IsGlass
1 NaN ['Rough'] 1
2 ['Red', 'Blue'] NaN 0
3 ['Blue', 'Green', 'Purple'] ['Silky','Shiny','Fuzzy'] 1
So each observation in the index corresponds to something I measured about its color, texture, and whether it's glass or not. What I'd like to do is turn this into a new "indicator" DataFrame, by creating a column for each observed value, and changing the corresponding entry to a one if I observed it, and NaN if I have no information.
>>> df
Red Blue Green Purple Rough Silky Shiny Fuzzy Is Glass
1 Nan Nan Nan Nan 1 NaN Nan Nan 1
2 1 1 Nan Nan Nan Nan Nan Nan 0
3 Nan 1 1 1 Nan 1 1 1 1
I have solution that loops over each column, looks at its values, and through a series of Try/Excepts for non-Nan values splits the lists, creates a new column, etc., and concatenates.
This is my first post to StackOverflow - I hope this post conforms to the posting guidelines. Thanks.
Stacking Hacks!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.stack().unstack(fill_value=[])
def b(c):
d = mlb.fit_transform(c)
return pd.DataFrame(d, c.index, mlb.classes_)
pd.concat([b(df[c]) for c in ['Color', 'Texture']], axis=1).join(df.IsGlass)
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
I am just using pandas, get_dummies
l=[pd.get_dummies(df[x].apply(pd.Series).stack(dropna=False)).sum(level=0) for x in ['Color','Texture']]
pd.concat(l,axis=1).assign(IsGlass=df.IsGlass)
Out[662]:
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
For each texture/color in each row, I check if the value is null. If not, we add that value as a column = 1 for that row.
import numpy as np
import pandas as pd
df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
df['IsGlass']=[1,0,1]
for row in df.itertuples():
if not np.all(pd.isnull(row.Color)):
for val in row.Color:
df.loc[row.Index,val] = 1
if not np.all(pd.isnull(row.Texture)):
for val in row.Texture:
df.loc[row.Index,val] = 1

Removing negative values in pandas column keeping NaN

I was wondering how I can remove rows which have a negative value but keep the NaNs. At the moment I am using:
DF = DF.ix[DF['RAF01Time'] >= 0]
But this removes the NaNs.
Thanks in advance.
You need boolean indexing with another condition with isnull:
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
Sample:
DF = pd.DataFrame({'RAF01Time':[-1,2,3,np.nan]})
print (DF)
RAF01Time
0 -1.0
1 2.0
2 3.0
3 NaN
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
Another solution with query:
DF = DF.query("~(RAF01Time < 0)")
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
You can just use < 0 and then take the inverse of the condition.
DF = DF[~(DF['RAF01Time'] < 0)]

Element-wise Maximum of Two DataFrames Ignoring NaNs

I have two dataframes (df1 and df2) that each have the same rows and columns. I would like to take the maximum of these two dataframes, element-by-element. In addition, the result of any element-wise maximum with a number and NaN should be the number. The approach I have implemented so far seems inefficient:
def element_max(df1,df2):
import pandas as pd
cond = df1 >= df2
res = pd.DataFrame(index=df1.index, columns=df1.columns)
res[(df1==df1)&(df2==df2)&(cond)] = df1[(df1==df1)&(df2==df2)&(cond)]
res[(df1==df1)&(df2==df2)&(~cond)] = df2[(df1==df1)&(df2==df2)&(~cond)]
res[(df1==df1)&(df2!=df2)&(~cond)] = df1[(df1==df1)&(df2!=df2)]
res[(df1!=df1)&(df2==df2)&(~cond)] = df2[(df1!=df1)&(df2==df2)]
return res
Any other ideas? Thank you for your time.
A more readable way to do this in recent versions of pandas is concat-and-max:
import scipy as sp
import pandas as pd
A = pd.DataFrame([[1., 2., 3.]])
B = pd.DataFrame([[3., sp.nan, 1.]])
pd.concat([A, B]).max(level=0)
#
# 0 1 2
# 0 3.0 2.0 3.0
#
You can use where to test your df against another df, where the condition is True, the values from df are returned, when false the values from df1 are returned. Additionally in the case where NaN values are in df1 then an additional call to fillna(df) will use the values from df to fill those NaN and return the desired df:
In [178]:
df = pd.DataFrame(np.random.randn(5,3))
df.iloc[1,2] = np.NaN
print(df)
df1 = pd.DataFrame(np.random.randn(5,3))
df1.iloc[0,0] = np.NaN
print(df1)
0 1 2
0 2.671118 1.412880 1.666041
1 -0.281660 1.187589 NaN
2 -0.067425 0.850808 1.461418
3 -0.447670 0.307405 1.038676
4 -0.130232 -0.171420 1.192321
0 1 2
0 NaN -0.244273 -1.963712
1 -0.043011 -1.588891 0.784695
2 1.094911 0.894044 -0.320710
3 -1.537153 0.558547 -0.317115
4 -1.713988 -0.736463 -1.030797
In [179]:
df.where(df > df1, df1).fillna(df)
Out[179]:
0 1 2
0 2.671118 1.412880 1.666041
1 -0.043011 1.187589 0.784695
2 1.094911 0.894044 1.461418
3 -0.447670 0.558547 1.038676
4 -0.130232 -0.171420 1.192321

Resources