Pandas groupby with dropna set to True generating wrong output - python-3.x

In the following snippet:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"b": [1, np.nan, 1, np.nan, 2, 1, 2, np.nan, 1]
}
)
df_again = df.groupby("b", dropna=False).apply(lambda x: x)
I was expecting df and df_again to be identical. But they are not:
df
a b
0 1 1.0
1 2 NaN
2 3 1.0
3 4 NaN
4 5 2.0
5 6 1.0
6 7 2.0
7 8 NaN
8 9 1.0
df_again
a b
0 1 1.0
2 3 1.0
4 5 2.0
5 6 1.0
6 7 2.0
8 9 1.0
Now, if I tweak slightly the lambda expression to "see" what is going on by
df.groupby("b", dropna=False).apply(lambda x: print(x)) I can actually visualize that also the portion of the df where b was NaN was processed.
What am I missing here?
(Using pandas 1.3.1 and numpy 1.20.3)

It's because None and None are the same thing:
>>> None == None
True
>>>
You have to use np.nan:
>>> np.NaN == np.NaN
False
>>>
So try this:
df = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"b": [1, np.NaN, 1, np.NaN, 2, 1, 2, np.NaN, 1]
}
)
df_again = df.groupby("b", dropna=False).apply(lambda x: x)
Now df and df_again would be the same:
>>> df
a b
0 1 1.0
1 2 NaN
2 3 1.0
3 4 NaN
4 5 2.0
5 6 1.0
6 7 2.0
7 8 NaN
8 9 1.0
>>> df_again
a b
0 1 1.0
1 2 NaN
2 3 1.0
3 4 NaN
4 5 2.0
5 6 1.0
6 7 2.0
7 8 NaN
8 9 1.0
>>> df.equals(df_again)
True
>>>

This was a bug introduced in pandas 1.2.0 as described here and was solved here.

Related

Pandas create a new column based on existing column with chain rule

I have below code
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, 3, 1],"B":[7, 2, 54, 3, None],"C":[20, 16, 11, 3, 8],"D":[14, 3, None, 2, 6]})
df['A1'] = np.where(df['A'] > 10, 10, np.where(df['A'] < 3, 3, df['A']))
While this is okay, I want create the final dataframe (i.e. 2nd line of code) using chain rule from the first line. I want to achieve this to increase readability.
Could you please help how can I achieve this?
You can use clip here:
df.assign(A1=df['A'].clip(upper=10,lower=3))
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3
If you really need to do this in one line (note that I dont find this readable)
pd.DataFrame({"A":[12, 4, 5, 3, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]}).assign(A1=lambda x:x['A'].clip(upper=10,lower=3))
You could use np.select() like the following. It makes the conditions and choices very readable.
conditions = [df['A'] > 10,
df['A'] < 3]
choices = [10,3]
df['A2'] = np.select(conditions, choices, default = df['A'])
print(df)
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3

fill values after condition with NaN

I have a df like this:
df = pd.DataFrame(
[
['A', 1],
['A', 1],
['A', 1],
['B', 2],
['B', 0],
['A', 0],
['A', 1],
['B', 1],
['B', 0]
], columns = ['key', 'val'])
df
print:
key val
0 A 1
1 A 1
2 A 1
3 B 2
4 B 0
5 A 0
6 A 1
7 B 1
8 B 0
I want to fill the rows after 2 in the val column (in the example all values in the val column from row 3 to 8 are replaced with nan).
I tried this:
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
and iterating over rows like this:
for row in df.iterrows():
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
but cant get it to fill nan forward.
You can use boolean indexing with cummax to fill nan values:
df.loc[df['val'].eq(2).cummax(), 'val'] = np.nan
Alternatively you can also use Series.mask:
df['val'] = df['val'].mask(lambda x: x.eq(2).cummax())
key val
0 A 1.0
1 A 1.0
2 A 1.0
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN
You can try :
ind = df.loc[df['val']==2].index
df.iloc[ind[0]:,1] = np.nan
Once you get index by df.index[df.val.shift(-1).eq(2)].item() then you can use slicing
idx = df.index[df.val.shift(-1).eq(2)].item()
df.iloc[idx:, 1] = np.nan
df
key val
0 A 1.0
1 A 1.0
2 A NaN
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Filling in nans for numbers in a column-specific way

Given a DataFrame and a list of indexes, is there an efficient pandas function that put nan value for all values vertically preceeding each of the entries of the list?
For example, suppose we have the list [4,8] and the following DataFrame:
index 0 1
5 1 2
2 9 3
4 3.2 3
8 9 8.7
The desired output is simply:
index 0 1
5 nan nan
2 nan nan
4 3.2 nan
8 9 8.7
Any suggestions for such a function that does this fast?
Here's one NumPy approach based on np.searchsorted -
s = [4,8]
a = df.values
idx = df.index.values
sidx = np.argsort(idx)
matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
mask = np.arange(a.shape[0])[:,None] < matching_row_indx
a[mask] = np.nan
Sample run -
In [107]: df
Out[107]:
0 1
index
5 1.0 2.0
2 9.0 3.0
4 3.2 3.0
8 9.0 8.7
In [108]: s = [4,8]
In [109]: a = df.values
...: idx = df.index.values
...: sidx = np.argsort(idx)
...: matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
...: mask = np.arange(a.shape[0])[:,None] < matching_row_indx
...: a[mask] = np.nan
...:
In [110]: df
Out[110]:
0 1
index
5 NaN NaN
2 NaN NaN
4 3.2 NaN
8 9.0 8.7
It was a bit tricky to recreate your example but this should do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'index': [5, 2, 4, 8], 0: [1, 9, 3.2, 9], 1: [2, 3, 3, 8.7]})
df.set_index('index', inplace=True)
for i, item in enumerate([4,8]):
for index, row in df.iterrows():
if index != item:
row[i] = np.nan
else:
break

Resources