Filling in nans for numbers in a column-specific way - python-3.x

Given a DataFrame and a list of indexes, is there an efficient pandas function that put nan value for all values vertically preceeding each of the entries of the list?
For example, suppose we have the list [4,8] and the following DataFrame:
index 0 1
5 1 2
2 9 3
4 3.2 3
8 9 8.7
The desired output is simply:
index 0 1
5 nan nan
2 nan nan
4 3.2 nan
8 9 8.7
Any suggestions for such a function that does this fast?

Here's one NumPy approach based on np.searchsorted -
s = [4,8]
a = df.values
idx = df.index.values
sidx = np.argsort(idx)
matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
mask = np.arange(a.shape[0])[:,None] < matching_row_indx
a[mask] = np.nan
Sample run -
In [107]: df
Out[107]:
0 1
index
5 1.0 2.0
2 9.0 3.0
4 3.2 3.0
8 9.0 8.7
In [108]: s = [4,8]
In [109]: a = df.values
...: idx = df.index.values
...: sidx = np.argsort(idx)
...: matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
...: mask = np.arange(a.shape[0])[:,None] < matching_row_indx
...: a[mask] = np.nan
...:
In [110]: df
Out[110]:
0 1
index
5 NaN NaN
2 NaN NaN
4 3.2 NaN
8 9.0 8.7

It was a bit tricky to recreate your example but this should do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'index': [5, 2, 4, 8], 0: [1, 9, 3.2, 9], 1: [2, 3, 3, 8.7]})
df.set_index('index', inplace=True)
for i, item in enumerate([4,8]):
for index, row in df.iterrows():
if index != item:
row[i] = np.nan
else:
break

Related

fill values after condition with NaN

I have a df like this:
df = pd.DataFrame(
[
['A', 1],
['A', 1],
['A', 1],
['B', 2],
['B', 0],
['A', 0],
['A', 1],
['B', 1],
['B', 0]
], columns = ['key', 'val'])
df
print:
key val
0 A 1
1 A 1
2 A 1
3 B 2
4 B 0
5 A 0
6 A 1
7 B 1
8 B 0
I want to fill the rows after 2 in the val column (in the example all values in the val column from row 3 to 8 are replaced with nan).
I tried this:
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
and iterating over rows like this:
for row in df.iterrows():
df['val'] = np.where(df['val'].shift(-1) == 2, np.nan, df['val'])
but cant get it to fill nan forward.
You can use boolean indexing with cummax to fill nan values:
df.loc[df['val'].eq(2).cummax(), 'val'] = np.nan
Alternatively you can also use Series.mask:
df['val'] = df['val'].mask(lambda x: x.eq(2).cummax())
key val
0 A 1.0
1 A 1.0
2 A 1.0
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN
You can try :
ind = df.loc[df['val']==2].index
df.iloc[ind[0]:,1] = np.nan
Once you get index by df.index[df.val.shift(-1).eq(2)].item() then you can use slicing
idx = df.index[df.val.shift(-1).eq(2)].item()
df.iloc[idx:, 1] = np.nan
df
key val
0 A 1.0
1 A 1.0
2 A NaN
3 B NaN
4 B NaN
5 A NaN
6 A NaN
7 B NaN
8 B NaN

Loop over columns with df.shift in Python

Lets say you have a dataframe like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
df
A B
0 3 5
1 1 6
2 2 7
3 3 8
Now I want to skew and calculate on each column. I put the values as I want them skewed in the index:
range_span = range(4)
result = pd.DataFrame(index=range_span)
Then I try to pupulate result with the following:
for c in df.columns:
for i in range_span:
result.iloc[i][c] = df[c].shift(i).max()
result
This only returns the index. I expected something like this:
You've got 3 critical issues:
issue #1
At this line
result.iloc[i][c] = df[c].shift(i).max()
Raises warning that help understand why result is empty.
...\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
According to their document:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
As iloc[i] will return slice - aka copy - of that rows, you couldn't set original dataframe result. Further, this is why iloc didn't raised issue when it got str index. Explained in #2.
Instead you use iloc - potentially loc with str - like this:
>>> df
A B C
0 1 10 100
1 2 20 200
2 3 30 300
>>> df.iloc[1, 2]
200
>>>df.iloc[[1, 2], [1, 2]]
B C
1 20 200
2 30 300
>>> df.iloc[1:3, 1:3]
B C
1 20 200
2 30 300
>>> df.iloc[:, 1:3]
B C
0 10 100
1 20 200
2 30 300
# ..and so on
issue #2
If you fix issue #1 then you'll see following error:
result.iloc[[i][c]] = df[c].shift(i).max()
TypeError: list indices must be integers or slices, not str
Also from their document:
property DataFrame.iloc: Purely integer-location based indexing for selection by position.
At for c in df.columns: You're passing column name A, B which is str, not int. Use loc instead for str column indices.
This didn't raise TypeError due to issue #1 - as c was passed as argument of __setitem__().
Issue #3
Normally dataframe cannot be enlarged without special functions like combine.
# using same df from #1
>>> df.iloc[1, 3] = 300
Traceback (most recent call last):
File "~\pandas\core\indexing.py", line 1394, in _has_valid_setitem_indexer
raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object
Easier fix would be using dict and convert to DataFrame when manipulation is complete. Or just creating DataFrame to match or have a larger size at firsthand:
>>> df2 = pd.DataFrame(index=range(4), columns=range(3))
>>> df2
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Combining all, correct fix would be:
import pandas as pd
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
result = pd.DataFrame(index=df.index, columns=df.columns)
for col in df.columns:
for index in df.index:
result.loc[index, col] = df[col].shift(index).max()
print(result)
Output:
A B
0 3 8
1 3 7
2 3 6
3 3 5

Pairwise operations in Scikit-Learn and different filtering conditions on each pair

I have the following 2 data frames, say df1
a b c d
0 0 1 2 3
1 4 0 0 7
2 8 9 10 11
3 0 0 0 15
and df2
a b c d
0 5 1 2 3
What I am interested in doing is a pairwise operation on each row in df1 with the single row in df2. However, if a column in a row of df1 is 0, then that column is used in neither the df1 row nor df2 row to perform the pairwise operation. So each pairwise operation will work on pairs of rows of different length. Let me break it down how the 4 comparison should be.
Comparison 1
0 1 2 3 vs 5 1 2 3
The pairwise operation is done on 1 2 3 vs 1 2 3 as column a has a 0
Comparison 2
4 0 0 7 vs 5 1 2 3 is done on 4 7 vs 5 3 as we have 2 columns that need to be dropped
Comparison 3
8 9 10 11 vs 5 1 2 3 is done on 8 9 10 11 vs 5 1 2 3 as no columns are dropped
Comparison 4
0 0 0 15 vs 5 1 2 3 is done on 15 vs 3 as all but one column is dropped
The result of each pairwise operation is a scalar so the result is some sort of structure whether it be list, array, data frame, whatever with 4 (or the number of rows in df1) values. Also, I should note that values in df2 are irrelevant and no filtering is done based upon the value of any column in df2.
For simplicity, you could try looping over each row in the dataframe and do something like this:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# loop over each row in 'a'
for i in range(len(a)):
# find indicies of non-zero elements of the row
non_zero = np.nonzero(a.iloc[i].to_numpy())[0]
# perform pair-wise addition between non-zero elements in 'a' and the same elements in 'b'
print(np.array(a.iloc[i])[(non_zero)] + np.array(b.iloc[0])[(non_zero)])
Here I used pair-wise addition but you could replace the addition with an operation of your choosing.
Edit:
We may want to vectorize this to avoid the loop if the dataframes are large. Here is an idea for that, where we convert zero values to nan so they are ignored in the row-wise operation:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# find indicies of zeros
zeros = (a==0).values
# set zeros to nan
a[zeros] = np.nan
# tile and reshape 'b' so its the same shape as 'a'
b = pd.DataFrame(np.tile(b, len(a)).reshape(np.shape(a)), columns=b.columns)
# set the zero indices to nan
b[zeros] = np.nan
print('a:')
print(a)
print('b:')
print(b)
# now do some row-wise operation. For example take the sum of each row
print(np.sum(a+b, axis=1))
Output:
a:
a b c d
0 NaN 1.0 2.0 3
1 4.0 NaN NaN 7
2 8.0 9.0 10.0 11
3 NaN NaN NaN 15
b:
a b c d
0 NaN 1.0 2.0 3
1 5.0 NaN NaN 3
2 5.0 1.0 2.0 3
3 NaN NaN NaN 3
sum:
0 12.0
1 19.0
2 49.0
3 18.0
dtype: float64

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

Locate rows with 0 value columns and set them to none pandas

Data:
f a b
5 0 1
5 1 3
5 1 3
5 6 3
5 0 0
5 1 5
5 0 0
I know how to locate the rows with both columns being 0, setting them to None on the other hand is a mystery.
df_o[(df_o['a'] == 0) & (df_o['d'] == 0)]
# set a and b to None
Expected result:
f a b
5 0 1
5 1 3
5 1 3
5 6 3
5 None None
5 1 5
5 None None
If working with numeric values None is converted to NaN and integers to float by design:
df_o.loc[(df_o['a'] == 0) & (df_o['b'] == 0), ['a','b']] = None
print (df_o)
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
Another solution with DataFrame.all for check if all Trues per rows with axis=1:
df_o.loc[(df_o[['a', 'b']] == 0).all(axis=1), ['a','b']] = None
print (df_o)
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
Details:
print ((df_o[['a', 'b']] == 0))
a b
0 True False
1 False False
2 False False
3 False False
4 True True
5 False False
6 True True
print ((df_o[['a', 'b']] == 0).all(axis=1))
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
One way I could think of is like this. Create an extra copy of the dataframe and check both individually while setting the value to None on the main dataframe. Not the cleanest solutions but:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['f'] = [5,5,5,5,5,5,5]
df['a'] = [0,1,1,6,0,1,0]
df['b'] = [1,3,3,3,0,5,0]
df1 = df.copy()
df['a'] = np.where((df.a == 0) & (df.b == 0), None, df.a)
df['b'] = np.where((df1.a == 0) & (df1.b == 0), None, df.b)
print(df)
Output:
f a b
0 5 0 1
1 5 1 3
2 5 1 3
3 5 6 3
4 5 None None
5 5 1 5
6 5 None None
df.replace(0, np.nan) -- to get NaNs (possibly more useful)
df.replace(0, 'None') -- what you actually want
It is surely not the most elegant way to do this, but maybe this helps.
import pandas as pd
data = {'a': [0,1,1,6,0,1,0],
'b':[1,3,3,3,0,5,0]}
df_o = pd.DataFrame.from_dict(data)
df_None = df_o[(df_o['a'] == 0) & (df_o['b'] == 0)]
df_o.loc[df_None.index,:] = None
print(df_o)
Out:
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
This is how I would do it:
import pandas as pd
a = pd.Series([0, 1, 1, 6, 0, 1, 0])
b = pd.Series([1, 3, 3, 3, 0, 5 ,0])
data = pd.DataFrame({'a': a, 'b': b})
v = [[data[i][j] for i in data] == [0, 0] for j in range(len(data['a']))] # spot null rows
a = [None if v[i] else a[i] for i in range(len(a))]
b = [None if v[i] else b[i] for i in range(len(b))]
data = pd.DataFrame({'a': a, 'b': b})
print(data)
Output:
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN

Resources