Dataframe sequence detection: Find groups where three rows in a row have negative values - python-3.x

Lets say I have a column df['test']:
-1, -2, -3, 2, -4, 3, -5, -4, -3, -7
So I would like to filter out the groups which have at least three negative values in a row. So
groups = my_grouping_function_by_sequence()
groups[0] = [-1,-2-3]
groups[1] = [-5,-4,-3,-7]
Are there some pre-defined checks on testing for sequences in numerical data for pandas? It does not need to be pandas, but I am searching for a fast and adaptable solution. Any advice would be helpful. Thanks!

Using GroupBy and cumsum to create groups of consecutive negative numbers.
grps = df['test'].gt(0).cumsum()
dfs = [d.dropna() for _, d in df.mask(df['test'].gt(0)).groupby(grps) if d.shape[0] >= 3]
Output
for df in dfs:
print(df)
test
0 -1.0
1 -2.0
2 -3.0
test
6 -5.0
7 -4.0
8 -3.0
9 -7.0
Explanation
Let's go through this step by step:
The first line, creates groups for consecutive negative numbers
print(grps)
0 0
1 0
2 0
3 1
4 1
5 2
6 2
7 2
8 2
9 2
Name: test, dtype: int32
But as we can see, it also includes the positive numbers, which we don't want to consider in our ouput. So we use DataFrame.mask to convert these values to NaN:
df.mask(df['test'].gt(0))
# same as df.mask(df['test'] > 0)
test
0 -1.0
1 -2.0
2 -3.0
3 NaN
4 -4.0
5 NaN
6 -5.0
7 -4.0
8 -3.0
9 -7.0
Then we groupby on this dataframe and only keep the groups which are >= 3 rows:
for _, d in df.mask(df['test'].gt(0)).groupby(grps):
if d.shape[0] >= 3:
print(d.dropna())
test
0 -1.0
1 -2.0
2 -3.0
test
6 -5.0
7 -4.0
8 -3.0
9 -7.0

Too acknowledge #erfan answer elegant but didn't easily understand. My attempt below.
df = pd.DataFrame({'test': [-1, -2, -3, 2, -4, 3, -5, -4, -3, -7]})
Conditionally select rows with negatives
df['j'] = np.where(df['test']<0,1,-1)
df['k']=df['j'].rolling(3, min_periods=1).sum()
df2=df[df['k']==3]
slice Iteratively the dataframe getting 3rd and 2 consecutive rows above
for index, row in df2.iterrows():
print(df.loc[index - 2 : index + 0, 'test'])

#Erfan your answer is brilliant and I'm still trying to understand the second line. Your first line got me started to try to write it in my own, less efficient way.
import pandas as pd
df = pd.DataFrame({'test': [-1, -2, -3, 2, -4, 3, -5, -4, -3, -7]})
df['+ or -'] = df['test'].gt(0)
df['group'] = df['+ or -'].cumsum()
df_gb = df.groupby('group').count().reset_index().drop('+ or -', axis=1)
df_new = pd.merge(df, df_gb, how='left', on='group').drop('+ or -', axis=1)
df_new = df_new[(df_new['test_x'] < 0) & (df_new['test_y'] >=3)].drop('test_y',
axis=1)
for i in df_new['group'].unique():
j = pd.DataFrame(df_new.loc[df_new['group'] == i, 'test_x'])
print(j)

Related

Drop a column in pandas if all values equal 1?

How do I drop columns in pandas where all values in that column are equal to a particular number? For instance, consider this dataframe:
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, 1]})
print(df)
Output:
A B C
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
How would I drop the 1 columns so that the output is:
B
0 0
1 1
2 2
3 3
Use DataFrame.loc with test if at least one non 1 value by DataFrame.ne with DataFrame.any:
df1 = df.loc[:, df.ne(1).any()]
Or test for 1 by DataFrame.eq with DataFrame.all for all Trues per columns and inverted mask by ~:
df1 = df.loc[:, ~df.eq(1).all()]
print (df1)
B
0 0
1 1
2 2
3 3
EDIT:
One consideration is what do you want to happen if you have a column with Nan and 1 only?
Then replace NaNs to 0 by DataFrame.fillna and use same solution like before:
df1 = df.loc[:, df.fillna(0).ne(1).any()]
df1 = df.loc[:, ~df.fillna(0).eq(1).all()]
You can use any:
df.loc[:, df.ne(1).any()]
One consideration is what do you want to happen if you have a column with Nan and 1 only?
If you want to drop under this condition also, you will to either fillna with 1 or add or and new condition.
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, np.nan]})
print(df)
A B C
0 1 0 1.0
1 1 1 1.0
2 1 2 1.0
3 1 3 NaN
All these leave that column with NaN and 1's.
df.loc[:, df.ne(1).any()]
df.loc[:, ~df.eq(1).all()]
So, you can add this addition to drop that column also.
df.loc[:, ~(df.eq(1) | df.isna()).all()]
Output:
B
0 0
1 1
2 2
3 3

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

How to replace selected rows of pandas dataframe with a np array, sequentially?

I have a pandas dataframe
A B C
0 NaN 2 6
1 3.0 4 0
2 NaN 0 4
3 NaN 1 2
where I have a column A that has NaN values in some rows (not necessarily consecutive).
I want to replace these values not with a constant value (which pd.fillna does), but rather with the values from a numpy array.
So the desired outcome is:
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
I'm not sure the .replace method will help here as well, since that seems to replace value <-> value via dictionary. Whereas here I want to sequentially change NaN to its corresponding value (by index) in the np array.
I tried:
MWE:
huh = pd.DataFrame([[np.nan, 2, 6],
[3, 4, 0],
[np.nan, 0, 4],
[np.nan, 1, 2]],
columns=list('ABC'))
huh.A[huh.A.isnull()] = np.array([1,5,7]) # what i want to do, but this gives error
gives the error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
'''
I read the docs but I can't understand how to do this with .loc.
How do I do this properly, preferably without a for loop?
Other info:
The number of elements in the np array will always match the number of NaN in the dataframe, so your answer does not need to check for this.
You are really close, need DataFrame.loc for avoid chained assignments:
huh.loc[huh.A.isnull(), 'A'] = np.array([1,5,7])
print (huh)
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
zip
This should account for uneven lengths
m = huh.A.isna()
a = np.array([1, 5, 7])
s = pd.Series(dict(zip(huh.index[m], a)))
huh.fillna({'A': s})
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2

Filling in nans for numbers in a column-specific way

Given a DataFrame and a list of indexes, is there an efficient pandas function that put nan value for all values vertically preceeding each of the entries of the list?
For example, suppose we have the list [4,8] and the following DataFrame:
index 0 1
5 1 2
2 9 3
4 3.2 3
8 9 8.7
The desired output is simply:
index 0 1
5 nan nan
2 nan nan
4 3.2 nan
8 9 8.7
Any suggestions for such a function that does this fast?
Here's one NumPy approach based on np.searchsorted -
s = [4,8]
a = df.values
idx = df.index.values
sidx = np.argsort(idx)
matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
mask = np.arange(a.shape[0])[:,None] < matching_row_indx
a[mask] = np.nan
Sample run -
In [107]: df
Out[107]:
0 1
index
5 1.0 2.0
2 9.0 3.0
4 3.2 3.0
8 9.0 8.7
In [108]: s = [4,8]
In [109]: a = df.values
...: idx = df.index.values
...: sidx = np.argsort(idx)
...: matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
...: mask = np.arange(a.shape[0])[:,None] < matching_row_indx
...: a[mask] = np.nan
...:
In [110]: df
Out[110]:
0 1
index
5 NaN NaN
2 NaN NaN
4 3.2 NaN
8 9.0 8.7
It was a bit tricky to recreate your example but this should do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'index': [5, 2, 4, 8], 0: [1, 9, 3.2, 9], 1: [2, 3, 3, 8.7]})
df.set_index('index', inplace=True)
for i, item in enumerate([4,8]):
for index, row in df.iterrows():
if index != item:
row[i] = np.nan
else:
break

How to make pandas .mean(0) .std(0) etc. into a list?

I have a df and some of the columns contains numbers and I calculate mean, std, median etc on these columns using df.mean(0)..
How can I put these summary statistics in a list?? One list for mean, one for median etc..
I think you can use Series.tolist, because output of your functions is Series:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#sum(0), std(0) is same as sum(), std() because 0 is by default
L1 = df.sum().tolist()
L2 = df.std().tolist()
print (L1)
print (L2)
[6, 15, 24, 9, 14, 14]
[1.0, 1.0, 1.0, 2.0, 1.5275252316519465, 2.0816659994661326]

Resources