Compare current value with n values above and below on Pandas DataFrame - python-3.x

I have this df:
x
0 2
1 2
2 2
3 1
4 1
5 2
6 2
I need to compare current value on column x with respect to the n previous and next values based on a defined condition, if condition is met q times then add 1 in a new column, if not, add 0.
For instance, if n is 2, q is 3 and the condition is current_value <= value / 2. In this case, the code will do 7 comparisons:
1st comparison: compare current_value = 2 to previous n = 2 numbers (in this case there are no such numbers because is the first value on the column) and then compare current_value = 2 to the next n = 2 values (in this case both numbers are 2, so condtion is not met on neither (2 <= 2/2)). In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
2nd comparison: compare current_value = 2 to previous n = 2 numbers (in this case there is just one number above, the condition is not met (2 <= 2/2)) and then compare current_value = 2 to the next n = 2 values (in this case there's a number 2 and then a number 1, so condition is not met (2 <= 2/2 and 2 <= 1/2)). In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
3rd comparison: In this case there are no condition met, as q = 3 >= 0 the code adds 0 to the new column.
4th comparison: compare current_value = 1 to previous n = 2 numbers (in this case there are two number 2 above, the condition is met on both of them (1 <= 2/2)) and then compare current_value = 1 to the next n = 2 values (in this case there's a number 1 and then a number 2, so condition is met once (1 <= 2/2 and 1 <= 1/2)). In this case there are 3 conditions met, as q = 3 >= 3 the code adds 1 to the new column.
5th comparison: In this case there are 3 conditions met, as q = 3 >= 3 the code adds 1 to the new column.
6th comparison: In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
7th comparison: In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
Desired result:
x comparison
0 2 0
1 2 0
2 2 0
3 1 1
4 1 1
5 2 0
6 2 0
I was thinking on using something like shift function but I'm not sure how to implement it. Any help?

I suggest to use numpy here, to benefit from its sliding window view:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv
n = 2
q = 3
# convert to numpy array
a = df['x'].astype(float).to_numpy()
# create a sliding window
# remove central value, divide by 2
# compare to original value
# count number of matches
count = (a[:,None] <= swv(np.pad(a, n, constant_values=np.nan), 2*n+1)[:, np.r_[:n,n+1:2*n+1]]/2).sum(1)
# array([0, 0, 0, 3, 3, 0, 0])
# compare number of matches to q
df['comparison'] = (count >= q).astype(int)
print(df)
An alternative with only pandas would require to compute two rolling windows (forward and backward) as it's not trivial to access the current index in a centered rolling with min_periods=1:
n = 2
q = 3
s1 = df['x'].rolling(n+1, min_periods=2).apply(lambda x: sum(x.iloc[-1]<=x.iloc[:-1]/2))
s2 = df.loc[::-1, 'x'].rolling(n+1, min_periods=2).apply(lambda x: sum(x.iloc[-1]<=x.iloc[:-1]/2))
df['comparison'] = s1.add(s2, fill_value=0).ge(3).astype(int)
Output:
x comparison
0 2 0
1 2 0
2 2 0
3 1 1
4 1 1
5 2 0
6 2 0

Related

groupby and trim some rows based on condition

I have a data frame something like this:
df = pd.DataFrame({"ID":[1,1,2,2,2,3,3,3,3,3],
"IF_car":[1,0,0,1,0,0,0,1,0,1],
"IF_car_history":[0,0,0,1,0,0,0,1,0,1],
"observation":[0,0,0,1,0,0,0,2,0,3]})
I want output where I can trim rows in groupby with ID and condition on "IF_car_history" == 1
tried_df = df.groupby(['ID']).apply(lambda x: x.loc[:(x['IF_car_history'] == '1').idxmax(),:]).reset_index(drop = True)
I want to drop rows in a groupby by after i get ['IF_car_history'] == '1'
expected output:
Thanks
First compare values for mask m by Series.eq and then use GroupBy.cumsum, and for values before 1 compare by 0, last filter by boolean indexing, but because id necesary remove after last 1 is used swapped values by slicing with [::-1].
m = df['IF_car_history'].eq(1).iloc[::-1]
df1 = df[m.groupby(df['ID']).cumsum().ne(0).iloc[::-1]]
print (df1)
ID IF_car IF_car_history observation
2 2 0 0 0
3 2 1 1 1
5 3 0 0 0
6 3 0 0 0
7 3 1 1 2
8 3 0 0 0
9 3 1 1 3

Pandas remove group if difference between first and last row in group exceeds value

I have a dataframe df:
df = pd.DataFrame({})
df['X'] = [3,8,11,6,7,8]
df['name'] = [1,1,1,2,2,2]
X name
0 3 1
1 8 1
2 11 1
3 6 2
4 7 2
5 8 2
For each group within 'name' and want to remove that group if the difference between the first and last row of that group is smaller than a specified value d_dif in absolute way:
For example, when d_dif= 5, I want to get:
X name
0 3 1
1 8 1
2 11 1
If your data is increasingly in X, you can use groupby().transform() and np.ptp
threshold = 5
ranges = df.groupby('name')['X'].transform(np.ptp)
df[ranges > threshold]
If you only care about first and last, then transform just first and last:
threshold = 5
groups = df.groupby('name')['X']
ranges = groups.transform('last') - groups.transform('first')
df[ranges.abs() > threshold]

How to recognize [1,X,X,X,1] repeating pattern in panda serie

I have a boolean column in a csv file for example:
1 1
2 0
3 0
4 0
5 1
6 1
7 1
8 0
9 0
10 1
11 0
12 0
13 1
14 0
15 1
You can see here 1 is reapting every 5 lines.
I want to recognize this repeating pattern [1,0,0,0] as soon as the repetition is above 10 in python (I have ~20.000 rows/file).
The pattern can start at any position
How could I manage this in python avoiding if .....
# Generate 20000 of 0s and 1s
data = pd.Series(np.random.randint(0, 2, 20000))
# Keep indices of 1s
idx = df[df > 0].index
# Check distance of current index with next index whether is 4 or not,
# Say if position 2 and position 6 is found as 1, so 6 - 2 = 4
found = []
for i, v in enumerate(idx):
if i == len(idx) - 1:
break
next_value = idx[i + 1]
if (next_value - v) == 4:
found.append(v)
print(found)

How to use nested while loops

I'm trying to make a function that uses a nested while loop that prints something like this.
ranges(5,2)
5
0 1 2 3 4
4
0 1 2 3
3
0 1 2
2
0 1
my code that i have so far looks like this
def ranges(high,low):
while high >= low:
print(high)
high = high - 1
y = 0
x = high
while x > y:
print (y, end = " ")
y = y + 1
The output is like this
5
0 1 2 3 4
0 1 2 3
0 1 2
0
I'm pretty sure I missed up in calling the nested while loop because when i split up the code to just print 5,...,2 in a column it works and so does the code for printing the numbers in a row. Any help would be cool
Add print("") right after the while loop, and modify the condition of the while loop to >=:
def ranges(high,low):
while high >= low: # <-- change the condition otherwise you'll miss the last number in every line
print(high)
high = high - 1
y = 0
x = high
while x >= y:
print (y, end = " ")
y = y + 1
print("") # <-- this
ranges(5, 2)
OUTPUT
5
0 1 2 3 4
4
0 1 2 3
3
0 1 2
2
0 1

Pandas Flag Rows with Complementary Zeros

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.
You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

Resources