Pandas Flag Rows with Complementary Zeros - python-3.x

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.

You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

Related

Compare current value with n values above and below on Pandas DataFrame

I have this df:
x
0 2
1 2
2 2
3 1
4 1
5 2
6 2
I need to compare current value on column x with respect to the n previous and next values based on a defined condition, if condition is met q times then add 1 in a new column, if not, add 0.
For instance, if n is 2, q is 3 and the condition is current_value <= value / 2. In this case, the code will do 7 comparisons:
1st comparison: compare current_value = 2 to previous n = 2 numbers (in this case there are no such numbers because is the first value on the column) and then compare current_value = 2 to the next n = 2 values (in this case both numbers are 2, so condtion is not met on neither (2 <= 2/2)). In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
2nd comparison: compare current_value = 2 to previous n = 2 numbers (in this case there is just one number above, the condition is not met (2 <= 2/2)) and then compare current_value = 2 to the next n = 2 values (in this case there's a number 2 and then a number 1, so condition is not met (2 <= 2/2 and 2 <= 1/2)). In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
3rd comparison: In this case there are no condition met, as q = 3 >= 0 the code adds 0 to the new column.
4th comparison: compare current_value = 1 to previous n = 2 numbers (in this case there are two number 2 above, the condition is met on both of them (1 <= 2/2)) and then compare current_value = 1 to the next n = 2 values (in this case there's a number 1 and then a number 2, so condition is met once (1 <= 2/2 and 1 <= 1/2)). In this case there are 3 conditions met, as q = 3 >= 3 the code adds 1 to the new column.
5th comparison: In this case there are 3 conditions met, as q = 3 >= 3 the code adds 1 to the new column.
6th comparison: In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
7th comparison: In this case there are no conditions met, as q = 3 >= 0 the code adds 0 to the new column.
Desired result:
x comparison
0 2 0
1 2 0
2 2 0
3 1 1
4 1 1
5 2 0
6 2 0
I was thinking on using something like shift function but I'm not sure how to implement it. Any help?
I suggest to use numpy here, to benefit from its sliding window view:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv
n = 2
q = 3
# convert to numpy array
a = df['x'].astype(float).to_numpy()
# create a sliding window
# remove central value, divide by 2
# compare to original value
# count number of matches
count = (a[:,None] <= swv(np.pad(a, n, constant_values=np.nan), 2*n+1)[:, np.r_[:n,n+1:2*n+1]]/2).sum(1)
# array([0, 0, 0, 3, 3, 0, 0])
# compare number of matches to q
df['comparison'] = (count >= q).astype(int)
print(df)
An alternative with only pandas would require to compute two rolling windows (forward and backward) as it's not trivial to access the current index in a centered rolling with min_periods=1:
n = 2
q = 3
s1 = df['x'].rolling(n+1, min_periods=2).apply(lambda x: sum(x.iloc[-1]<=x.iloc[:-1]/2))
s2 = df.loc[::-1, 'x'].rolling(n+1, min_periods=2).apply(lambda x: sum(x.iloc[-1]<=x.iloc[:-1]/2))
df['comparison'] = s1.add(s2, fill_value=0).ge(3).astype(int)
Output:
x comparison
0 2 0
1 2 0
2 2 0
3 1 1
4 1 1
5 2 0
6 2 0

groupby and trim some rows based on condition

I have a data frame something like this:
df = pd.DataFrame({"ID":[1,1,2,2,2,3,3,3,3,3],
"IF_car":[1,0,0,1,0,0,0,1,0,1],
"IF_car_history":[0,0,0,1,0,0,0,1,0,1],
"observation":[0,0,0,1,0,0,0,2,0,3]})
I want output where I can trim rows in groupby with ID and condition on "IF_car_history" == 1
tried_df = df.groupby(['ID']).apply(lambda x: x.loc[:(x['IF_car_history'] == '1').idxmax(),:]).reset_index(drop = True)
I want to drop rows in a groupby by after i get ['IF_car_history'] == '1'
expected output:
Thanks
First compare values for mask m by Series.eq and then use GroupBy.cumsum, and for values before 1 compare by 0, last filter by boolean indexing, but because id necesary remove after last 1 is used swapped values by slicing with [::-1].
m = df['IF_car_history'].eq(1).iloc[::-1]
df1 = df[m.groupby(df['ID']).cumsum().ne(0).iloc[::-1]]
print (df1)
ID IF_car IF_car_history observation
2 2 0 0 0
3 2 1 1 1
5 3 0 0 0
6 3 0 0 0
7 3 1 1 2
8 3 0 0 0
9 3 1 1 3

How to replace the values of 1's and 0's of various column into a single column of a data frame?

The 0's and 1's need to be transposed to there appropriate headers in python.
How can I achieve this and get the column final_list?
If there is always only one 1 per rows use DataFrame.dot:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,0,0],
'c':[0,0,1]})
df['Final'] = df.dot(df.columns)
print (df)
a b c Final
0 0 1 0 b
1 1 0 0 a
2 0 0 1 c
If possible multiple 1 also add separator and then remove it by Series.str.rstrip from output Series:
df = pd.DataFrame({'a':[0,1,0],
'b':[1,1,0],
'c':[1,1,1]})
df['Final'] = df.dot(df.columns + ',').str.rstrip(',')
print (df)
a b c Final
0 0 1 1 b,c
1 1 1 1 a,b,c
2 0 0 1 c

Python Pandas: copy several columns at specific row from one dataframe to another with different names

I have dataframe1 with columns a,b,c,d with 5 rows.
I also have another dataframe2 with columns e,f,g,h
Let's say I want to copy columns a,b in row 3 from dataframe1 to columns f,g in row 3 at dataframe2.
I tried to use this code:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].
The results was NaN in dataframe2.
Any ideas how can I solve it?
One idea is convert to numpy array for avoid alignment data by columns names:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
Sample:
dataframe1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3]})
print (dataframe1)
a b c
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
dataframe2 = pd.DataFrame({'f':list('HIJK'),
'g':[0,0,7,1],
'h':[0,1,0,1]})
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 K 1 1
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 d 5 1

Index Value of Last Matching Row Python Panda DataFrame

I have a dataframe which has a value of either 0 or 1 in a "column 2", and either a 0 or 1 in "column 1", I would somehow like to find and append as a column the index value for the last row where Column1 = 1 but only for rows where column 2 = 1. This might be easier to see than read:
d = {'C1' : pd.Series([1, 0, 1,0,0], index=[1,2,3,4,5]),'C2' : pd.Series([0, 0,0,1,1], index=[1,2,3,4,5])}
df = pd.DataFrame(d)
print(df)
C1 C2
1 1 0
2 0 0
3 1 0
4 0 1
5 0 1
#I've left out my attempts as they don't even get close
df['C3'] = IF C2 = 1: Call Function that gives Index Value of last place where C1 = 1 Else 0 End
This would result in this result set:
C1 C2 C3
1 1 0 0
2 0 0 0
3 1 0 0
4 0 1 3
5 0 1 3
I was trying to get a function to do this as there are roughly 2million rows in my data set but only ~10k where C2 =1.
Thank you in advance for any help, I really appreciate it - I only started
programming with python a few weeks ago.
It is not so straight forward, you have to do a few loops to get this result. The key here is the fillna method which can do forwards and backwards filling.
It is often the case that pandas methods does more than one thing, this makes it very hard to figure out what methods to use for what.
So let me talk you through this code.
First we need to set C3 to nan, otherwise we cannot use fillna later.
Then we set C3 to be the index but only where C1 == 1 (the mask does this)
After this we can use fillna with method='ffill' to propagate the last observation forwards.
Then we have to mask away all the values where C2 == 0, same way we set the index earlier, with a mask.
df['C3'] = pd.np.nan
mask = df['C1'] == 1
df['C3'].loc[mask] = df.index[mask].copy()
df['C3'] = df['C3'].fillna(method='ffill')
mask = df['C2'] == 0
df['C3'].loc[mask] = 0
df
C1 C2 C3
1 1 0 0
2 0 0 0
3 1 0 0
4 0 1 3
5 0 1 3
EDIT:
Added a .copy() to the index, otherwise we overwrite it and the index gets all full of zeroes.

Resources