Index Value of Last Matching Row Python Panda DataFrame - python-3.x

I have a dataframe which has a value of either 0 or 1 in a "column 2", and either a 0 or 1 in "column 1", I would somehow like to find and append as a column the index value for the last row where Column1 = 1 but only for rows where column 2 = 1. This might be easier to see than read:
d = {'C1' : pd.Series([1, 0, 1,0,0], index=[1,2,3,4,5]),'C2' : pd.Series([0, 0,0,1,1], index=[1,2,3,4,5])}
df = pd.DataFrame(d)
print(df)
C1 C2
1 1 0
2 0 0
3 1 0
4 0 1
5 0 1
#I've left out my attempts as they don't even get close
df['C3'] = IF C2 = 1: Call Function that gives Index Value of last place where C1 = 1 Else 0 End
This would result in this result set:
C1 C2 C3
1 1 0 0
2 0 0 0
3 1 0 0
4 0 1 3
5 0 1 3
I was trying to get a function to do this as there are roughly 2million rows in my data set but only ~10k where C2 =1.
Thank you in advance for any help, I really appreciate it - I only started
programming with python a few weeks ago.

It is not so straight forward, you have to do a few loops to get this result. The key here is the fillna method which can do forwards and backwards filling.
It is often the case that pandas methods does more than one thing, this makes it very hard to figure out what methods to use for what.
So let me talk you through this code.
First we need to set C3 to nan, otherwise we cannot use fillna later.
Then we set C3 to be the index but only where C1 == 1 (the mask does this)
After this we can use fillna with method='ffill' to propagate the last observation forwards.
Then we have to mask away all the values where C2 == 0, same way we set the index earlier, with a mask.
df['C3'] = pd.np.nan
mask = df['C1'] == 1
df['C3'].loc[mask] = df.index[mask].copy()
df['C3'] = df['C3'].fillna(method='ffill')
mask = df['C2'] == 0
df['C3'].loc[mask] = 0
df
C1 C2 C3
1 1 0 0
2 0 0 0
3 1 0 0
4 0 1 3
5 0 1 3
EDIT:
Added a .copy() to the index, otherwise we overwrite it and the index gets all full of zeroes.

Related

pandas - show column name + sum in which the sum is higher than zero

I read my dataframe in with:
dataframe = pd.read_csv("testFile.txt", sep = "\t", index_col= 0)
I got a dataframe like this:
cell 17472131 17472132 17472133 17472134 17472135 17472136
cell_0 1 0 1 0 1 0
cell_1 0 0 0 0 1 0
cell_2 0 1 1 1 0 0
cell_3 1 0 0 0 1 0
with pandas I would like to get all the column names in which the sum of the column is > 1 and the total sum.
So I would like:
17472131 2
17472133 2
17472135 3
I figured out how to get the sums of each column with
dataframe.sum(axis=0)
but this also returns the columns with a sum lower than 2.. is there a way to only show the columns with a higher value than i.e. 1?
One pretty neat way is to use lambda function in loc:
df.set_index('cell').sum().loc[lambda x: x>1]
Output:
17472131 2
17472133 2
17472135 3
dtype: int64
Details: df.sum returns a pd.Series and we can use lambda x: x>1 to produce as boolean series which loc use boolean indexing to select only True parts of the pd.Series.

groupby and trim some rows based on condition

I have a data frame something like this:
df = pd.DataFrame({"ID":[1,1,2,2,2,3,3,3,3,3],
"IF_car":[1,0,0,1,0,0,0,1,0,1],
"IF_car_history":[0,0,0,1,0,0,0,1,0,1],
"observation":[0,0,0,1,0,0,0,2,0,3]})
I want output where I can trim rows in groupby with ID and condition on "IF_car_history" == 1
tried_df = df.groupby(['ID']).apply(lambda x: x.loc[:(x['IF_car_history'] == '1').idxmax(),:]).reset_index(drop = True)
I want to drop rows in a groupby by after i get ['IF_car_history'] == '1'
expected output:
Thanks
First compare values for mask m by Series.eq and then use GroupBy.cumsum, and for values before 1 compare by 0, last filter by boolean indexing, but because id necesary remove after last 1 is used swapped values by slicing with [::-1].
m = df['IF_car_history'].eq(1).iloc[::-1]
df1 = df[m.groupby(df['ID']).cumsum().ne(0).iloc[::-1]]
print (df1)
ID IF_car IF_car_history observation
2 2 0 0 0
3 2 1 1 1
5 3 0 0 0
6 3 0 0 0
7 3 1 1 2
8 3 0 0 0
9 3 1 1 3

pandas assign value in multiple columns based on value in one

I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.
Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1

Pandas Flag Rows with Complementary Zeros

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.
You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

Count values in a range comprehended between two values

At Column A i have this values 1
0
3
2
0
5
1
1
1
0
2
1
1
1
0
2
1
1
1
0
0
3
0
2
0
0
3
1
This list grows everyday.
I need a formula to put on every cell of column B that counts upwards how many values bigger than 1 are until the next value = 1 is found.
In another words i need to count how many values larger than 1 are between 1's.
The pretended result would be something like this:
1
0
3
2
0
5
1 3
1
0
2
1 1
1
0
2
1 1
1
0
0
3
0
2
0
0
3
1 3
Thanks in Advance
I would use a helper column, if this is acceptable.
So to create a running count of numbers greater than one which resets each time it encounters a '1', enter this starting in B2 and pull down (I'm assuming the data has a heading and the list starts with a 1) :-
=IF(A2=1,0,B1+(A2>1))
Then to display the counts at each '1' value (but not for repeated ones) enter this in C2 and pull down:-
=IF(AND(A2=1,A1<>1,ISNUMBER(A1)),B1,"")
It's also possible to do it with an array formula, but not sure if it's worth the effort:-
=IF(AND(A2=1,A1<>1),
COUNTIF(
OFFSET(
A$1,
MAX(ROW(A1:A$2)*(A1:A$2=1))-ROW(A$1)+1,,
MAX(ROW(A1))-MAX(ROW(A1:A$2)*(A1:A$2=1))),
">"&0),
"")
to be entered in B2 with Ctrl Shift Enter and pulled down.

Resources