find index of row element in pandas - python-3.x

If you have a df:
apple banana carrot
a 1 2 3
b 2 3 1
c 0 0 1
To find the index for the columns where a cell is equal to 0 is df[df['apple']==0].index
but can you transpose this so you find the index of row c where it is 0?
Basically I need to drop the columns where c==0 and would like to do this in one line by row rather than by each column.

If want test row c and get all columns if 0:
c = df.columns[df.loc['c'] == 0]
print (c)
Index(['apple', 'banana'], dtype='object')
If want test all rows:
c1 = df.columns[df.eq(0).any()]
print (c1)
Index(['apple', 'banana'], dtype='object')
If need remove columns if 0 in any row:
df = df.loc[:, df.ne(0).all()]
print (df)
carrot
a 3
b 1
c 1
Detail/explanation:
First compare all values of DataFrame by ne (!=):
print (df.ne(0))
apple banana carrot
a True True True
b True True True
c False False True
Then get all rows if all True rows:
print (df.ne(0).all())
apple False
banana False
carrot True
dtype: bool
Last filter by DataFrame.loc:
print (df.loc[:, df.loc['c'].ne(0)])
carrot
a 3
b 1
c 1
If need test only c row solution is similar, only first select c row by loc and omit all:
df = df.loc[:, df.loc['c'].ne(0)]

Yes you can, df.T[df.T['c']==0]

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.
IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

Collapse values from multiple rows of a column into an array when all other columns values are same

I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.
So if I have this dataframe:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
3 c 7 6
I would like to convert it to this:
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.
Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?
Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
You can also create columns names dynamic like:
changes = ['B']
cols = df.columns.difference(changes).tolist()
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
For all lists in column solution is simplier:
changes = ['B']
cols = df.columns.difference(changes).tolist()
df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a [1] 2
1 b [3] 4
2 c [5, 7] 6
Here is another approach using pivot_table and applymap:
(df.pivot_table(index='A',aggfunc=list).applymap(lambda x: x[0] if len(set(x))==1 else x)
.reset_index())
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6

Drop by multiple columns groups if specific values not exit in another column in Pandas

How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())

How to split record from a DataFrame cross pairs in pandas?

I hava a dataframe like this :
a b c
0 A B 1
4 B A 1
1 C D -1
3 D C 3
2 E F 3
The '0' row and '4'row are a pair, I will remove one row by the value of 'c' columns. According to 'c' columns, I decide to remove which one or remove all of them. If mirror pair have same value in c column, I will remove one row, or I will remove all of them.
a b c
0 A B 1
2 E F 3
I use while, but my data set is huge. Have any good ideas ?
IIUC using np.sort with duplicated
df1=df.loc[~pd.DataFrame(np.sort(df[['a','b']].values,axis=1)).duplicated().values]
a b c
0 A B 1
1 C D -1
2 E F 3
You may use agg with frozenset and duplicated and slicing
s = df[['a', 'b']].agg(frozenset, axis=1)
m = ~s.duplicated(keep=False) | (s.duplicated(keep=False) & df.c.duplicated())
df.loc[m]
Out[165]:
a b c
4 B A 1
2 E F 3
first select the non-duplicated rows using np.sort and Series.duplicated (see m1 detail)
Then you can use DataFrame.groupby
and group according to columns a, b (see detail g). Then perform a Boolean indexing using Groupby.transform to eliminate duplicates when c does not match.:
df2=df.reset_index(drop=True)
m1=~pd.DataFrame(np.sort(df2[['a','b']])).duplicated()
g=m1.cumsum()
m2=~df2.groupby(g,sort=False)['c'].transform(lambda x: (x.nunique()==len(x))&(len(x)>1))
mask=m1&m2
print(mask)
0 True
1 False
2 False
3 False
4 True
dtype: bool
df_filtered=df2[mask]
print(df_filtered)
a b c
0 A B 1
4 E F 3
Details:
m1
0 True
1 False
2 True
3 False
4 True
dtype: bool
m2
0 True
1 True
2 False
3 False
4 True
dtype: bool
g
0 1
1 1
2 2
3 2
4 3
dtype: int64

Getting all rows where for column 'C' the entry is larger than the preceding element in column 'C'

How can I select all rows of a data frame where a condition is met according to a column, which has to do with the relationship between every 2 entries of that column. To give the specific example, lets say I have a DataFrame:
>>>df = pd.DataFrame({'A': [ 1, 2, 3, 4],
'B':['spam', 'ham', 'egg', 'foo'],
'C':[4, 5, 3, 4]})
>>> df
A B C
0 1 spam 4
1 2 ham 5
2 3 egg 3
3 4 foo 4
>>>df2 = df[ return every row of df where C[i] > C[i-1] ]
>>> df2
A B C
1 2 ham 5
3 4 foo 4
There is plenty of great information about slicing and indexing in the pandas docs and here, but this is a bit more complicated, I think. I could also be going about it wrong. What I'm looking for is the rows of data where the value stored in C is no longer monotonously declining.
Any help is appreciated!
Use boolean indexing with compare by shifted column values:
print (df[df['C'] > df['C'].shift()])
A B C
1 2 ham 5
3 4 foo 4
Detail:
print (df['C'] > df['C'].shift())
0 False
1 True
2 False
3 True
Name: C, dtype: bool
If want all monotonously declining rows compare diff of column:
print (df[df['C'].diff() > 0])
A B C
1 2 ham 5
3 4 foo 4

Resources