delete rows based on first N columns - python-3.x

I have a datafame:
import pandas as pd
df= pd.DataFrame({'date':['2017-12-31','2018-02-01','2018-03-01'],'type':['Asset','Asset','Asset'],'Amount':[1,0,0],'Amount1':[1,0,0],'Ted':[1,0,0]})
df
I want to delete rows where the first three columns are 0. I don't want to use the name of the column as it changes. In this case, I want to delete the 2nd and 3rd rows.

Use boolean indexing:
df = df[df.iloc[:, :3].ne(0).any(axis=1)]
#alternative solution with inverting mask by ~
#df = df[~df.iloc[:, :3].eq(0).all(axis=1)]
print (df)
Amount Amount1 Ted date type
0 1 1 1 2017-12-31 Asset
Detail:
First select N columns by iloc:
print (df.iloc[:, :3])
Amount Amount1 Ted
0 1 1 1
1 0 0 0
2 0 0 0
Compare by ne (!=):
print (df.iloc[:, :3].ne(0))
Amount Amount1 Ted
0 True True True
1 False False False
2 False False False
Get all rows at least one True per row by any:
print (df.iloc[:, :3].ne(0).any(axis=1))
0 True
1 False
2 False
dtype: bool

Related

Python: Compare 2 pandas dataframe with unequal number of rows

Need to compare two pandas dataframe with unequal number of rows and generate a new df with True for matching records and False for non matching and missing records.
df1:
date x y
0 2022-11-01 4 5
1 2022-11-02 12 5
2 2022-11-03 11 3
df2:
date x y
0 2022-11-01 4 5
1 2022-11-02 11 5
expected df_output:
date x y
0 True True True
1 False False False
2 False False False
Code:
df1 = pd.DataFrame({'date':['2022-11-01', '2022-11-02', '2022-11-03'],'x':[4,12,11],'y':[5,5,3]})
df2 = pd.DataFrame({'date':['2022-11-01', '2022-11-02'],'x':[4,11],'y':[5,5]})
df_output = pd.DataFrame(np.where(df1 == df2, True, False), columns=df1.columns)
print(df_output)
Error: ValueError: Can only compare identically-labeled DataFrame objects
You can use:
# cell to cell equality
# comparing by date
df3 = df1.eq(df1[['date']].merge(df2, on='date', how='left'))
# or to compare by index
# df3 = df1.eq(df2, axis=1)
# if you also want to turn a row to False if there is any False
df3 = (df3.T & df3.all(axis=1)).T
Output:
date x y
0 True True True
1 False False False
2 False False False

How to check if column is binary? (Pandas)

How to (efficiently!) check if a column is binary ?
"col" "col2"
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
also there might be a problem with columns that arent meant to be binary,
but only include zeros.
(I thought of using a list with their names which is filled after the column is added to the DF,
but is there a way to directly sign a column as "binary" during creation?)
the purpose is featurescaling for machine learning. (binarys shouldnt be scaled)
If want filter columns names with 0 or 1 values:
c = df.columns[df.isin([0,1]).all()]
print (c)
Index(['col', 'col2'], dtype='object')
If need filter columns:
df1 = df.loc[:, df.isin([0,1]).all()]
print (df1)
col col2
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
you can use this:
pd.unique(df[['col', 'col2']].values.ravel('K'))
and it returns:
array([0, 1], dtype=int64)
or you can use also pd.unique for each column
That's what I use to also cover all corner cases with mixed string/numeric types
import numpy as np
import pandas as pd
def checkBinary(ser, dropna = False):
try:
if dropna:
ser = pd.to_numeric(ser.dropna(), errors="raise") #With a safety reminder that errors must be raised
else:
ser = pd.to_numeric(ser, errors="raise")
except:
return False
return {0,1} == set(pd.unique(ser))
ser = pd.Series(["0",1,"1.000", np.nan])
checkBinary(ser, dropna = True)
>> True
ser = pd.Series(["0",0,"0.000"])
checkBinary(ser)
>> False

Update a pandas dataframe

I have a pandas dataframe with multiple columns, I have to update a column with true or false based on a condition. Example the column names are price and result, if price column has promotion as value then result column should be updated as true or else false.
Please help me with this.
Given this df:
price result
0 promotion 0
1 1 0
2 4 0
3 3 0
You can do so:
df['result'] = np.where(df['price'] == 'promotion', True, False)
Output:
price result
0 promotion True
1 1 False
2 4 False
3 3 False
Lets suppose the dataframe looks like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion False
4 3 False
5 promotion False
You can create two boolean arrays. The first one will have 'True' value at the indices where you want to set the 'True' value in result column and the second one will have 'True' values at the indices where you want to set 'False' value in result column.
Here is the code:
index_true = (df['price'] == 'promotion')
index_false = (df['price'] != 'promotion')
df.loc[index_true, 'result'] = True
df.loc[index_false, 'result'] = False
The resultant dataframe will look like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion True
4 3 False
5 promotion True

find index of row element in pandas

If you have a df:
apple banana carrot
a 1 2 3
b 2 3 1
c 0 0 1
To find the index for the columns where a cell is equal to 0 is df[df['apple']==0].index
but can you transpose this so you find the index of row c where it is 0?
Basically I need to drop the columns where c==0 and would like to do this in one line by row rather than by each column.
If want test row c and get all columns if 0:
c = df.columns[df.loc['c'] == 0]
print (c)
Index(['apple', 'banana'], dtype='object')
If want test all rows:
c1 = df.columns[df.eq(0).any()]
print (c1)
Index(['apple', 'banana'], dtype='object')
If need remove columns if 0 in any row:
df = df.loc[:, df.ne(0).all()]
print (df)
carrot
a 3
b 1
c 1
Detail/explanation:
First compare all values of DataFrame by ne (!=):
print (df.ne(0))
apple banana carrot
a True True True
b True True True
c False False True
Then get all rows if all True rows:
print (df.ne(0).all())
apple False
banana False
carrot True
dtype: bool
Last filter by DataFrame.loc:
print (df.loc[:, df.loc['c'].ne(0)])
carrot
a 3
b 1
c 1
If need test only c row solution is similar, only first select c row by loc and omit all:
df = df.loc[:, df.loc['c'].ne(0)]
Yes you can, df.T[df.T['c']==0]

How to delete the entire row if any of its value is 0 in pandas

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0
You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

Resources