How to delete the entire row if any of its value is 0 in pandas - python-3.x

In the below example I only want to retain the row 1 and 2
I want to delete all the rows which has 0 anywhere across the column:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
3 3 3 0 3 3
4 0 4 0 0 0
5 5 5 5 5 0
the output should read like below:
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
I have tried:
df.loc[(df!=0).any(axis=1)]
But it deletes the row only if all of its corresponding columns are 0

You are really close, need DataFrame.all for check all Trues per row:
df = df.loc[(df!=0).all(axis=1)]
print (df)
kt b tt mky depth
1 1 1 1 1 4
2 2 2 2 2 2
Details:
print (df!=0)
kt b tt mky depth
1 True True True True True
2 True True True True True
3 True True False True True
4 False True False False False
5 True True True True False
print ((df!=0).all(axis=1))
1 True
2 True
3 False
4 False
5 False
dtype: bool
Alternative solution with any for check at least one True for row with changed mask df == 0 and inversing by ~:
df = df.loc[~(df==0).any(axis=1)]

Related

PANDAS/Python check if the value from 2 datasets is equal and change the 1&0 to True or False

I want to check if the value in both datasets is equal. But the datasets are not in the same order so need to loop through the datasets.
Dataset 1 contract : enter image description here
Part number
H50
H51
H53
ID001
1
1
1
ID002
1
1
1
ID003
0
1
0
ID004
1
1
1
ID005
1
1
1
data 2 anx : enter image description here
So the partnumber are not in the same order, but to check the value the partnumber needs to be equal from each file. Then if the part nr is the same, check if the Hcolumn is the same too. If both partnumber and the H(header)nr are the same, check if the value is the same.
Part number
H50
H51
H53
ID001
1
1
1
ID003
0
0
1
ID004
0
1
1
ID002
1
0
1
ID005
1
1
1
Expecting outcome:
If the value 1==1 or 0 == 0 from both dataset -> change to TRUE.
If the value = 1 in dataset1 but = 0 in dataset2 -> change the value to FALSE. and safe all the rows that contains FALSE value into an excel file name "Not in contract"
If the value = 0 in dataset1 but 1 in dataset2 -> change the value to FALSE
Example expected outcome
Part number
H50
H51
H53
ID001
TRUE
TRUE
TRUE
ID002
TRUE
FALSE
TRUE
ID003
TRUE
FALSE
FALSE
ID004
FALSE
TRUE
TRUE
ID005
TRUE
TRUE
TRUE
df_merged = df1.merge(df2, on='Part number')
a = df_merged[df_merged.columns[df_merged.columns.str.contains('_x')]]
b = df_merged[df_merged.columns[df_merged.columns.str.contains('_y')]]
out = pd.concat([df_merged['Part number'], pd.DataFrame(a.values == b.values, columns=df1.columns[1:4])], axis=1)
out
Part number H50 H51 H53
0 ID001 True True True
1 ID002 True False True
2 ID003 True False False
3 ID004 False True True
4 ID005 True True True

Checking for specific value change between columns in pandas

I've got 4 columns with numeric values between 1 and 4, and I'm trying to see which rows change from a value of 1 to a value of 4 progressing from column a to column d within those 4 columns. Currently I'm pulling the difference between each of the columns and looking for a value of 3. Is there a better way to do this?
Here's what I'm looking for (with 0's in place of nan):
ID a b c d check
1 1 0 1 4 True
2 1 0 1 1 False
3 1 1 1 4 True
4 1 3 3 4 True
5 0 0 1 4 True
6 1 2 3 3 False
7 1 0 0 4 True
8 1 4 4 4 True
9 1 4 3 4 True
10 1 4 1 1 True
You can just do cummax
col = ['a','b','c','d']
s = df[col].cummax(1)
df['new'] = s[col[:3]].eq(1).any(1) & s[col[-1]].eq(4)
Out[523]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
dtype: bool
You can try compare the index of 4 and 1 in apply
cols = ['a', 'b', 'c', 'd']
def get_index(lst, num):
return lst.index(num) if num in lst else -1
df['Check'] = df[cols].apply(lambda row: get_index(row.tolist(), 4) > get_index(row.tolist(), 1), axis=1)
print(df)
ID a b c d check Check
0 1 1 0 1 4 True True
1 2 1 0 1 1 False False
2 3 1 1 1 4 True True
3 4 1 3 3 4 True True
4 5 0 0 1 4 True True
5 6 1 2 3 3 False False
6 7 1 0 0 4 True True
7 8 1 4 4 4 True True
8 9 1 4 3 4 True True

How to add a number to a group of rows in a column only when the rows are grouped and have the same value?

I have a dataframe with multiple columns. One of these columns consists of boolean numbers. For example:
data = pd.DataFrame([0,0,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0])
What I need to do is identify every group of 1s and add a constant number, except the first group of 1s.
The output should be a dataframe as follows:
0,0,0,0,1,1,1,0,0,0,0,0,2,2,0,0,0,0,3,3,3,3,0,0
Is there a way to make this without being messy and complicated?
Use a boolean mask:
# Look for current row = 1 and previous row = 0
m = df['A'].diff().eq(1)
df['G'] = m.cumsum().mask(df['A'].eq(0), 0)
print(df)
# Output
A G # m
0 0 0 # False
1 0 0 # False
2 0 0 # False
3 0 0 # False
4 1 1 # True <- Group 1
5 1 1 # False
6 1 1 # False
7 0 0 # False
8 0 0 # False
9 0 0 # False
10 0 0 # False
11 0 0 # False
12 1 2 # True <- Group 2
13 1 2 # False
14 0 0 # False
15 0 0 # False
16 0 0 # False
17 0 0 # False
18 1 3 # True <- Group 3
19 1 3 # False
20 1 3 # False
21 1 3 # False
22 0 0 # False
23 0 0 # False

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings

I have the following df,
id invoice_no
1 6636
1 6637
2 6639
2 6639
3
3
4 6635
4 6635
4 6635
the invoice_no for id 3 are all empty strings or spaces; I want to
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
but also consider spaces and empty string invoice_no in each group as same_invoice_no = False; I am wondering how to do that. The result will look like,
id invoice_no same_invoice_no
1 6636 False
1 6637 False
2 6639 True
2 6639 True
3 False
3 False
4 6635 True
4 6635 True
4 6635 True
Empty strings equate to True but NaNs don't. Replace empty strings by Numpy nan
df.replace('', np.nan, inplace = True)
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
id invoice_no same_invoice_no
0 1 6636.0 False
1 1 6637.0 False
2 2 6639.0 True
3 2 6639.0 True
4 3 NaN False
5 3 NaN False
6 4 6635.0 True
7 4 6635.0 True
8 4 6635.0 True

Select from a DataFrame based on several levels of the MultiIndex

How to extend the logic of selecting from a DataFrame based on the first N-1 levels when N > 2?
As an example, consider a DataFrame:
midx = pd.MultiIndex.from_product([[0, 1], [10, 20, 30], ["a", "b"]])
df = pd.DataFrame(1, columns=midx, index=np.arange(3))
In[11]: df
Out[11]:
0 1
10 20 30 10 20 30
a b a b a b a b a b a b
0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 1 1
Here, it is easy to select columns where 0 or 1 are in the first level:
df[[0, 1]]
But the same logic does not extend to selecting columns with 0 or 1 in the first and 10 or 20 in the second level:
In[13]: df[[(0, 10), (0, 20), (1, 10), (1, 20)]]
ValueError: operands could not be broadcast together with shapes (4,2) (3,) (4,2)
The following works:
df.loc[:, pd.IndexSlice[[0, 1], [10, 20], :]]
but is cumbersome, especially when the selector needs to be extracted from another DataFrame with a 2-level MultiIndex:
idx = df.columns.droplevel(2)
In[16]: idx
Out[16]:
MultiIndex(levels=[[0, 1], [10, 20, 30]],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, ... 1, 2, 2]])
In[17]: df[idx]
ValueError: operands could not be broadcast together with shapes (12,2) (3,) (12,2)
EDIT: Ideally, I would also like to be able to order columns this way, not just select them — again, in the spirit of df[[1, 0]] being able to order columns based on the first level.
If possible, you can filter by boolean indexing with get_level_values and isin:
m1 = df.columns.get_level_values(0).isin([0,1])
m2 = df.columns.get_level_values(1).isin([10,20])
print (m1)
[ True True True True True True True True True True True True]
print (m2)
[ True True True True False False True True True True False False]
print (m1 & m2)
[ True True True True False False True True True True False False]
df1 = df.loc[:, m1 & m2]
print (df1)
0 1
10 20 10 20
a b a b a b a b
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
df.columns = df.columns.droplevel(2)
print (df)
0 1
10 10 20 20 30 30 10 10 20 20 30 30
0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 1 1
df2 = df.loc[:, m1 & m2]
print (df2)
0 1
10 10 20 20 10 10 20 20
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1

Resources