Filter a dataframe with NOT and AND condition - python-3.x

I know this question has been asked multiple times, but for some reason it is not working for my case.
So I want to filter the dataframe using the NOT and AND condition.
For example, my dataframe df looks like:
col1 col2
a 1
a 2
b 3
b 4
b 5
c 6
Now, I want to use a condition to remove where col1 has "a" AND col2 has 2
My resulting dataframe should look like:
col1 col2
a 1
b 3
b 4
b 5
c 6
I tried this: Even though I used & but it removes all the rows which have "a" in col1 .
df = df[(df['col1'] != "a") & (df['col2'] != "2")]

To remove cells where col1 is "a" AND col2 is 2 means to keep cells where col1 isn't "a" OR col2 isn't 2 (negation of A AND B is NOT(A) OR NOT(B)):
df = df[(df['col1'] != "a") | (df['col2'] != 2)] # or "2", depending on whether the `2` is an int or a str

Related

Creating a column or list comprehension with multiple column conditions

I have a dataframe (sample) as under:
col0 col1 col2 col3
0 101 3 5
1 102 6 2 1
2 103 2
3 104 4 6 4
4 105 8 3
5 106 1
6 107
Now I need two things as new columns to the same dataframe (col4 and col5):
To bring latest value as per priority col3>col2>col1 for each row:
If col3 has value, col3, elif col2 has value, col2, elif col1 has value, col1, else "Invalid"
To know whether that row has 1/2/3 or no values against these columns.
If col3 has value, 3, elif col2 has value, 2, elif col1 has value, 1, else 0.
I have done list comprehensions in format [x1 if condition1 else x2 if condition2 else x3 for val in df['col']]
However, I do not understand how to loop through three columns in single list comprehension attempt.
Or if there is some other way than list comprehension to do this?
I tried this:
df['col4'] = [df['col3'] if df['col3'].notna() else df['col2'] if df['col2'].notna() else df['col1'] if df['col1'].notna() else "Invalid" for x in df['col0']]
df['col5'] = [3 if df['col3'].notna() else 2 if df['col2'].notna() else 1 if df['col1'].notna() else 0]
But they do not work.
One solution that I tried was as under, but it requires four lines of code for each column:
df.loc[df['col1'].notna(),['col5']] = 1
df.loc[df['col2'].notna(),['col5']] = 2
df.loc[df['col3'].notna(),['col5']] = 3
df['col5'] = df['col5'].fillna(0)
Please suggest if any other means is possible.

Pandas: Create different dataframes from an unique multiIndex dataframe

I would like to know how to pass from a multiindex dataframe like this:
A B
col1 col2 col1 col2
1 2 12 21
3 1 2 0
To two separated dfs. df_A:
col1 col2
1 2
3 1
df_B:
col1 col2
12 21
2 0
Thank you for the help
I think here is better use DataFrame.xs for selecting by first level:
print (df.xs('A', axis=1, level=0))
col1 col2
0 1 2
1 3 1
What need is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby(level=0, axis=1):
globals()['df_' + str(i)] = g.droplevel(level=0, axis=1)
print (df_A)
col1 col2
0 1 2
1 3 1
Better is create dictionary of DataFrames:
d = {i:g.droplevel(level=0, axis=1)for i, g in df.groupby(level=0, axis=1)}
print (d['A'])
col1 col2
0 1 2
1 3 1

How to append a column to a dataframe with values from a specified column index [duplicate]

This question already has an answer here:
Vectorized look-up of values in Pandas dataframe
(1 answer)
Closed 2 years ago.
I have a dataframe in which one of the columns is used to designate which of the other columns has the specific value I'm looking for.
df = pd.DataFrame({'COL1':['X','Y','Z'],'COL2':['A','B','C'],'X_SORT':['COL1','COL2','COL1']})
I'm trying to add a new column called 'X_SORT_VALUE' and assigning the value of the column identified in the X_SORT column.
df = df.assign(X_SORT_VALUE=lambda x: (x['X_SORT']))
But I'm getting the value of the X_SORT column:
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 COL1
1 Y B COL2 COL2
2 Z C COL1 COL1
Rather than getting the value of that column index, like I want:
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 X
1 Y B COL2 B
2 Z C COL1 Z
You need df.lookup here:
df['X_SORT_VALUE'] = df.lookup(df.index,df['X_SORT'])
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 X
1 Y B COL2 B
2 Z C COL1 Z

Grouping corresponding Rows based on One column

I have an Excel Sheet Dataframe with no fixed number of rows and columns.
eg.
Col1 Col2 Col3
A 1 -
A - 2
B 3 -
B - 4
C 5 -
I would like to Group Col1 which has the same content. Like the following.
Col1 Col2 Col3
A 1 2
B 3 4
C 5 -
I am using pandas GroupBy, but not getting what I wanted.
Try using groupby:
print(df.replace('-', pd.np.nan).groupby('Col1', as_index=False).first().fillna('-'))
Output:
Col1 Col2 Col3
0 A 1 2
1 B 3 4
2 C 5 -

Pandas add empty rows as filler

Given the following data frame:
import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'],
'COL2' : [1,2,1],
'COL3' : ['X','Y','X']})
DF
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.
The desired result is as follows:
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y
This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).
Thanks in advance!
UPDATE:
For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:
#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})
#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)
#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')
#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']
#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)
#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF
Something like this?
col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y

Resources