Creating a column or list comprehension with multiple column conditions - python-3.x

I have a dataframe (sample) as under:
col0 col1 col2 col3
0 101 3 5
1 102 6 2 1
2 103 2
3 104 4 6 4
4 105 8 3
5 106 1
6 107
Now I need two things as new columns to the same dataframe (col4 and col5):
To bring latest value as per priority col3>col2>col1 for each row:
If col3 has value, col3, elif col2 has value, col2, elif col1 has value, col1, else "Invalid"
To know whether that row has 1/2/3 or no values against these columns.
If col3 has value, 3, elif col2 has value, 2, elif col1 has value, 1, else 0.
I have done list comprehensions in format [x1 if condition1 else x2 if condition2 else x3 for val in df['col']]
However, I do not understand how to loop through three columns in single list comprehension attempt.
Or if there is some other way than list comprehension to do this?
I tried this:
df['col4'] = [df['col3'] if df['col3'].notna() else df['col2'] if df['col2'].notna() else df['col1'] if df['col1'].notna() else "Invalid" for x in df['col0']]
df['col5'] = [3 if df['col3'].notna() else 2 if df['col2'].notna() else 1 if df['col1'].notna() else 0]
But they do not work.

One solution that I tried was as under, but it requires four lines of code for each column:
df.loc[df['col1'].notna(),['col5']] = 1
df.loc[df['col2'].notna(),['col5']] = 2
df.loc[df['col3'].notna(),['col5']] = 3
df['col5'] = df['col5'].fillna(0)
Please suggest if any other means is possible.

Related

Group by and drop duplicates in pandas dataframe

I have a pandas dataframe as below. I want to group by based on all the three columns and retain the group with the max of Col1.
import pandas as pd
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'B', 'B'], 'col2':['1', '1', '1', '1', '2', '3'], 'col3':['5', '5', '2', '2', '2', '3']})
df
col1 col2 col3
0 A 1 5
1 A 1 5
2 A 1 2
3 A 1 2
4 B 2 2
5 B 3 3
My expected output
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I tried below code, but it return me the last row of each group, instead I want to sort by col3 and keep the group with max col3
df.drop_duplicates(keep='last', subset=['col1','col2','col3'])
col1 col2 col3
1 A 1 5
3 A 1 2
4 B 2 2
5 B 3 3
For Example: Here I want to drop 1st group because 2 < 5, so I want to keep the group with col3 as 5
df.sort_values(by=['col1', 'col2', 'col3'], ascending=False)
a_group = df.groupby(['col1', 'col2', 'col3'])
for name, group in a_group:
group = group.reset_index(drop=True)
print(group)
col1 col2 col3
0 A 1 2
1 A 1 2
col1 col2 col3
0 A 1 5
1 A 1 5
col1 col2 col3
0 B 2 2
col1 col2 col3
0 B 3 3
You cant group on all columns since the col you wish to retain max for has different values. Instead dont include that column in the group and consider others:
col_to_max = 'col3'
i = df.columns ^ [col_to_max]
out = df[df[col_to_max] == df.groupby(list(i))[col_to_max].transform('max')]
print(out)
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
So we can do
out = df[df.col3==df.groupby(['col1','col2'])['col3'].transform('max')]
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I believe you can use groupby with nlargest(2). Also make sure that your 'col3' is a numerical one.
>>> df['col3'] = df['col3'].astype(int)
>>> df.groupby(['col1','col2'])['col3'].nlargest(2).reset_index().drop('level_2',axis=1)
col1 col2 col3
0 A 1 5
1 A 1 5
2 B 2 2
3 B 3 3
You can get index which doesn't has col3 max value and duplicated index and drop the intersection
ind = df.assign(max = df.groupby("col1")["col3"].transform("max")).query("max != col3").index
ind2 = df[df.duplicated(keep=False)].index
df.drop(set(ind).intersection(ind2))
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3

Pandas: Create different dataframes from an unique multiIndex dataframe

I would like to know how to pass from a multiindex dataframe like this:
A B
col1 col2 col1 col2
1 2 12 21
3 1 2 0
To two separated dfs. df_A:
col1 col2
1 2
3 1
df_B:
col1 col2
12 21
2 0
Thank you for the help
I think here is better use DataFrame.xs for selecting by first level:
print (df.xs('A', axis=1, level=0))
col1 col2
0 1 2
1 3 1
What need is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby(level=0, axis=1):
globals()['df_' + str(i)] = g.droplevel(level=0, axis=1)
print (df_A)
col1 col2
0 1 2
1 3 1
Better is create dictionary of DataFrames:
d = {i:g.droplevel(level=0, axis=1)for i, g in df.groupby(level=0, axis=1)}
print (d['A'])
col1 col2
0 1 2
1 3 1

groupby column in pandas

I am trying to groupby columns value in pandas but I'm not getting.
Example:
Col1 Col2 Col3
A 1 2
B 5 6
A 3 4
C 7 8
A 11 12
B 9 10
-----
result needed grouping by Col1
Col1 Col2 Col3
A 1,3,11 2,4,12
B 5,9 6,10
c 7 8
but I getting this ouput
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025BEB4D6E50>
I am getting using excel power query with function group by and count all rows, but I canĀ“t get the same with python and pandas. Any help?
Try this
(
df
.groupby('Col1')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index()
)
it outputs
Col1 Col2 Col3
0 A 1,3,11 2,4,12
1 B 5,9 6,10
2 C 7 8
Very good I created solution between 0 and 0:
df[df['A'] != 0].groupby((df['A'] == 0).cumsum()).sub()
It will group column between 0 and 0 and sum it

Filter a dataframe with NOT and AND condition

I know this question has been asked multiple times, but for some reason it is not working for my case.
So I want to filter the dataframe using the NOT and AND condition.
For example, my dataframe df looks like:
col1 col2
a 1
a 2
b 3
b 4
b 5
c 6
Now, I want to use a condition to remove where col1 has "a" AND col2 has 2
My resulting dataframe should look like:
col1 col2
a 1
b 3
b 4
b 5
c 6
I tried this: Even though I used & but it removes all the rows which have "a" in col1 .
df = df[(df['col1'] != "a") & (df['col2'] != "2")]
To remove cells where col1 is "a" AND col2 is 2 means to keep cells where col1 isn't "a" OR col2 isn't 2 (negation of A AND B is NOT(A) OR NOT(B)):
df = df[(df['col1'] != "a") | (df['col2'] != 2)] # or "2", depending on whether the `2` is an int or a str

Fetch column value based on dynamic input

I have a dataframe, where in I have 1 column, which contains names of column satisfying certain conditions for each row.
It's like if columns of dataframe are Index, Col1, Col2, Col3, Col_Name. Where Col_Name has either Col1 or Col2 or Col3 for each row.
Now in a new column say Col_New, I want output for each row such as if 5th row Col_Name mentions Col_1, then value of Col_1 in 5th row.
I am sorry I cannot post the code I am working on, hence gave this hypothetical example.
Obliged for any help, thanks.
IIUC you could use:
df['col_new'] = df.reset_index().apply(lambda x: df.at[x['index'], x['col_name']], axis=1)
Example:
cols = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df['Col_Name'] = np.random.choice(cols, 10)
print(df)
Col1 Col2 Col3 Col_Name
0 0.833988 0.939254 0.256450 Col2
1 0.675909 0.609494 0.641944 Col3
2 0.877474 0.971299 0.218273 Col3
3 0.201189 0.265742 0.800580 Col2
4 0.397945 0.135153 0.941313 Col2
5 0.666252 0.697983 0.164768 Col2
6 0.863377 0.839421 0.601316 Col2
7 0.138975 0.731359 0.379258 Col3
8 0.412148 0.541033 0.197861 Col2
9 0.980040 0.506752 0.823274 Col3
df['Col_New'] = df.reset_index().apply(lambda x: df.at[x['index'], x['Col_Name']], axis=1)
[out]
Col1 Col2 Col3 Col_Name Col_New
0 0.833988 0.939254 0.256450 Col2 0.939254
1 0.675909 0.609494 0.641944 Col3 0.641944
2 0.877474 0.971299 0.218273 Col3 0.218273
3 0.201189 0.265742 0.800580 Col2 0.265742
4 0.397945 0.135153 0.941313 Col2 0.135153
5 0.666252 0.697983 0.164768 Col2 0.697983
6 0.863377 0.839421 0.601316 Col2 0.839421
7 0.138975 0.731359 0.379258 Col3 0.379258
8 0.412148 0.541033 0.197861 Col2 0.541033
9 0.980040 0.506752 0.823274 Col3 0.823274
Example 2 (based on integer col references)
cols = [1, 2, 3]
np.random.seed(0)
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df[13] = np.random.choice(cols, 10)
print(df)
1 2 3 13
0 0.548814 0.715189 0.602763 3
1 0.544883 0.423655 0.645894 3
2 0.437587 0.891773 0.963663 1
3 0.383442 0.791725 0.528895 3
4 0.568045 0.925597 0.071036 1
5 0.087129 0.020218 0.832620 1
6 0.778157 0.870012 0.978618 1
7 0.799159 0.461479 0.780529 2
8 0.118274 0.639921 0.143353 2
9 0.944669 0.521848 0.414662 3
Instead use:
df['Col_New'] = df.reset_index().apply(lambda x: df.at[int(x['index']), int(x[13])], axis=1)
1 2 3 13 Col_New
0 0.548814 0.715189 0.602763 3 0.602763
1 0.544883 0.423655 0.645894 3 0.645894
2 0.437587 0.891773 0.963663 1 0.437587
3 0.383442 0.791725 0.528895 3 0.528895
4 0.568045 0.925597 0.071036 1 0.568045
5 0.087129 0.020218 0.832620 1 0.087129
6 0.778157 0.870012 0.978618 1 0.778157
7 0.799159 0.461479 0.780529 2 0.461479
8 0.118274 0.639921 0.143353 2 0.639921
9 0.944669 0.521848 0.414662 3 0.414662
Using the example DataFrame from Chris A.
You could do it like this:
cols = ['Col1', 'Col2', 'Col3']
df = pd.DataFrame(np.random.rand(10, 3), columns=cols)
df['Col_Name'] = np.random.choice(cols, 10)
print(df)
df['Col_New'] = [df.loc[df.index[i],j]for i,j in enumerate(df.Col_Name)]
print(df)
In pandas is for this function DataFrame.lookup, also it seems need same types of values in columns and looking column, so is possible convert both to strings:
np.random.seed(123)
cols = [1, 2, 3]
df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=cols).rename(columns=str)
df['Col_Name'] = np.random.choice(cols, 5)
df['Col_New'] = df.lookup(df.index, df['Col_Name'].astype(str))
print(df)
1 2 3 Col_Name Col_New
0 2 2 6 3 6
1 1 3 9 2 3
2 6 1 0 1 6
3 1 9 0 1 1
4 0 9 3 1 0

Resources