pandas if else only on specific rows - python-3.x

I have a pandas dataframe as below. I want to apply below condition
Only for row where A =2, update the column 'C', 'D' TO -99.
I have a function like below which updates the value of C and D to -99.
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
Now i just want to call that function, if A =2. I tried the below code but it updates all the rows of C and D to -99
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
df
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
if (df['A'] == 2).any():
func(df)
print(df)
My expected output:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

You can do that by filtering:
df.loc[df['A'] == 2, ['C', 'D']] = -99
Here the first item of the filtering filters the rows, and we filter these such that we only select rows where the value for the column of 'A' is 2. We filter the columns by a list of names (C and D). We then assign -99 to these items.
For the given sample data, we obtain:
>>> df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
>>> df.loc[df['A'] == 2, ['C', 'D']] = -99
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

Related

How do you maintain on Python the value of the last row in a column, like on excel?

I have looked around and haven't found an 'elegant' solution. It can't be that it is not doable.
What I need is to have a column ('col A') on a dataframe that it is always 0, if the adjacent ('col B') column hits 1, then change the value to 1, and all further rows should be 1 (no matter what else happens on 'col B'), until another column ('col C') hits 1, then 'col A' returns to 0, until this repeats. The data has thousands of rows, and it gets updated regularly.
any ideas? I have tried shift, iloc and loops, but can't make it work.
the result should look something like this:
[sample data][1]
date col A col B col C
... 0 0 0
... 0 0 0
... 1 1 0
... 1 1 0
... 1 0 1
... 0 0 0
... 0 0 0
... 1 1 0
... 1 1 0
... 1 0 0
... 1 0 0
... 1 1 0
... 1 0 0
... 1 1 0
... 1 0 1
... 0 0 0
This is the base code I have been thinking about, but I can't get it to work:
df['B'] = df['A'].apply(lambda x: 1 if x == 1 else 0)
for i in range(1, len(df)):
if df.loc[i, 'C'] == 1:
df.loc[i, 'B'] = 0
else:
df.loc[i, 'B'] = df.loc[i-1, 'B']

Call a function inside of an if-statement in Python

I have a pandas dataframe as below. I have a function which updates the value of C and D to -99.
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
Now i just want to call that function, if A =2. I tried the below code but it updates all the rows of C and D to -99
import pandas as pd
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
df
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
if (df['A'] == 2).any():
func(df)
print(df)
My expected output:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

iterating over a list of columns in pandas dataframe

I have a dataframe like below. I want to update the value of column C,D, E based on column A and B.
If column A < B, then C, D, E = A, else B. I tried the below code but I'm getting ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). error
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
list_1 = ['C', 'D', 'E']
for i in df[list_1]:
if df['A'] < df['B']:
df[i] = df['A']
else:
df['i'] = df['B']
I'm expecting below output:
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 2 2 2
np.where
Return elements are chosen from A or B depending on condition.
df.assign
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
nums = np.where(df.A < df.B, df.A, df.B)
df = df.assign(C=nums, D=nums, E=nums)
Use DataFrame.mask:
df.loc[:,df.columns != 'B']=df.loc[:,df.columns != 'B'].mask(df['B']>df['A'],df['A'],axis=0)
print(df)
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 0 0 0
personally i always use .apply to modify columns based on other columns
list_1 = ['C', 'D', 'E']
for i in list_1:
df[i]=df.apply(lambda x: x.a if x.a<x.b else x.b, axis=1)
I don't know what you are trying to achieve here. Because condition df['A'] < df['B'] will always return same output in your loop. Just for sake of understanding:
When you do if df['A'] < df['B']:
The if condition expects a Boolean, but df['A'] < df['B'] gives a Series of Boolean values. So, it says either use something like
if (df['A'] < df['B']).all():
OR
if (df['A'] < df['B']).any():
What I would do is I would only create a DataFrame with columns 'A' and 'B', and then create column 'C' in the following way:
df['C'] = df.min(axis=1)
Columns 'D' and 'E' seem to be redundant.
If you have to start with all the columns and need to have all of them as output then you can do the following:
df['C'] = df[['A', 'B']].min(axis=1)
df['D'] = df['C']
df['E'] = df['C']
You can use the function where in numpy:
df.loc[:,'C':'E'] = np.where(df['A'] < df['B'], df['A'], df['B']).reshape(-1, 1)

Create new rows out of columns with multiple items in Python

I have these codes and I need to create a data frame similar to the picture attached - Thanks
import pandas as pd
Product = [(100, 'Item1, Item2'),
(101, 'Item1, Item3'),
(102, 'Item4')]
labels = ['product', 'info']
ProductA = pd.DataFrame.from_records(Product, columns=labels)
Cust = [('A', 200),
('A', 202),
('B', 202),
('C', 200),
('C', 204),
('B', 202),
('A', 200),
('C', 204)]
labels = ['customer', 'product']
Cust1 = pd.DataFrame.from_records(Cust, columns=labels)
merge with get_dummies
dfA.merge(dfB).set_index('customer').tags.str.get_dummies(', ').sum(level=0,axis=0)
Out[549]:
chocolate filled glazed sprinkles
customer
A 3 1 0 2
C 1 0 2 1
B 2 2 0 0
IIUC possible with merge, split, melt and concat:
dfB = dfB.merge(dfA, on='product')
dfB = pd.concat([dfB.iloc[:,:-1], dfB.tags.str.split(',', expand=True)], axis=1)
dfB = dfB.melt(id_vars=['customer', 'product']).drop(columns = ['product', 'variable'])
dfB = pd.concat([dfB.customer, pd.get_dummies(dfB['value'])], axis=1)
dfB
Output:
customer filled sprinkles chocolate glazed
0 A 0 0 1 0
1 C 0 0 1 0
2 A 0 0 1 0
3 A 0 0 1 0
4 B 0 0 1 0
5 B 0 0 1 0
6 C 0 0 0 1
7 C 0 0 0 1
8 A 0 1 0 0
9 C 0 1 0 0
10 A 0 1 0 0
11 A 1 0 0 0
12 B 1 0 0 0
13 B 1 0 0 0

merging multiple columns in a dataframe

I have data frame like this one:
dataf = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': ['c', 'c',np.nan]})
get_dummies(df):
A_a A_b B_a B_b B_c C_c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
I want all common attributes of dataframe to be in one column. Here for attribute 'a' we have two columns i.e. A_a & B_a. I want that in one column with name 'a' and values as UNION of A_a & B_a. And it should be applicable to all similar attributes. It should look like:
a b c
0 1 1 1
1 1 1 1
2 1 0 1
In original, I have hundreds of thousands of attributes in million+ rows. Therefore a generic formula will work. Thanks.
You can add parameters prefix and prefix_sep to get_dummies and then groupby by columns with sum:
import pandas as pd
import numpy as np
import io
dataf = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': ['c', 'c',np.nan]})
print dataf
A B C
0 a b c
1 b a c
2 a c NaN
df = pd.get_dummies(dataf, prefix="", prefix_sep="")
print df
a b a b c c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
print df.groupby(df.columns, axis=1).sum()
a b c
0 1 1 1
1 1 1 1
2 1 0 1
EDIT by comment, thank you John Galt:
If values are lenght = 1 (as in sample):
df = pd.get_dummies(dataf)
print df
A_a A_b B_a B_b B_c C_c
0 1 0 0 1 0 1
1 0 1 1 0 0 1
2 1 0 0 0 1 0
print df.groupby(df.columns.str[-1:], axis=1).any().astype(int)
a b c
0 1 1 1
1 1 1 1
2 1 0 1

Resources