Pandas If Statements (excel equivalent) - python-3.x

I'm trying to create a simple if statement in Pandas.
The excel version is as follows:
=IF(E2="ABC",C2,E2)
I'm stuck on how to assign it based on a string or partial string.
Here is what I have.
df['New Value'] = df['E'].map(lambda x: df['C'] if x == 'ABC' else df['E']]
I know I'm making a mistake here.
As the outcome is the entire dataframe values in each cell.
Any help would be much appreciated!

use np.where:
In [36]:
df = pd.DataFrame({'A':np.random.randn(5), 'B':0, 'C':np.arange(5),'D':1, 'E':['asdsa','ABC','DEF','ABC','DAS']})
df
Out[36]:
A B C D E
0 0.831728 0 0 1 asdsa
1 0.734007 0 1 1 ABC
2 -1.032752 0 2 1 DEF
3 1.414198 0 3 1 ABC
4 1.042621 0 4 1 DAS
In [37]:
df['New Value'] = np.where(df['E'] == 'ABC', df['C'], df['E'])
df
Out[37]:
A B C D E New Value
0 0.831728 0 0 1 asdsa asdsa
1 0.734007 0 1 1 ABC 1
2 -1.032752 0 2 1 DEF DEF
3 1.414198 0 3 1 ABC 3
4 1.042621 0 4 1 DAS DAS
The syntax for np.where is:
np.where( < condition >, True condition, False condition )
So when the condition is True it returns the True condition and when False the other condition.

Related

value count for an attribute from the column when there are multiple values for the attribute

I a trying to count and visualize netflix dataset depending on the country column, but when checked the data set I found there are some rows in the column that contains multiple values for country such as the
below one;
following is the code to count
country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries.shape
so I wanted to count those rows as individual countries to get the proper count of countries.
You can split the country column by , and then .explode(). Next step is .groupby():
df = df['country'].apply(lambda x: x.split(',')).explode().to_frame()
print( df.groupby('country').agg('size') )
Prints:
country
Austria 1
Canada 1
Germany 1
India 2
United Kingdom 1
United States 1
dtype: int64
You can compile all possible values from your 'country' column, make a set out of it and create new columns for each.
Then you can iterate your rows and fill in if the column is inside this rows 'country':
import pandas as pd
df = pd.DataFrame({"country":["A,B,C","A,D,E,F","G"]})
print(df)
df[[*sorted(set(','.join(df["country"]).split(",")))]] = 0
for row in df.iterrows():
row[1][ [*(row[1]["country"].split(","))]] = 1
print(df)
Output:
country A B C D E F G
0 A,B,C 1 1 1 None None None None
1 A,D,E,F 1 None None 1 1 1 None
2 G None None None None None None 1
If you'd rather have 0 instead of Noneuse df.fillna(0, inplace=True) to convert them:
# 0 instead of None
df.fillna(value=0, inplace=True)
print(df)
# print sums
for c in df.columns:
if c == "country":
continue
print(f"{c} {df[c].sum()}")
Output:
country A B C D E F G
0 A,B,C 1 1 1 0 0 0 0
1 A,D,E,F 1 0 0 1 1 1 0
2 G 0 0 0 0 0 0 1
A 2
B 1
C 1
D 1
E 1
F 1
G 1

groupby and trim some rows based on condition

I have a data frame something like this:
df = pd.DataFrame({"ID":[1,1,2,2,2,3,3,3,3,3],
"IF_car":[1,0,0,1,0,0,0,1,0,1],
"IF_car_history":[0,0,0,1,0,0,0,1,0,1],
"observation":[0,0,0,1,0,0,0,2,0,3]})
I want output where I can trim rows in groupby with ID and condition on "IF_car_history" == 1
tried_df = df.groupby(['ID']).apply(lambda x: x.loc[:(x['IF_car_history'] == '1').idxmax(),:]).reset_index(drop = True)
I want to drop rows in a groupby by after i get ['IF_car_history'] == '1'
expected output:
Thanks
First compare values for mask m by Series.eq and then use GroupBy.cumsum, and for values before 1 compare by 0, last filter by boolean indexing, but because id necesary remove after last 1 is used swapped values by slicing with [::-1].
m = df['IF_car_history'].eq(1).iloc[::-1]
df1 = df[m.groupby(df['ID']).cumsum().ne(0).iloc[::-1]]
print (df1)
ID IF_car IF_car_history observation
2 2 0 0 0
3 2 1 1 1
5 3 0 0 0
6 3 0 0 0
7 3 1 1 2
8 3 0 0 0
9 3 1 1 3

iterating over a list of columns in pandas dataframe

I have a dataframe like below. I want to update the value of column C,D, E based on column A and B.
If column A < B, then C, D, E = A, else B. I tried the below code but I'm getting ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). error
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
list_1 = ['C', 'D', 'E']
for i in df[list_1]:
if df['A'] < df['B']:
df[i] = df['A']
else:
df['i'] = df['B']
I'm expecting below output:
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 2 2 2
np.where
Return elements are chosen from A or B depending on condition.
df.assign
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
nums = np.where(df.A < df.B, df.A, df.B)
df = df.assign(C=nums, D=nums, E=nums)
Use DataFrame.mask:
df.loc[:,df.columns != 'B']=df.loc[:,df.columns != 'B'].mask(df['B']>df['A'],df['A'],axis=0)
print(df)
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 0 0 0
personally i always use .apply to modify columns based on other columns
list_1 = ['C', 'D', 'E']
for i in list_1:
df[i]=df.apply(lambda x: x.a if x.a<x.b else x.b, axis=1)
I don't know what you are trying to achieve here. Because condition df['A'] < df['B'] will always return same output in your loop. Just for sake of understanding:
When you do if df['A'] < df['B']:
The if condition expects a Boolean, but df['A'] < df['B'] gives a Series of Boolean values. So, it says either use something like
if (df['A'] < df['B']).all():
OR
if (df['A'] < df['B']).any():
What I would do is I would only create a DataFrame with columns 'A' and 'B', and then create column 'C' in the following way:
df['C'] = df.min(axis=1)
Columns 'D' and 'E' seem to be redundant.
If you have to start with all the columns and need to have all of them as output then you can do the following:
df['C'] = df[['A', 'B']].min(axis=1)
df['D'] = df['C']
df['E'] = df['C']
You can use the function where in numpy:
df.loc[:,'C':'E'] = np.where(df['A'] < df['B'], df['A'], df['B']).reshape(-1, 1)

pandas if else only on specific rows

I have a pandas dataframe as below. I want to apply below condition
Only for row where A =2, update the column 'C', 'D' TO -99.
I have a function like below which updates the value of C and D to -99.
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
Now i just want to call that function, if A =2. I tried the below code but it updates all the rows of C and D to -99
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
df
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
if (df['A'] == 2).any():
func(df)
print(df)
My expected output:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0
You can do that by filtering:
df.loc[df['A'] == 2, ['C', 'D']] = -99
Here the first item of the filtering filters the rows, and we filter these such that we only select rows where the value for the column of 'A' is 2. We filter the columns by a list of names (C and D). We then assign -99 to these items.
For the given sample data, we obtain:
>>> df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
>>> df.loc[df['A'] == 2, ['C', 'D']] = -99
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

How to apply function to data frame column to created iterated column

I have IDs with system event times, and I have grouped the event times by id (individual systems) and made a new column where the value is 1 if the eventtimes.diff() is greater than 1 day, else 0 . Now that I have the flag I am trying to make a function that will be applied to groupby('ID') so the new column starts with 1 and keeps returning 1 for each row in the new column until the flag shows 1 then the new column will go up 1, to 2 and keep returning 2 until the flag shows 1 again.
I will apply this along with groupby('ID') since I need the new column to start over again at 1 for each ID.
I have tried to the following:
def try(x):
y = 1
if row['flag']==0:
y = y
else:
y += y+1
df['NewCol'] = df.groupby('ID')['flag'].apply(try)
I have tried differing variations of the above to no avail. Thanks in advance for any help you may provide.
Also, feel free to let me know if I messed up posting the question. Not sure if my title is great either.
Use boolean indexing for filtering + cumcount + reindex what is much faster solution as loopy apply :
I think you need for count only 1 per group and if no 1 then 1 is added to output:
df = pd.DataFrame({
'ID': ['a','a','a','a','b','b','b','b','b'],
'flag': [0,0,1,1,0,0,1,1,1]
})
df['new'] = (df[df['flag'] == 1].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=1))
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3
Detail:
#filter by condition
print (df[df['flag'] == 1])
ID flag
2 a 1
3 a 1
6 b 1
7 b 1
8 b 1
#count per group
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount())
2 0
3 1
6 0
7 1
8 2
dtype: int64
#add 1 for count from 1
print (df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1))
2 1
3 2
6 1
7 2
8 3
dtype: int64
If need count 0 and if no 0 is added -1:
df['new'] = (df[df['flag'] == 0].groupby('ID')['flag']
.cumcount()
.add(1)
.reindex(df.index, fill_value=-1))
print (df)
ID flag new
0 a 0 1
1 a 0 2
2 a 1 -1
3 a 1 -1
4 b 0 1
5 b 0 2
6 b 1 -1
7 b 1 -1
8 b 1 -1
Another 2 step solution:
df['new'] = df[df['flag'] == 1].groupby('ID')['flag'].cumcount().add(1)
df['new'] = df['new'].fillna(1).astype(int)
print (df)
ID flag new
0 a 0 1
1 a 0 1
2 a 1 1
3 a 1 2
4 b 0 1
5 b 0 1
6 b 1 1
7 b 1 2
8 b 1 3

Resources