Python pandas sort dataframe by enum class values - python-3.x

If I have enum class:
from enum import Enum
class Colors(Enum):
RED = 1
ORANGE = 2
GREEN = 3
And if I have a dataframe whose one column is color (it can be in lowercase to):
>>> import pandas as pd
>>> df = pd.DataFrame({'X':['A', 'B', 'C', 'A'], 'color' : ['GREEN', 'RED', 'ORANGE', 'ORANGE']})
>>> df
X color
0 A GREEN
1 B RED
2 C ORANGE
3 A ORANGE
How to make color column as categorical type respecting Color class values, and sort the dataframe by "color" and "X" (ascending)?
For example, the dataframe above should be sorted as:
X, color
--------
B, RED
A, ORANGE
C, ORANGE
A, GREEN

Combination of this answer and this one: use a pd.Categorical to sort by the Colors class (with a slight edit to change its str):
from enum import Enum
import pandas as pd
df = pd.DataFrame({'X':['A', 'B', 'C', 'A'], 'color' : ['GREEN', 'RED', 'ORANGE', 'ORANGE']})
class Colors(Enum):
RED = 1
ORANGE = 2
GREEN = 3
def __str__(self):
return self.name
df['color'] = pd.Categorical(df['color'], [str(i) for i in Colors], ordered=True)
df = df.sort_values(['color','X'])
Result:
X color
1 B RED
3 A ORANGE
2 C ORANGE
0 A GREEN

Use getattr:
df["value"] = df["color"].apply(lambda x: getattr(Colors, x).value)
df.sort_values(by=['value',"X"])
Output:
X color value
1 B RED 1
3 A ORANGE 2
2 C ORANGE 2
0 A GREEN 3
In one line (and without creation of value column):
df.iloc[pd.concat([df["X"], df["color"].apply(lambda x: getattr(Colors, x))], axis=1).sort_values(by=['color',"X"]).index]
Output:
X color
1 B RED
3 A ORANGE
2 C ORANGE
0 A GREEN

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.
IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

Python, pandas dataframe, groupby column and known in advance values

Consider this example:
>>> import pandas as pd
>>> df = pd.DataFrame(
... [
... ['X', 'R', 1],
... ['X', 'G', 2],
... ['X', 'R', 1],
... ['X', 'B', 3],
... ['X', 'R', 2],
... ['X', 'B', 2],
... ['X', 'G', 1],
... ],
... columns=['client', 'status', 'cnt']
... )
>>> df
client status cnt
0 X R 1
1 X G 2
2 X R 1
3 X B 3
4 X R 2
5 X B 2
6 X G 1
>>>
>>> df_gb = df.groupby(['client', 'status']).cnt.sum().unstack()
>>> df_gb
status B G R
client
X 5 3 4
>>>
>>> def color(row):
... if 'R' in row:
... red = row['R']
... else:
... red = 0
... if 'B' in row:
... blue = row['B']
... else:
... blue = 0
... if 'G' in row:
... green = row['G']
... else:
... green = 0
... if red > 0:
... return 'red'
... elif blue > 0 and (red + green) == 0:
... return 'blue'
... elif green > 0 and (red + blue) == 0:
... return 'green'
... else:
... return 'orange'
...
>>> df_gb.apply(color, axis=1)
client
X red
dtype: object
>>>
What this code does, is groupby in order to get counts of each category (red, green, blue).
Than apply is used in order to implement logic for determining color of the each client (in this case there is only one).
The problem here is in fact that groupby object can conain any combiantion of RGB values.
For example, I can have R and G column but not B, or I could have just R column, or I will not have any of the RGB coluimns.
Because of that fact, int the apply function, I had to introduce if statements for each column in order to have counts for each color no matter if its value is in the groupby object or not.
Do I have any other option to enforce the logic from color function, using something else instead of apply in such (ugly) way?
For example, in this case I know in advance that I need counts for exactly three categories - R, G and B. I need something like group by column and these three values.
Can I group dataframe by these three categories (series, dict, function?) and always get zero or a sum for all three categories no matter whether they exist in group or not?
Use:
#changed data for more combinations
df = pd.DataFrame(
[
['W', 'R', 1],
['X', 'G', 2],
['Y', 'R', 1],
['Y', 'B', 3],
['Z', 'R', 2],
['Z', 'B', 2],
['Z', 'G', 1],
],
columns=['client', 'status', 'cnt']
)
print (df)
client status cnt
0 W R 1
1 X G 2
2 Y R 1
3 Y B 3
4 Z R 2
5 Z B 2
6 Z G 1
Then is added fill_value=0 parameter for replace non matched values (missing values) to 0:
df_gb = df.groupby(['client', 'status']).cnt.sum().unstack(fill_value=0)
#alternative
df_gb = df.pivot_table(index='client',
columns='status',
values='cnt',
aggfunc='sum',
fill_value=0)
print (df_gb)
status B G R
client
W 0 0 1
X 0 2 0
Y 3 0 1
Z 2 1 2
Instead function is created helper DataFrame with all combinations of 0,1 and added new column for output:
from itertools import product
df1 = pd.DataFrame(product([0,1], repeat=3), columns=['R','G','B'])
#change colors like need
df1['output'] = ['no','blue','green','color2','red','red1','red2','all']
print (df1)
R G B output
0 0 0 0 no
1 0 0 1 blue
2 0 1 0 green
3 0 1 1 color2
4 1 0 0 red
5 1 0 1 red1
6 1 1 0 red2
7 1 1 1 all
Then for replace values above 1 to 1 is used DataFrame.clip:
print (df_gb.clip(upper=1))
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all
And last is used DataFrame.merge for new output column, there is no on parameter, so joined by intersection of columns in both DataFrames, here R,G,B:
df2 = df_gb.clip(upper=1).merge(df1)
print (df2)
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all

Compare returns incorrect result in pandas data frame [duplicate]

How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise?
Type Set
1 A Z
2 B Z
3 B X
4 C Y
If you only have two choices to select from:
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)
yields
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
If you have more than two conditions then use np.select. For example, if you want color to be
yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,
then use
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)
which yields
Set Type color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black
List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.
Example list comprehension:
df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit tests:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop
The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.
Simple example using just the "Set" column:
def set_color(row):
if row["Set"] == "Z":
return "red"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Example with more colours and more columns taken into account:
def set_color(row):
if row["Set"] == "Z":
return "red"
elif row["Type"] == "C":
return "blue"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C blue
Edit (21/06/2019): Using plydata
It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).
from plydata import define, if_else
Simple if_else:
df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Nested if_else:
df = define(df, color=if_else(
'Set=="Z"',
'"red"',
if_else('Type=="C"', '"green"', '"blue"')))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B blue
3 Y C green
Another way in which this could be achieved is
df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:
def map_values(row, values_dict):
return values_dict[row]
values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})
df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))
What's it look like:
df
Out[2]:
INDICATOR VALUE NEW_VALUE
0 A 10 1
1 B 9 2
2 C 8 3
3 D 7 4
This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).
And of course you could always do this:
df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)
But that approach is more than three times as slow as the apply approach from above, on my machine.
And you could also do this, using dict.get:
df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]
You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).
Code Summary:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
Explanation:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
# df so far:
Type Set
0 A Z
1 B Z
2 B X
3 C Y
add a 'color' column and set all values to "red"
df['Color'] = "red"
Apply your single condition:
df.loc[(df['Set']=="Z"), 'Color'] = "green"
# df:
Type Set Color
0 A Z green
1 B Z green
2 B X red
3 C Y red
or multiple conditions if you want:
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
You can read on Pandas logical operators and conditional selection here:
Logical operators for boolean indexing in Pandas
You can use pandas methods where and mask:
df['color'] = 'green'
df['color'] = df['color'].where(df['Set']=='Z', other='red')
# Replace values where the condition is False
or
df['color'] = 'red'
df['color'] = df['color'].mask(df['Set']=='Z', other='green')
# Replace values where the condition is True
Alternatively, you can use the method transform with a lambda function:
df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')
Output:
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
Performance comparison from #chai:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green')
%timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green')
%timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green')
397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
976 ms ± 241 ms per loop
673 ms ± 139 ms per loop
796 ms ± 182 ms per loop
if you have only 2 choices, use np.where()
df = pd.DataFrame({'A':range(3)})
df['B'] = np.where(df.A>2, 'yes', 'no')
if you have over 2 choices, maybe apply() could work
input
arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})
and arr is
A B C D
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else
arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)
and finally the arr is
A B C D E
0 a 0 3 6 0
1 b 1 4 7 4
2 c 2 5 8 8
One liner with .apply() method is following:
df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')
After that, df data frame looks like this:
>>> print(df)
Type Set color
0 A Z green
1 B Z green
2 B X red
3 C Y red
The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:
For a single condition:
df.case_when(
df.col1 == "Z", # condition
"green", # value if True
"red", # value if False
column_name = "color"
)
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
For multiple conditions:
df.case_when(
df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result
df.Set.eq('Z') & df.Type.eq('B'), 'blue', # condition, result
df.Type.eq('B'), 'purple', # condition, result
'black', # default if none of the conditions evaluate to True
column_name = 'color'
)
Type Set color
1 A Z yellow
2 B Z blue
3 B X purple
4 C Y black
More examples can be found here
If you're working with massive data, a memoized approach would be best:
# First create a dictionary of manually stored values
color_dict = {'Z':'red'}
# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}
# Next, merge the two
color_dict.update(color_dict_other)
# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)
This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4
E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.
A Less verbose approach using np.select:
a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']])
df = pd.DataFrame(a,columns=['Type','Set'])
conditions = [
df['Set'] == 'Z'
]
outputs = [
'Green'
]
# conditions Z is Green, Red Otherwise.
res = np.select(conditions, outputs, 'Red')
res
array(['Green', 'Green', 'Red', 'Red'], dtype='<U5')
df.insert(2, 'new_column',res)
df
Type Set new_column
0 A Z Green
1 B Z Green
2 B X Red
3 C Y Red
df.to_numpy()
array([['A', 'Z', 'Green'],
['B', 'Z', 'Green'],
['B', 'X', 'Red'],
['C', 'Y', 'Red']], dtype=object)
%%timeit conditions = [df['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%%timeit conditions = [df2['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is an easy one-liner you can use when you have one or several conditions:
df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")
Easy and good to go!
See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html

Concatenate two rows based on the same value in the next row of a new column

I am creating a new column and trying to concatenate the rows where the column value is the same. 1 the 1st row would have the initial value in that row, second row would the value of the 1st row and 2nd row. I have been able to make it work where the column has two values, if the column has 3 or more values only two values are being concatenated in the final row.
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit']+df['length'].map(lambda x: ' '*x)
df['same_fruit']=np.where(df['Fruit']!=df['Fruit'].shift(1),df['Fruit_color'],df['Fruit_color'].shift(1)+" "+df['Fruit_color]
Current output:
How do i get the expected output.
Below is the output that i am expecting
Regards,
Ren.
Here is an answer:
In [1]:
import pandas as pd
data={ 'Fruit':['Apple','Apple','Mango','Mango','Mango','Watermelon'],
'Color':['Red','Green','Yellow','Green','Orange','Green']
}
df = pd.DataFrame(data)
df['length']=df['Fruit'].str.len()
df['Fruit_color']=df['Fruit'] + ' ' + df['Color']
df.sort_values(by=['Fruit_color'], inplace=True)
## Get the maximum of fruit occurrence
maximum = df[['Fruit', 'Color']].groupby(['Fruit']).count().max().tolist()[0]
## Iter shift as many times as the highest occurrence
new_cols = []
for i in range(maximum):
temporary_col = 'Fruit_' + str(i)
df[temporary_col] = df['Fruit'].shift(i+1)
new_col = 'new_col_' + str(i)
df[new_col] = df['Fruit_color'].shift(i+1)
df.loc[df[temporary_col] != df['Fruit'], new_col] = ''
df.drop(columns=[temporary_col], axis=1, inplace=True)
new_cols.append(new_col)
## Use this shifted columns to create `same fruit` and drop useless columns
df['same_fruit'] = df['Fruit_color']
for col in new_cols:
df['same_fruit'] = df['same_fruit'] + ' ' + df[col]
df.drop(columns=[col], axis=1, inplace=True)
Out [1]:
Fruit Color length Fruit_color same_fruit
1 Apple Green 5 Apple Green Apple Green
0 Apple Red 5 Apple Red Apple Red Apple Green
3 Mango Green 5 Mango Green Mango Green
4 Mango Orange 5 Mango Orange Mango Orange Mango Green
2 Mango Yellow 5 Mango Yellow Mango Yellow Mango Orange Mango Green
5 Watermelon Green 10 Watermelon Green Watermelon Green

Replace value that is greater than A with B and value that is smaller than C with D [duplicate]

How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise?
Type Set
1 A Z
2 B Z
3 B X
4 C Y
If you only have two choices to select from:
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)
yields
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
If you have more than two conditions then use np.select. For example, if you want color to be
yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,
then use
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)
which yields
Set Type color
0 Z A yellow
1 Z B blue
2 X B purple
3 Y C black
List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.
Example list comprehension:
df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit tests:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop
The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.
Simple example using just the "Set" column:
def set_color(row):
if row["Set"] == "Z":
return "red"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Example with more colours and more columns taken into account:
def set_color(row):
if row["Set"] == "Z":
return "red"
elif row["Type"] == "C":
return "blue"
else:
return "green"
df = df.assign(color=df.apply(set_color, axis=1))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C blue
Edit (21/06/2019): Using plydata
It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).
from plydata import define, if_else
Simple if_else:
df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B green
3 Y C green
Nested if_else:
df = define(df, color=if_else(
'Set=="Z"',
'"red"',
if_else('Type=="C"', '"green"', '"blue"')))
print(df)
Set Type color
0 Z A red
1 Z B red
2 X B blue
3 Y C green
Another way in which this could be achieved is
df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:
def map_values(row, values_dict):
return values_dict[row]
values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})
df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))
What's it look like:
df
Out[2]:
INDICATOR VALUE NEW_VALUE
0 A 10 1
1 B 9 2
2 C 8 3
3 D 7 4
This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).
And of course you could always do this:
df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)
But that approach is more than three times as slow as the apply approach from above, on my machine.
And you could also do this, using dict.get:
df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]
You can simply use the powerful .loc method and use one condition or several depending on your need (tested with pandas=1.0.5).
Code Summary:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
Explanation:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
# df so far:
Type Set
0 A Z
1 B Z
2 B X
3 C Y
add a 'color' column and set all values to "red"
df['Color'] = "red"
Apply your single condition:
df.loc[(df['Set']=="Z"), 'Color'] = "green"
# df:
Type Set Color
0 A Z green
1 B Z green
2 B X red
3 C Y red
or multiple conditions if you want:
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
You can read on Pandas logical operators and conditional selection here:
Logical operators for boolean indexing in Pandas
You can use pandas methods where and mask:
df['color'] = 'green'
df['color'] = df['color'].where(df['Set']=='Z', other='red')
# Replace values where the condition is False
or
df['color'] = 'red'
df['color'] = df['color'].mask(df['Set']=='Z', other='green')
# Replace values where the condition is True
Alternatively, you can use the method transform with a lambda function:
df['color'] = df['Set'].transform(lambda x: 'green' if x == 'Z' else 'red')
Output:
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
Performance comparison from #chai:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%timeit df['color1'] = 'red'; df['color1'].where(df['Set']=='Z','green')
%timeit df['color2'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color3'] = np.where(df['Set']=='Z', 'red', 'green')
%timeit df['color4'] = df.Set.map(lambda x: 'red' if x == 'Z' else 'green')
397 ms ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
976 ms ± 241 ms per loop
673 ms ± 139 ms per loop
796 ms ± 182 ms per loop
if you have only 2 choices, use np.where()
df = pd.DataFrame({'A':range(3)})
df['B'] = np.where(df.A>2, 'yes', 'no')
if you have over 2 choices, maybe apply() could work
input
arr = pd.DataFrame({'A':list('abc'), 'B':range(3), 'C':range(3,6), 'D':range(6, 9)})
and arr is
A B C D
0 a 0 3 6
1 b 1 4 7
2 c 2 5 8
if you want the column E tobe if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else
arr['E'] = arr.apply(lambda x: x['B'] if x['A']=='a' else(x['C'] if x['A']=='b' else(x['D'] if x['A']=='c' else 1234)), axis=1)
and finally the arr is
A B C D E
0 a 0 3 6 0
1 b 1 4 7 4
2 c 2 5 8 8
One liner with .apply() method is following:
df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')
After that, df data frame looks like this:
>>> print(df)
Type Set color
0 A Z green
1 B Z green
2 B X red
3 C Y red
The case_when function from pyjanitor is a wrapper around pd.Series.mask and offers a chainable/convenient form for multiple conditions:
For a single condition:
df.case_when(
df.col1 == "Z", # condition
"green", # value if True
"red", # value if False
column_name = "color"
)
Type Set color
1 A Z green
2 B Z green
3 B X red
4 C Y red
For multiple conditions:
df.case_when(
df.Set.eq('Z') & df.Type.eq('A'), 'yellow', # condition, result
df.Set.eq('Z') & df.Type.eq('B'), 'blue', # condition, result
df.Type.eq('B'), 'purple', # condition, result
'black', # default if none of the conditions evaluate to True
column_name = 'color'
)
Type Set color
1 A Z yellow
2 B Z blue
3 B X purple
4 C Y black
More examples can be found here
If you're working with massive data, a memoized approach would be best:
# First create a dictionary of manually stored values
color_dict = {'Z':'red'}
# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}
# Next, merge the two
color_dict.update(color_dict_other)
# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)
This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4
E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.
A Less verbose approach using np.select:
a = np.array([['A','Z'],['B','Z'],['B','X'],['C','Y']])
df = pd.DataFrame(a,columns=['Type','Set'])
conditions = [
df['Set'] == 'Z'
]
outputs = [
'Green'
]
# conditions Z is Green, Red Otherwise.
res = np.select(conditions, outputs, 'Red')
res
array(['Green', 'Green', 'Red', 'Red'], dtype='<U5')
df.insert(2, 'new_column',res)
df
Type Set new_column
0 A Z Green
1 B Z Green
2 B X Red
3 C Y Red
df.to_numpy()
array([['A', 'Z', 'Green'],
['B', 'Z', 'Green'],
['B', 'X', 'Red'],
['C', 'Y', 'Red']], dtype=object)
%%timeit conditions = [df['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
134 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
df2 = pd.DataFrame({'Type':list('ABBC')*1000000, 'Set':list('ZZXY')*1000000})
%%timeit conditions = [df2['Set'] == 'Z']
outputs = ['Green']
np.select(conditions, outputs, 'Red')
188 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is an easy one-liner you can use when you have one or several conditions:
df['color'] = np.select(condlist=[df['Set']=="Z", df['Set']=="Y"], choicelist=["green", "yellow"], default="red")
Easy and good to go!
See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html

Resources