Pandas create a new column based on existing column with chain rule - python-3.x

I have below code
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, 3, 1],"B":[7, 2, 54, 3, None],"C":[20, 16, 11, 3, 8],"D":[14, 3, None, 2, 6]})
df['A1'] = np.where(df['A'] > 10, 10, np.where(df['A'] < 3, 3, df['A']))
While this is okay, I want create the final dataframe (i.e. 2nd line of code) using chain rule from the first line. I want to achieve this to increase readability.
Could you please help how can I achieve this?

You can use clip here:
df.assign(A1=df['A'].clip(upper=10,lower=3))
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3
If you really need to do this in one line (note that I dont find this readable)
pd.DataFrame({"A":[12, 4, 5, 3, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]}).assign(A1=lambda x:x['A'].clip(upper=10,lower=3))

You could use np.select() like the following. It makes the conditions and choices very readable.
conditions = [df['A'] > 10,
df['A'] < 3]
choices = [10,3]
df['A2'] = np.select(conditions, choices, default = df['A'])
print(df)
A B C D A1
0 12 7.0 20 14.0 10
1 4 2.0 16 3.0 4
2 5 54.0 11 NaN 5
3 3 3.0 3 2.0 3
4 1 NaN 8 6.0 3

Related

Pandas groupby with dropna set to True generating wrong output

In the following snippet:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"b": [1, np.nan, 1, np.nan, 2, 1, 2, np.nan, 1]
}
)
df_again = df.groupby("b", dropna=False).apply(lambda x: x)
I was expecting df and df_again to be identical. But they are not:
df
a b
0 1 1.0
1 2 NaN
2 3 1.0
3 4 NaN
4 5 2.0
5 6 1.0
6 7 2.0
7 8 NaN
8 9 1.0
df_again
a b
0 1 1.0
2 3 1.0
4 5 2.0
5 6 1.0
6 7 2.0
8 9 1.0
Now, if I tweak slightly the lambda expression to "see" what is going on by
df.groupby("b", dropna=False).apply(lambda x: print(x)) I can actually visualize that also the portion of the df where b was NaN was processed.
What am I missing here?
(Using pandas 1.3.1 and numpy 1.20.3)
It's because None and None are the same thing:
>>> None == None
True
>>>
You have to use np.nan:
>>> np.NaN == np.NaN
False
>>>
So try this:
df = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"b": [1, np.NaN, 1, np.NaN, 2, 1, 2, np.NaN, 1]
}
)
df_again = df.groupby("b", dropna=False).apply(lambda x: x)
Now df and df_again would be the same:
>>> df
a b
0 1 1.0
1 2 NaN
2 3 1.0
3 4 NaN
4 5 2.0
5 6 1.0
6 7 2.0
7 8 NaN
8 9 1.0
>>> df_again
a b
0 1 1.0
1 2 NaN
2 3 1.0
3 4 NaN
4 5 2.0
5 6 1.0
6 7 2.0
7 8 NaN
8 9 1.0
>>> df.equals(df_again)
True
>>>
This was a bug introduced in pandas 1.2.0 as described here and was solved here.

Pandas Min and Max Across Rows

I have a dataframe that looks like below. I want to get a min and max value per city along with the information about which products were ordered min and max for that city. Please help.
Dataframe
db.min(axis=0) - min value for each column
db.min(axis=1) - min value for each row
use Dataframe.min and Datafram.max
DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
matrix = [(22, 16, 23),
(33, 50, 11),
(44, 34, 11),
(55, 35, 60),
(66, 36, 13)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
x y z
a 22 16.0 23.0
b 33 50 11.0
c 44 34.0 11.0
d 55 35.0 60
e 66 36.0 13.0
Get a series containing the minimum value of each row
minValuesObj = dfObj.min(axis=1)
print('minimum value in each row : ')
print(minValuesObj)
output
minimum value in each row :
a 16.0
b 11.0
c 11.0
d 35.0
e 13.0
dtype: float64
MMT Marathi, based on the answers provided by Danil and Sutharp777, you should be able to get to your answer. However, I see you have questions for them. Not sure if you are looking for a column to be created that has the min/max value for each row.
Here's the full dataframe with the solution. I am merely compiling the answers they have already given
import pandas as pd
d = [['20in Monitor',2,2,1,2,2,2,2,2,2],
['27in 4k Gaming Monitor',2,1,2,2,1,2,2,2,2],
['27in FHD Monitor',2,2,2,2,2,2,2,2,2],
['34in Ultrawide Monitor',2,1,2,2,2,2,2,2,2],
['AA Batteries (4-pack)',5,5,6,7,6,6,6,6,5],
['AAA Batteries (4-pack)',7,7,8,8,9,7,8,9,7],
['Apple Airpods Headphones',2,2,3,2,2,2,2,2,2],
['Bose SoundSport Headphones',2,2,2,2,3,2,2,3,2],
['Flatscreen TV',2,1,2,2,2,2,2,2,2]]
c = ['Product','Atlanta','Austin','Boston','Dallas','Los Angeles',
'New York City','Portland','San Francisco','Seattle']
df = pd.DataFrame(d,columns=c)
df['min_value'] = df.min(axis=1)
df['max_value'] = df.max(axis=1)
print (df)
The output of this will be:
Product Atlanta Austin ... Seattle min_value max_value
0 20in Monitor 2 2 ... 2 1 2
1 27in 4k Gaming Monitor 2 1 ... 2 1 2
2 27in FHD Monitor 2 2 ... 2 2 2
3 34in Ultrawide Monitor 2 1 ... 2 1 2
4 AA Batteries (4-pack) 5 5 ... 5 5 7
5 AAA Batteries (4-pack) 7 7 ... 7 7 9
6 Apple Airpods Headphones 2 2 ... 2 2 3
7 Bose SoundSport Headphones 2 2 ... 2 2 3
8 Flatscreen TV 2 1 ... 2 1 2
If you want the min and max of each column, then you can do this:
print ('min of each column :', df.min(axis=0).to_list()[1:])
print ('max of each column :', df.max(axis=0).to_list()[1:])
This will give you:
min of each column : [2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2]
max of each column : [7, 7, 8, 8, 9, 7, 8, 9, 7, 7, 9]

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

How to NaN elements in a numpy array based on upper and lower boundery

I have an numpy array with 0-10 elements.
a = np.arange(0,11)
np.random.shuffle(a)
a
array([ 1, 7, 8, 0, 2, 3, 4, 10, 9, 5, 6])
I wanted to convert elements to NaN if they are between 4 and 8.
As a first step I tried getting the array with np.where like below:
np.where([a > 4] & [a < 8])
but got error. any help please?
You need:
import numpy as np
a = np.arange(0,11)
np.random.shuffle(a)
print(a)
# [ 7 4 2 3 6 10 1 9 5 0 8]
a = np.where((a > 4) & (a < 8), np.nan, a)
print(a)
# [nan 4. 2. 3. nan 10. 1. 9. nan 0. 8.]
If you want to know how np.where()works, refer numpy.where() detailed, step-by-step explanation / examples

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Resources