How to check the value is postive or negative and insert a new value in a column of a data frame - python-3.x

I am working with pandas and python and I have a task where, I need to check
GDP diff sign
Quarter
1999q4 12323.3 NaN None
2000q1 12359.1 35.8 None
2000q2 12592.5 233.4 None
2000q3 12607.7 15.2 None
2000q4 12679.3 71.6 None
Let the above dataframe be df and when I do
if df.iloc[2]['diff'] > 0:
df.iloc[2]['sign'] = "Positive"
The value is not getting updated on the dataframe. Is there something where I'm doing wrong. Its a direct assignment like how we do df['something'] = 'some value' and by doing this it will insert than value into df under that column. But when i do the above where i need to determine positive or negative, it is still showing as None when I do
df.iloc[2]['sign']
I tried using apply with lambdas, but couldn't get what I wanted.
Some help would be appreciated
Thank you.

You can use double numpy.where first with filtering NaN values by isnull and then by condition df['diff'] > 0:
df.sign = np.where(df['diff'].isnull(), np.nan,
np.where(df['diff'] > 0, 'Positive', 'Negative'))
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN NaN
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative
because if use only df['diff'] > 0 get Negative for NaN values:
df.sign = np.where(df['diff'] > 0, 'Positive', 'Negative')
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN Negative
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative

I'd create a categorical column
d = df['diff']
sign = np.where(d < 0, 'Negative',
np.where(d == 0, 'UnChanged',
np.where(d > 0, 'Positive', np.nan)))
df['sign'] = pd.Categorical(sign,
categories=['Negative', 'UnChanged', 'Positive'],
ordered=True)
df

Related

Replace NULL or NA in a column wrt to other column in pandas data frame [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I have a table:
df = pd.DataFrame([[0.1, 2, 55, 0,np.nan],
[0.2, 4, np.nan, 1,99],
[0.6, np.nan, 22, 5,88],
[1.4, np.nan, np.nan, 4,77]],
columns=list('ABCDE'))
A B C D E
0 0.1 2.0 55.0 0 NaN
1 0.2 NaN NaN 1 99.0
2 0.6 NaN 22.0 5 88.0
3 1.4 NaN NaN 4 77.0
I want to replace NaN values in Column B based on condition on Column A.
Example:
When B is NULL and value in `column A > 0.2 and < 0.6` replace "NaN" in column B as 5
When B is NULL value in `column A > 0.6 and < 2` replace "NaN" in column B as 10
I tried something like this:
if df["A"]>=val1 and pd.isnull(df['B']):
df["B"]=5
elif df["A"]>=val2 and df["A"]<val3 and pd.isnull(df['B']):
df["B"]=10
elif df["A"]<val4 and pd.isnull(df['B']):
df["B"]=15
The above code is not working.
Please let me know is there any other alternative approach using for loop or apply functions to iterate over pandas dataframe.
You can use mask:
df['B'] = df['B'].mask((df['A']>0.2) & (df['A']<0.6), df['B'].fillna(5))
df['B'] = df['B'].mask((df['A']>0.6) & (df['A']<2), df['B'].fillna(10))
or you can try np.where but it will involve a long condition I guess.

How to sum the column value seperated with semicolon in python

I have a dataframe with the values as below:
df = pd.DataFrame({'Column4': ['NaN;NaN;1;4','4;8','nan']} )
print (df)
Column4
0 NaN;NaN;1;4
1 4;8
2 nan
I tried with the code below to get the sum.
df['Sum'] = df['Column4'].apply(lambda x: sum(map(int, x.split(';'))))
I am getting the error message as
ValueError: invalid literal for int() with base 10: 'NaN'
Use Series.str.split with expand=True for DataFrame, convert to floats and sum per rows - pandas by default exclude missing values:
df['Sum'] = df['Column4'].str.split(';', expand=True).astype(float).sum(axis=1)
print (df)
Column4 Sum
0 NaN;NaN;1;4 5.0
1 4;8 12.0
2 nan 0.0
Your solution should be changed:
f = lambda x: sum(int(y) for y in x.split(';') if not y in ('nan','NaN'))
df['Sum'] = df['Column4'].apply(f)
because if convert to float get mssing values for NaNs with another numeric:
df['Sum'] = df['Column4'].apply(lambda x: sum(map(float, x.split(';'))))
print (df)
Column4 Sum
0 NaN;NaN;1;4 NaN
1 4;8 12.0
2 nan NaN

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

subselect columns pandas with count

I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1

Removing negative values in pandas column keeping NaN

I was wondering how I can remove rows which have a negative value but keep the NaNs. At the moment I am using:
DF = DF.ix[DF['RAF01Time'] >= 0]
But this removes the NaNs.
Thanks in advance.
You need boolean indexing with another condition with isnull:
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
Sample:
DF = pd.DataFrame({'RAF01Time':[-1,2,3,np.nan]})
print (DF)
RAF01Time
0 -1.0
1 2.0
2 3.0
3 NaN
DF = DF[(DF['RAF01Time'] >= 0) | (DF['RAF01Time'].isnull())]
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
Another solution with query:
DF = DF.query("~(RAF01Time < 0)")
print (DF)
RAF01Time
1 2.0
2 3.0
3 NaN
You can just use < 0 and then take the inverse of the condition.
DF = DF[~(DF['RAF01Time'] < 0)]

Resources