How to replace Specific values of a particular column in Pandas Dataframe based on a certain condition? - python-3.x

I have a Pandas dataframe which contains students and percentages of marks obtained by them. There are some students whose marks are shown as greater than 100%. Obviously these values are incorrect and I would like to replace all percentage values which are greater than 100% by NaN.
I have tried on some code but not quite able to get exactly what I would like to desire.
import numpy as np
import pandas as pd
new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
'Percentages' : [85, 70, 101, 55, 120]})
# Percentages Student
#0 85 S1
#1 70 S2
#2 101 S3
#3 55 S4
#4 120 S5
new_DF[(new_DF.iloc[:, 0] > 100)] = np.NaN
# Percentages Student
#0 85.0 S1
#1 70.0 S2
#2 NaN NaN
#3 55.0 S4
#4 NaN NaN
As you can see the code kind of works but it actually replaces all the values in that particular row where Percentages is greater than 100 by NaN. I would only like to replace the value in Percentages column by NaN where its greater than 100. Is there any way to do that?

Try and use np.where:
new_DF.Percentages=np.where(new_DF.Percentages.gt(100),np.nan,new_DF.Percentages)
or
new_DF.loc[new_DF.Percentages.gt(100),'Percentages']=np.nan
print(new_DF)
Student Percentages
0 S1 85.0
1 S2 70.0
2 S3 NaN
3 S4 55.0
4 S5 NaN

Also,
df.Percentages = df.Percentages.apply(lambda x: np.nan if x>100 else x)
or,
df.Percentages = df.Percentages.where(df.Percentages<100, np.nan)

You can use .loc:
new_DF.loc[new_DF['Percentages']>100, 'Percentages'] = np.NaN
Output:
Student Percentages
0 S1 85.0
1 S2 70.0
2 S3 NaN
3 S4 55.0
4 S5 NaN

import numpy as np
import pandas as pd
new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
'Percentages' : [85, 70, 101, 55, 120]})
#print(new_DF['Student'])
index=-1
for i in new_DF['Percentages']:
index+=1
if i > 100:
new_DF['Percentages'][index] = "nan"
print(new_DF)

Related

Pandas change the dataframe shape column values to row

I have a dataframe which is looking as below
df
school1 game1 game2 game3
school2 game1
school3 game2 game3
school4 game2
output
game1 school1 school2
game2 school1 school4 school3
game3 school1 school3
can any one suggest me how can I get the out put i am new to pandas please help me
Thank you
Here is one way that relies on the melt() method to first make a long table out of the original and then on the pivot() method to transform it to the new wide format:
import pandas as pd
import numpy as np
# Code that creates your input dataframe (replace column names as needed)
df = pd.DataFrame(
{'school':['school1', 'school2', 'school3', 'school4'],
'g1':['game1', 'game1', 'game2', None],
'g2':['game2', None, 'game3', 'game2'],
'g3':['game3', None, None, None],
}
)
# Convert to long format (one row per school-game)
long_df = df.set_index('school').melt(ignore_index=False).reset_index()
# Remove null (non-existing) school-game combinations
# Also, convert index to column for next step
long_df = long_df[long_df.value.notnull()].reset_index(drop=True).reset_index()
# Convert to dataframe with one row per game ID
by_game_df = long_df.pivot(index='value',columns='index',values='school')
At this point, the dataframe will look like this:
index value 0 1 2 3 4 5 6
0 game1 school1 school2 NaN NaN NaN NaN NaN
1 game2 NaN NaN school3 school1 NaN school4 NaN
2 game3 NaN NaN NaN NaN school3 NaN school1
You can perform these additional steps to shift non-null school values to the left and to remove columns with only NaN's remaining:
# per https://stackoverflow.com/a/65596853:
idx = pd.isnull(by_game_df.values).argsort(axis=1)
squeezed_df = pd.DataFrame(
by_game_df.values[np.arange(by_game_df.shape[0])[:,None], idx],
index=by_game_df.index,
columns=by_game_df.columns
)
result = squeezed_df.dropna(axis=1, how='all')
result
# index value 0 1 2
# 0 game1 school1 school2 NaN
# 1 game2 school3 school1 school4
# 2 game3 school3 school1 NaN
Or with a Series of lists and a much maligned loop:
d = {'School': ['s1','s2','s3','s4'], 'c1': ['g1','g1','g2',np.nan], 'c2': ['g2',np.nan,'g3','g2'], 'c3': ['g3',np.nan,np.nan,np.nan]}
df = pd.DataFrame(d)
df
School c1 c2 c3
0 s1 g1 g2 g3
1 s2 g1 NaN NaN
2 s3 g2 g3 NaN
3 s4 NaN g2 NaN
gg = pd.Series(dtype=object)
def add_gs(game, sch):
if type(game) is str:
if game in gg.keys():
gg[game] += [sch]
else:
gg[game] = [sch]
cols = df.filter(regex='c[0-9]').columns
for i in range(len(df)):
for col in cols:
add_gs(df.loc[i,col],df.loc[i,'School'])
gg
g1 [s1, s2]
g2 [s1, s3, s4]
g3 [s1, s3]
A solution that relies on defaultdict() for reshaping the data:
from collections import defaultdict
import pandas as pd
# Code that creates your input dataframe (replace column names as needed)
df = pd.DataFrame(
{'school':['school1', 'school2', 'school3', 'school4'],
'g1':['game1', 'game1', 'game2', None],
'g2':['game2', None, 'game3', 'game2'],
'g3':['game3', None, None, None],
}
)
# convert df to dictionary
d = df.set_index('school').to_dict(orient='index')
# reshape the dictionary
def_d = defaultdict(list)
for k, v in d.items():
for i in v.values():
if i is not None:
def_d[i].append(k)
d_rs = dict(def_d)
# prepare dictionary for converting back to dataframe
dict_for_df = {
k: pd.Series(
v + [None] * (len(max(d_rs.values(), key=lambda x: len(x))) - len(v))
) for k, v in d_rs.items()
}
# convert dictionary to dataframe
final_df = pd.DataFrame.from_dict(dict_for_df, orient='index')
}
# 0 1 2
# game1 school1 school2 None
# game2 school1 school3 school4
# game3 school1 school3 None

Calculation of percentile and mean

I want to find the 3% percentile of the following data and then average the data.
Given below is the data structure.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
96927 NaN
96928 NaN
96929 NaN
96930 NaN
96931 NaN
Over here the concerned data lies exactly between the data from 13240:61156.
Given below is my code:
enter code here
import pandas as pd
import numpy as np
load_var=pd.read_excel(r'path\file name.xlsx')
load_var
a=pd.DataFrame(load_var['column whose percentile is to be found'])
print(a)
b=np.nanpercentile(a,3)
print(b)
Please suggest the changes in the code.
Thank you.
Use Series.quantile with mean in Series.agg:
df = pd.DataFrame({
'col':[7,8,9,4,2,3, np.nan],
})
f = lambda x: x.quantile(0.03)
f.__name__ = 'q'
s = df['col'].agg(['mean', f])
print (s)
mean 5.50
q 2.15
Name: col, dtype: float64

Replacing NaN value with None fill the value from the previous row in a DataFrame (pandas 1.0.3)

Not sure if it is a bug, but I am unable to replace NaN value with None value using latest Pandas library. When I use DataFrame.replace() method to replace NaN with None, dataframe is taking value from previous row instead of None value. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'x': [10, 20, np.nan], 'y': [30, 40, 50]})
print(df)
Outputs
x y
0 10.0 30
1 20.0 40
2 NaN 50
And if I apply replace method
print(df.replace(np.NaN, None))
Outputs. Cell(X, 2) should be None instead of 20.0.
x y
0 10.0 30
1 20.0 40
2 20.0 50
Anyone help is appreciated.

subselect columns pandas with count

I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1

How to check the value is postive or negative and insert a new value in a column of a data frame

I am working with pandas and python and I have a task where, I need to check
GDP diff sign
Quarter
1999q4 12323.3 NaN None
2000q1 12359.1 35.8 None
2000q2 12592.5 233.4 None
2000q3 12607.7 15.2 None
2000q4 12679.3 71.6 None
Let the above dataframe be df and when I do
if df.iloc[2]['diff'] > 0:
df.iloc[2]['sign'] = "Positive"
The value is not getting updated on the dataframe. Is there something where I'm doing wrong. Its a direct assignment like how we do df['something'] = 'some value' and by doing this it will insert than value into df under that column. But when i do the above where i need to determine positive or negative, it is still showing as None when I do
df.iloc[2]['sign']
I tried using apply with lambdas, but couldn't get what I wanted.
Some help would be appreciated
Thank you.
You can use double numpy.where first with filtering NaN values by isnull and then by condition df['diff'] > 0:
df.sign = np.where(df['diff'].isnull(), np.nan,
np.where(df['diff'] > 0, 'Positive', 'Negative'))
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN NaN
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative
because if use only df['diff'] > 0 get Negative for NaN values:
df.sign = np.where(df['diff'] > 0, 'Positive', 'Negative')
print (df)
Quarter GDP diff sign
0 1999q4 12323.3 NaN Negative
1 2000q1 12359.1 35.8 Positive
2 2000q2 12592.5 233.4 Positive
3 2000q3 12607.7 15.2 Positive
4 2000q4 12679.3 -71.6 Negative
I'd create a categorical column
d = df['diff']
sign = np.where(d < 0, 'Negative',
np.where(d == 0, 'UnChanged',
np.where(d > 0, 'Positive', np.nan)))
df['sign'] = pd.Categorical(sign,
categories=['Negative', 'UnChanged', 'Positive'],
ordered=True)
df

Resources