Is there a way to replace outliers with NaN in a dataframe?

Is there a way to replace outliers with NaN in a dataframe? - python-3.x

I want to replace outliers with NaN so that I can concat that dataframe with the other dataframe where I don't want to remove the outliers. Following is the dataset. I want to perform outlier removal only on 'age', 'height', 'weight', 'ap_hi', 'ap_lo'.
id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio
988 22469 1 155 69.0 130 80 2 2 0 0 1 0
989 14648 1 163 71.0 110 70 1 1 0 0 1 1
990 21901 1 165 70.0 120 80 1 1 0 0 1 0
991 14549 2 165 85.0 120 80 1 1 1 1 1 0
992 23393 1 155 62.0 120 80 1 1 0 0 1 0
I tried the following method but it's taking all columns into consideration:
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

Related

ValueError: could not convert string to float: 'Mme'

When I run the following code in Jupyter Lab
import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
selector = SelectKBest(f_classif,k=5)
selector.fit(titanic[predictors],titanic["Survived"])
Then it went errors and note that ValueError: could not convert string to float: 'Mme',details are like these:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_17760/1637555559.py in <module>
5 predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
6 selector = SelectKBest(f_classif,k=5)
----> 7 selector.fit(titanic[predictors],titanic["Survived"])
......
ValueError: could not convert string to float: 'Mme'
I tried to print titanic[predictors] and titanic["Survived"],then the details are follows:
Pclass Sex Age SibSp Parch Fare Embarked FamilySize Title NameLength
0 3 0 22.0 1 0 7.2500 0 1 1 23
1 1 1 38.0 1 0 71.2833 1 1 3 51
2 3 1 26.0 0 0 7.9250 0 0 2 22
3 1 1 35.0 1 0 53.1000 0 1 3 44
4 3 0 35.0 0 0 8.0500 0 0 1 24
... ... ... ... ... ... ... ... ... ... ...
886 2 0 27.0 0 0 13.0000 0 0 6 21
887 1 1 19.0 0 0 30.0000 0 0 2 28
888 3 1 28.0 1 2 23.4500 0 3 2 40
889 1 0 26.0 0 0 30.0000 1 0 1 21
890 3 0 32.0 0 0 7.7500 2 0 1 19
891 rows × 10 columns
0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 891, dtype: int64
How to Solve this Problem?

When you are trying to fit some algorithm (in your case SelectKBest), you need to be aware of your data. And, almost all time you need to preprocess it.
Take a look to your data:
Do you have categorical features or they are numerical? Or a mix?
Do you have NaN values?
...
Most of algorithm don't accept categorical features, and you will need to make a transformation to numerical one (evaluate the use of OneHotEncoder).
In your case it seems you have a categorical value called Mme, which is in the feature Title. Check it.
You will have the same problem with NaN values.
In conclusion, before start fitting, you have to preprocess your data.

is it printing column labels in first line?
if so then you do proper data assigning so assign the array starting from second row array[1:,:]
otherwise try to look into it and see where is "Mme" string located so you understand how the code is fetching it.

How to return first item when the items in the pandas dataframe window are the same?

I am a python beginner.
I have the following pandas DataFrame, with only two columns; "Time" and "Input".
I want to loop over the "Input" column. Assuming we have a window size w= 3. (three consecutive values) such that for every selected window, we will check if all the items/elements within that window are 1's, then return the first item as 1 and change the remaining values to 0's.
index Time Input
0 11 0
1 22 0
2 33 0
3 44 1
4 55 1
5 66 1
6 77 0
7 88 0
8 99 0
9 1010 0
10 1111 1
11 1212 1
12 1313 1
13 1414 0
14 1515 0
My intended output is as follows
index Time Input What_I_got What_I_Want
0 11 0 0 0
1 22 0 0 0
2 33 0 0 0
3 44 1 1 1
4 55 1 1 0
5 66 1 1 0
6 77 1 1 1
7 88 1 0 0
8 99 1 0 0
9 1010 0 0 0
10 1111 1 1 1
11 1212 1 0 0
12 1313 1 0 0
13 1414 0 0 0
14 1515 0 0 0
What should I do to get the desired output? Am I missing something in my code?

import pandas as pd
import re
pd.Series(list(re.sub('111', '100', ''.join(df.Input.astype(str))))).astype(int)
Out[23]:
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 0
10 1
11 0
12 0
13 0
14 0
dtype: int32

Sum up multiple excel rows [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 1 year ago.
I tried to sum up multiple rows excluding Hour and Date row, But i get the error as
"Value Error: cannot join with no overlapping index names"
Exact data
Hour Input Date Total DA DB CA CB X Y Z Z1 Z2
0 A 9/23/2021 14570 6816 636 6821 297 14213 335 9 13 0
0 B 9/23/2021 147864 63746 10186 63746 10186 147821 0 42 1 0
1 A 9/23/2021 126681 63180 191 63178 132 126606 34 36 5 0
1 B 9/23/2021 33119 1 16558 1 16559 33106 0 13 0 0
2 A 9/23/2021 11550 5398 653 5395 104 10991 549 2 8 0
2 B 9/23/2021 25197 0 12599 0 12598 25176 0 21 0 0
3 A 9/23/2021 259 0 157 0 102 204 55 0 0 0
3 B 9/23/2021 14379 0 7189 0 7190 14347 0 32 0 0
Required output
Hour Input Date Total DA DB CA CB X Y Z Z1 Z2
0 A 9/23/2021 162434 70562 10822 70567 10483 162034 335 51 14 0
1 A 9/23/2021 159800 63181 16749 63179 16691 159712 34 49 5 0
2 A 9/23/2021 36747 5398 13252 5395 12702 36167 549 23 8 0
3 A 9/23/2021 14638 0 7346 0 7292 14551 55 32 0 0
I used the following script:
column_list = list(df_output)
column_list.remove('Hour')
df_output = df[df_output].sum(axis=1)

IIUC use:
df_output = df.groupby(['Input','Date'], as_index=False).sum()

Optimized way of modifying a column based on another column of a dataframe

Let's say I have a dataframe like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44 1 96 1 40 1 88 0 81
1 2017-05-01 State NY 0 42 0 55 1 92 1 82 0 38
2 2017-06-01 State NY 1 11 0 7 1 35 0 70 1 61
3 2017-07-01 State NY 1 12 1 80 1 83 1 47 1 44
4 2017-08-01 State NY 1 63 1 48 0 61 0 5 0 20
5 2017-09-01 State NY 1 56 1 92 0 55 0 45 1 17
I'd like to replace all the values of columns with _rank as NaN if it's corresponding flag is zero.To get something like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0
Which is fairly simple. This is my approach for the same:
for k in variables:
dt[k+'_rank'] = np.where(dt[k+'_flag']==0,np.nan,dt[k+'_rank'])
Although this works fine for a smaller dataset, it takes a significant amount of time for processing a dataframe with very high number of columns and entries. So is there a optimized way of achieving the same without iteration?
P.S. There are other payloads apart from _rank and _flag in the data.
Thanks in advance

Use .str.endswith to filter the columns that ends with _flag, then use rstrip to strip the flag label and add rank label to get the corresponding column names with rank label, then use np.where to fill the NaN values in the columns containing _rank depending upon the condition when the corresponding values in flag columns is 0:
flags = df.columns[df.columns.str.endswith('_flag')]
ranks = flags.str.rstrip('flag') + 'rank'
df[ranks] = np.where(df[flags].eq(0), np.nan, df[ranks])
OR, it is also possible to use DataFrame.mask:
df[ranks] = df[ranks].mask(df[flags].eq(0).to_numpy())
Result:
# print(df)
Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0

Rename column index from 0 to last column pandas

I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.

dat.columns = range(dat.shape[1])

There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is there a way to replace outliers with NaN in a dataframe? - python-3.x

Related

ValueError: could not convert string to float: 'Mme'

How to return first item when the items in the pandas dataframe window are the same?

Sum up multiple excel rows [duplicate]

Optimized way of modifying a column based on another column of a dataframe

Rename column index from 0 to last column pandas

Categories

Resources