Handing missing data of age in classification [duplicate] - python-3.x

I have a titanic Dataset. It has attributes and i was working manly on
1.Age
2.Embark ( from which port passengers embarked..There are total 3 ports..S,Q and C)
3.Survived ( 0 for did not survived,1 for survived)
I was filtering the useless data. Then i needed to fill Null values present in Age. So i counted how many passengers survived and didn't survived in each Embark i.e. S,Q and C
I find out the mean age of Passengers who survived and who did not survived after embarking from each S,Q and C port. But now i have no idea how to fill these 6 values ( 3 for survived from each S,Q and C and 3 for who did not survived from each S,Q and C...So total 6) in the original titanic Age column. If i do simply titanic.Age.fillna('With one of the six values') it will fill All the Null values of Age with that one value which i don't want.
After giving some time,i tried this.
titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)
This showed no error but still it doesn't work. Any idea what should i do?

I think you need groupby with apply with fillna by mean:
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
import seaborn as sns
titanic = sns.load_dataset('titanic')
#check NaN rows in age
print (titanic[titanic['age'].isnull()].head(10))
survived pclass sex age sibsp parch fare embarked class \
5 0 3 male NaN 0 0 8.4583 Q Third
17 1 2 male NaN 0 0 13.0000 S Second
19 1 3 female NaN 0 0 7.2250 C Third
26 0 3 male NaN 0 0 7.2250 C Third
28 1 3 female NaN 0 0 7.8792 Q Third
29 0 3 male NaN 0 0 7.8958 S Third
31 1 1 female NaN 1 0 146.5208 C First
32 1 3 female NaN 0 0 7.7500 Q Third
36 1 3 male NaN 0 0 7.2292 C Third
42 0 3 male NaN 0 0 7.8958 C Third
who adult_male deck embark_town alive alone
5 man True NaN Queenstown no True
17 man True NaN Southampton yes True
19 woman False NaN Cherbourg yes True
26 man True NaN Cherbourg no True
28 woman False NaN Queenstown yes True
29 man True NaN Southampton no True
31 woman False B Cherbourg yes False
32 woman False NaN Queenstown yes True
36 man True NaN Cherbourg yes True
42 man True NaN Cherbourg no True
idx = titanic[titanic['age'].isnull()].index
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
#check if values was replaced
print (titanic.loc[idx].head(10))
survived pclass sex age sibsp parch fare embarked \
5 0 3 male 30.325000 0 0 8.4583 Q
17 1 2 male 28.113184 0 0 13.0000 S
19 1 3 female 28.973671 0 0 7.2250 C
26 0 3 male 33.666667 0 0 7.2250 C
28 1 3 female 22.500000 0 0 7.8792 Q
29 0 3 male 30.203966 0 0 7.8958 S
31 1 1 female 28.973671 1 0 146.5208 C
32 1 3 female 22.500000 0 0 7.7500 Q
36 1 3 male 28.973671 0 0 7.2292 C
42 0 3 male 33.666667 0 0 7.8958 C
class who adult_male deck embark_town alive alone
5 Third man True NaN Queenstown no True
17 Second man True NaN Southampton yes True
19 Third woman False NaN Cherbourg yes True
26 Third man True NaN Cherbourg no True
28 Third woman False NaN Queenstown yes True
29 Third man True NaN Southampton no True
31 First woman False B Cherbourg yes False
32 Third woman False NaN Queenstown yes True
36 Third man True NaN Cherbourg yes True
42 Third man True NaN Cherbourg no True
#check mean values
print (titanic.groupby(['survived','embarked'])['age'].mean())
survived embarked
0 C 33.666667
Q 30.325000
S 30.203966
1 C 28.973671
Q 22.500000
S 28.113184
Name: age, dtype: float64

Related

Check whether one column's data type is number or NaN in Pandas

Given a dataframe df as follows:
id room area check
0 1 A-102 world NaN
1 2 NaN 24 room name is not valid
2 3 B309 NaN NaN
3 4 C·102 25 room name is not valid
4 5 E_1089 hello room name is not valid
5 6 27 NaN NaN
6 7 27 NaN NaN
I want to check whether area columns is valid data format, if it's either numbers or NaNs, then consider it's as valid data, eitherwise, update check column with area is not a number.
I tried with df.loc[df.area.str.contains('^\d+$', na = True), 'check'] = 'area is not a number', but not get what I needed.
How could I get an expected result like this:
id room area check
0 1 A-102 world area is not a number
1 2 NaN 24 room name is not valid
2 3 B309 NaN NaN
3 4 C·102 25 room name is not valid
4 5 E_1089 hello room name is not valid; area is not a number
5 6 27 NaN NaN
6 7 27 NaN NaN
Thanks for your help at advance.
You are close, only invert mask by ~:
df.loc[~df.area.str.contains('^\d+$', na = True), 'check'] = 'area is not a number'
print(df)
id room area check
0 1 A-102 world area is not a number
1 2 NaN 24 NaN
2 3 B309 NaN NaN
3 4 C·102 25 NaN
4 5 E_1089 hello area is not a number
5 6 27 NaN NaN
6 7 27 NaN NaN
Or use Series.where:
df['check'] = df['check'].where(df.area.str.contains('^\d+$', na = True),
'area is not a number')
EDIT:
m1 = df.room.str.contains('([^a-zA-Z\d\-])', na = True)
m2 = df.area.str.contains('^\d+$', na = True)
v1 = 'room name is not valid'
v2 = 'area is not a number'
df['check'] = np.where(m1 & ~m2, v1 + ', ' + v2,
np.where(m1, v1,
np.where(~m2, v2, None)))
print(df)
id room area check
0 1 A-102 world area is not a number
1 2 NaN 24 room name is not valid
2 3 B309 NaN None
3 4 C 102 25 room name is not valid
4 5 E_1089 hello room name is not valid, area is not a number
5 6 27 NaN None
6 7 27 NaN None

Optimized way of modifying a column based on another column of a dataframe

Let's say I have a dataframe like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44 1 96 1 40 1 88 0 81
1 2017-05-01 State NY 0 42 0 55 1 92 1 82 0 38
2 2017-06-01 State NY 1 11 0 7 1 35 0 70 1 61
3 2017-07-01 State NY 1 12 1 80 1 83 1 47 1 44
4 2017-08-01 State NY 1 63 1 48 0 61 0 5 0 20
5 2017-09-01 State NY 1 56 1 92 0 55 0 45 1 17
I'd like to replace all the values of columns with _rank as NaN if it's corresponding flag is zero.To get something like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0
Which is fairly simple. This is my approach for the same:
for k in variables:
dt[k+'_rank'] = np.where(dt[k+'_flag']==0,np.nan,dt[k+'_rank'])
Although this works fine for a smaller dataset, it takes a significant amount of time for processing a dataframe with very high number of columns and entries. So is there a optimized way of achieving the same without iteration?
P.S. There are other payloads apart from _rank and _flag in the data.
Thanks in advance
Use .str.endswith to filter the columns that ends with _flag, then use rstrip to strip the flag label and add rank label to get the corresponding column names with rank label, then use np.where to fill the NaN values in the columns containing _rank depending upon the condition when the corresponding values in flag columns is 0:
flags = df.columns[df.columns.str.endswith('_flag')]
ranks = flags.str.rstrip('flag') + 'rank'
df[ranks] = np.where(df[flags].eq(0), np.nan, df[ranks])
OR, it is also possible to use DataFrame.mask:
df[ranks] = df[ranks].mask(df[flags].eq(0).to_numpy())
Result:
# print(df)
Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0

Subtract from all columns in dataframe row by the value in a Series when indexes match

I am trying to subtract 1 from all columns in the rows of a DataFrame that have a matching index in a list.
For example, if I have a DataFrame like this one:
df = pd.DataFrame({'AMOS Admin': [1,1,0,0,2,2], 'MX Programs': [0,0,1,1,0,0], 'Material Management': [2,2,2,2,1,1]})
print(df)
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 0 1 2
3 0 1 2
4 2 0 1
5 2 0 1
I want to subtract 1 from all columns where index is in [2, 3] so that the end result is:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
Having found no way to do this I created a Series:
sr = pd.Series([1,1], index=['2', '3'])
print(sr)
2 1
3 1
dtype: int64
However, applying the sub method as per this question results in a DataFrame with all NaN and new rows at the bottom.
AMOS Admin MX Programs Material Management
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Any help would be most appreciated.
Thanks,
Juan
Using reindex with you sr then subtract using values
df.loc[:]=df.values-sr.reindex(df.index,fill_value=0).values[:,None]
df
Out[1117]:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
If what you want to do is that specific, why don't you just:
df.loc[[2, 3], :] = df.loc[[2, 3], :].subtract(1)

Get all the rows with and without NaN in pandas dataframe

Most efficient way of splitting the row which contains with and without NaN in pandas dataframe.
input :- ID Gender Dependants Income Education Married
1 Male 2 500 Graduate Yes
2 NaN 4 2500 Graduate No
3 Female 3 NaN NaN Yes
4 Male NaN 7000 Graduate Yes
5 Female 4 500 Graduate NaN
6 Female 2 4500 Graduate Yes
The expected output without NaN is,
ID Gender Dependants Income Education Married
1 Male 2 500 Graduate Yes
6 Female 2 4500 Graduate Yes
The expected output with NaN is,
ID Gender Dependants Income Education Married
2 NaN 4 2500 Graduate No
3 Female 3 NaN NaN Yes
4 Male NaN 7000 Graduate Yes
5 Female 4 500 Graduate NaN
Use boolean indexing with check missing values and any for check at least one True per rows:
mask = df.isnull().any(axis=1)
df1 = df[~mask]
df2 = df[mask]
print (df1)
ID Gender Dependants Income Education Married
0 1 Male 2.0 500.0 Graduate Yes
5 6 Female 2.0 4500.0 Graduate Yes
print (df2)
ID Gender Dependants Income Education Married
1 2 NaN 4.0 2500.0 Graduate No
2 3 Female 3.0 NaN NaN Yes
3 4 Male NaN 7000.0 Graduate Yes
4 5 Female 4.0 500.0 Graduate NaN
Details:
print (df.isnull())
ID Gender Dependants Income Education Married
0 False False False False False False
1 False True False False False False
2 False False False True True False
3 False False True False False False
4 False False False False False True
5 False False False False False False
print (mask)
0 False
1 True
2 True
3 True
4 True
5 False
dtype: bool
And you can always use a more readable way of the previous code where you don't need to invert the mask:
mask = df.notna().any(axis=1)
df1 = df[mask]
Same exact result.

Groupby and if condtion on a data frame in pandas

I have a below data frame
df=
city code qty1 qty2 month type
hyd 1 10 12 1 x
hyd 2 12 21 y
hyd 2 15 36 x
hyd 4 25 44 3 z
pune 1 10 1 x
pune 3 12 2 2 y
pune 1 15 3 x
pune 2 25 4 x
ban 2 10 1 1 X
ban 4 10 2 x
ban 2 12 3 x
ban 1 15 4 3 y
I want to groupby(city and code) and find both res1 and res2 based on the below conditions.
The result data frame is
result=
city code res1 res2
hyd 1 Nan 12
hyd 2 27 Nan
hyd 4 Nan Nan
pune 1 25 Nan
pune 3 Nan Nan
pune 2 25 Nan
ban 2 12 10
ban 4 10 Nan
ban 1 Nan Nan
I have tried grouping and itering the result of groupyby with the conditions. But no result. Any help would be appreciated. Thanks
You can groupby then calculated what you need one by one , then concat back
g=df.groupby(['city','code'])
pd.concat([g.apply(lambda x : sum(x['qty1'][x['month']==''])),g.apply(lambda x : sum(x['qty2'][(x['month']!='')&(x['type']=='x')]))],axis=1)
Out[135]:
0 1
city code
ban 1 0 0
2 12 0
4 10 0
hyd 1 0 12
2 27 0
4 0 0
pune 1 25 0
2 25 0
3 0 0
IIUC
df = df.set_index(['city', 'code'])
cond1 = df.month.isnull()
df['res1'] = df[cond1].groupby(['city', 'code']).qty1.sum()
cond2 = df.month.notnull() & (df.type=='x')
df['res2'] = df[cond2].groupby(['city', 'code']).qty2.sum()
qty1 qty2 month type res1 res2
city code
hyd 1 10 12 1.0 x NaN 12.0
2 12 21 NaN y 27.0 NaN
2 15 36 NaN x 27.0 NaN
4 25 44 3.0 z NaN NaN
pune 1 10 1 NaN x 25.0 NaN
3 12 2 2.0 y NaN NaN
1 15 3 NaN x 25.0 NaN
2 25 4 NaN x 25.0 NaN
ban 2 10 1 1.0 x 12.0 1.0
4 10 2 NaN x 10.0 NaN
2 12 3 NaN x 12.0 1.0
1 15 4 3.0 y NaN NaN

Resources