Get all the rows with and without NaN in pandas dataframe - python-3.x

Most efficient way of splitting the row which contains with and without NaN in pandas dataframe.
input :- ID Gender Dependants Income Education Married
1 Male 2 500 Graduate Yes
2 NaN 4 2500 Graduate No
3 Female 3 NaN NaN Yes
4 Male NaN 7000 Graduate Yes
5 Female 4 500 Graduate NaN
6 Female 2 4500 Graduate Yes
The expected output without NaN is,
ID Gender Dependants Income Education Married
1 Male 2 500 Graduate Yes
6 Female 2 4500 Graduate Yes
The expected output with NaN is,
ID Gender Dependants Income Education Married
2 NaN 4 2500 Graduate No
3 Female 3 NaN NaN Yes
4 Male NaN 7000 Graduate Yes
5 Female 4 500 Graduate NaN

Use boolean indexing with check missing values and any for check at least one True per rows:
mask = df.isnull().any(axis=1)
df1 = df[~mask]
df2 = df[mask]
print (df1)
ID Gender Dependants Income Education Married
0 1 Male 2.0 500.0 Graduate Yes
5 6 Female 2.0 4500.0 Graduate Yes
print (df2)
ID Gender Dependants Income Education Married
1 2 NaN 4.0 2500.0 Graduate No
2 3 Female 3.0 NaN NaN Yes
3 4 Male NaN 7000.0 Graduate Yes
4 5 Female 4.0 500.0 Graduate NaN
Details:
print (df.isnull())
ID Gender Dependants Income Education Married
0 False False False False False False
1 False True False False False False
2 False False False True True False
3 False False True False False False
4 False False False False False True
5 False False False False False False
print (mask)
0 False
1 True
2 True
3 True
4 True
5 False
dtype: bool
And you can always use a more readable way of the previous code where you don't need to invert the mask:
mask = df.notna().any(axis=1)
df1 = df[mask]
Same exact result.

Related

Check whether one column's data type is number or NaN in Pandas

Given a dataframe df as follows:
id room area check
0 1 A-102 world NaN
1 2 NaN 24 room name is not valid
2 3 B309 NaN NaN
3 4 C·102 25 room name is not valid
4 5 E_1089 hello room name is not valid
5 6 27 NaN NaN
6 7 27 NaN NaN
I want to check whether area columns is valid data format, if it's either numbers or NaNs, then consider it's as valid data, eitherwise, update check column with area is not a number.
I tried with df.loc[df.area.str.contains('^\d+$', na = True), 'check'] = 'area is not a number', but not get what I needed.
How could I get an expected result like this:
id room area check
0 1 A-102 world area is not a number
1 2 NaN 24 room name is not valid
2 3 B309 NaN NaN
3 4 C·102 25 room name is not valid
4 5 E_1089 hello room name is not valid; area is not a number
5 6 27 NaN NaN
6 7 27 NaN NaN
Thanks for your help at advance.
You are close, only invert mask by ~:
df.loc[~df.area.str.contains('^\d+$', na = True), 'check'] = 'area is not a number'
print(df)
id room area check
0 1 A-102 world area is not a number
1 2 NaN 24 NaN
2 3 B309 NaN NaN
3 4 C·102 25 NaN
4 5 E_1089 hello area is not a number
5 6 27 NaN NaN
6 7 27 NaN NaN
Or use Series.where:
df['check'] = df['check'].where(df.area.str.contains('^\d+$', na = True),
'area is not a number')
EDIT:
m1 = df.room.str.contains('([^a-zA-Z\d\-])', na = True)
m2 = df.area.str.contains('^\d+$', na = True)
v1 = 'room name is not valid'
v2 = 'area is not a number'
df['check'] = np.where(m1 & ~m2, v1 + ', ' + v2,
np.where(m1, v1,
np.where(~m2, v2, None)))
print(df)
id room area check
0 1 A-102 world area is not a number
1 2 NaN 24 room name is not valid
2 3 B309 NaN None
3 4 C 102 25 room name is not valid
4 5 E_1089 hello room name is not valid, area is not a number
5 6 27 NaN None
6 7 27 NaN None

Combine text from multiple rows in pandas

I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.
You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0
idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645

Update Column based on another column and Delete data from the other

Lets assume the df looks like:
import pandas as pd
df = pd.DataFrame(data={'fname':['Anky','Anky','Tom','Harry','Harry','Harry'],'lname':['sur1','sur1','sur2','sur3','sur3','sur3'],'role':['','abc','def','ghi','','ijk'],'mobile':['08511663451212','+4471123456','0851166346','','0851166347',''],'Pmobile':['085116634512','1234567890','8885116634','','+353051166347','0987654321']})
import numpy as np
df.replace('',np.nan,inplace=True)
df:
fname lname role mobile Pmobile
0 Anky sur1 NaN 08511663451212 085116634512
1 Anky sur1 abc +4471123456 1234567890
2 Tom sur2 def 0851166346 8885116634
3 Harry sur3 ghi NaN NaN
4 Harry sur3 NaN 0851166347 +353051166347
5 Harry sur3 ijk NaN 0987654321
So I want to update the column mobile with values from Pmobile where the values starts with '08','8','+353 and simultaneously it should delete the value from Pmobile field where it finds a match and copies data to mobile field.
Presently I am getting this by :
df.mobile.update(df['Pmobile'][df['Pmobile'].str.startswith(('08','8','+353'),na=False)])
df.Pmobile[df.mobile==df.Pmobile] = np.nan
df:
fname lname role mobile Pmobile
0 Anky sur1 NaN 085116634512 NaN
1 Anky sur1 abc +4471123456 1234567890
2 Tom sur2 def 8885116634 NaN
3 Harry sur3 ghi NaN NaN
4 Harry sur3 NaN +353051166347 NaN
5 Harry sur3 ijk NaN 0987654321
Is there a way to do this on the fly?
Thanks in advance. :)
You can use shift to shift the columns left do this:
In[50]:
df.loc[df['Pmobile'].str.startswith(('08','8','+353'),na=False), ['mobile','Pmobile']] = df[['mobile','Pmobile']].shift(-1,axis=1)
df
Out[50]:
fname lname role mobile Pmobile
0 Anky sur1 NaN 085116634512 NaN
1 Anky sur1 abc +4471123456 1234567890
2 Tom sur2 def 8885116634 NaN
3 Harry sur3 ghi NaN NaN
4 Harry sur3 NaN +353051166347 NaN
5 Harry sur3 ijk NaN 0987654321
So use your condition to mask the rows of interest and then assign the result of those 2 columns shifted left by 1 where the condition is met.
This will leave a NaN where the value has shifted and do nothing where the condition isn't met

Calculating rolling sum in a pandas dataframe on the basis of 2 variable constraints

I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0

Handing missing data of age in classification [duplicate]

I have a titanic Dataset. It has attributes and i was working manly on
1.Age
2.Embark ( from which port passengers embarked..There are total 3 ports..S,Q and C)
3.Survived ( 0 for did not survived,1 for survived)
I was filtering the useless data. Then i needed to fill Null values present in Age. So i counted how many passengers survived and didn't survived in each Embark i.e. S,Q and C
I find out the mean age of Passengers who survived and who did not survived after embarking from each S,Q and C port. But now i have no idea how to fill these 6 values ( 3 for survived from each S,Q and C and 3 for who did not survived from each S,Q and C...So total 6) in the original titanic Age column. If i do simply titanic.Age.fillna('With one of the six values') it will fill All the Null values of Age with that one value which i don't want.
After giving some time,i tried this.
titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)
This showed no error but still it doesn't work. Any idea what should i do?
I think you need groupby with apply with fillna by mean:
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
import seaborn as sns
titanic = sns.load_dataset('titanic')
#check NaN rows in age
print (titanic[titanic['age'].isnull()].head(10))
survived pclass sex age sibsp parch fare embarked class \
5 0 3 male NaN 0 0 8.4583 Q Third
17 1 2 male NaN 0 0 13.0000 S Second
19 1 3 female NaN 0 0 7.2250 C Third
26 0 3 male NaN 0 0 7.2250 C Third
28 1 3 female NaN 0 0 7.8792 Q Third
29 0 3 male NaN 0 0 7.8958 S Third
31 1 1 female NaN 1 0 146.5208 C First
32 1 3 female NaN 0 0 7.7500 Q Third
36 1 3 male NaN 0 0 7.2292 C Third
42 0 3 male NaN 0 0 7.8958 C Third
who adult_male deck embark_town alive alone
5 man True NaN Queenstown no True
17 man True NaN Southampton yes True
19 woman False NaN Cherbourg yes True
26 man True NaN Cherbourg no True
28 woman False NaN Queenstown yes True
29 man True NaN Southampton no True
31 woman False B Cherbourg yes False
32 woman False NaN Queenstown yes True
36 man True NaN Cherbourg yes True
42 man True NaN Cherbourg no True
idx = titanic[titanic['age'].isnull()].index
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
#check if values was replaced
print (titanic.loc[idx].head(10))
survived pclass sex age sibsp parch fare embarked \
5 0 3 male 30.325000 0 0 8.4583 Q
17 1 2 male 28.113184 0 0 13.0000 S
19 1 3 female 28.973671 0 0 7.2250 C
26 0 3 male 33.666667 0 0 7.2250 C
28 1 3 female 22.500000 0 0 7.8792 Q
29 0 3 male 30.203966 0 0 7.8958 S
31 1 1 female 28.973671 1 0 146.5208 C
32 1 3 female 22.500000 0 0 7.7500 Q
36 1 3 male 28.973671 0 0 7.2292 C
42 0 3 male 33.666667 0 0 7.8958 C
class who adult_male deck embark_town alive alone
5 Third man True NaN Queenstown no True
17 Second man True NaN Southampton yes True
19 Third woman False NaN Cherbourg yes True
26 Third man True NaN Cherbourg no True
28 Third woman False NaN Queenstown yes True
29 Third man True NaN Southampton no True
31 First woman False B Cherbourg yes False
32 Third woman False NaN Queenstown yes True
36 Third man True NaN Cherbourg yes True
42 Third man True NaN Cherbourg no True
#check mean values
print (titanic.groupby(['survived','embarked'])['age'].mean())
survived embarked
0 C 33.666667
Q 30.325000
S 30.203966
1 C 28.973671
Q 22.500000
S 28.113184
Name: age, dtype: float64

Resources