Merge 2 pandas dataframes - python

Merge 2 pandas dataframes - python - python-3.x

I have 2 pandas dataframes:
data1=
sample ID
name
sex
0
a
male
1
b
male
2
c
male
3
d
male
4
e
male
data2=
samples
Diabetic
age
0
yes
43
1
yes
50
2
no
63
3
no
21
4
yes
44
I want to merge both data frames to end up with the following data frame
samples
Diabetic
age
name
sex
0
yes
43
a
male
1
yes
50
b
male
2
no
63
c
male
3
no
21
d
male
4
yes
44
e
male

Related

add column from a dataframe to another dataframe with same rows

I have a dataframe (df) that contains 30 000 rows
id Name Age
1 Joey 22
2 Anna 34
3 Jon 33
4 Amy 30
5 Kay 22
And Another dataframe (df2) that contains same columns but with some Ids missing
id Name Age Sport
Jon 33 Tennis
5 Kay 22 Football
Joey 22 Basketball
4 Amy 30 Running
Anna 42 Dancing
I want the missing IDs to appear in df2 with the correspondant name
df2:
id Name Age Sport
3 Jon 33 Tennis
5 Kay 22 Football
1 Joey 22 Basketball
4 Amy 30 Running
2 Anna 42 Dancing
Can someone help ? I am new to pandas and dataframe

you can use .map with .fillna
df2['id'] = df2['id'].replace('',np.nan,regex=True)\
.fillna(df2['Name'].map(df1.set_index('Name')['id'])).astype(int)
print(df2)
id Name Age Sport
0 3 Jon 33 Tennis
1 5 Kay 22 Football
2 1 Joey 22 Basketball
3 4 Amy 30 Running
4 2 Anna 42 Dancing

First, join the two dataframes with pd.merge based on your keys. I suppose the keys are 'Name' and 'Age' in this case. Then replace the null id values in df2, using np.where and .isnull() to find the null values.
df3 = pd.merge(df2, df1, on=['name', 'age'], how='left')
df2['id'] = np.where(df3.id_x.isnull(), df3.id_y, df3.id_x).astype(int)
id name age sport
0 1 Joey 22 Tennis
1 2 Anna 34 Football
2 3 Jon 33 Basketball
3 4 Amy 30 Running
4 5 Kay 22 Dancing

Compare two dataframes and remove rows from a df based on a matching column value

I have two pandas df which look like this:
df1:
pid Name score age
100 Ram 3 36
101 Tony 2 40
101 Jack 4 56
200 Jill 6 30
df2
pid Name score age
100 Ram 3 36
101 Tony 2 40
101 John 4 51
101 Jack 9 32
200 Jill 6 30
Both df's are indexed with 'pid'. I would like to compare df1 & df2 based on the column 'score'. i.e, I need to keep only those rows in df2 that are matching with df1 on index and value of score.
My expected result should be
new df2:
pid Name index age
100 Ram 3 36
101 Tony 2 40
101 John 4 51
200 Jill 6 30
Any help on this regard is highly appreciated.

Use merge by columns pid and score, but first create columns from index by reset_index, last create pid index again and for same columns of new DataFrame add reindex by df2.columns:
df = (pd.merge(df1.reset_index(),
df2.reset_index(), on=['score', 'pid'], how='left', suffixes=['_',''])
.set_index('pid')
.reindex(columns=df2.columns))
print (df)
Name score age
pid
100 Ram 3 36
101 Tony 2 40
101 John 4 51
200 Jill 6 30
Inputs:
print (df1)
Name score age
pid
100 Ram 3 36
101 Tony 2 40
101 Jack 4 56
200 Jill 6 30
print (df2)
Name score age
pid
100 Ram 3 36
101 Tony 2 40
101 John 4 51
101 Jack 9 32
200 Jill 6 30

ValueError in onehotencoding

I'm not able to encode this column
Sex
male
female
female
female
male
male
male
male
female
female
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder=LabelEncoder()
X[:,2]=labelencoder.fit_transform(X[:,2])
ohe=OneHotEncoder(categorical_features=X[2])
ohe.fit_transform(X)
I'm getting this error.
could not convert string to float: 'male'
Can anyone help me with this?

Demo:
In [6]: df
Out[6]:
Sex
0 male
1 female
2 female
3 female
4 male
5 male
6 male
7 male
8 female
9 female
In [7]: le = LabelEncoder()
In [8]: df['Sex'] = le.fit_transform(df['Sex'])
In [9]: df
Out[9]:
Sex
0 1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 0
9 0
In [10]: df.dtypes
Out[10]:
Sex int64
dtype: object

Handing missing data of age in classification [duplicate]

I have a titanic Dataset. It has attributes and i was working manly on
1.Age
2.Embark ( from which port passengers embarked..There are total 3 ports..S,Q and C)
3.Survived ( 0 for did not survived,1 for survived)
I was filtering the useless data. Then i needed to fill Null values present in Age. So i counted how many passengers survived and didn't survived in each Embark i.e. S,Q and C
I find out the mean age of Passengers who survived and who did not survived after embarking from each S,Q and C port. But now i have no idea how to fill these 6 values ( 3 for survived from each S,Q and C and 3 for who did not survived from each S,Q and C...So total 6) in the original titanic Age column. If i do simply titanic.Age.fillna('With one of the six values') it will fill All the Null values of Age with that one value which i don't want.
After giving some time,i tried this.
titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)
This showed no error but still it doesn't work. Any idea what should i do?

I think you need groupby with apply with fillna by mean:
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
import seaborn as sns
titanic = sns.load_dataset('titanic')
#check NaN rows in age
print (titanic[titanic['age'].isnull()].head(10))
survived pclass sex age sibsp parch fare embarked class \
5 0 3 male NaN 0 0 8.4583 Q Third
17 1 2 male NaN 0 0 13.0000 S Second
19 1 3 female NaN 0 0 7.2250 C Third
26 0 3 male NaN 0 0 7.2250 C Third
28 1 3 female NaN 0 0 7.8792 Q Third
29 0 3 male NaN 0 0 7.8958 S Third
31 1 1 female NaN 1 0 146.5208 C First
32 1 3 female NaN 0 0 7.7500 Q Third
36 1 3 male NaN 0 0 7.2292 C Third
42 0 3 male NaN 0 0 7.8958 C Third
who adult_male deck embark_town alive alone
5 man True NaN Queenstown no True
17 man True NaN Southampton yes True
19 woman False NaN Cherbourg yes True
26 man True NaN Cherbourg no True
28 woman False NaN Queenstown yes True
29 man True NaN Southampton no True
31 woman False B Cherbourg yes False
32 woman False NaN Queenstown yes True
36 man True NaN Cherbourg yes True
42 man True NaN Cherbourg no True
idx = titanic[titanic['age'].isnull()].index
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
#check if values was replaced
print (titanic.loc[idx].head(10))
survived pclass sex age sibsp parch fare embarked \
5 0 3 male 30.325000 0 0 8.4583 Q
17 1 2 male 28.113184 0 0 13.0000 S
19 1 3 female 28.973671 0 0 7.2250 C
26 0 3 male 33.666667 0 0 7.2250 C
28 1 3 female 22.500000 0 0 7.8792 Q
29 0 3 male 30.203966 0 0 7.8958 S
31 1 1 female 28.973671 1 0 146.5208 C
32 1 3 female 22.500000 0 0 7.7500 Q
36 1 3 male 28.973671 0 0 7.2292 C
42 0 3 male 33.666667 0 0 7.8958 C
class who adult_male deck embark_town alive alone
5 Third man True NaN Queenstown no True
17 Second man True NaN Southampton yes True
19 Third woman False NaN Cherbourg yes True
26 Third man True NaN Cherbourg no True
28 Third woman False NaN Queenstown yes True
29 Third man True NaN Southampton no True
31 First woman False B Cherbourg yes False
32 Third woman False NaN Queenstown yes True
36 Third man True NaN Cherbourg yes True
42 Third man True NaN Cherbourg no True
#check mean values
print (titanic.groupby(['survived','embarked'])['age'].mean())
survived embarked
0 C 33.666667
Q 30.325000
S 30.203966
1 C 28.973671
Q 22.500000
S 28.113184
Name: age, dtype: float64

Create column on two conditions with pandas

I'm utilizing pandas to do some analysis exercise. I want to create a new column that the value is the sum of two rows. The original data set is as follow...
Admit Gender Dept Freq
0 Admitted Male A 512
1 Rejected Male A 313
2 Admitted Female A 89
3 Rejected Female A 19
4 Admitted Male B 353
5 Rejected Male B 207
6 Admitted Female B 17
7 Rejected Female B 8
8 Admitted Male C 120
9 Rejected Male C 205
10 Admitted Female C 202
11 Rejected Female C 391
12 Admitted Male D 138
13 Rejected Male D 279
14 Admitted Female D 131
15 Rejected Female D 244
16 Admitted Male E 53
17 Rejected Male E 138
18 Admitted Female E 94
19 Rejected Female E 299
20 Admitted Male F 22
21 Rejected Male F 351
22 Admitted Female F 24
23 Rejected Female F 317
I want to create a new column utilizing the following data frame...
Dept Gender Freq
0 A Female 108
1 A Male 825
2 B Female 25
3 B Male 560
4 C Female 593
5 C Male 325
6 D Female 375
7 D Male 417
8 E Female 393
9 E Male 191
10 F Female 341
11 F Male 373
I want to create a new column in the first data frame utilizing the Freq column of the second data frame. I need to insert the 108 value if Detp and Gender are the same in both data frames. The new data frame should look like this...
Admit Gender Dept Freq Total
0 Admitted Male A 512 825
1 Rejected Male A 313 825
2 Admitted Female A 89 108
3 Rejected Female A 19 108
4 Admitted Male B 353 560
5 Rejected Male B 207 560
6 Admitted Female B 17 25
7 Rejected Female B 8 25
I have tried the following code...
for i in data.iterrows():
for j in total_freq.iterrows():
if i[1].Gender == total_freq.Gender & i[1].Dept == total_freq.Dept:
data['Total'] = total_freq.Freq
I get the following error... TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
Any help to create the column with the correct values?

You can use transform
df['Total'] = df.groupby(['Dept', 'Gender']).Freq.transform('sum')
You get
Admit Gender Dept Freq Total
0 Admitted Male A 512 825
1 Rejected Male A 313 825
2 Admitted Female A 89 108
3 Rejected Female A 19 108
4 Admitted Male B 353 560
5 Rejected Male B 207 560
6 Admitted Female B 17 25
7 Rejected Female B 8 25
8 Admitted Male C 120 325
9 Rejected Male C 205 325
10 Admitted Female C 202 593
11 Rejected Female C 391 593
12 Admitted Male D 138 417
13 Rejected Male D 279 417
14 Admitted Female D 131 375
15 Rejected Female D 244 375
16 Admitted Male E 53 191
17 Rejected Male E 138 191
18 Admitted Female E 94 393
19 Rejected Female E 299 393
20 Admitted Male F 22 373
21 Rejected Male F 351 373
22 Admitted Female F 24 341
23 Rejected Female F 317 341

You can use pandas.DataFrame.merge() to left join your totals from the second dataframe to the first. First, rename freq in the totals df.
df1 = df1.rename(columns={'Freq':'Total'})
df_totals = pd.merge(df, df1['Total'], how='left', on=['Gender', 'Dept'])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Merge 2 pandas dataframes - python - python-3.x

Related

add column from a dataframe to another dataframe with same rows

Compare two dataframes and remove rows from a df based on a matching column value

ValueError in onehotencoding

Handing missing data of age in classification [duplicate]

Create column on two conditions with pandas

Categories

Resources