ValueError in onehotencoding - python-3.x

I'm not able to encode this column
Sex
male
female
female
female
male
male
male
male
female
female
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder=LabelEncoder()
X[:,2]=labelencoder.fit_transform(X[:,2])
ohe=OneHotEncoder(categorical_features=X[2])
ohe.fit_transform(X)
I'm getting this error.
could not convert string to float: 'male'
Can anyone help me with this?

Demo:
In [6]: df
Out[6]:
Sex
0 male
1 female
2 female
3 female
4 male
5 male
6 male
7 male
8 female
9 female
In [7]: le = LabelEncoder()
In [8]: df['Sex'] = le.fit_transform(df['Sex'])
In [9]: df
Out[9]:
Sex
0 1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 0
9 0
In [10]: df.dtypes
Out[10]:
Sex int64
dtype: object

Related

How to add another column when looping over a dataframe [duplicate]

My pandas dataframe looks like this:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 32917 271 88172 Male
2 18273 552 90291 Female
I want to replicate every row 3 times reset the index to get:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
I tried solutions such as:
pd.concat([df[:5]]*3, ignore_index=True)
And:
df.reindex(np.repeat(df.index.values, df['ID']), method='ffill')
But none of them worked.
Solutions:
Use np.repeat:
Version 1:
Try using np.repeat:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
print(newdf)
The above code will output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
np.repeat repeats the values of df, 3 times.
Then we add the columns with assigning new_df.columns = df.columns.
Version 2:
You could also assign the column names in the first line, like below:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Version 3:
You could shorten it with loc and only repeat the index, like below:
newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
I use reset_index to replace the index with monotonic indexes (0, 1, 2, 3, 4...).
Without np.repeat:
Version 4:
You could use the built-in pd.DataFrame.index.repeat function, like the below:
newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Remember to add reset_index to line-up the index.
Version 5:
Or by using concat with sort_index, like below:
newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Version 6:
You could also use loc with Python list multiplication and sorted, like below:
newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Timings:
Timing with the following code:
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame({'Person': {0: 12345, 1: 32917, 2: 18273}, 'ID': {0: 882, 1: 271, 2: 552}, 'ZipCode': {0: 38182, 1: 88172, 2: 90291}, 'Gender': {0: 'Female', 1: 'Male', 2: 'Female'}})
def version1():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
def version2():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
def version3():
newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
def version4():
newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
def version5():
newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
def version6():
newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print('Version 1 Speed:', timeit.timeit('version1()', 'from __main__ import version1', number=20000))
print('Version 2 Speed:', timeit.timeit('version2()', 'from __main__ import version2', number=20000))
print('Version 3 Speed:', timeit.timeit('version3()', 'from __main__ import version3', number=20000))
print('Version 4 Speed:', timeit.timeit('version4()', 'from __main__ import version4', number=20000))
print('Version 5 Speed:', timeit.timeit('version5()', 'from __main__ import version5', number=20000))
print('Version 6 Speed:', timeit.timeit('version6()', 'from __main__ import version6', number=20000))
Output:
Version 1 Speed: 9.879425965991686
Version 2 Speed: 7.752138633004506
Version 3 Speed: 7.078321029010112
Version 4 Speed: 8.01169377300539
Version 5 Speed: 19.853051771002356
Version 6 Speed: 9.801617017001263
We can see that Versions 2 & 3 are faster than the others, the reason for this is because they both use the np.repeat function, and numpy functions are very fast because they are implemented with C.
Version 3 wins against Version 2 marginally due to the usage of loc instead of DataFrame.
Version 5 is significantly slower because of the functions concat and sort_index, since concat copies DataFrames quadratically, which takes longer time.
Fastest Version: Version 3.
These will repeat the indices and preserve the columns as op demonstrated
iloc version 1
df.iloc[np.arange(len(df)).repeat(3)]
iloc version 2
df.iloc[np.arange(len(df) * 3) // 3]
Using concat:
pd.concat([df]*3).sort_index()
Out[129]:
Person ID ZipCode Gender
0 12345 882 38182 Female
0 12345 882 38182 Female
0 12345 882 38182 Female
1 32917 271 88172 Male
1 32917 271 88172 Male
1 32917 271 88172 Male
2 18273 552 90291 Female
2 18273 552 90291 Female
2 18273 552 90291 Female
I'm not sure why this was never proposed, but you can easily use df.index.repeat in conjection with .loc:
new_df = df.loc[df.index.repeat(3)]
Output:
>>> new_df
Person ID ZipCode Gender
0 12345 882 38182 Female
0 12345 882 38182 Female
0 12345 882 38182 Female
1 32917 271 88172 Male
1 32917 271 88172 Male
1 32917 271 88172 Male
2 18273 552 90291 Female
2 18273 552 90291 Female
2 18273 552 90291 Female
You can try the following code:
df = df.iloc[df.index.repeat(3),:].reset_index()
df.index.repeat(3) will create a list where each index value will be repeated 3 times and df.iloc[df.index.repeat(3),:] will help generate a dataframe with the rows as exactly returned by this list.
You can do it like this.
def do_things(df, n_times):
ndf = df.append(pd.DataFrame({'name' : np.repeat(df.name.values, n_times) }))
ndf = ndf.sort_values(by='name')
ndf = ndf.reset_index(drop=True)
return ndf
if __name__ == '__main__':
df = pd.DataFrame({'name' : ['Peter', 'Quill', 'Jackson']})
n_times = 3
print do_things(df, n_times)
And with explanation...
import pandas as pd
import numpy as np
n_times = 3
df = pd.DataFrame({'name' : ['Peter', 'Quill', 'Jackson']})
# name
# 0 Peter
# 1 Quill
# 2 Jackson
# Duplicating data.
df = df.append(pd.DataFrame({'name' : np.repeat(df.name.values, n_times) }))
# name
# 0 Peter
# 1 Quill
# 2 Jackson
# 0 Peter
# 1 Peter
# 2 Peter
# 3 Quill
# 4 Quill
# 5 Quill
# 6 Jackson
# 7 Jackson
# 8 Jackson
# The DataFrame is sorted by 'name' column.
df = df.sort_values(by=['name'])
# name
# 2 Jackson
# 6 Jackson
# 7 Jackson
# 8 Jackson
# 0 Peter
# 0 Peter
# 1 Peter
# 2 Peter
# 1 Quill
# 3 Quill
# 4 Quill
# 5 Quill
# Reseting the index.
# You can play with drop=True and drop=False, as parameter of `reset_index()`
df = df.reset_index()
# index name
# 0 2 Jackson
# 1 6 Jackson
# 2 7 Jackson
# 3 8 Jackson
# 4 0 Peter
# 5 0 Peter
# 6 1 Peter
# 7 2 Peter
# 8 1 Quill
# 9 3 Quill
# 10 4 Quill
# 11 5 Quill
If you need to index your repeats (e.g. for a multi-index) and also base the number of repeats on a value in a column, you can do this:
someDF["RepeatIndex"] = someDF["RepeatBasis"].fillna(value=0).apply(lambda x: list(range(int(x))) if x > 0 else [])
superDF = someDF.explode("RepeatIndex").dropna(subset="RepeatIndex")
This gives a DataFrame in which each record is repeated however many times is indicated in the "RepeatBasis" column. The DataFrame also gets a "RepeatIndex" column, which you can combine with the existing index to make into a multi-index, preserving index uniqueness.
If anyone's wondering why you'd want to do such a thing, in my case it's when I get data in which frequencies have already been summarized and for whatever reason, I need to work with singular observations. (think of reverse-engineering a histogram)
This question doesn't have enough answers yet! Here are some more ways to do this that are still missing and that allow chaining :)
# SQL-style cross-join
# (one line and counts replicas)
(
data
.join(pd.DataFrame(range(3), columns=["replica"]), how="cross")
.drop(columns="replica") # remove if you want to count replicas
)
# DataFrame.apply + Series.repeat
# (most readable, but potentially slow)
(
data
.apply(lambda x: x.repeat(3))
.reset_index(drop=True)
)
# DataFrame.explode
# (fun to have explosions in your code)
(
data
.assign(replica=lambda df: [[x for x in range(3)]] * len(df))
.explode("replica", ignore_index=True)
.drop(columns="replica") # or keep if you want to know which copy it is
)
(Edit: On a more serious note, using explode is useful if you need to count replicas and have a dynamic replica count per row. For example, if you have per-customer usage data with a start and end date, you can use the above to transform the data into monthly per-customer usage data.)
And of course here is the snippet to create the data for testing:
data = pd.DataFrame([
[12345, 882, 38182, "Female"],
[32917, 271, 88172, "Male"],
[18273, 552, 90291, "Female"],
],
columns=["Person", "ID", "ZipCode", "Gender"]
)
Use pd.concat: create three of the same dataFrames and merge them together, doesn't use a lot of code:
df = pd.concat([df]*3, ignore_index=True)
print(df)
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Note: ignore_index=True makes the index reset.
Could also use np.tile()
df.loc[np.tile(df.index,3)].sort_index().reset_index(drop=True)
Output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

Merge if two string columns are substring of one column from another dataframe in Python

Given two dataframes as follow:
df1:
id address price
0 1 8563 Parker Ave. Lexington, NC 27292 3
1 2 242 Bellevue Lane Appleton, WI 54911 3
2 3 771 Greenview Rd. Greenfield, IN 46140 5
3 4 93 Hawthorne Street Lakeland, FL 33801 6
4 5 8952 Green Hill Street Gettysburg, PA 17325 3
5 6 7331 S. Sherwood Dr. New Castle, PA 16101 4
df2:
state street quantity
0 PA S. Sherwood 12
1 IN Hawthorne Street 3
2 NC Parker Ave. 7
Let's say if both state and street from df2 are contained in address from df2, then merge df2 to df1.
How could I do that in Pandas? Thanks.
The expected result df:
id address ... street quantity
0 1 8563 Parker Ave. Lexington, NC 27292 ... Parker Ave. 7.00
1 2 242 Bellevue Lane Appleton, WI 54911 ... NaN NaN
2 3 771 Greenview Rd. Greenfield, IN 46140 ... NaN NaN
3 4 93 Hawthorne Street Lakeland, FL 33801 ... NaN NaN
4 5 8952 Green Hill Street Gettysburg, PA 17325 ... NaN NaN
5 6 7331 S. Sherwood Dr. New Castle, PA 16101 ... S. Sherwood 12.00
[6 rows x 6 columns]
My testing code:
df2['addr'] = df2['state'].astype(str) + df2['street'].astype(str)
pat = '|'.join(r'\b{}\b'.format(x) for x in df2['addr'])
df1['addr']= df1['address'].str.extract('\('+ pat + ')', expand=False)
df = df1.merge(df2, on='addr', how='left')
Output:
id address ... street_y quantity_y
0 1 8563 Parker Ave. Lexington, NC 27292 ... NaN nan
1 2 242 Bellevue Lane Appleton, WI 54911 ... NaN nan
2 3 771 Greenview Rd. Greenfield, IN 46140 ... NaN nan
3 4 93 Hawthorne Street Lakeland, FL 33801 ... NaN nan
4 5 8952 Green Hill Street Gettysburg, PA 17325 ... NaN nan
5 6 7331 S. Sherwood Dr. New Castle, PA 16101 ... NaN nan
[6 rows x 10 columns]
TRY:
pat_state = f"({'|'.join(df2['state'])})"
pat_street = f"({'|'.join(df2['street'])})"
df1['street'] = df1['address'].str.extract(pat=pat_street)
df1['state'] = df1['address'].str.extract(pat=pat_state)
df1.loc[df1['state'].isna(),'street'] = np.NAN
df1.loc[df1['street'].isna(),'state'] = np.NAN
df1 = df1.merge(df2, left_on=['state','street'], right_on=['state','street'], how ='left')
k="|".join(df2['street'].to_list())
df1=df1.assign(temp=df1['address'].str.findall(k).str.join(', '), temp1=df1['address'].str.split(",").str[-1])
dfnew=pd.merge(df1,df2, how='left', left_on=['temp','temp1'], right_on=['street',"state"])

Merge 2 pandas dataframes - python

I have 2 pandas dataframes:
data1=
sample ID
name
sex
0
a
male
1
b
male
2
c
male
3
d
male
4
e
male
data2=
samples
Diabetic
age
0
yes
43
1
yes
50
2
no
63
3
no
21
4
yes
44
I want to merge both data frames to end up with the following data frame
samples
Diabetic
age
name
sex
0
yes
43
a
male
1
yes
50
b
male
2
no
63
c
male
3
no
21
d
male
4
yes
44
e
male

Get all the rows with and without NaN in pandas dataframe

Most efficient way of splitting the row which contains with and without NaN in pandas dataframe.
input :- ID Gender Dependants Income Education Married
1 Male 2 500 Graduate Yes
2 NaN 4 2500 Graduate No
3 Female 3 NaN NaN Yes
4 Male NaN 7000 Graduate Yes
5 Female 4 500 Graduate NaN
6 Female 2 4500 Graduate Yes
The expected output without NaN is,
ID Gender Dependants Income Education Married
1 Male 2 500 Graduate Yes
6 Female 2 4500 Graduate Yes
The expected output with NaN is,
ID Gender Dependants Income Education Married
2 NaN 4 2500 Graduate No
3 Female 3 NaN NaN Yes
4 Male NaN 7000 Graduate Yes
5 Female 4 500 Graduate NaN
Use boolean indexing with check missing values and any for check at least one True per rows:
mask = df.isnull().any(axis=1)
df1 = df[~mask]
df2 = df[mask]
print (df1)
ID Gender Dependants Income Education Married
0 1 Male 2.0 500.0 Graduate Yes
5 6 Female 2.0 4500.0 Graduate Yes
print (df2)
ID Gender Dependants Income Education Married
1 2 NaN 4.0 2500.0 Graduate No
2 3 Female 3.0 NaN NaN Yes
3 4 Male NaN 7000.0 Graduate Yes
4 5 Female 4.0 500.0 Graduate NaN
Details:
print (df.isnull())
ID Gender Dependants Income Education Married
0 False False False False False False
1 False True False False False False
2 False False False True True False
3 False False True False False False
4 False False False False False True
5 False False False False False False
print (mask)
0 False
1 True
2 True
3 True
4 True
5 False
dtype: bool
And you can always use a more readable way of the previous code where you don't need to invert the mask:
mask = df.notna().any(axis=1)
df1 = df[mask]
Same exact result.

Handing missing data of age in classification [duplicate]

I have a titanic Dataset. It has attributes and i was working manly on
1.Age
2.Embark ( from which port passengers embarked..There are total 3 ports..S,Q and C)
3.Survived ( 0 for did not survived,1 for survived)
I was filtering the useless data. Then i needed to fill Null values present in Age. So i counted how many passengers survived and didn't survived in each Embark i.e. S,Q and C
I find out the mean age of Passengers who survived and who did not survived after embarking from each S,Q and C port. But now i have no idea how to fill these 6 values ( 3 for survived from each S,Q and C and 3 for who did not survived from each S,Q and C...So total 6) in the original titanic Age column. If i do simply titanic.Age.fillna('With one of the six values') it will fill All the Null values of Age with that one value which i don't want.
After giving some time,i tried this.
titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)
This showed no error but still it doesn't work. Any idea what should i do?
I think you need groupby with apply with fillna by mean:
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
import seaborn as sns
titanic = sns.load_dataset('titanic')
#check NaN rows in age
print (titanic[titanic['age'].isnull()].head(10))
survived pclass sex age sibsp parch fare embarked class \
5 0 3 male NaN 0 0 8.4583 Q Third
17 1 2 male NaN 0 0 13.0000 S Second
19 1 3 female NaN 0 0 7.2250 C Third
26 0 3 male NaN 0 0 7.2250 C Third
28 1 3 female NaN 0 0 7.8792 Q Third
29 0 3 male NaN 0 0 7.8958 S Third
31 1 1 female NaN 1 0 146.5208 C First
32 1 3 female NaN 0 0 7.7500 Q Third
36 1 3 male NaN 0 0 7.2292 C Third
42 0 3 male NaN 0 0 7.8958 C Third
who adult_male deck embark_town alive alone
5 man True NaN Queenstown no True
17 man True NaN Southampton yes True
19 woman False NaN Cherbourg yes True
26 man True NaN Cherbourg no True
28 woman False NaN Queenstown yes True
29 man True NaN Southampton no True
31 woman False B Cherbourg yes False
32 woman False NaN Queenstown yes True
36 man True NaN Cherbourg yes True
42 man True NaN Cherbourg no True
idx = titanic[titanic['age'].isnull()].index
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
#check if values was replaced
print (titanic.loc[idx].head(10))
survived pclass sex age sibsp parch fare embarked \
5 0 3 male 30.325000 0 0 8.4583 Q
17 1 2 male 28.113184 0 0 13.0000 S
19 1 3 female 28.973671 0 0 7.2250 C
26 0 3 male 33.666667 0 0 7.2250 C
28 1 3 female 22.500000 0 0 7.8792 Q
29 0 3 male 30.203966 0 0 7.8958 S
31 1 1 female 28.973671 1 0 146.5208 C
32 1 3 female 22.500000 0 0 7.7500 Q
36 1 3 male 28.973671 0 0 7.2292 C
42 0 3 male 33.666667 0 0 7.8958 C
class who adult_male deck embark_town alive alone
5 Third man True NaN Queenstown no True
17 Second man True NaN Southampton yes True
19 Third woman False NaN Cherbourg yes True
26 Third man True NaN Cherbourg no True
28 Third woman False NaN Queenstown yes True
29 Third man True NaN Southampton no True
31 First woman False B Cherbourg yes False
32 Third woman False NaN Queenstown yes True
36 Third man True NaN Cherbourg yes True
42 Third man True NaN Cherbourg no True
#check mean values
print (titanic.groupby(['survived','embarked'])['age'].mean())
survived embarked
0 C 33.666667
Q 30.325000
S 30.203966
1 C 28.973671
Q 22.500000
S 28.113184
Name: age, dtype: float64

Resources