Scikit-learn ColumnTransformer + OneHotEncoder

Scikit-learn ColumnTransformer + OneHotEncoder - scikit-learn

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np
X = [['male', 0, 3], ['male', 1, 0], ['female', 2, 1], ['female', 0, 2]]
# 字符串编码为整形
sex_enc = OrdinalEncoder(dtype = np.int)
# 独热编码
one_hot_enc = OneHotEncoder(sparse=False, handle_unknown='ignore', dtype=np.int)
# 对第0列的字符串做整形转换, 然后对所有列做one-hot
col_transformer = ColumnTransformer(transformers = [('sex_enc', sex_enc, [0]), ('one_hot_enc', one_hot_enc, [0])])
# 训练编码
col_transformer.fit(X)
X_trans = col_transformer.transform(X)
print(X_trans)
[[1 0 1] [1 0 1] [0 1 0] [0 1 0]]
feature0 has values male and female, why one-hot outputs threes cols with columntransformer?

Related

How come Verbose=True does not show any output with VotingClassifier?

It's in the documentation that verbose=True will output time elapsed, but it is not doing so for me:
from sklearn.ensemble import VotingClassifier
voting_c_all = VotingClassifier(
estimators=[
('random_forest', gs_forest2),
('grid_search', gs),
],
voting='soft',
verbose=True,
n_jobs=-1
)
voting_c_all.fit(X_res, y_res)

Using the example from the manual:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft',verbose=True)
eclf1 = eclf1.fit(X, y)
[Voting] ....................... (1 of 2) Processing lr, total= 0.0s
[Voting] ....................... (2 of 2) Processing rf, total= 0.1s
But once you set n_jobs to be more than 1, the job should be sent to other cores and you don't see the print, hence doesn't keep track of the time:
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft',verbose=True,n_jobs=2)
eclf1 = eclf1.fit(X, y)

python saborn.heatmap annotation as symbols

I want heatmap annotation as symbols. '*' at place of 1 and blank at 0.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
x = pd.DataFrame({'a':[1,0,1,0]})
fig, (ax) = plt.subplots(ncols=1)
sns.heatmap(x, cmap="BuPu",annot=True,fmt='g',annot_kws={'size':10},ax=ax, yticklabels=[], cbar=False, linewidths=.5,robust=True, vmin=0, vmax=1)
plt.show()

The heatmap can only annotate with numbers. To put other text (or unicode symbols), ax.text can be used. The center of each cell is at 0.5 added to both the row and the column number.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
x = pd.DataFrame({'a': [1, 0, 1, 0], 'b': [1, 1, 0, 1], 'c': [0, 1, 0, 0]})
fig, (ax) = plt.subplots(ncols=1)
sns.heatmap(x, cmap="BuPu", annot=False, ax=ax, yticklabels=[], cbar=False, linewidths=.5)
for i, c in enumerate(x.columns):
for j, v in enumerate(x[c]):
if v == 1:
ax.text(i + 0.5, j + 0.5, '★', color='gold', size=20, ha='center', va='center')
plt.show()

heatmap: each column with different color and scaling in R/Python

I want to generate the annotated heatmap where each column will be having new color.
<my code>
```
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'clust': ['Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2'],'value': [4,2,0,0, 0,0,1,3, 1,0,0,0], 'category': ['A','A','A','A','B','B','B','B','C','C','C','C']})
result = df.pivot(index='clust', columns='category',values='value')
sns.heatmap(result, annot=True, fmt="g", cmap='viridis')
plt.show()
```
<Input file>
No A B C
Clust 10 4 0 1
Clust 11 2 0 0
Clust 1 0 1 0
Clust 2 0 3 0
Clust 3 3 1 0
Clust 4 2 0 2
<Output>
enter image description here

You can create a heat map by plotly module in python. Below is the code which will generate heatmap.
import plotly.figure_factory as ff
a = [
[4, 0, 1],
[2, 0, 0],
[0, 1, 0],
[0, 3, 0],
[3, 1, 0],
[2, 0, 2]
]
fig = ff.create_annotated_heatmap(a)
fig.show()
See https://plot.ly/python/annotated-heatmap/ for more information on how to generate heatmap.
Note: I have not tested it but this code is for reference.

Working code.
```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'clust': ['Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2'],'value': [4,2,0,0, 0,0,1,3, 1,0,0,0], 'category': ['A','A','A','A','B','B','B','B','C','C','C','C']})
result = df.pivot(index='clust', columns='category',values='value')
print(result)
cm = ['Blues', 'Greens', 'YlG']
f, axs = plt.subplots(1, df.columns.size, gridspec_kw={'wspace': 0})
for i, (s, a, c) in enumerate(zip(result.columns, axs, cm)):
sns.heatmap(np.array([result[s].values]).T, yticklabels=result.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c, cbar=False)
if i>0:
a.yaxis.set_ticks([])
plt.show()
```

sklearn train_test_split returns some elements in both test/train

I have a data-set X with 260 unique observations.
when running x_train,x_test,_,_=test_train_split(X,y,test_size=0.2) I would assume that
[p for p in x_test if p in x_train] would be empty, but it is not. Actually it turns out that only two observations in x_test is not in x_train.
Is that intended or...?
EDIT (posted the data I am using):
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test if p in x_train]) #is not 0
EDIT 2.0: Showing that the test works
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])
len([p for p in a if p in b]) #1

This is not a bug with the implementation of train_test_split in sklearn, but a weird peculiarity of how the in operator works on numpy arrays. The in operator first does an elementwise comparison between two arrays, and returns True if ANY of the elements match.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5]])
a in b # True
The correct way to test for this kind of overlap is using the equality operator and np.all and np.any. As a bonus, you also get the indices that overlap for free.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5], [7, 8, 9]])
a in b # True
z = np.any(np.all(a == b[:, None, :], -1)) # False
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [1, 2, 3], [7, 8, 9]])
a in b # True
overlap = np.all(a == b[:, None, :], -1)
z = np.any(overlap) # True
indices = np.nonzero(overlap) # (1, 0)

You need to check using the following:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test.tolist() if p in x_train.tolist()])
0
Using x_test.tolist() the in operator will work as intended.
Reference: testing whether a Numpy array contains a given row

Replace a bunch of if-else conditions with scikit-learn

I'm trying to wrap my head around ML with scikit-learn
Here is what I'm trying to do:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
df = pd.DataFrame({
"f1": [1, 1],
"f2": [0, 0],
"c": [1, 0]
})
#df
f1 f2 c # f1, f2 - features / c - class/ classifier
1 1 1 # for f1 = 1 and f2 = 1 > expected c = 1
0 0 0 # for f1 = 0 and f2 = 0 > expected c = 0
dtc_clf = DecisionTreeClassifier()
features = df[["f1", "f2"]]
labels = df[["c"]]
dtc_clf.fit(features, labels)
test_features = pd.DataFrame({"ft1": [1, 1],
"ft2": [0, 0]})
#test_features
ft1 ft2 #I added for test exactly the training data
1 1
0 0
dtc_clf.predict(test_features)
#I'm getting this result:
#array([0, 0])
#I expected this result
#array([1, 0])
If '1,1 => 1' then '0, 0 => 0'
It should be 'array([1, 0])' right?
Each column is a condition which if it's respected will be 1 if not 0.
Basically I'm trying to replace a lot of if else conditions with ML.

Works with DecisionTreeRegressor
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
# "beer": 1
# "wine": 2
df = pd.DataFrame({
"boy": [1, 0],
"hetero": [1, 1],
"drink": [1, 2]
})
X = df[["boy", "hetero"]]
y = df[["drink"]]
regr = DecisionTreeRegressor(random_state=0)
model = regr.fit(X, y)
# Make new observation
observation = [[1, 1]]
# Predict observation's value
model.predict(observation)
Result :
array([ 1.])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scikit-learn ColumnTransformer + OneHotEncoder - scikit-learn

Related

How come Verbose=True does not show any output with VotingClassifier?

python saborn.heatmap annotation as symbols

heatmap: each column with different color and scaling in R/Python

sklearn train_test_split returns some elements in both test/train

Replace a bunch of if-else conditions with scikit-learn

Categories

Resources