Related
It's in the documentation that verbose=True will output time elapsed, but it is not doing so for me:
from sklearn.ensemble import VotingClassifier
voting_c_all = VotingClassifier(
estimators=[
('random_forest', gs_forest2),
('grid_search', gs),
],
voting='soft',
verbose=True,
n_jobs=-1
)
voting_c_all.fit(X_res, y_res)
Using the example from the manual:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft',verbose=True)
eclf1 = eclf1.fit(X, y)
[Voting] ....................... (1 of 2) Processing lr, total= 0.0s
[Voting] ....................... (2 of 2) Processing rf, total= 0.1s
But once you set n_jobs to be more than 1, the job should be sent to other cores and you don't see the print, hence doesn't keep track of the time:
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='soft',verbose=True,n_jobs=2)
eclf1 = eclf1.fit(X, y)
I want heatmap annotation as symbols. '*' at place of 1 and blank at 0.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
x = pd.DataFrame({'a':[1,0,1,0]})
fig, (ax) = plt.subplots(ncols=1)
sns.heatmap(x, cmap="BuPu",annot=True,fmt='g',annot_kws={'size':10},ax=ax, yticklabels=[], cbar=False, linewidths=.5,robust=True, vmin=0, vmax=1)
plt.show()
The heatmap can only annotate with numbers. To put other text (or unicode symbols), ax.text can be used. The center of each cell is at 0.5 added to both the row and the column number.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
x = pd.DataFrame({'a': [1, 0, 1, 0], 'b': [1, 1, 0, 1], 'c': [0, 1, 0, 0]})
fig, (ax) = plt.subplots(ncols=1)
sns.heatmap(x, cmap="BuPu", annot=False, ax=ax, yticklabels=[], cbar=False, linewidths=.5)
for i, c in enumerate(x.columns):
for j, v in enumerate(x[c]):
if v == 1:
ax.text(i + 0.5, j + 0.5, '★', color='gold', size=20, ha='center', va='center')
plt.show()
I want to generate the annotated heatmap where each column will be having new color.
<my code>
```
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'clust': ['Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2'],'value': [4,2,0,0, 0,0,1,3, 1,0,0,0], 'category': ['A','A','A','A','B','B','B','B','C','C','C','C']})
result = df.pivot(index='clust', columns='category',values='value')
sns.heatmap(result, annot=True, fmt="g", cmap='viridis')
plt.show()
```
<Input file>
No A B C
Clust 10 4 0 1
Clust 11 2 0 0
Clust 1 0 1 0
Clust 2 0 3 0
Clust 3 3 1 0
Clust 4 2 0 2
<Output>
enter image description here
You can create a heat map by plotly module in python. Below is the code which will generate heatmap.
import plotly.figure_factory as ff
a = [
[4, 0, 1],
[2, 0, 0],
[0, 1, 0],
[0, 3, 0],
[3, 1, 0],
[2, 0, 2]
]
fig = ff.create_annotated_heatmap(a)
fig.show()
See https://plot.ly/python/annotated-heatmap/ for more information on how to generate heatmap.
Note: I have not tested it but this code is for reference.
Working code.
```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'clust': ['Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2','Clust 10','Clust 11','Clust 1','Clust 2'],'value': [4,2,0,0, 0,0,1,3, 1,0,0,0], 'category': ['A','A','A','A','B','B','B','B','C','C','C','C']})
result = df.pivot(index='clust', columns='category',values='value')
print(result)
cm = ['Blues', 'Greens', 'YlG']
f, axs = plt.subplots(1, df.columns.size, gridspec_kw={'wspace': 0})
for i, (s, a, c) in enumerate(zip(result.columns, axs, cm)):
sns.heatmap(np.array([result[s].values]).T, yticklabels=result.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c, cbar=False)
if i>0:
a.yaxis.set_ticks([])
plt.show()
```
I have a data-set X with 260 unique observations.
when running x_train,x_test,_,_=test_train_split(X,y,test_size=0.2) I would assume that
[p for p in x_test if p in x_train] would be empty, but it is not. Actually it turns out that only two observations in x_test is not in x_train.
Is that intended or...?
EDIT (posted the data I am using):
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test if p in x_train]) #is not 0
EDIT 2.0: Showing that the test works
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[1,2,3],[11,12,13]])
len([p for p in a if p in b]) #1
This is not a bug with the implementation of train_test_split in sklearn, but a weird peculiarity of how the in operator works on numpy arrays. The in operator first does an elementwise comparison between two arrays, and returns True if ANY of the elements match.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5]])
a in b # True
The correct way to test for this kind of overlap is using the equality operator and np.all and np.any. As a bonus, you also get the indices that overlap for free.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [5, 5, 5], [7, 8, 9]])
a in b # True
z = np.any(np.all(a == b[:, None, :], -1)) # False
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[6, 7, 8], [1, 2, 3], [7, 8, 9]])
a in b # True
overlap = np.all(a == b[:, None, :], -1)
z = np.any(overlap) # True
indices = np.nonzero(overlap) # (1, 0)
You need to check using the following:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split as split
import numpy as np
DATA=load_breast_cancer()
X=DATA.data
y= DATA.target
y=np.array([1 if p==0 else 0 for p in DATA.target])
x_train,x_test,y_train,y_test=split(X,y,test_size=0.2,stratify=y,random_state=42)
len([p for p in x_test.tolist() if p in x_train.tolist()])
0
Using x_test.tolist() the in operator will work as intended.
Reference: testing whether a Numpy array contains a given row
I'm trying to wrap my head around ML with scikit-learn
Here is what I'm trying to do:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
df = pd.DataFrame({
"f1": [1, 1],
"f2": [0, 0],
"c": [1, 0]
})
#df
f1 f2 c # f1, f2 - features / c - class/ classifier
1 1 1 # for f1 = 1 and f2 = 1 > expected c = 1
0 0 0 # for f1 = 0 and f2 = 0 > expected c = 0
dtc_clf = DecisionTreeClassifier()
features = df[["f1", "f2"]]
labels = df[["c"]]
dtc_clf.fit(features, labels)
test_features = pd.DataFrame({"ft1": [1, 1],
"ft2": [0, 0]})
#test_features
ft1 ft2 #I added for test exactly the training data
1 1
0 0
dtc_clf.predict(test_features)
#I'm getting this result:
#array([0, 0])
#I expected this result
#array([1, 0])
If '1,1 => 1' then '0, 0 => 0'
It should be 'array([1, 0])' right?
Each column is a condition which if it's respected will be 1 if not 0.
Basically I'm trying to replace a lot of if else conditions with ML.
Works with DecisionTreeRegressor
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
# "beer": 1
# "wine": 2
df = pd.DataFrame({
"boy": [1, 0],
"hetero": [1, 1],
"drink": [1, 2]
})
X = df[["boy", "hetero"]]
y = df[["drink"]]
regr = DecisionTreeRegressor(random_state=0)
model = regr.fit(X, y)
# Make new observation
observation = [[1, 1]]
# Predict observation's value
model.predict(observation)
Result :
array([ 1.])