gridsearch.predict_proba results in list rather than array - scikit-learn

I followed example and tried to use gridsearch with a random forest classifier to generate roc_auc_score, however, the y_prob=model.predict_proba(X_test)
I generated was in list (two arrays) rather than one. So I was wondering what went wrong here.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import roc_auc_score
X = np.random.rand(50,10)
y = np.random.permutation([1] * 25 + [0] * 25)
y= label_binarize(y, classes=[0, 1])
y= np.hstack((1-y, y))
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=7)
index_split = sss.split(X, y)
train_index = []
test_index = []
for train_ind, test_ind in index_split:
train_index.extend(train_ind)
test_index.extend(test_ind)
data_train = X[train_index]
out_train = y[train_index]
data_test = X[test_index]
out_test = y[test_index]
rf = RandomForestClassifier()
grids = {
'n_estimators': [10, 50, 100, 200],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['gini', 'entropy']
}
rf_grids_searched = GridSearchCV(rf,
grids,
scoring = "roc_auc",
n_jobs = -1,
refit=True,
cv = 5,
verbose=10)
rf_grids_searched.fit(data_train, out_train)
rf_best = rf_grids_searched.best_estimator_
y_prob=rf_best.predict_proba(data_test)
print(roc_auc_score(out_test, y_prob))
my result:
array([[0.5, 0.5],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.7, 0.3],
[0.5, 0.5],
[0.1, 0.9],
[0.6, 0.4],
[0.6, 0.4],
[0.4, 0.6]]), array([[0.5, 0.5],
[0.5, 0.5],
[0.3, 0.7],
[0.7, 0.3],
[0.3, 0.7],
[0.5, 0.5],
[0.9, 0.1],
[0.4, 0.6],
[0.4, 0.6],
[0.6, 0.4]])]
expected results with probability of [0,1]:
array([[0.5, 0.5],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.7, 0.3],
[0.5, 0.5],
[0.1, 0.9],
[0.6, 0.4],
[0.6, 0.4],
I also tried not to binarize y in the first place and then train gridsearch to get the following array y_prob. Later, I binarize y_test to match the dimension of y_prob and get the score. I was wondering if the sequence is correct?
code:
out_test1= label_binarize(out_test, classes=[0, 1])
out_test1= np.hstack((1-out_test1, out_test1))
print(roc_auc_score(out_test1, y_prob))
array([[0.6, 0.4],
[0.5, 0.5],
[0.6, 0.4],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.8, 0.2],
[0.4, 0.6],
[0.8, 0.2],
[0.4, 0.6]])

The grid search's predict_proba method is just a dispatch to the best estimator's predict_proba. And from the docstring for RandomForestClassifier.predict_proba (emphasis added):
Returns
p : ndarray of shape (n_samples, n_classes), or a list of n_outputs
such arrays if n_outputs > 1. ...
Since you've specified two outputs (two columns in y), you get predicted probabilities for each of the two classes for each of the two targets.

Related

Optimize classifier for multiclass Brier score instead of accuracy

I am more interested in optimizing my multiclass problem with Brier score instead of accuracy. To achieve that, I am evaluating my classifiers with the results of predict_proba() like:
import numpy as np
probs = np.array(
[ [1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]]
)
targets = np.array(
[[0.9, 0.05, 0.05],
[0.1, 0.8, 0.1],
[0.7, 0.2, 0.1],
[0.1, 0.9, 0],
[0, 0, 1],
[0.5, 0.3, 0.2],
[0.1, 0.5, 0.4],
[0.34, 0.33, 0.33]]
)
def brier_multi(targets, probs):
return np.mean(np.sum((probs - targets) ** 2, axis=1))
brier_multi(targets, probs)
Is it possible to optimize scikit-learns classifier directly during training for multiclass Brier score instead of accuracy?
Edit:
...
pipe = Pipeline(
steps=[
("preprocessor", preprocessor),
("selector", None),
("classifier", model.get("classifier")),
]
)
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
brier_multi_loss = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
search = GridSearchCV(
estimator=pipe,
param_grid=model.get("param_grid"),
scoring=brier_multi_loss,
cv=3,
n_jobs=-1,
refit=True,
verbose=3,
)
search.fit(X_train, y_train)
...
leads to nan as score
/home/andreas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py:969: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan]
warnings.warn(
You're already aware of the scoring parameter, so you just need to wrap your brier_multi into the format expected by GridSearchCV. There's a utility for that, make_scorer:
from sklearn.metrics import make_scorer
neg_mc_brier_score = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
GridSearchCV(..., scoring=neg_mc_brier_score)
See the User Guide and the docs for make_scorer.
Unfortunately, that won't run, because your version of the scorer expects a one-hot-encoded targets array, whereas sklearn multiclass will send y_true as a 1d array. As a hack to make sure the rest works, you can modify:
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
but I would encourage you to make this more robust (what if the classes aren't just 0, 1, ..., n_classes-1?).
For what it's worth, sklearn has a PR in progress to add multiclass Brier score: https://github.com/scikit-learn/scikit-learn/pull/22046 (be sure to see the linked PR18699, as it has the beginning of development and review).

Matplotlib plot is not displaying all xticks and yticks

I am creating subplots in matplotlib but not all xticks and yticks are being displayed. I have tried everything from setting xlim and ylim, chainging figure size etc. The thing is this is a handson on hackerrnak and they are evaluating my output against their expected output. The 0.0 in xaxis and 1.0 on yaxis are simply not matching up. What am I doing wrong here.
Here is the code,
import matplotlib.pyplot as plt
import numpy as np
def test_generate_figure2():
np.random.seed(1000)
x = np.random.rand(10)
y = np.random.rand(10)
z = np.sqrt(x**2 + y**2)
fig = plt.figure(figsize=(8,6))
axes1 = plt.subplot(2, 2, 1, title="Scatter plot with Upper Triangle Markers")
axes1.set_xticks([0.0, 0.4, 0.8, 1.2])
axes1.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes1.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes1.set_xlim(0.0,1.2)
print(axes1.get_yticks())
axes1.scatter(x, y, marker="^", s=80, c=z)
axes2 = plt.subplot(2, 2, 2, title="Scatter plot with Plus Markers")
axes2.set_xticks([0.0, 0.4, 0.8, 1.2])
axes2.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes2.scatter(x, y, marker="+", s=80, c=z)
axes3 = plt.subplot(2, 2, 3, title="Scatter plot with Circle Markers")
axes3.set_xticks([0.0, 0.4, 0.8, 1.2])
axes3.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes3.scatter(x, y, marker="o", s=80, c=z)
axes4 = plt.subplot(2, 2, 4, title="Scatter plot with Diamond Markers")
axes4.set_xticks([0.0, 0.4, 0.8, 1.2])
axes4.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes4.scatter(x, y, marker="d", s=80,c=z)
plt.tight_layout()
plt.show()
test_generate_figure2()
My Output,
Expected Output,
Your set_xlim & set_ylim approach works. You just need to set it for every subplot:
https://akuiper.com/console/5vaLIq0ZC_KO
import matplotlib.pyplot as plt
import numpy as np
def test_generate_figure2():
np.random.seed(1000)
x = np.random.rand(10)
y = np.random.rand(10)
z = np.sqrt(x**2 + y**2)
fig = plt.figure(figsize=(8,6))
axes1 = plt.subplot(2, 2, 1, title="Scatter plot with Upper Triangle Markers")
axes1.set_xticks([0.0, 0.4, 0.8, 1.2])
axes1.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes1.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes1.set_xlim(0.0,1.2)
print(axes1.get_yticks())
axes1.scatter(x, y, marker="^", s=80, c=z)
axes2 = plt.subplot(2, 2, 2, title="Scatter plot with Plus Markers")
axes2.set_xticks([0.0, 0.4, 0.8, 1.2])
axes2.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes2.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes2.set_xlim(0.0,1.2)
axes2.scatter(x, y, marker="+", s=80, c=z)
axes3 = plt.subplot(2, 2, 3, title="Scatter plot with Circle Markers")
axes3.set_xticks([0.0, 0.4, 0.8, 1.2])
axes3.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes3.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes3.set_xlim(0.0,1.2)
axes3.scatter(x, y, marker="o", s=80, c=z)
axes4 = plt.subplot(2, 2, 4, title="Scatter plot with Diamond Markers")
axes4.set_xticks([0.0, 0.4, 0.8, 1.2])
axes4.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes4.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes4.set_xlim(0.0,1.2)
axes4.scatter(x, y, marker="d", s=80,c=z)
plt.tight_layout()
plt.show()
test_generate_figure2()

matplotlib shift pcolormesh plot to symmetrized coordinates

I have some 2D data with x and y coordinates both within [0,1], plotted using pcolormesh.
Now I want to symmetrize the plot to [-0.5, 0.5] for both x and y coordinates. In Matlab I was able to achieve this by changing x and y from e.g. [0, 0.2, 0.4, 0.6, 0.8] to [0, 0.2, 0.4, -0.4, -0.2], without rearranging the data. However, with pcolormesh I cannot get the desired result.
A minimum example is shown below, with data represented simply by x+y:
import matplotlib.pyplot as plt
import numpy as np
x,y = np.mgrid[0:1:5j,0:1:5j]
fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(9,3.3),constrained_layout=1)
# original plot spanning [0,1]
img1 = ax1.pcolormesh(x,y,x+y,shading='auto')
# shift x and y from [0,1] to [-0.5,0.5]
x = x*(x<0.5)+(x-1)*(x>0.5)
y = y*(y<0.5)+(y-1)*(y>0.5)
img2 = ax2.pcolormesh(x,y,x+y,shading='auto') # similar code works in Matlab
# for this specific case, the following is close to the desired result, I can just rename x and y tick labels
# to [-0.5,0.5], but in general data is not simply x+y
img3 = ax3.pcolormesh(x+y,shading='auto')
fig.colorbar(img1,ax=[ax1,ax2,ax3],orientation='horizontal')
The corresponding figure is below, any suggestion on what is missed would be appreciated!
Let's look at what you want to achieve in a 1D example.
You have x values between 0 and 1 and a dummy function f(x) = 20*x to produce some values.
# x = [0, .2, .4, .6, .8] -> [0, .2, .4, -.4, -.2] -> [-.4, .2, .0, .2, .4])
# fx = [0, 4, 8, 12, 16] -> [0, 4, 8, 12, 16] -> [ 12, 16, 0, 4, 8]
# ^ only flip and shift x not fx ^
You could use np.roll() to achieve the last operation.
I used n=14 to make the result better visible and show that this approach works for arbitrary n.
import numpy as np
import matplotlib.pyplot as plt
n = 14
x, y = np.meshgrid(np.linspace(0, 1, n, endpoint=False),
np.linspace(0, 1, n, endpoint=False))
z = x + y
x_sym = x*(x <= .5)+(x-1)*(x > .5)
# array([[ 0. , 0.2, 0.4, -0.4, -0.2], ...
x_sym = np.roll(x_sym, n//2, axis=(0, 1))
# array([[-0.4, -0.2, 0. , 0.2, 0.4], ...
y_sym = y*(y <= .5)+(y-1)*(y > .5)
y_sym = np.roll(y_sym, n//2, axis=(0, 1))
z_sym = np.roll(z, n//2, axis=(0, 1))
# array([[1.2, 1.4, 0.6, 0.8, 1. ],
# [1.4, 1.6, 0.8, 1. , 1.2],
# [0.6, 0.8, 0. , 0.2, 0.4],
# [0.8, 1. , 0.2, 0.4, 0.6],
# [1. , 1.2, 0.4, 0.6, 0.8]])
fig, (ax1, ax2) = plt.subplots(1, 2)
img1 = ax1.imshow(z, origin='lower', extent=(.0, 1., .0, 1.))
img2 = ax2.imshow(z_sym, origin='lower', extent=(-.5, .5, -.5, .5))

Pytorch, sample given batch logits

Given logits like
# each row is a record of data
logits = np.array([ [0.1, 0.3, 0.5], [0.3, 0.1, 0.5], [0.1, 0.3, 0.0] ])
How can I use Pytorch to sample the index for the logits of each row? Current distribution APIs does not support such functions.
What I want is, for example
distribution = Categorical(logits=logits)
labels = distribution.sample(dim=1)

Using Pipeline with GridSearchCV

Suppose I have this Pipeline object:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('my_transform', my_transform()),
('estimator', SVC())
])
To pass the hyperparameters to my Support Vector Classifier (SVC) I could do something like this:
pipe_parameters = {
'estimator__gamma': (0.1, 1),
'estimator__kernel': (rbf)
}
Then, I could use GridSearchCV:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters)
grid.fit(X_train, y_train)
We know that a linear kernel does not use gamma as a hyperparameter. So, how could I include the linear kernel in this GridSearch?
For example, In a simple GridSearch (without Pipeline) I could do:
param_grid = [
{'C': [ 0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'kernel': ['rbf']},
{'C': [0.1, 1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'degree': [2, 3],
'kernel': ['poly']}
]
grid = GridSearchCV(SVC(), param_grid)
Therefore, I need a working version of this sort of code:
pipe_parameters = {
'bag_of_words__max_features': (None, 1500),
'estimator__kernel': (rbf),
'estimator__gamma': (0.1, 1),
'estimator__kernel': (linear),
'estimator__C': (0.1, 1),
}
Meaning that I want to use as hyperparameters the following combinations:
kernel = rbf, gamma = 0.1
kernel = rbf, gamma = 1
kernel = linear, C = 0.1
kernel = linear, C = 1
You are almost there. Similar to how you created multiple dictionaries for SVC model, create a list of dictionaries for the pipeline.
Try this example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
pipe = Pipeline([
('bag_of_words', CountVectorizer()),
('estimator', SVC())])
pipe_parameters = [
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [ 0.1, ],
'estimator__gamma': [0.0001, 1],
'estimator__kernel': ['rbf']},
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [0.1, 1],
'estimator__kernel': ['linear']}
]
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters, cv=2)
grid.fit(data_train.data, data_train.target)
grid.best_params_
# {'bag_of_words__max_features': None,
# 'estimator__C': 0.1,
# 'estimator__kernel': 'linear'}

Resources