I followed example and tried to use gridsearch with a random forest classifier to generate roc_auc_score, however, the y_prob=model.predict_proba(X_test)
I generated was in list (two arrays) rather than one. So I was wondering what went wrong here.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.metrics import roc_auc_score
X = np.random.rand(50,10)
y = np.random.permutation([1] * 25 + [0] * 25)
y= label_binarize(y, classes=[0, 1])
y= np.hstack((1-y, y))
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=7)
index_split = sss.split(X, y)
train_index = []
test_index = []
for train_ind, test_ind in index_split:
train_index.extend(train_ind)
test_index.extend(test_ind)
data_train = X[train_index]
out_train = y[train_index]
data_test = X[test_index]
out_test = y[test_index]
rf = RandomForestClassifier()
grids = {
'n_estimators': [10, 50, 100, 200],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion': ['gini', 'entropy']
}
rf_grids_searched = GridSearchCV(rf,
grids,
scoring = "roc_auc",
n_jobs = -1,
refit=True,
cv = 5,
verbose=10)
rf_grids_searched.fit(data_train, out_train)
rf_best = rf_grids_searched.best_estimator_
y_prob=rf_best.predict_proba(data_test)
print(roc_auc_score(out_test, y_prob))
my result:
array([[0.5, 0.5],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.7, 0.3],
[0.5, 0.5],
[0.1, 0.9],
[0.6, 0.4],
[0.6, 0.4],
[0.4, 0.6]]), array([[0.5, 0.5],
[0.5, 0.5],
[0.3, 0.7],
[0.7, 0.3],
[0.3, 0.7],
[0.5, 0.5],
[0.9, 0.1],
[0.4, 0.6],
[0.4, 0.6],
[0.6, 0.4]])]
expected results with probability of [0,1]:
array([[0.5, 0.5],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.7, 0.3],
[0.5, 0.5],
[0.1, 0.9],
[0.6, 0.4],
[0.6, 0.4],
I also tried not to binarize y in the first place and then train gridsearch to get the following array y_prob. Later, I binarize y_test to match the dimension of y_prob and get the score. I was wondering if the sequence is correct?
code:
out_test1= label_binarize(out_test, classes=[0, 1])
out_test1= np.hstack((1-out_test1, out_test1))
print(roc_auc_score(out_test1, y_prob))
array([[0.6, 0.4],
[0.5, 0.5],
[0.6, 0.4],
[0.5, 0.5],
[0.7, 0.3],
[0.3, 0.7],
[0.8, 0.2],
[0.4, 0.6],
[0.8, 0.2],
[0.4, 0.6]])
The grid search's predict_proba method is just a dispatch to the best estimator's predict_proba. And from the docstring for RandomForestClassifier.predict_proba (emphasis added):
Returns
p : ndarray of shape (n_samples, n_classes), or a list of n_outputs
such arrays if n_outputs > 1. ...
Since you've specified two outputs (two columns in y), you get predicted probabilities for each of the two classes for each of the two targets.
Related
I am more interested in optimizing my multiclass problem with Brier score instead of accuracy. To achieve that, I am evaluating my classifiers with the results of predict_proba() like:
import numpy as np
probs = np.array(
[ [1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]]
)
targets = np.array(
[[0.9, 0.05, 0.05],
[0.1, 0.8, 0.1],
[0.7, 0.2, 0.1],
[0.1, 0.9, 0],
[0, 0, 1],
[0.5, 0.3, 0.2],
[0.1, 0.5, 0.4],
[0.34, 0.33, 0.33]]
)
def brier_multi(targets, probs):
return np.mean(np.sum((probs - targets) ** 2, axis=1))
brier_multi(targets, probs)
Is it possible to optimize scikit-learns classifier directly during training for multiclass Brier score instead of accuracy?
Edit:
...
pipe = Pipeline(
steps=[
("preprocessor", preprocessor),
("selector", None),
("classifier", model.get("classifier")),
]
)
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
brier_multi_loss = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
search = GridSearchCV(
estimator=pipe,
param_grid=model.get("param_grid"),
scoring=brier_multi_loss,
cv=3,
n_jobs=-1,
refit=True,
verbose=3,
)
search.fit(X_train, y_train)
...
leads to nan as score
/home/andreas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py:969: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan]
warnings.warn(
You're already aware of the scoring parameter, so you just need to wrap your brier_multi into the format expected by GridSearchCV. There's a utility for that, make_scorer:
from sklearn.metrics import make_scorer
neg_mc_brier_score = make_scorer(
brier_multi,
greater_is_better=False,
needs_proba=True,
)
GridSearchCV(..., scoring=neg_mc_brier_score)
See the User Guide and the docs for make_scorer.
Unfortunately, that won't run, because your version of the scorer expects a one-hot-encoded targets array, whereas sklearn multiclass will send y_true as a 1d array. As a hack to make sure the rest works, you can modify:
def brier_multi(targets, probs):
ohe_targets = OneHotEncoder().fit_transform(targets.reshape(-1, 1))
return np.mean(np.sum(np.square(probs - ohe_targets), axis=1))
but I would encourage you to make this more robust (what if the classes aren't just 0, 1, ..., n_classes-1?).
For what it's worth, sklearn has a PR in progress to add multiclass Brier score: https://github.com/scikit-learn/scikit-learn/pull/22046 (be sure to see the linked PR18699, as it has the beginning of development and review).
I am creating subplots in matplotlib but not all xticks and yticks are being displayed. I have tried everything from setting xlim and ylim, chainging figure size etc. The thing is this is a handson on hackerrnak and they are evaluating my output against their expected output. The 0.0 in xaxis and 1.0 on yaxis are simply not matching up. What am I doing wrong here.
Here is the code,
import matplotlib.pyplot as plt
import numpy as np
def test_generate_figure2():
np.random.seed(1000)
x = np.random.rand(10)
y = np.random.rand(10)
z = np.sqrt(x**2 + y**2)
fig = plt.figure(figsize=(8,6))
axes1 = plt.subplot(2, 2, 1, title="Scatter plot with Upper Triangle Markers")
axes1.set_xticks([0.0, 0.4, 0.8, 1.2])
axes1.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes1.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes1.set_xlim(0.0,1.2)
print(axes1.get_yticks())
axes1.scatter(x, y, marker="^", s=80, c=z)
axes2 = plt.subplot(2, 2, 2, title="Scatter plot with Plus Markers")
axes2.set_xticks([0.0, 0.4, 0.8, 1.2])
axes2.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes2.scatter(x, y, marker="+", s=80, c=z)
axes3 = plt.subplot(2, 2, 3, title="Scatter plot with Circle Markers")
axes3.set_xticks([0.0, 0.4, 0.8, 1.2])
axes3.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes3.scatter(x, y, marker="o", s=80, c=z)
axes4 = plt.subplot(2, 2, 4, title="Scatter plot with Diamond Markers")
axes4.set_xticks([0.0, 0.4, 0.8, 1.2])
axes4.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes4.scatter(x, y, marker="d", s=80,c=z)
plt.tight_layout()
plt.show()
test_generate_figure2()
My Output,
Expected Output,
Your set_xlim & set_ylim approach works. You just need to set it for every subplot:
https://akuiper.com/console/5vaLIq0ZC_KO
import matplotlib.pyplot as plt
import numpy as np
def test_generate_figure2():
np.random.seed(1000)
x = np.random.rand(10)
y = np.random.rand(10)
z = np.sqrt(x**2 + y**2)
fig = plt.figure(figsize=(8,6))
axes1 = plt.subplot(2, 2, 1, title="Scatter plot with Upper Triangle Markers")
axes1.set_xticks([0.0, 0.4, 0.8, 1.2])
axes1.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes1.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes1.set_xlim(0.0,1.2)
print(axes1.get_yticks())
axes1.scatter(x, y, marker="^", s=80, c=z)
axes2 = plt.subplot(2, 2, 2, title="Scatter plot with Plus Markers")
axes2.set_xticks([0.0, 0.4, 0.8, 1.2])
axes2.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes2.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes2.set_xlim(0.0,1.2)
axes2.scatter(x, y, marker="+", s=80, c=z)
axes3 = plt.subplot(2, 2, 3, title="Scatter plot with Circle Markers")
axes3.set_xticks([0.0, 0.4, 0.8, 1.2])
axes3.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes3.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes3.set_xlim(0.0,1.2)
axes3.scatter(x, y, marker="o", s=80, c=z)
axes4 = plt.subplot(2, 2, 4, title="Scatter plot with Diamond Markers")
axes4.set_xticks([0.0, 0.4, 0.8, 1.2])
axes4.set_yticks([-0.2, 0.2, 0.6, 1.0])
axes4.set_ylim(-0.2,1.0) #Doing this still doesnot get the expected output
axes4.set_xlim(0.0,1.2)
axes4.scatter(x, y, marker="d", s=80,c=z)
plt.tight_layout()
plt.show()
test_generate_figure2()
I have some 2D data with x and y coordinates both within [0,1], plotted using pcolormesh.
Now I want to symmetrize the plot to [-0.5, 0.5] for both x and y coordinates. In Matlab I was able to achieve this by changing x and y from e.g. [0, 0.2, 0.4, 0.6, 0.8] to [0, 0.2, 0.4, -0.4, -0.2], without rearranging the data. However, with pcolormesh I cannot get the desired result.
A minimum example is shown below, with data represented simply by x+y:
import matplotlib.pyplot as plt
import numpy as np
x,y = np.mgrid[0:1:5j,0:1:5j]
fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(9,3.3),constrained_layout=1)
# original plot spanning [0,1]
img1 = ax1.pcolormesh(x,y,x+y,shading='auto')
# shift x and y from [0,1] to [-0.5,0.5]
x = x*(x<0.5)+(x-1)*(x>0.5)
y = y*(y<0.5)+(y-1)*(y>0.5)
img2 = ax2.pcolormesh(x,y,x+y,shading='auto') # similar code works in Matlab
# for this specific case, the following is close to the desired result, I can just rename x and y tick labels
# to [-0.5,0.5], but in general data is not simply x+y
img3 = ax3.pcolormesh(x+y,shading='auto')
fig.colorbar(img1,ax=[ax1,ax2,ax3],orientation='horizontal')
The corresponding figure is below, any suggestion on what is missed would be appreciated!
Let's look at what you want to achieve in a 1D example.
You have x values between 0 and 1 and a dummy function f(x) = 20*x to produce some values.
# x = [0, .2, .4, .6, .8] -> [0, .2, .4, -.4, -.2] -> [-.4, .2, .0, .2, .4])
# fx = [0, 4, 8, 12, 16] -> [0, 4, 8, 12, 16] -> [ 12, 16, 0, 4, 8]
# ^ only flip and shift x not fx ^
You could use np.roll() to achieve the last operation.
I used n=14 to make the result better visible and show that this approach works for arbitrary n.
import numpy as np
import matplotlib.pyplot as plt
n = 14
x, y = np.meshgrid(np.linspace(0, 1, n, endpoint=False),
np.linspace(0, 1, n, endpoint=False))
z = x + y
x_sym = x*(x <= .5)+(x-1)*(x > .5)
# array([[ 0. , 0.2, 0.4, -0.4, -0.2], ...
x_sym = np.roll(x_sym, n//2, axis=(0, 1))
# array([[-0.4, -0.2, 0. , 0.2, 0.4], ...
y_sym = y*(y <= .5)+(y-1)*(y > .5)
y_sym = np.roll(y_sym, n//2, axis=(0, 1))
z_sym = np.roll(z, n//2, axis=(0, 1))
# array([[1.2, 1.4, 0.6, 0.8, 1. ],
# [1.4, 1.6, 0.8, 1. , 1.2],
# [0.6, 0.8, 0. , 0.2, 0.4],
# [0.8, 1. , 0.2, 0.4, 0.6],
# [1. , 1.2, 0.4, 0.6, 0.8]])
fig, (ax1, ax2) = plt.subplots(1, 2)
img1 = ax1.imshow(z, origin='lower', extent=(.0, 1., .0, 1.))
img2 = ax2.imshow(z_sym, origin='lower', extent=(-.5, .5, -.5, .5))
Given logits like
# each row is a record of data
logits = np.array([ [0.1, 0.3, 0.5], [0.3, 0.1, 0.5], [0.1, 0.3, 0.0] ])
How can I use Pytorch to sample the index for the logits of each row? Current distribution APIs does not support such functions.
What I want is, for example
distribution = Categorical(logits=logits)
labels = distribution.sample(dim=1)
Suppose I have this Pipeline object:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('my_transform', my_transform()),
('estimator', SVC())
])
To pass the hyperparameters to my Support Vector Classifier (SVC) I could do something like this:
pipe_parameters = {
'estimator__gamma': (0.1, 1),
'estimator__kernel': (rbf)
}
Then, I could use GridSearchCV:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters)
grid.fit(X_train, y_train)
We know that a linear kernel does not use gamma as a hyperparameter. So, how could I include the linear kernel in this GridSearch?
For example, In a simple GridSearch (without Pipeline) I could do:
param_grid = [
{'C': [ 0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'kernel': ['rbf']},
{'C': [0.1, 1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'degree': [2, 3],
'kernel': ['poly']}
]
grid = GridSearchCV(SVC(), param_grid)
Therefore, I need a working version of this sort of code:
pipe_parameters = {
'bag_of_words__max_features': (None, 1500),
'estimator__kernel': (rbf),
'estimator__gamma': (0.1, 1),
'estimator__kernel': (linear),
'estimator__C': (0.1, 1),
}
Meaning that I want to use as hyperparameters the following combinations:
kernel = rbf, gamma = 0.1
kernel = rbf, gamma = 1
kernel = linear, C = 0.1
kernel = linear, C = 1
You are almost there. Similar to how you created multiple dictionaries for SVC model, create a list of dictionaries for the pipeline.
Try this example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
pipe = Pipeline([
('bag_of_words', CountVectorizer()),
('estimator', SVC())])
pipe_parameters = [
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [ 0.1, ],
'estimator__gamma': [0.0001, 1],
'estimator__kernel': ['rbf']},
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [0.1, 1],
'estimator__kernel': ['linear']}
]
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters, cv=2)
grid.fit(data_train.data, data_train.target)
grid.best_params_
# {'bag_of_words__max_features': None,
# 'estimator__C': 0.1,
# 'estimator__kernel': 'linear'}