Hi i want to combine train/test split with a cross validation and get the results in auc.
My first approach I get it but with accuracy.
# split data into train+validation set and test set
X_trainval, X_test, y_trainval, y_test = train_test_split(dataset.data, dataset.target)
# split train+validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval)
# train on classifier
clf.fit(X_train, y_train)
# evaluate the classifier on the test set
score = svm.score(X_valid, y_valid)
# combined training & validation set and evaluate it on the test set
clf.fit(X_trainval, y_trainval)
test_score = svm.score(X_test, y_test)
And I do not find how to apply roc_auc, please help.
Using scikit-learn you can do:
import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
Now we get:
print(fpr)
array([ 0. , 0.5, 0.5, 1. ])
print(tpr)
array([ 0.5, 0.5, 1. , 1. ])
print(thresholds)
array([ 0.8 , 0.4 , 0.35, 0.1 ])
In your code, after training your classifier, get the predictions with:
y_preds = clf.predict(X_test)
And then use this to calculate the auc value:
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y, y_preds, pos_label=1)
auc_roc = auc(fpr, tpr)
Related
I'm trying to attack an MLP with DeepFool, but plotting results I have some strange behavior.
First of all the MLP structure is as follows:
Dense(16, activation='relu', input_shape=(512,)) Dense(16, activation='relu') Dense(2, activation='softmax')
And it is trained with a standard approach passing a training set and validation set with labels, using following parameters:
training_params = { 'optimizer': 'adam', 'loss': 'sparse_categorical_crossentropy', 'metrics': ['accuracy'] }
I know that DeepFool can be used only deleting classification layers from the network, so i deleted the last Dense layer and create a KerasClassifier:
logit_model = tf.keras.Model(MLP.input, MLP.layers[-2].output) classifier = KerasClassifier(clip_values=(0, 8), model=logit_model)
N.B. clip_values=(0,8) because feature vectors assume values between 0 and 8.
When i try to attack the MLP with DeepFool tryinng different values of epsilon, it happens that the perturbation remain constant and also if i pass to the attack a max perturbation (epsilon) of 0.001, the adversarial sample perturbation is 0.7. I will show an example of code and the relative output below.
The attack
`
import matplotlib.pyplot as plt
import numpy as np
epsilon_list = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
max_iter = 10
acc = []
pert = []
for eps in epsilon_list:
attack = DeepFool(classifier=classifier, epsilon=eps, max_iter=10, verbose=False)
test_samples_adv = attack.generate(X_test_copy)
loss_test, accuracy_test = MLP.evaluate(test_samples_adv, Y_test)
perturbation = np.mean(np.abs((test_samples_adv - X_test_copy)))
print('Accuracy on adversarial test data: {:4.2f}%'.format(accuracy_test * 100))
print('Average perturbation: {:4.2f}'.format(perturbation))
acc.append(accuracy_test)
pert.append(perturbation)
x = np.array(pert)
y = np.array(acc)
plotting_curves(x, y)
`
plotting_curves is just a function that plot a graph passing x and y.
These are the results:
(https://i.stack.imgur.com/WMGBp.png)
Can anybody explain me if it has sense and why?
scikit-learn has a quantile regression based confidence interval implementation for GBM (example form the docs).
Is there a reason why it doesn't provide a similar quantile based loss implementation for RandomForestRegressor?
There is an scikit-learn compatible/compliant Quantile Regression Forest implementation that can be used to generate confidence intervals here: https://github.com/zillow/quantile-forest
Setup should be as easy as:
pip install quantile-forest
Then, as an example, to generate CIs on a full dataset:
import matplotlib.pyplot as plt
import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import KFold
X, y = datasets.fetch_california_housing(return_X_y=True)
qrf = RandomForestQuantileRegressor(n_estimators=100, random_state=0)
kf = KFold(n_splits=5)
kf.get_n_splits(X)
y_true = []
y_pred = []
y_pred_lower = []
y_pred_upper = []
for train_index, test_index in kf.split(X):
X_train, X_test, y_train, y_test = (
X[train_index], X[test_index], y[train_index], y[test_index]
)
qrf.set_params(max_features=X_train.shape[1] // 3)
qrf.fit(X_train, y_train)
# Get predictions at 95% prediction intervals and median.
y_pred_i = qrf.predict(X_test, quantiles=[0.025, 0.5, 0.975])
y_true = np.concatenate((y_true, y_test))
y_pred = np.concatenate((y_pred, y_pred_i[:, 1]))
y_pred_lower = np.concatenate((y_pred_lower, y_pred_i[:, 0]))
y_pred_upper = np.concatenate((y_pred_upper, y_pred_i[:, 2]))
fig = plt.figure(figsize=(10, 4))
y_pred_interval = y_pred_upper - y_pred_lower
sort_idx = np.argsort(y_pred_interval)
y_true = y_true[sort_idx]
y_pred_lower = y_pred_lower[sort_idx]
y_pred_upper = y_pred_upper[sort_idx]
# Center data, with the mean of the prediction interval at 0.
mean = (y_pred_lower + y_pred_upper) / 2
y_true -= mean
y_pred_lower -= mean
y_pred_upper -= mean
plt.plot(y_true, marker=".", ms=5, c="r", lw=0)
plt.fill_between(
np.arange(len(y_pred_upper)),
y_pred_lower,
y_pred_upper,
alpha=0.2,
color="gray",
)
plt.plot(np.arange(len(y)), y_pred_lower, marker="_", c="0.2", lw=0)
plt.plot(np.arange(len(y)), y_pred_upper, marker="_", c="0.2", lw=0)
plt.xlim([0, len(y)])
plt.xlabel("Ordered Samples")
plt.ylabel("Observed Values and Prediction Intervals (Centered)")
plt.show()
There seems to be contributed scikit-learn package (example copy pasted from there for RandomForestRegressor)
I had to install development version in order to have correct path to current scikit-learn by:
pip install git+git://github.com/scikit-learn-contrib/forest-confidence-interval.git
https://github.com/scikit-learn-contrib/forest-confidence-interval
Example (copy pasted from the link above):
# Regression Forest Example
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import sklearn.model_selection as xval
from sklearn.datasets import fetch_openml
import forestci as fci
# retreive mpg data from machine learning library
mpg_data = fetch_openml('autompg')
# separate mpg data into predictors and outcome variable
mpg_X = mpg_data["data"]
mpg_y = mpg_data["target"]
# remove rows where the data is nan
not_null_sel = np.invert(
np.sum(np.isnan(mpg_data["data"]), axis=1).astype(bool))
mpg_X = mpg_X[not_null_sel]
mpg_y = mpg_y[not_null_sel]
# split mpg data into training and test set
mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(mpg_X, mpg_y,
test_size=0.25,
random_state=42)
# Create RandomForestRegressor
n_trees = 2000
mpg_forest = RandomForestRegressor(n_estimators=n_trees, random_state=42)
mpg_forest.fit(mpg_X_train, mpg_y_train)
mpg_y_hat = mpg_forest.predict(mpg_X_test)
# Plot predicted MPG without error bars
plt.scatter(mpg_y_test, mpg_y_hat)
plt.plot([5, 45], [5, 45], 'k--')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()
# Calculate the variance
mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train,
mpg_X_test)
# Plot error bars for predicted MPG using unbiased variance
plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([5, 45], [5, 45], 'k--')
plt.xlabel('Reported MPG')
plt.ylabel('Predicted MPG')
plt.show()
I am trying to run GradientBoostingClassifier() with the help of gridsearchcv.
For every combination of parameter, I also need "Precison", "recall" and accuracy in tabular format.
Here is the code:
scoring= ['accuracy', 'precision','recall']
parameters = {#'nthread':[3,4], #when use hyperthread, xgboost may become slower
"criterion": ["friedman_mse", "mae"],
"loss":["deviance","exponential"],
"max_features":["log2","sqrt"],
'learning_rate': [0.01,0.05,0.1,1,0.5], #so called `eta` value
'max_depth': [3,4,5],
'min_samples_leaf': [4,5,6],
'subsample': [0.6,0.7,0.8],
'n_estimators': [5,10,15,20],#number of trees, change it to 1000 for better results
'scoring':scoring
}
# sorted(sklearn.metrics.SCORERS.keys()) # To see different loss functions
#clf_xgb = GridSearchCV(xgb_model, parameters, n_jobs=5,verbose=2, refit=True,cv = 8)
clf_gbm = GridSearchCV(gbm_model, parameters, n_jobs=5,cv = 8)
clf_gbm.fit(X_train,y_train)
print(clf_gbm.best_params_)
print(clf_gbm.best_score_)
feature_importances = pd.DataFrame(clf_gbm.best_estimator_.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
depth=clf_gbm.cv_results_["param_max_depth"]
score=clf_gbm.cv_results_["mean_test_score"]
params=clf_gbm.cv_results_["params"]
I get error as:
ValueError: Invalid parameter seed for estimator GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.01, loss='deviance', max_depth=3,
max_features='log2', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=4, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=5, presort='auto',
random_state=None, subsample=1.0, verbose=0,
warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer
#creating Scoring parameter:
scoring = {'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score),'recall':make_scorer(recall_score)}
# A sample parameter
parameters = {
"loss":["deviance"],
"learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
"min_samples_split": np.linspace(0.1, 0.5, 12),
"min_samples_leaf": np.linspace(0.1, 0.5, 12),
"max_depth":[3,5,8],
"max_features":["log2","sqrt"],
"criterion": ["friedman_mse", "mae"],
"subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
"n_estimators":[10]
}
#passing the scoring function in the GridSearchCV
clf = GridSearchCV(GradientBoostingClassifier(), parameters,scoring=scoring,refit=False,cv=2, n_jobs=-1)
clf.fit(trainX, trainY)
#converting the clf.cv_results to dataframe
df=pd.DataFrame.from_dict(clf.cv_results_)
#here Possible inputs for cross validation is cv=2, there two split split0 and split1
df[['split0_test_accuracy','split1_test_accuracy','split0_test_precision','split1_test_precision','split0_test_recall','split1_test_recall']]
find the best parameter based on the accuracy_score, precision_score or recall and refit the model and prediction on the test data
#find the best parameter based on the accuracy_score
#taking the average of the accuracy_score
df['accuracy_score']=(df['split0_test_accuracy']+df['split1_test_accuracy'])/2
df.loc[df['accuracy_score'].idxmax()]['params']
Prediction on the test data
clf =GradientBoostingClassifier(criterion='mae',
learning_rate=0.1,
loss='deviance',
max_depth= 5,
max_features='sqrt',
min_samples_leaf= 0.1,
min_samples_split= 0.42727272727272736,
n_estimators=10,
subsample=0.8)
clf.fit(trainX, trainY)
correct_test = correct_data(test)
testX = correct_test[predictor].values
result = clf.predict(testX)
I need to perform a grid search on the parameters listed below for a Logistic Regression classifier, using recall for scoring and cross-validation three times.
The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing
I have grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
I need to apply penalty L1 e L2 in a Logistic Regression
I couldn't verify if the scores will run because I have the following error:
Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
This is my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr_l1 = LogisticRegression(penalty='l1')
grid_lr_l1 = GridSearchCV(lr_l1, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_l1.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l1.decision_function(X_test)
lr_l2 = LogisticRegression(penalty='l2')
grid_lr_l2 = GridSearchCV(lr_l2, param_grid = grid_values, cv=3 , scoring = 'recall')
grid_lr_l2.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l2.decision_function(X_test)
#The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
results = pd.DataFrame()
results['l1_results'] = pd.DataFrame(grid_lr_l1.cv_results_)
results['l1_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
results['l2_results'] = pd.DataFrame(grid_lr_l2.cv_results_)
results['l2_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
return results
LogisticR_penalty()
I expected from .cv_results_, the average test scores of each parameter combination that I should be available here: mean_test_precision_score but not sure
The output is: ValueError: Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
The error message contains the answer for your question. You can use the function estimator.get_params().keys() to see all available parameters for you estimator:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
print(lr.get_params().keys())
Output:
dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])
From scikit-learn's documentation, the LogisticRegression has no parameter gamma, but a parameter C for the regularization weight.
If you change grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]} for grid_values = {'C':[0.01, 0.1, 1, 10, 100]} your code should work.
My code contained some errors the main error was using param_grid incorrectly. I had to apply L1 and L2 penalties with gamma 0.01, 0.1, 1, 10, 100. The right way to do this is:
grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
Then it was necessary to correct the way I was training my logistic regression and to correct the way I retrieved the scores in cv_results_ and averaged those scores.
Follow my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr = LogisticRegression()
# We use GridSearchCV to find the value of the range that optimizes a given measurement metric.
grid_lr_recall = GridSearchCV(lr, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_recall.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_recall.decision_function(X_test)
##The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
CVresults = []
CVresults = pd.DataFrame(grid_lr_recall.cv_results_)
#test scores and mean of them
split_test_scores = np.vstack((CVresults['split0_test_score'], CVresults['split1_test_score'], CVresults['split2_test_score']))
mean_scores = split_test_scores.mean(axis=0).reshape(5, 2)
return mean_scores
LogisticR_penalty()
I try to calculate the f1_score but I get some warnings for some cases when I use the sklearn f1_score method.
I have a multilabel 5 classes problem for a prediction.
import numpy as np
from sklearn.metrics import f1_score
y_true = np.zeros((1,5))
y_true[0,0] = 1 # => label = [[1, 0, 0, 0, 0]]
y_pred = np.zeros((1,5))
y_pred[:] = 1 # => prediction = [[1, 1, 1, 1, 1]]
result_1 = f1_score(y_true=y_true, y_pred=y_pred, labels=None, average="weighted")
print(result_1) # prints 1.0
result_2 = f1_score(y_true=y_ture, y_pred=y_pred, labels=None, average="weighted")
print(result_2) # prints: (1.0, 1.0, 1.0, None) for precision/recall/fbeta_score/support
When I use average="samples" instead of "weighted" I get (0.1, 1.0, 0.1818..., None). Is the "weighted" option not useful for a multilabel problem or how do I use the f1_score method correctly?
I also get a warning when using average="weighted":
"UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples."
It works if you slightly add up data:
y_true = np.array([[1,0,0,0], [1,1,0,0], [1,1,1,1]])
y_pred = np.array([[1,0,0,0], [1,1,1,0], [1,1,1,1]])
recall_score(y_true=y_true, y_pred=y_pred, average='weighted')
>>> 1.0
precision_score(y_true=y_true, y_pred=y_pred, average='weighted')
>>> 0.9285714285714286
f1_score(y_true=y_true, y_pred=y_pred, average='weighted')
>>> 0.95238095238095244
The data suggests we have not missed any true positives and have not predicted any false negatives (recall_score equals 1). However, we have predicted one false positive in the second observation that lead to precision_score equal ~0.93.
As both precision_score and recall_score are not zero with weighted parameter, f1_score, thus, exists. I believe your case is invalid due to lack of information in the example.