cross validation for logistic regression by setting up threshold probability - python-3.x

I have dataset X_train , y_train , X_test, y_test. Now I want to train logistic regression with K=10 Cross validation . Same time I would like to have F1 score and accuracy for each fold.
But I would also like to set threshold of probability lets's to .65.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
How can be done this in single line with sklearn.model_selection.cross_validate or sklearn.model_selection.cross_val_score
Thanks in advance

Related

Cross-validation using when using the Gaussian Naive Bayes model

Well, I am trying to solve this clustering problem that involves the Gaussian Naive-Bayes algorithm.
Question:
Classification
Consider the data in the file - link below. Train the algorithm Gaussian Naive Bayes using the method of cross-validation holdout (Use the first 700 lines for the training set and the rest for the test set.) What is the accuracy of the training set? What is the accuracy of the test set? Do the same training with the method Leave-One-Out. What is the average accuracy for the training set? What is the average accuracy for the test set?
My solution that I am not sure about:
Basic Code (Full code in the Collab link below):
#Using Holdout
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_train = classifier.predict(X_train)
cm0 = confusion_matrix(y_train, y_pred_train )
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(accuracy_score(y_train, y_pred_train))
My Answer for the holdout:
[[ 23 51]
[ 21 205]]
0.76
0.7871428571428571
LOO:
#Using LOO
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
#This is where I got the code: https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/
cv = LeaveOneOut()
accuracies = cross_val_score(estimator=classifier, X = X_train, y = y_pred_train, scoring='accuracy',cv=cv)
print(f"Accuracy Train {accuracies.mean()}")
print(f"Standard Deviation {accuracies.std()}")
accuraciestest = cross_val_score(estimator=classifier, X = X_test, y = y_test, scoring='accuracy', cv=cv)
print(f"Accuracy Test {accuraciestest.mean()}")
print(f"Standard Deviation Test {accuraciestest.std()}")
My Answer for the LeaveOneOut:
Accuracy Train 0.9771428571428571
Standard Deviation 0.1494479637785374
Accuracy Test 0.7433333333333333
Standard Deviation Test 0.43679387460092534
Data:
https://drive.google.com/file/d/1v9V-007yV3vVckPcQN0Q5VuNZYF_JjBW/view?usp=sharing
Colabs Link: https://colab.research.google.com/drive/1X68-Li6FacnAAQ4ASg3mmqdrU15v2ReP?usp=sharing

Prediction with linear regression is very inaccurate

This is the csv that im using https://gist.github.com/netj/8836201 currently, im trying to predict the variety which is categorical data with linear regression but somehow the prediction is very very inaccurate. While you know, the actual label is just combination of 0.0 and 1. but the prediction is 0.numbers and 1.numbers even with minus numbers which in my opinion is very inaccurate, what part did i make the mistake and what is the solution for this inaccuracy? this is the assignment my teacher gave me, he said we could predict the categorical data with linear regression not only logistic regression
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn import metrics
path= r"D:\python projects\iris.csv"
df = pd.read_csv(path)
array = df.values
X = array[:,0:3]
y = array[:,4]
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(categorical_features=[0])
y = le.fit_transform(y)
y = y.reshape(-1,1)
y = ohe.fit_transform(y).toarray()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
model = LinearRegression(n_jobs=-1).fit(X_train, y_train)
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': X_test.flatten(), 'Predicted': y_pred.flatten()})
the output :
y_pred
Out[46]:
array([[-0.08676055, 0.43120144, 0.65555911],
[ 0.11735424, 0.72384335, 0.1588024 ],
[ 1.17081347, -0.24484483, 0.07403136],
X_test
Out[61]:
array([[-0.09544771, -0.58900572, 0.72247648],
[ 0.14071157, -1.98401928, 0.10361279],
[-0.44968663, 2.66602591, -1.35915595],
Linear Regression is used to predict continuous output data. As you correctly said, you are trying to predict categorical (discrete) output data. Essentially, you want to be doing classification instead of regression - linear regression is not appropriate for this.
As you also said, logistic regression can and should be used instead as it is applicable to classification tasks.

Using python 3 how to get co-variance/variance

I have a simple linear regression model and i need to count the variance and the co-variance. How to calculate variance and co-variance using linear regression ?
Variance, in the context of Machine Learning, is a type of error that occurs due to a model's sensitivity to small fluctuations in the training set.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([2,3,4,5])
y = np.array([4,3,2,9] )
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Train the model using the training sets
model = LinearRegression()
model.fit(x_train, y_train)
y_predict = model.predict(X_predict)
Try this for the output vector that you get for variance and co-variance:
y_variance = np.mean((y_predict - np.mean(y_predict))**2)
y_covariace = np.mean(y_predict - y_true_values)
Note: Co-variance here is mean of change of predictions with respect to there true values.

return parameters of best score of cross validation for linear regression in scikit learn

this is the code for cross validation for the linear regression model. as you can see the best score is 0.7 but how can I retrieve the parameters (coefficients) of the model with the best score??
from sklearn.model_selection import cross_val_score
clf = linear_model.LinearRegression()
scores = cross_val_score(clf, data_f[features], data_f['temperature'], cv=5)
scores
this is the result
array([ 0.61858698, 0.52880606, 0.70729139, 0.48306915, 0.68386676])

scikit-learn cross_validation over-fitting or under-fitting

I'm using scikit-learn cross_validation(http://scikit-learn.org/stable/modules/cross_validation.html) and get for example 0.82 mean score(r2_scorer).
How could I know do I have over-fitting or under-fitting using scikit-learn functions?
Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.
You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
regressor.fit(X[train], y[train])
train_scores.append(scorer(regressor, X[train], y[train]))
test_scores.append(scorer(regressor, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
If you compute the mean train and test scores with cross validation you can then find out if you are:
Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).
Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.
You should compare your scores when testing on training and testing data. If the scores are close to equal, you are likely underfitting. If they are far apart, you are likely overfitting (unless using a method such as random forest).
To compute the scores for both train and test data, you can use something along the following (assuming your data is in variables X and Y):
from sklearn import cross_validation
#do five iterations
for i in range(5):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=0.4)
#Your predictor, linear SVM in this example
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print "Test score", clf.score(X_test, y_test)
print "Train score", clf.score(X_train, y_train)

Resources