How can the coefficients of the multiclass logistic regression model be used to predict the probabilities of class membership of observations? - scikit-learn

I am trying to solve one problem that resembles that of Fisher's irises classification. The problem is that I can train the model on my computer, but the given model has to predict class membership on a computer where it is impossible to install python and scikit learn. I want to understand how, having received the coefficients of the logistic regression model, I can predict the belonging to a certain class without using the predict method of the model.
Using the Fisher problem as an example, I do the following.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, f1_score
# data preparation
iris = load_iris()
data = pd.DataFrame(data=np.hstack([iris.data, iris.target[:, np.newaxis]]),
columns=iris.feature_names + ['target'])
names = data.columns
# split data
X_train, X_test, y_train, y_test = train_test_split(data[names[:-1]], data[names[-1]], random_state=42)
# train model
cls = make_pipeline(
StandardScaler(),
LogisticRegression(C=2, random_state=42)
)
cls = cls.fit(X_train.to_numpy(), y_train)
preds_train = cls.predict(X_train)
# prediction
preds_test = cls.predict(X_test)
# scores
train_score = accuracy_score(preds_train, y_train), f1_score(preds_train, y_train, average='macro') # on train data
# train_score = (0.9642857142857143, 0.9653621232568601)
test_score = accuracy_score(preds_test, y_test), f1_score(preds_test, y_test, average='macro') # on test data
# test_score = (1.0, 1.0)
# model coefficients
cls[1].coef_, cls[1].intercept_
>>> (array([[-1.13948079, 1.30623841, -2.21496793, -2.05617771],
[ 0.66515676, -0.2541143 , -0.55819748, -0.86441227],
[ 0.47432404, -1.05212411, 2.77316541, 2.92058998]]),
array([-0.35860337, 2.43929019, -2.08068682]))
Now I have the coefficients of the model. And I want to use them to make predictions.
First, I make a prediction using the predict method for the first five observations on the test sample.
preds_test = cls.predict_proba(X_test)
preds_test[0:5]
>>>array([[5.66019001e-03, 9.18455687e-01, 7.58841233e-02],
[9.75854479e-01, 2.41455095e-02, 1.10881450e-08],
[1.18780156e-09, 6.53295166e-04, 9.99346704e-01],
[6.71574900e-03, 8.14174200e-01, 1.79110051e-01],
[6.98756622e-04, 8.09096425e-01, 1.90204818e-01]])
Then I manually calculate the predictions of the class probabilities for the observations using the coefficients of the model.
# define two functions for making predictions
def logit(x, w):
return np.dot(x, w)
# from here: https://stackoverflow.com/questions/34968722/how-to-implement-the-softmax-function-in-python
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
n, k = X_test.shape
X_ = np.hstack((np.ones((n, 1)), X_test)) # add column with 1 for intercept
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_)) # create weights matrix
results = softmax(logit(X_, weights.T)) # calculate probabilities
results[0:5]
>>>array([[3.67343725e-14, 4.63938438e-06, 9.99995361e-01],
[2.81976786e-05, 8.63083152e-01, 1.36888650e-01],
[1.24572182e-22, 5.47800683e-11, 1.00000000e+00],
[3.32990060e-14, 3.08352323e-06, 9.99996916e-01],
[2.66415118e-15, 1.78252465e-06, 9.99998217e-01]])
If you compare the two results obtained (preds_test[0:5] and results[0:5]), you can see that they do not coincide at all. Please explain me what I am doing wrong and how I can use the model's coefficients to calculate predictions without using the predict method.

I forgot that a scaler was applied. If you change the code a little, then the results are the same.
scaler = StandardScaler()
scaler.fit(X_train)
X_test_transf = scaler.transform(X_test)
def logit(x, w):
return np.dot(x, w)
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
n, k = X_test_transf.shape
X_ = np.hstack((np.ones((n, 1)), X_test_transf))
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_))
results = softmax(logit(X_, weights.T))
np.allclose(preds_test, results)
>>>True

There are two values for every predict_proba. The first value is the probability of the event not occurring and the probability of the event occurring. predict_proba(X)[:,1] to get the probability of the event occurring.

Related

How to calculate Precision,Recall and F1 score using sklearn

I am trying to calculate the Precision, Recall and F1 in this sample code. I have calculated the accuracy of the model on train and test dataset. Kindly help to calculate these matrices.
Please look at the code I have comment every important line for an explanation.
# develop a classifier for the Faces Dataset
from numpy import load
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Normalizer
from sklearn.svm import SVC
import pickle
# load dataset
data = load('faces-embeddings.npz')
trainX, trainy, testX, testy = data['arr_0'], data['arr_1'], data['arr_2'], data['arr_3']
print('Dataset: train=%d, test=%d' % (trainX.shape[0], testX.shape[0]))
# normalize input vectors
in_encoder = Normalizer(norm='l2')
trainX = in_encoder.transform(trainX)
testX = in_encoder.transform(testX)
# label encode targets
out_encoder = LabelEncoder()
out_encoder.fit(trainy)
trainy = out_encoder.transform(trainy)
testy = out_encoder.transform(testy)
# fit model
model = SVC(kernel='linear', probability=True)
model.fit(trainX, trainy)
#Saving Model
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))
# predict
yhat_train = model.predict(trainX)
yhat_test = model.predict(testX)
# score
score_train = accuracy_score(trainy, yhat_train)
score_test = accuracy_score(testy, yhat_test)
# summarize
print('Accuracy: train=%.3f, test=%.3f' % (score_train*100, score_test*100))
knowing the true value of Y (trainy here) and the predicted value of Y (yhat_train here) you can directly compute the precision, recall and F1 score, exactly as you did for the accuracy (thanks to sklearn.metrics):
sklearn.metrics.precision_score(trainy,yhat_train)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
sklearn.metrics.recall_score(trainy,yhat_train)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score
sklearn.metrics.f1_score(trainy,yhat_train)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score

Unable to calculate Model performance for Decision Tree Regressor

Although my code run fine on repl and did giving me results but it miserably fails on the Katacoda testing environment.
I am attaching the repl file here for your review as well, which also contains the question which is commented just above the code I have written.
Kindly review and let me know what mistakes I am making here.
Repl Link
https://repl.it/repls/WarmRobustOolanguage
Also sharing code below
Commented is Question Instructions
#Import two modules sklearn.datasets, and #sklearn.model_selection.
#Import numpy and set random seed to 100.
#Load popular Boston dataset from sklearn.datasets module #and assign it to variable boston.
#Split boston.data into two sets names X_train and X_test. #Also, split boston.target into two sets Y_train and Y_test.
#Hint: Use train_test_split method from #sklearn.model_selection; set random_state to 30.
#Print the shape of X_train dataset.
#Print the shape of X_test dataset.
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
import numpy as np
np.random.seed(100)
max_depth = range(2, 6)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
#Import required module from sklearn.tree.
#Build a Decision tree Regressor model from X_train set and #Y_train labels, with default parameters. Name the model as #dt_reg.
#Evaluate the model accuracy on training data set and print #it's score.
#Evaluate the model accuracy on testing data set and print it's score.
#Predict the housing price for first two samples of X_test #set and print them.(Hint : Use predict() function)
dt_reg = DecisionTreeRegressor(random_state=1)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', cross_val_score(dt_reg, X_train,Y_train, cv=10 ))
print('Accuracy of Test Data :', cross_val_score(dt_reg, X_test,Y_test, cv=10 ))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
#Fit multiple Decision tree regressors on X_train data and #Y_train labels with max_depth parameter value changing from #2 to 5.
#Evaluate each model accuracy on testing data set.
#Hint: Make use of for loop
#Print the max_depth value of the model with highest accuracy.
dt_reg = DecisionTreeRegressor()
random_grid = {'max_depth': max_depth}
dt_random = RandomizedSearchCV(estimator = dt_reg, param_distributions = random_grid,
n_iter = 90, cv = 3, verbose=2, random_state=42, n_jobs = -1)
dt_random.fit(X_train, Y_train)
dt_random.best_params_
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
best_random = dt_random.best_estimator_
random_accuracy = evaluate(best_random, X_test,Y_test)
print("Accuracy Scores of the Model ",random_accuracy)
best_parameters = (dt_random.best_params_['max_depth']);
print(best_parameters)
The question is asking for default values. Try to remove random_state=1
Current Line:
dt_reg = DecisionTreeRegressor(random_state=1)
Update Line:
dt_reg = DecisionTreeRegressor()
I think it should Work!!!
# ================================================================================
# Machine Learning Using Scikit-Learn | 3 | Decision Trees ================================================================================
import sklearn.datasets as datasets
import sklearn.model_selection as model_selection
import numpy as np
from sklearn.tree import DecisionTreeRegressor
np.random.seed(100)
# Load popular Boston dataset from sklearn.datasets module and assign it to variable boston.
boston = datasets.load_boston()
# print(boston)
# Split boston.data into two sets names X_train and X_test. Also, split boston.target into two sets Y_train and Y_test
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(boston.data, boston.target, random_state=30)
# Print the shape of X_train dataset
print(X_train.shape)
# Print the shape of X_test dataset.
print(X_test.shape)
# Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as dt_reg
dt_Regressor = DecisionTreeRegressor()
dt_reg = dt_Regressor.fit(X_train, Y_train)
print(dt_reg.score(X_train,Y_train))
print(dt_reg.score(X_test,Y_test))
predicted = dt_reg.predict(X_test[:2])
print(predicted)
# Get the max depth
maxdepth = 2
maxscore = 0
for x in range(2, 6):
dt_Regressor = DecisionTreeRegressor(max_depth=x)
dt_reg = dt_Regressor.fit(X_train, Y_train)
score = dt_reg.score(X_test, Y_test)
if(maxscore < score):
maxdepth = x
maxscore = score
print(maxdepth)

Tensorflow classifier.evaluate running indefinitely?

I've started trying out some of the Tensorflow API's. I am using the iris data set to experiment with Tensorflows Estimator's. I'm loosely following this tutorial except that I load my data in a little differently: https://www.tensorflow.org/guide/premade_estimators#top_of_page
My problem is that when the code below executes and I get to the section with:
# Evaluate the model.
eval_result = classifier.evaluate(
My computer just runs seemingly without end. I've been waiting for my jupyter notebook to complete this step now for an hour and a half but no end in sight. The lastoutput of the notebook is:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Problem statement: How can I adjust my code to make this evaluation more efficient I'm obviously making it do much more work than I anticipated.
So far I have tried adjusting the batch size and the number or neurons in the layers but with no luck.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
#Extract both into features and labels.
#features should be a dictionary.
#label can just be an array
def extract_features_and_labels(dataframe):
#features and label for training
x = dataframe.copy()
y = dataframe.pop('species')
return dict(x), y
#break the data up into train and test.
#split the overall df into training and testing data
train, test = train_test_split(df, test_size=0.2)
train_x, train_y = extract_features_and_labels(train)
test_x, test_y = extract_features_and_labels(test)
print(len(train_x), 'training examples')
print(len(train_y), 'testing examples')
my_feature_columns = []
for key in train_x.keys():
my_feature_columns.append(tf.feature_column.numeric_column(key=key))
def train_input_fn(features, labels, batch_size):
"""An input function for training"""
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
return dataset.shuffle(1000).repeat().batch(batch_size)
#Build the classifier!!!!
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
# Two hidden layers of 10 nodes each.
hidden_units=[4, 4],
# The model must choose between 3 classes.
n_classes=3)
# Train the Model.
classifier.train(
input_fn=lambda:train_input_fn(train_x, train_y, 10), steps=1000)
# Evaluate the model.
eval_result = classifier.evaluate(
input_fn=lambda:train_input_fn(test_x, test_y, 100))
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
Had to switch a lot of things up but finally got the estimator working on the IRIS data set. Here is the code below for any who may find it useful in the future. Cheers.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
#Grab only our categorial data
#categories = df.select_dtypes(include=[object])
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
# use df.apply() to apply le.fit_transform to all columns
#X_2 = X.apply(le.fit_transform)
#Reshape as the one hot encoder doesnt like one row/column
#X_2 = X.reshape(-1, 1)
#features is everything but our label so makes sense to simply....
features = df.drop('species', axis=1)
print(features.head())
target = df['species']
print(target.head())
#trains and test
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.33, random_state=42)
#Introduce Tensorflow feature column (numeric column)
numeric_column = ['sepal_length','sepal_width','petal_length','petal_width']
numeric_features = [tf.feature_column.numeric_column(key = column)
for column in numeric_column]
print(numeric_features[0])
#Build the input function for training
training_input_fn = tf.estimator.inputs.pandas_input_fn(x = X_train,
y=y_train,
batch_size=10,
shuffle=True,
num_epochs=None)
#Build the input function for testing input
eval_input_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,
y=y_test,
batch_size=10,
shuffle=False,
num_epochs=1)
#Instansiate the model
dnn_classifier = tf.estimator.DNNClassifier(feature_columns=numeric_features,
hidden_units=[3,3],
optimizer=tf.train.AdamOptimizer(1e-4),
n_classes=3,
dropout=0.1,
model_dir = "dnn_classifier")
dnn_classifier.train(input_fn = training_input_fn,steps=2000)
#Evaluate the trained model
dnn_classifier.evaluate(input_fn = eval_input_fn)
# print("Loss is " + str(loss))
pred = list(dnn_classifier.predict(input_fn = eval_input_fn))
for e in pred:
print(e)
print("\n")
#pred = [p['species'][0] for p in pred]

ValueError: Can only tuple-index with a MultiIndex

For a Multilabel Classification problem i am trying to plot precission and recall curve.
The sample code is taken from "https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py" under section Create multi-label data, fit, and predict.
I am trying to fit it in my code but i get below error as "ValueError: Can only tuple-index with a MultiIndex" when i try below code.
train_df.columns.values
array(['DefId', 'DefectCount', 'SprintNo', 'ReqName', 'AreaChange',
'CodeChange', 'TestSuite'], dtype=object)
Test Suite is the value to be predicted
X_train = train_df.drop("TestSuite", axis=1)
Y_train = train_df["TestSuite"]
X_test = test_df.drop("DefId", axis=1).copy()
classes --> i have hardcorded with the testsuite values
from sklearn.preprocessing import label_binarize
# Use label_binarize to be multi-label like settings
Y = label_binarize(Y_train, classes=np.array([0, 1, 2,3,4])
n_classes = Y.shape[1]
# We use OneVsRestClassifier for multi-label prediction
from sklearn.multiclass import OneVsRestClassifier
# Run classifier
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=3))
classifier.fit(X_train, Y_train)
y_score = classifier.decision_function(X_test)
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
import pandas as pd
# For each class
precision = dict()
recall = dict()
average_precision = dict()
#n_classes = Y.shape[1]
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(Y_train[:, i], y_score[:, i])
average_precision[i] = average_precision_score(Y_train[:, i], y_score[:, i])
Input Data -> Values has been categorised

Writing best GridSearch classifiers into a table

I found and successfully tested following script that applies Pipeline and GridSearchCV to classifier selection. The script outputs the best classifier and its accuracy.
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10] # Augmenting test data
y_test = iris.target[:10] # Augmenting test data
#Create a pipeline
pipe = Pipeline([('classifier', LogisticRegression())])
# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
'classifier__penalty': ['l1', 'l2'],
'classifier__C': np.logspace(0, 4, 10)},
{'classifier': [RandomForestClassifier()],
'classifier__n_estimators': [10, 100, 1000],
'classifier__max_features': [1, 2, 3]}]
# Create grid search
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)
# Fit grid search
best_model = clf.fit(X_train, y_train)
print('Best training accuracy: %.3f' % best_model.best_score_)
print('Best estimator:', best_model.best_estimator_.get_params()['classifier'])
# Predict on test data with best params
y_pred = best_model.predict(X_test)
# Test data accuracy of model with best params
print(classification_report(y_test, y_pred, digits=4))
print('Test set accuracy score for best params: %.3f' % accuracy_score(y_test, y_pred))
from sklearn.metrics import precision_recall_fscore_support
print(precision_recall_fscore_support(y_test, y_pred,
average='weighted'))
How can I adjust the script so that it not only outputs the best classifier, which is LogReg in our example, but also the best selected among the other classifiers? Above, I like to see the output from RandomForestClassifier(), too.
Ideal is a solution where the best classifier for each algorithm (LogReg, RandomForest,..) is shown and where each of those best classifiers is sorted into a table. The first column or index should be the model and precision_recall_fscore_support values are in rows on the right. The table should then be sorted by F-score.
PS: Though the script works, I'm yet unsure what the function of LogisticRegression() in the Pipeline is, as it's defined in the search space later.
Solution (simplified):
from sklearn import datasets
iris = datasets.load_iris()
X_train = iris.data
y_train = iris.target
X_test = iris.data[:10]
y_test = iris.target[:10]
seed=1
models = [
'RFC',
'logisticRegression'
]
clfs = [
RandomForestClassifier(random_state=seed,n_jobs=-1),
LogisticRegression()
]
params = {
models[0]:{'n_estimators':[100]},
models[1]: {'C':[1000]}
}
for name, estimator in zip(models,clfs):
print(name)
clf = GridSearchCV(estimator, params[name], scoring='accuracy', refit='True', n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
print("best params: " + str(clf.best_params_))
print("best scores: " + str(clf.best_score_))
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy: {:.4%}".format(acc))
print(classification_report(y_test, y_pred, digits=4))
If I understood correctly, this should work fine.
import pandas as pd
import numpy as np
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the test_score of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
# The first line contains the best model and its parameters
df_final.to_csv('sorted_table.csv')
# OR to avoid the index in the writting
df_final.to_csv('sorted_table2.csv',index=False)
Results:
However, in this case, the ordering is not done based on the F values. To do so use this. Define in the GridSearch the scoring attribute to f1_weighted and repeat my code.
Example:
...
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0,scoring='f1_weighted')
best_model = clf.fit(X_train, y_train)
df = pd.DataFrame(list(best_model.cv_results_['params']))
ranking = best_model.cv_results_['rank_test_score']
# The sorting is done based on the F values of the models.
sorting = np.argsort(best_model.cv_results_['rank_test_score'])
# Sort the lines based on the ranking of the models
df_final = df.iloc[sorting]
df_final.to_csv('F_sorted_table.csv')
Results:

Resources