Geneating Learning Curve for Logistic Regression - python-3.x

I have a dataset of 260 microscopic image.I want to generate a learning curve for logistic regression algorithm.But I am getting this error "'module' object is not iterable" .Perhaps I don't understand something basic, as I'm a begineer who freshly learning Python
from sklearn.cross_validation import train_test_split
from imutils import paths
from scipy import misc
import numpy as np
import argparse
import imutils
import cv2
import os
from matplotlib import pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.model_selection import cross_val_score
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(50, 80, 110)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : int or None, optional (default=None)
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
train_sizes : array-like, shape (n_ticks,), dtype float or int
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.
(default: np.linspace(0.1, 1.0, 5))
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
#training with logistic regression
clfLR = LogisticRegression(random_state=0, solver='lbfgs')
clfLR.fit(trainFeat,trainLabels)
acc = clfLR.score(testFeat, testLabels)
print("accuracy of Logistic regression ",acc)
I am facing this problem only when i Want to plot the curve.Rest of the code works fine.
#plotting the curve
estimator =LogisticRegression()
train_sizes, train_scores, valid_scores = plot_learning_curve(
estimator,'logistic learning curve ', trainFeat, trainLabels, cv=5, n_jobs=4,train_sizes=[50, 80, 110])
print(train_sizes)
plt.show()
Screenshot of the error
learning curve

Try running the code on Jupyter online IDE IDE. It plots automatically if you add "%matplotlib" line to import section.
If you want to keep working on this IDE, share your error message so maybe I can help you. You are missing one of the imports probably or it might be a Python2/3 problem.

Related

What does "n_features" and "centers" parameters mean in make_blobs in SciKit?

I have gone through the documents about n_features and centers parameters in make_blobs function in SciKit. However, every explanation I've seen doesn't sound so clear to me since I am new to SciKit and Mathematics. I am wondering what do these two parameters: n_features, centers do in make_blobs function as below.
make_blobs(n_samples=50, n_features=2, centers=2, random_state=75)
Thank you in advance.
The make_blobs function is a part of sklearn.datasets.samples_generator. All methods in the package, help us to generate data samples or datasets. In machine learning, which scikit-learn all about, datasets are used to evaluate performance of machine learning models. This is an example on how to evaluate a KNN classifier:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_blobs(n_features=2, centers=3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print('accuracy: {}%'.format(acc))
Now, as you mentioned, n_features determined how many columns or features the generated datasets will have. In machine learning, features correspond to numerical characteristics data. For example, in Iris Dataset, there are 4 features (Sepal Length, Sepal Width, Petal Length and Petal Width) so there are 4 numerical columns in the dataset. So by increasing n_features in make_blobs, we are adding more features hence increase the complexity of generated dataset.
As for the centers, it is easier to understand by visualizing the generated dataset. I use matplotlib to help us on that:
from sklearn.datasets.samples_generator import make_blobs
import matplot
# plot 1
X, y = make_blobs(n_features=2, centers=1)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.savefig('centers_1.png')
plt.title('centers = 1')
# plot 2
X, y = make_blobs(n_features=2, centers=2)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 2')
# plot 3
X, y = make_blobs(n_features=2, centers=3)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title('centers = 3')
plt.show()
If you run the code above you can easily see that centers corresponds to number of classes generated in the data. It uses centers as a term because samples that belong to same class, tend to gather close to a center (coordinate).

eli5: show_weights() with two labels

I'm trying eli5 in order to understand the contribution of terms to the prediction of certain classes.
You can run this script:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
#categories = ['alt.atheism', 'soc.religion.christian']
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics']
np.random.seed(1)
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=7)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=7)
bow_model = CountVectorizer(stop_words='english')
clf = LogisticRegression()
pipel = Pipeline([('bow', bow),
('classifier', clf)])
pipel.fit(train.data, train.target)
import eli5
eli5.show_weights(clf, vec=bow, top=20)
Problem:
When working with two labels, the output is unfortunately limited to only one table:
categories = ['alt.atheism', 'soc.religion.christian']
However, when using three labels, it also outputs three tables.
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics']
Is it a bug in the software that it misses y=0 in the first output, or do I miss a statistical point? I would expect to see two tables for the first case.
This has not to do with eli5 but with how scikit-learn (in this case LogisticRegression()) treats two categories. For only two categories, the problem turns into a binary one, so only a single column of attributes is returned everywhere from learned classifier.
Look at the attributes of LogisticRegression:
coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
intercept_ : array, shape (1,) or (n_classes,)
Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero.
intercept_ is of shape(1,) when the problem is binary.
coef_ is of shape (1, n_features) when binary. This coef_ is used by the eli5.show_weights().
Hope this makes it clear.

Scikitlearn LinearSVC Bad input shape

I'm trying to use LinearSVC on my data! My code below:
from sklearn import svm
clf2 = svm.LinearSVC()
clf2.fit(X_train, y_train)
Results in the following error:
ValueError: bad input shape (2190, 9)
I've used one-hot encoding on my y value before splitting into y_test and y_train, and believe this to be the issue. I've tried implementing similar fixes (sklearn (Bad Input Shape) ValueError) but still get errors when I try and re-shape.
After one hot-encoding, I have a target variable (y) that has 9 classes, and there are a total of 2190 samples i'm running. It seems I need to reduce these 9 classes to 1 class in order to fit the SVM.
Any suggestions would be greatly appreciated!
LinearSVC dont accept 2-d values for y. As documented:
Parameters:
y : array-like, shape = [n_samples]
Target vector relative to X
So you don't need to convert into one-hot encoded matrix. Just supply them as is, even if its strings. They will be internally handled correctly.
Accoding to the document,
you may try sklearn.multiclass.OneVsRestClassifier as follows:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
clf = OneVsRestClassifier(LinearSVC())
clf.fit(X_train, y_train)
You need to reshape the arrays. Here is an example using random data and as target variable a variable that contains 5 classes:
import numpy as np
from sklearn import svm
# 100 samples and 10 features
x = np.random.rand(100, 10)
#5 classes
y = [1,2,3,4,5] * 20
x = np.asarray(x)
y = np.asarray(y)
print(x.shape)
print(y.shape)
clf2 = svm.LinearSVC()
clf2.fit(x, y)
Results:
(100, 10)
(100,)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)

sample_weight parameter shape error in scikit-learn GridSearchCV

Passing the sample_weight parameter to GridSearchCV raises an error due to incorrect shape. My suspicion is that cross validation is not capable of handling the split of sample_weights accordingly with the dataset.
First part: Using sample_weight as a model parameter works beautifully
Let's consider a simple example, first without GridSearch:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'
x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3
my_sample_weights contains the importance that I assign to each observation in x, y, as the following picture shows. The points of the sinusoidal curve get higher weights than those forming the background noise.
plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")
Let's train a neural network, first without using the information contained in my_sample_weights:
def make_model(number_of_hidden_neurons=1):
model = Sequential()
model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='sgd', loss='mse')
return model
net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)
plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")
As the following picture shows, the neural network tries to fit the shape of the sinusoidal but the background noise prevents it from a good fit.
Now, using the information of my_sample_weights , the quality of the prediction is a much better one.
Second part: Using sample_weight as a GridSearchCV parameter raises an error
my_Regressor = KerasRegressor(make_model)
validator = GridSearchCV(my_Regressor,
param_grid={'number_of_hidden_neurons': range(4, 5),
'epochs': [500],
},
fit_params={'sample_weight': [ my_sample_weights ]},
n_jobs=1,
)
validator.fit(x, y)
Trying to pass the sample_weights as a parameter gives the following error:
...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.
It seems that the sample_weight vector has not been split in a similar manner to the input array.
For what is worth:
import sklearn
print(sklearn.__version__)
0.18.1
import keras
print(keras.__version__)
2.0.5
The problem is that as a standard, the GridSearch uses 3-fold cross-validation, unless explicity stated otherwise. This means that 2/3 data points of the data are used as training data and 1/3 for cross-validation, which does fit the error message. The input shape of 1000 of the fit_params doesn't match the number of training examples used for training (666). Adjust the size and the code will run.
my_sample_weights = np.random.uniform(size=666)
We developed PipeGraph, an extension to Scikit-Learn Pipeline that allows you to get intermediate data, build graph like workflows, and in particular, solve this problem (see the examples in the gallery at http://mcasl.github.io/PipeGraph )

trying a custom computation of grid.best_score_ (obtained with GridSearchCV)

I'm trying to recompute grid.best_score_ I obtained on my own data without success...
So I tried it using a conventional dataset but no more success. Here is the code :
from sklearn import datasets
from sklearn import linear_model
from sklearn.cross_validation import ShuffleSplit
from sklearn import grid_search
from sklearn.metrics import r2_score
import numpy as np
lr = linear_model.LinearRegression()
boston = datasets.load_boston()
target = boston.target
param_grid = {'fit_intercept':[False]}
cv = ShuffleSplit(target.size, n_iter=5, test_size=0.30, random_state=0)
grid = grid_search.GridSearchCV(lr, param_grid, cv=cv)
grid.fit(boston.data, target)
# got cv score computed by gridSearchCV :
print grid.best_score_
0.677708680059
# now try a custom computation of cv score
cv_scores = []
for (train, test) in cv:
y_true = target[test]
y_pred = grid.best_estimator_.predict(boston.data[test,:])
cv_scores.append(r2_score(y_true, y_pred))
print np.mean(cv_scores)
0.703865991851
I can't see why it's different, GridSearchCV is supposed to use scorer from LinearRegression, which is r2 score. Maybe the way I code cv score is not the one used to compute best_score_... I'm asking here before going through GridSearchCV code.
Unless refit=False in the GridSearchCV constructor, the winning estimator is refit on the entire dataset at the end of fit. best_score_ is the estimator's average score using the cross-validation splits, while best_estimator_ is an estimator of the winning configuration fit on all the data.
lr2 = linear_model.LinearRegression(fit_intercept=False)
scores2 = [lr2.fit(boston.data[train,:], target[train]).score(boston.data[test,:], target[test])
for train, test in cv]
print np.mean(scores2)
Will print 0.67770868005943297.

Resources