eli5: show_weights() with two labels - scikit-learn

I'm trying eli5 in order to understand the contribution of terms to the prediction of certain classes.
You can run this script:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
#categories = ['alt.atheism', 'soc.religion.christian']
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics']
np.random.seed(1)
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=7)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=7)
bow_model = CountVectorizer(stop_words='english')
clf = LogisticRegression()
pipel = Pipeline([('bow', bow),
('classifier', clf)])
pipel.fit(train.data, train.target)
import eli5
eli5.show_weights(clf, vec=bow, top=20)
Problem:
When working with two labels, the output is unfortunately limited to only one table:
categories = ['alt.atheism', 'soc.religion.christian']
However, when using three labels, it also outputs three tables.
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics']
Is it a bug in the software that it misses y=0 in the first output, or do I miss a statistical point? I would expect to see two tables for the first case.

This has not to do with eli5 but with how scikit-learn (in this case LogisticRegression()) treats two categories. For only two categories, the problem turns into a binary one, so only a single column of attributes is returned everywhere from learned classifier.
Look at the attributes of LogisticRegression:
coef_ : array, shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary.
intercept_ : array, shape (1,) or (n_classes,)
Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero.
intercept_ is of shape(1,) when the problem is binary.
coef_ is of shape (1, n_features) when binary. This coef_ is used by the eli5.show_weights().
Hope this makes it clear.

Related

W2VTransformer: Only works with one word as input?

Following reproducible script is used to compute the accuracy of a Word2Vec classifier with the W2VTransformer wrapper in gensim:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import W2VTransformer
from gensim.utils import simple_preprocess
# Load synthetic data
data = pd.read_csv('https://pastebin.com/raw/EPCmabvN')
data = data.head(10)
# Set random seed
np.random.seed(0)
# Tokenize text
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
# Get labels
y_train = data.label
train_input = [x[0] for x in X_train]
# Train W2V Model
model = W2VTransformer(size=10, min_count=1)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1)
clf.fit(model.transform(train_input), y_train)
text_w2v = Pipeline(
[('features', model),
('classifier', clf)])
score = text_w2v.score(train_input, y_train)
score
0.80000000000000004
The problem with this script is that it only works when train_input = [x[0] for x in X_train], which essentially is always the first word only.
Once change to train_input = X_train (or train_input simply substituted by X_train), the script returns:
ValueError: cannot reshape array of size 10 into shape (10,10)
How can I solve this issue, i.e. how can the classifier work with more than one word of input?
Edit:
Apparently, the W2V wrapper can't work with the variable-length train input, as compared to D2V. Here is a working D2V version:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess, lemmatize
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(dm=1, size=50, min_count=2, iter=10, seed=0)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1, random_state=0)
clf.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
y_pred = pipeline.predict(X_train)
score = accuracy_score(y_train,y_pred)
print(score)
This is technically not an answer, but cannot be written in comments so here it is. There are multiple issues here:
LogisticRegression class (and most other scikit-learn models) work with 2-d data (n_samples, n_features).
That means that it needs a collection of 1-d arrays (one for each row (sample), in which the elements of array contains the feature values).
In your data, a single word will be a 1-d array, which means that the single sentence (sample) will be a 2-d array. Which means that the complete data (collection of sentences here) will be a collection of 2-d arrays. Even in that, since each sentence can have different number of words, it cannot be combined into a single 3-d array.
Secondly, the W2VTransformer in gensim looks like a scikit-learn compatible class, but its not. It tries to follows "scikit-learn API conventions" for defining the methods fit(), fit_transform() and transform(). They are not compatible with scikit-learn Pipeline.
You can see that the input param requirements of fit() and fit_transform() are different.
fit():
X (iterable of iterables of str) – The input corpus.
X can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from
disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec
module for such examples.
fit_transform():
X (numpy array of shape [n_samples, n_features]) – Training set.
If you want to use scikit-learn, then you will need to have the 2-d shape. You will need to "somehow merge" word-vectors for a single sentence to form a 1-d array for that sentence. That means that you need to form a kind of sentence-vector, by doing:
sum of individual words
average of individual words
weighted averaging of individual words based on frequency, tf-idf etc.
using other techniques like sent2vec, paragraph2vec, doc2vec etc.
Note:- I noticed now that you were doing this thing based on D2VTransformer. That should be the correct approach here if you want to use sklearn.
The issue in that question was this line (since that question is now deleted):
X_train = vectorizer.fit_transform(X_train)
Here, you overwrite your original X_train (list of list of words) with already calculated word vectors and hence that error.
Or else, you can use other tools / libraries (keras, tensorflow) which allow sequential input of variable size. For example, LSTMs can be configured here to take a variable input and an ending token to mark the end of sentence (a sample).
Update:
In the above given solution, you can replace the lines:
model = D2VTransformer(dm=1, size=50, min_count=2, iter=10, seed=0)
model.fit(X_train)
clf = LogisticRegression(penalty='l2', C=0.1, random_state=0)
clf.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
y_pred = pipeline.predict(X_train)
with
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_train)
No need to fit and transform separately, since pipeline.fit() will automatically do that.

Geneating Learning Curve for Logistic Regression

I have a dataset of 260 microscopic image.I want to generate a learning curve for logistic regression algorithm.But I am getting this error "'module' object is not iterable" .Perhaps I don't understand something basic, as I'm a begineer who freshly learning Python
from sklearn.cross_validation import train_test_split
from imutils import paths
from scipy import misc
import numpy as np
import argparse
import imutils
import cv2
import os
from matplotlib import pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.model_selection import cross_val_score
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(50, 80, 110)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : int or None, optional (default=None)
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
train_sizes : array-like, shape (n_ticks,), dtype float or int
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.
(default: np.linspace(0.1, 1.0, 5))
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
#training with logistic regression
clfLR = LogisticRegression(random_state=0, solver='lbfgs')
clfLR.fit(trainFeat,trainLabels)
acc = clfLR.score(testFeat, testLabels)
print("accuracy of Logistic regression ",acc)
I am facing this problem only when i Want to plot the curve.Rest of the code works fine.
#plotting the curve
estimator =LogisticRegression()
train_sizes, train_scores, valid_scores = plot_learning_curve(
estimator,'logistic learning curve ', trainFeat, trainLabels, cv=5, n_jobs=4,train_sizes=[50, 80, 110])
print(train_sizes)
plt.show()
Screenshot of the error
learning curve
Try running the code on Jupyter online IDE IDE. It plots automatically if you add "%matplotlib" line to import section.
If you want to keep working on this IDE, share your error message so maybe I can help you. You are missing one of the imports probably or it might be a Python2/3 problem.

Why do people use second column of predict_proba() (i.e predict_proba(X)[:,1]) for roc_auc_score()?

I've seen many times people use values of the second column of predict_proba(X)(i.e predict_proba(X)[:,1]) in order to find roc_auc_score() of binary classification model as follows:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
ypred = rfc.predict_proba(X_test)[:,1]
roc_auc_score(y_test, ypred)

Interpretation of coef_ attribute of LogisticRegression class

I have a (might be silly) question regarding coef_ attribute of sklearn.linear_model.LogisticRegression.
I fit LogisticRegression model to Iris dataset using only two features(petal width and length). To obtain weights of each feature I use coef_ attribute and it returns 3x2 array. I understand that the reason I get 3 rows is because of 3 classes and one-vs-rest rule.
However, I can not understand why it includes only w_1 and w_2 (or theta_1 and theta_2, depending on which notation you use), coefficients of feature 1 and 2, but missing w_0 (or theta_0), which is intercept.
Code:
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
iris = datasets.load_iris()
X = iris.data[:, [2,3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
lr = LogisticRegression(C = 1000, random_state=0)
lr.fit(X_train_std, y_train)
lr.coef_
the attribute 'coef_' gives only the Coefficients of the features in the decision function.
you can get the intercept by using :
lr.intercept_
You might notice the strange-looking trailing underscore at the end
of coef_ and intercept_. Scikit-learn always stores anything
that is derived from the training data in attributes that end with a
trailing underscore. That is to separate them from the parameters that
are set by the user.
lr.coef_
coef_ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial', coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).
lr.intercept_
intercept_ndarray of shape (1,) or (n_classes,)
Intercept (a.k.a. bias) added to the decision function.
If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class='multinomial', intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).
SOURCE: Scikit learn Official Documentation

sample_weight parameter shape error in scikit-learn GridSearchCV

Passing the sample_weight parameter to GridSearchCV raises an error due to incorrect shape. My suspicion is that cross validation is not capable of handling the split of sample_weights accordingly with the dataset.
First part: Using sample_weight as a model parameter works beautifully
Let's consider a simple example, first without GridSearch:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'
x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3
my_sample_weights contains the importance that I assign to each observation in x, y, as the following picture shows. The points of the sinusoidal curve get higher weights than those forming the background noise.
plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")
Let's train a neural network, first without using the information contained in my_sample_weights:
def make_model(number_of_hidden_neurons=1):
model = Sequential()
model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='sgd', loss='mse')
return model
net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)
plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")
As the following picture shows, the neural network tries to fit the shape of the sinusoidal but the background noise prevents it from a good fit.
Now, using the information of my_sample_weights , the quality of the prediction is a much better one.
Second part: Using sample_weight as a GridSearchCV parameter raises an error
my_Regressor = KerasRegressor(make_model)
validator = GridSearchCV(my_Regressor,
param_grid={'number_of_hidden_neurons': range(4, 5),
'epochs': [500],
},
fit_params={'sample_weight': [ my_sample_weights ]},
n_jobs=1,
)
validator.fit(x, y)
Trying to pass the sample_weights as a parameter gives the following error:
...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.
It seems that the sample_weight vector has not been split in a similar manner to the input array.
For what is worth:
import sklearn
print(sklearn.__version__)
0.18.1
import keras
print(keras.__version__)
2.0.5
The problem is that as a standard, the GridSearch uses 3-fold cross-validation, unless explicity stated otherwise. This means that 2/3 data points of the data are used as training data and 1/3 for cross-validation, which does fit the error message. The input shape of 1000 of the fit_params doesn't match the number of training examples used for training (666). Adjust the size and the code will run.
my_sample_weights = np.random.uniform(size=666)
We developed PipeGraph, an extension to Scikit-Learn Pipeline that allows you to get intermediate data, build graph like workflows, and in particular, solve this problem (see the examples in the gallery at http://mcasl.github.io/PipeGraph )

Resources