Role of class_weight in loss functions for linearSVC and LogisticRegression - scikit-learn

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.

class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

Related

What is the difference between decision function and score_samples in isolation_forest in SKLearn

I have read the documentation of the decision function and score_samples here, but could not figure out what is the difference between these two methods and which one should I use for an outlier detection algorithm.
Any help would be appreciated.
See the documentation for the attribute offset_:
Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. offset_ is defined as follows. When the contamination parameter is set to “auto”, the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than “auto” is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.
The User Guide references the paper Isolation forest written by Fei Tony, Kai Ming and Zhi-Hua.
I did not read the paper, but I think you can use either output to detect outliers. The documentation says score_samples is the opposite of decision_function, so I thought they would be inversely related, but both outputs seem to have the exact same relationship with the target. The only difference is that they are on different ranges. In fact, they even have the same variance.
To see this, I fit the model to the breast cancer dataset available in sklearn and visualized the average of the target variable grouped by the deciles of each output. As you can see, they both have the exact same relationship.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import IsolationForest
# Load data
X = load_breast_cancer()['data']
y = load_breast_cancer()['target']
# Fit model
clf = IsolationForest()
clf.fit(X, y)
# Split the outputs into deciles to see their relationship with target
t = pd.DataFrame({'target':y,
'decision_function':clf.decision_function(X),
'score_samples':clf.score_samples(X)})
t['bins_decision_function'] = pd.qcut(t['decision_function'], 10)
t['bins_score_samples'] = pd.qcut(t['score_samples'], 10)
# Visualize relationship
plt.plot(t.groupby('bins_decision_function')['target'].mean().values, lw=3, label='Decision Function')
plt.plot(t.groupby('bins_score_samples')['target'].mean().values, ls='--', label='Score Samples')
plt.legend()
plt.show()
Like I said, they even have the same variance:
t[['decision_function','score_samples']].var()
> decision_function 0.003039
> score_samples 0.003039
> dtype: float64
In conclusion, you can use them interchangeably as they both share the same relationship with the target.
As was previously stated in #Ben Reiniger's answer,
decision_function = score_samples - offset_. For further clarification...
If contamination = 'auto', then offset_ is fixed to 0.5
If contamination is set to something other than 'auto', then
offset is no longer fixed.
This can be seen under the fit function in the source code:
def fit(self, X, y=None, sample_weight=None):
...
if self.contamination == "auto":
# 0.5 plays a special role as described in the original paper.
# we take the opposite as we consider the opposite of their score.
self.offset_ = -0.5
return self
# else, define offset_ wrt contamination parameter
self.offset_ = np.percentile(self.score_samples(X),
100. * self.contamination)
Thus, it's important to take note of what contamination is set to, as well as which anomaly scores you are using. score_samples returns what can be thought of as the "raw" scores, as it is unaffected by offset_, whereas decision_function is dependent on offset_

How to deal with imbalanced classes in Keras

I am working on a multi-label image classification problem with Keras and so I utilize functions flow_from_dataframe() and fit_generator().
I have about 2000 classes and as you can guess they are highly skewed / imbalanced. After searching a bit I came across with arguments class_weight and classes and I decided to give them a try. My problem is, I am not sure if I use them correctly. Here is an example:
Let's assume that I have flatten all class occurrences so that I get the following list of (duplicated) labels:
labels = ['classD', 'classA', 'classA', 'classC', 'classD', 'classD']
And this is the function that computes classes and class_weight:
from collections import Counter
def get_classes_weights(l, n):
counter = Counter(l).most_common(n)
classes = [cls for cls, ocu in counter]
majority = max([ocu for cls, ocu in counter])
weights = {idx: float(majority/ocu) for idx, (cls, ocu) in enumerate(counter)}
return classes, weights
Let's also assume that I what to consider the top-2 classes only:
classes, class_weight = get_classes_weights(labels, 2)
This gives:
classes: ['classD', 'classA']
and:
class_weight: {0: 1.0, 1: 1.5}
And finally, this is how I use them within the functions:
generator_train.flow_from_dataframe(
classes=classes,
)
model.fit_generator(
class_weight=class_weight
)
So my question are:
Is the above the right way to apply weights given that I work on a multi-label image classification problem?
Does my validation set need to be balanced or it is OK if it has been taken from the same distribution as the training set (20% and 80% random selection, respectively)?

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted to check the significance of the returned coefficients and ANOVA but I cannot find how to access it. Is it possible at all? And what is the best strategy for data that contains lots of dummy variables? Thanks a lot!
Scikit-learn deliberately does not support statistical inference. If you want out-of-the-box coefficients significance tests (and much more), you can use Logit estimator from Statsmodels. This package mimics interface glm models in R, so you could find it familiar.
If you still want to stick to scikit-learn LogisticRegression, you can use asymtotic approximation to distribution of maximum likelihiood estimates. Precisely, for a vector of maximum likelihood estimates theta, its variance-covariance matrix can be estimated as inverse(H), where H is the Hessian matrix of log-likelihood at theta. This is exactly what the function below does:
import numpy as np
from scipy.stats import norm
from sklearn.linear_model import LogisticRegression
def logit_pvalue(model, x):
""" Calculate z-scores for scikit-learn LogisticRegression.
parameters:
model: fitted sklearn.linear_model.LogisticRegression with intercept and large C
x: matrix on which the model was fit
This function uses asymtptics for maximum likelihood estimates.
"""
p = model.predict_proba(x)
n = len(p)
m = len(model.coef_[0]) + 1
coefs = np.concatenate([model.intercept_, model.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
ans = np.zeros((m, m))
for i in range(n):
ans = ans + np.dot(np.transpose(x_full[i, :]), x_full[i, :]) * p[i,1] * p[i, 0]
vcov = np.linalg.inv(np.matrix(ans))
se = np.sqrt(np.diag(vcov))
t = coefs/se
p = (1 - norm.cdf(abs(t))) * 2
return p
# test p-values
x = np.arange(10)[:, np.newaxis]
y = np.array([0,0,0,1,0,0,1,1,1,1])
model = LogisticRegression(C=1e30).fit(x, y)
print(logit_pvalue(model, x))
# compare with statsmodels
import statsmodels.api as sm
sm_model = sm.Logit(y, sm.add_constant(x)).fit(disp=0)
print(sm_model.pvalues)
sm_model.summary()
The outputs of print() are identical, and they happen to be coefficient p-values.
[ 0.11413093 0.08779978]
[ 0.11413093 0.08779979]
sm_model.summary() also prints a nicely formatted HTML summary.

The dimension of dual_coef_ in sklearn. SVC

In SVC() for multi-classification, the one-vs-one classifiers are trained. So there are supposed to be n_class * (n_class - 1)/2 classifiers in total. But why clf.dual_coef_ returns me only (n_class - 1) * n_SV? What does each row represent then?
The dual coefficients of a sklearn.svm.SVC in the multiclass setting are tricky to interpret. There is an explanation in the scikit-learn documentation. The sklearn.svm.SVC uses libsvm for the calculations and adopts the same data structure for the dual coefficients. Another explanation of the organization of these coefficients is in the FAQ. In the case of the coefficients you find in the fitted SVC classifier, interpretation goes as follows:
The support vectors identified by the SVC each belong to a certain class. In the dual coefficients, they are ordered according to the class they belong to.
Given a fitted SVC estimator, e.g.
from sklearn.svm import SVC
svc = SVC()
svc.fit(X, y)
you will find
svc.classes_ # represents the unique classes
svc.n_support_ # represents the number of support vectors per class
The support vectors are organized according to these two variables. Each support vector being clearly identified with one class, it becomes evident that it can be implied in at most n_classes-1 one-vs-one problems, viz every comparison with all the other classes. But it is entirely possible that a given support vector will not be implied in all one-vs-one problems.
Taking a look at
support_indices = np.cumsum(svc.n_support_)
svc.dual_coef_[0:support_indices[0]] # < ---
# weights on support vectors of class 0
# for problems 0v1, 0v2, ..., 0v(n-1)
# so n-1 columns for each of the
# svc.n_support_[0] support vectors
svc.dual_coef_[support_indices[1]:support_indices[2]]
# ^^^
# weights on support vectors of class 1
# for problems 0v1, 1v2, ..., 1v(n-1)
# so n-1 columns for each of the
# svc.n_support_[1] support vectors
...
svc.dual_coef_[support_indices[n_classes - 2]:support_indices[n_classes - 1]]
# ^^^
# weights on support vectors of class n-1
# for problems 0vs(n-1), 1vs(n-1), ..., (n-2)v(n-1)
# so n-1 columns for each of the
# svc.n_support_[-1] support vectors
gives you the weights of the support vectors for the classes 0, 1, ..., n-1 in their respective one-vs-one problems. Comparisons to all other classes except its own are made, resulting in n_classes - 1 columns. The order in which this happens follows the order of the unique classes exposed above. There are as many rows in each group as there are support vectors.
Possibly what you are looking for are the primal weights, which live in feature space, in order to inspect them as to their "importance" for classification. This is only possible with a linear kernel. Try this
from sklearn.svm import SVC
svc = SVC(kernel="linear")
svc.fit(X, y) # X is your data, y your labels
Then take a look at
svc.coef_
This is an array of shape ((n_class * (n_class -1) / 2), n_features) and represents the aforementioned weights.
According to the doc the weights are ordered as:
class 0 vs class 1
class 0 vs class 2
...
class 0 vs class n-1
class 1 vs class 2
class 1 vs class 3
...
...
class n-2 vs class n-1

Resources