PySpark: Get Threshold (cuttoff) values for each point in ROC curve - apache-spark

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )

If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:

One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

Related

Sklearn TruncatedSVD not showing explained variance ration in descending order, or first number means something else?

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.

What is the difference between decision function and score_samples in isolation_forest in SKLearn

I have read the documentation of the decision function and score_samples here, but could not figure out what is the difference between these two methods and which one should I use for an outlier detection algorithm.
Any help would be appreciated.
See the documentation for the attribute offset_:
Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. offset_ is defined as follows. When the contamination parameter is set to “auto”, the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than “auto” is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.
The User Guide references the paper Isolation forest written by Fei Tony, Kai Ming and Zhi-Hua.
I did not read the paper, but I think you can use either output to detect outliers. The documentation says score_samples is the opposite of decision_function, so I thought they would be inversely related, but both outputs seem to have the exact same relationship with the target. The only difference is that they are on different ranges. In fact, they even have the same variance.
To see this, I fit the model to the breast cancer dataset available in sklearn and visualized the average of the target variable grouped by the deciles of each output. As you can see, they both have the exact same relationship.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import IsolationForest
# Load data
X = load_breast_cancer()['data']
y = load_breast_cancer()['target']
# Fit model
clf = IsolationForest()
clf.fit(X, y)
# Split the outputs into deciles to see their relationship with target
t = pd.DataFrame({'target':y,
'decision_function':clf.decision_function(X),
'score_samples':clf.score_samples(X)})
t['bins_decision_function'] = pd.qcut(t['decision_function'], 10)
t['bins_score_samples'] = pd.qcut(t['score_samples'], 10)
# Visualize relationship
plt.plot(t.groupby('bins_decision_function')['target'].mean().values, lw=3, label='Decision Function')
plt.plot(t.groupby('bins_score_samples')['target'].mean().values, ls='--', label='Score Samples')
plt.legend()
plt.show()
Like I said, they even have the same variance:
t[['decision_function','score_samples']].var()
> decision_function 0.003039
> score_samples 0.003039
> dtype: float64
In conclusion, you can use them interchangeably as they both share the same relationship with the target.
As was previously stated in #Ben Reiniger's answer,
decision_function = score_samples - offset_. For further clarification...
If contamination = 'auto', then offset_ is fixed to 0.5
If contamination is set to something other than 'auto', then
offset is no longer fixed.
This can be seen under the fit function in the source code:
def fit(self, X, y=None, sample_weight=None):
...
if self.contamination == "auto":
# 0.5 plays a special role as described in the original paper.
# we take the opposite as we consider the opposite of their score.
self.offset_ = -0.5
return self
# else, define offset_ wrt contamination parameter
self.offset_ = np.percentile(self.score_samples(X),
100. * self.contamination)
Thus, it's important to take note of what contamination is set to, as well as which anomaly scores you are using. score_samples returns what can be thought of as the "raw" scores, as it is unaffected by offset_, whereas decision_function is dependent on offset_

How do I use Theanets LSTM RNN's on my time series data?

I have a simple dataframe consisting of one column. In that column are 10320 observations (numerical). I'm simulating Time-Series data by inserting the data into a plot with a window of 200 observations each. Here is the code for plotting.
import matplotlib.pyplot as plt
from IPython import display
fig_size = plt.rcParams["figure.figsize"]
import time
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
fig, axes = plt.subplots(1,1, figsize=(19,5))
df = dframe.set_index(arange(0,len(dframe)))
std = dframe[0].std() * 6
window = 200
iterations = int(len(dframe)/window)
i = 0
dframe = dframe.set_index(arange(0,len(dframe)))
while i< iterations:
frm = window*i
if i == iterations:
to = len(dframe)
else:
to = frm+window
df = dframe[frm : to]
if len(df) > 100:
df = df.set_index(arange(0,len(df)))
plt.gca().cla()
plt.plot(df.index, df[0])
plt.axhline(y=std, xmin=0, xmax=len(df[0]),c='gray',linestyle='--',lw = 2, hold=None)
plt.axhline(y=-std , xmin=0, xmax=len(df[0]),c='gray',linestyle='--', lw = 2, hold=None)
plt.ylim(min(dframe[0])- 0.5 , max(dframe[0]) )
plt.xlim(-50,window+50)
display.clear_output(wait=True)
display.display(plt.gcf())
canvas = FigureCanvas(fig)
canvas.print_figure('fig.png', dpi=72, bbox_inches='tight')
i += 1
plt.close()
This simulates a flow of real-time data and visualizes it. What I want is to apply theanets RNN LSTM to the data to detect anomalies unsupervised. Because I am doing it unsupervised I don't think that I need to split my data into training and test sets. I haven't found much of anything that makes sense to me so far and have been googling for about 2 hours. Just hoping that you guys may be able to help. I want to put the prediction output of the RNN on the graph as well and define a threshold that, if the error is too large, the values will be identified as anomalous. If you need more information please comment and let me know. Thank you!
READING
Like neurons, LSTM networks are build of interconnected LSTM Blocks whose training is done via BackPropogation Through Time.
Classical anomaly detection using time series required prediction of time series output in future (at one or more points) and finding error on these points with true values. Prediction Error above a threshold will reflect and amomly
SOLUTION
Having said this
You've to train network so you need training sets and test sets both
Use N inputs to predict M outputs (decide upon N and M with experimentation - values for which training error is low)
Scroll a window of (N+M) elements in input data and use this data array of (N+M) items also termed as frame to train or test network.
Typically we use 90% of starting series for training and 10% for testing.
This scheme will fail as if training is not proper there will be false prediction errors which are not-anomaly. So make sure to provide enough training, and most important shuffle training frames and consider all variations.

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

How to find key trees/features from a trained random forest?

I am using Scikit-Learn Random Forest Classifier and trying to extract the meaningful trees/features in order to better understand the prediction results.
I found this method which seems relevant in the documention (http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params), but couldn't find an example how to use it.
I am also hoping to visualize those trees if possible, any relevant code would be great.
Thank you!
I think you're looking for Forest.feature_importances_. This allows you to see what the relative importance of each input feature is to your final model. Here's a simple example.
import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#Lets set up a training dataset. We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1. We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features. If we do it right, the model should point out these three as important.
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
line = []
if random.random()>0.5:
line.append(1.0)
#Let's add 3 features that we know indicate a row classified as "1".
line.append(.77)
line.append(.33)
line.append(.55)
for x in range(16):#fill in the rest with noise
line.append(random.random())
else:
#this is a "0" row, so fill it with noise.
line.append(0.0)
for x in range(19):
line.append(random.random())
train_data.append(line)
train_data = np.array(train_data)
# Create the random forest object which will include all the parameters
# for the fit. Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)
# Fit the training data to the training output and create the decision
# trees. This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])
#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one. Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
if i>np.average(Forest.feature_importances_):
important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!
To get the relative feature importances, read the relevant section of the documentation along with the code of the linked examples in that same section.
The trees themselves are stored in the estimators_ attribute of the random forest instance (only after the call to the fit method). Now to extract a "key tree" one would first require you to define what it is and what you are expecting to do with it.
You could rank the individual trees by computing there score on held out test set but I don't know what expect to get out of that.
Do you want to prune the forest to make it faster to predict by reducing the number of trees without decreasing the aggregate forest accuracy?
Here is how I visualize the tree:
First make the model after you have done all of the preprocessing, splitting, etc:
# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Make predictions:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Then make the plot of importances. The variable dataset is the name of the original dataframe.
# get importances from RF
importances = classifier.feature_importances_
# then sort them descending
indices = np.argsort(importances)
# get the features from the original data set
features = dataset.columns[0:26]
# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
This yields a plot as below:

Resources