How to get the threshold from a specific precision and recall - scikit-learn

I'm trying to get the threshold for a specific precision and recall. Let's say I want to get the threshold at the precision of 60% and recall of 40%. Are there any straightforward way to do it using the sklearn package?
precision, recall, threshold = precision_recall_curve(y_val, y_e)
df_pr = pd.DataFrame()
df_pr['precision'] = precision
df_pr['recall'] = recall
df_pr['threshold'] = list(threshold) + [1]
precision recall threshold
0 0.247543 1.000000 0.059483
1 0.247486 0.999692 0.059489
2 0.247504 0.999692 0.059512
3 0.247523 0.999692 0.059542

Provided that I've properly understood your question, imo, the point to highlight is that precision and recall are not necessarily coupled as you seem to imply. Here's a toy example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_scores = lr.predict_proba(X_test)
precision, recall, threshold = precision_recall_curve(y_test, y_scores[:, 1])
plt.plot(threshold, precision[:-1], 'b--', label='Precision')
plt.plot(threshold, recall[:-1], 'r--', label='Recall')
plt.xlabel('Threshold')
plt.legend(loc='lower left')
plt.ylim([0,1])
This said, the problem becomes something you can easily solve either with numpy or pandas, depending on your "setting". For instance, here's a toy function returning precision, recall and threshold at the index where the condition is attained.
def prt(arr, value):
array = np.asarray(arr)
idx = np.where(array[:-1] == value)[0][0]
return precision[idx], recall[idx], threshold[idx]
prt(precision, 0.6) # I checked ex-ante that precision=0.6 is attained. Differently you'll have to go with something custom.
(0.6, 0.9622641509433962, 0.052229434776723364)
Otherwise, to resemble your setting with a pandas DataFrame:
df = pd.DataFrame()
df['precision'] = precision[:-1]
df['recall'] = recall[:-1]
df['threshold'] = threshold
df[df.loc[:, 'precision'] == 0.6]
I would suggest you sklearn precision_recall_curve and threshold that tries to explain how .precision_recall_curve() works under the hood and Why does precision_recall_curve() return different values than confusion matrix? which might be somehow related.

Related

Discounted Cumulative Gain dcg_score sklearn

from sklearn.metrics import ndcg_score, dcg_score
import numpy as np
actual= [3,2,0,0,1]
ideal= sorted(actual, reverse=True)
#list to np asarray
actualarr=np.asarray([actual])
idealarr= np.asarray([ideal])
print ("actual score as array", actualarr)
print("ideal score as array", idealarr)
#Discounted Cumulative Gain
dcg= dcg_score(idealarr, actualarr)
print("DCG: ", dcg)
I don't understand why dcg_score takes y_score as a parameter. When I work out DCG longhand (sum relevance/log2(i+1)) I can get the same answer ~4.6, but i can achieve this just with the true scores [3,2,0,0,1], so why does it also require the ideal score [3,2,1,0,0] in the function?
I understood that sklearn.metrics.ndcg computes its sum by taking values from y_true as if it was reordered according to y_score.
As explained inside the code: "Sum the true scores ranked in the order induced by the predicted scores"
This means the metric is computed on the induced ranking, using true relevance values.
A small example:
import numpy as np
from sklearn.metrics import dcg_score
def naive_dcg(y_score):
score = 0
for i,n in enumerate(y_score[0]):
num = 2**n -1
den = np.log2(i+1+1)
score += num/den
return score
y_true = [[1,0]]
y_score = [[0,1]]
print(f'sklearn: {dcg_score(y_true,y_score):.2}, naive: {naive_dcg(y_score):.2}')
y_score = [[0.1,0.2]]
print(f'sklearn: {dcg_score(y_true,y_score):.2}, naive: {naive_dcg(y_score):.2}')
outputs:
sklearn: 0.63, naive: 0.63
sklearn: 0.63, naive: 0.17
which shows naive produces a different metric for the same ranking order.

probability difference between categorical target and one-hot encoding target using OneVsRestClassifier

A bit confused with the probability between categorical target and one-hot encoding target from OneVsRestClassifier of sklean. Using iris data with simple logistic regression as an example. When I use original iris class[0,1,2], the calculated OneVsRestClassifier() probability for each observation will always add up to 1. However, if I converted the target to dummies, this is not the case. I understand that OneVsRestClassifier() compares one vs rest (class 0 vs non class 0, class 1 vs non class 1, etc). It makes more sense that the sum of these probabilities has no relation with 1. Then why I see the difference and how so?
import numpy as np
import pandas as pd
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
np.set_printoptions(suppress=True)
iris = datasets.load_iris()
rng = np.random.RandomState(0)
perm = rng.permutation(iris.target.size)
X = iris.data[perm]
y = iris.target[perm]
# categorical target with no conversion
X_train, y_train1 = X[:80], y[:80]
X_test, y_test1 = X[80:], y[80:]
m3 = LogisticRegression(random_state=0)
clf1 = OneVsRestClassifier(m3).fit(X_train, y_train1)
y_pred1 = clf1.predict(X_test)
print(np.sum(y_pred1 == y_test))
y_prob1 = clf1.predict_proba(X_test)
y_prob1[:5]
#output
array([[0.00014508, 0.17238549, 0.82746943],
[0.03850173, 0.79646817, 0.1650301 ],
[0.73981106, 0.26018067, 0.00000827],
[0.00016332, 0.32231163, 0.67752505],
[0.00029197, 0.2495404 , 0.75016763]])
# one hot encoding for categorical target
y2 = pd.get_dummies(y)
y_train2 = y2[:80]
y_test2 = y2[80:]
clf2 = OneVsRestClassifier(m3).fit(X_train, y_train2)
y_pred2 = clf2.predict(X_test)
y_prob2 = clf2.predict_proba(X_test)
y_prob2[:5]
#output
array([[0.00017194, 0.20430011, 0.98066319],
[0.02152246, 0.44522562, 0.09225181],
[0.96277892, 0.3385952 , 0.00001076],
[0.00023024, 0.45436925, 0.95512082],
[0.00036849, 0.31493725, 0.94676348]])
When you encode the targets, sklearn interprets your problem as a multilabel one instead of just multiclass; that is, that it is possible for a point to have more than one true label. And in that case, it is perfectly acceptable for the total sum of probabilities to be greater (or less) than 1. That's generally true for sklearn, but OneVsRestClassifier calls it out specifically in the docstring:
OneVsRestClassifier can also be used for multilabel classification. To use this feature, provide an indicator matrix for the target y when calling .fit.
As for the first approach, there are indeed three independent models, but the predictions are normalized; see the source code. Indeed, that's the only difference:
(y_prob2 / y_prob2.sum(axis=1)[:, None] == y_prob1).all()
# output
True
It's probably worth pointing out that LogisticRegression also natively supports multiclass. In that case, the weights for each class are independent, so it's similar to three separate models, but the resulting probabilities are the result of a softmax application, and the loss function minimizes the loss for each class simultaneously, so that the resulting coefficients and hence predictions can be different from those obtained from OneVsRestClassifier:
m3.fit(X_train, y_train1)
y_prob0 = m3.predict_proba(X_test)
y_prob0[:5]
# output:
array([[0.00000494, 0.01381671, 0.98617835],
[0.02569699, 0.88835451, 0.0859485 ],
[0.95239985, 0.04759984, 0.00000031],
[0.00001338, 0.04195642, 0.9580302 ],
[0.00002815, 0.04230022, 0.95767163]])

Why does numpy vectorization not improve the speed of my code

Here is the original code that does not use vectorize
import tensorflow as tf
import time
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
path='mnist.npz'
)
x_train = x_train.reshape(60000,-1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x_train)
pca = PCA()
pca = PCA(n_components = 16) # or 12 -> 3, 4 filter_size=3
X_pca = pca.fit_transform(X_scaled).reshape(60000, 4, 4, 1)
start = time.time()
X_pca_zero = X_pca[0]
for i in range(1,60000):
X_pca_expanded = X_pca[i]
print(tf.image.ssim(X_pca_zero, X_pca_expanded, 255, filter_size=4))
print(time.time()-start)
It is essentially comparing the similarity between a reference image and a set of images. I feel it can be sped up by vectorization (so as to avoid the time wasted by the for loop). Therefore, I used the numpy vectorize function -
import numpy as np
import tensorflow as tf
import time
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
path='mnist.npz'
)
def my_func(x_train):
x_train = x_train.reshape(60000,-1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x_train)
pca = PCA()
pca = PCA(n_components = 16) # or 12 -> 3, 4 filter_size=3
X_pca = pca.fit_transform(X_scaled).reshape(60000, 4, 4, 1)
start = time.time()
X_pca_zero = X_pca[0]
for i in range(1,6000):
X_pca_expanded = X_pca[i]
print(tf.image.ssim(X_pca_zero, X_pca_expanded, 255, filter_size=4))
print(time.time()-start)
return 0
np.vectorize(my_func(x_train))
However, there doesn't seem to be any speed improvements.
In Python function calls, arguments are evaluated before they are passed to the function. np.vectorize is Python function that expects a function as argument, and returns another function.
np.vectorize(my_func(x_train))
runs my_func(x_train) before passing the result to vectorize. That argument evaluation does all the prints and timing, and returns 0. I doubted if it would work, but:
In [194]: np.vectorize(0)
Out[194]: <numpy.vectorize at 0x7ff091c36310>
So it does run without error, but does nothing. All the timing is done before anything is passed to np.vectorize.
I suspect you read about the magic of "vectorization", and tried to use a like name function without reading its docs. Not only did you miss the performance disclaimer, but also didn't learn how use it (it does have its uses). It is not some sort of "compiler" or "vectorizing magic".
Most of your code is sklearn, while the only thing you time is
X_pca_zero = X_pca[0]
for i in range(1,60000):
X_pca_expanded = X_pca[i]
print(tf.image.ssim(X_pca_zero, X_pca_expanded, 255, filter_size=4))
I'm not familiar with what tf.image.ssim does. But it is a tensorflow function, and may be complex. Running it 6000 times is bound to take noticeable time. If it doesn't let you provide all X_pca values at once (as opposed to one by one), there's nothing you can do speed it up.
I don't know what it returns, but usually we don't include a print inside a timing block. Each print call adds to the computation time.

Fewer than expected purity scores in PCA analysis

I'm trying to plot the line graph of purity scores against the captured variances in PCA. The goal is to plot the line graph of purity scores against the captured variances of 89% and 99% only. In my code when the components/dimensions are 2 it captures 89% of variance and and when components/dimensions are 4 it captures 99% of variance.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("clustering.csv")
X10_df = df.drop("Class",axis = 1) #feature matrix
Y10_df = df["Class"] #Target vector
X10_df = np.array(X10_df)
Y10_df = np.array(Y10_df)
scaler = StandardScaler() # Standardizing the data
df_std = scaler.fit_transform(X10_df)
pca = PCA()
pca.fit(df_std)
purity = []
n_comp = range(2,5)
for k in n_comp :
pca = PCA(n_components = k)
pca.fit(df_std)
pca.transform(df_std)
scores_pca = pca.transform(df_std)
kmeans_pca = KMeans(n_clusters=3, init ='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y12 = kmeans_pca.fit_predict(scores_pca)
purity13 = purity_score(Y10_df, pred_y12)
purity.append(purity13)
Below function calculates the purity score :
def purity_score(y_true, y_pred):
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
However, while I have four variance scores, I only have three purity scores. I expected to have four purity scores so that I could create a plot of the variance vs purity.
Why there are only three purity scores?
Here is the link to my dataset file : https://gofile.io/d/3CgFTi
This is simply because when you using for loop with a range, the last number in the range is ignored. So in a range(2,5), it will go 2, 3, 4 and then quite the loop. Please read on for loop in Python.

How to interpret the model once a set of coefficient is obtained for Multivariable polynomial regression?

I was solving a Multivariable polynomial regression problem,as a part of an online course, where one must obtain a model (polynomial form) for determining 'price of a car' as a function of 'horsepower','curb-weight','engine-size','highway-mpg'. The code given in the course slide didn't work for me and hence I tried to solve the problem on my own using a little different approach and (not sure) I succedded.
Now I want to determine which coefficient belongs to which variable and to what power.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
lm=LinearRegression()
pr=PolynomialFeatures(degree=2, include_bias=False)
zi=df[['horsepower','curb-weight','engine-size','highway-mpg']]
y=df["price"]
x_poly=pr.fit_transform(zi)
lm.fit(x_poly,y)
y_poly_pred=lm.predict(x_poly)
print(lm.intercept_)
print(lm.coef_)
The output of the 'print(lm.coef_)' is an array:
[ 3.76158683e+02, 1.09866844e+01, -1.15342835e+02, 2.20081486e+02,
1.67487147e+00, -1.85925420e-01, -1.27963440e+00, -1.97616945e+00,
5.93872420e-04, 1.11397083e-01, -2.12935236e-01, 1.04605018e-01,
2.69312438e-01, 4.36657298e+00]
How can I assign or know to which variables and to which powers each of these coeffecients correspond to?
One way of doing is, You can get the ploymomialfeature column names like this
pr.get_feature_names(zi.columns)
and
pd.DataFrame(zip(pr.get_feature_names(zi.columns),lm.coef_),columns=["feature","coef_"])
Above should print the coef for each feature
Working example :
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
data = pd.DataFrame.from_dict({
'x': np.random.randint(low=1, high=10, size=5),
'y': np.random.randint(low=-1, high=1, size=5),
})
lm=LinearRegression()
p = PolynomialFeatures(degree=2)
p_data = p.fit_transform(data)
lm.fit(p_data,data['y'])
print (p.get_feature_names(data.columns))
coefmapping = pd.DataFrame(zip(p.get_feature_names(data.columns),lm.coef_),columns=["feature","coef_"])
print(coefmapping)
output:
feature coef_
0 1 -1.204939e-14
1 x -1.165951e-15
2 y 5.000000e-01
3 x^2 -6.938894e-18
4 x y -3.156113e-16
5 y^2 -5.000000e-01

Resources