Explanation for coverage_error metric in scikit learn - scikit-learn

I am not understanding how the coverage_error is calculated in scikit learn, available in sklearn.metrics module. Explanation in the docs is as below:
The coverage_error function computes the average number of labels that have to be included in the final prediction such that all true labels are predicted.
For eg:
import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 1, 1]])
y_score = np.array([[1, 0, 0], [0, 1, 1]])
print coverage_error(y_true, y_score)
1.5
As per my understanding, here we need to include 3 labels from the prediction to get all labels in y_true. So coverage error = 3/2, ie, 1.5. But I am not able to understand what happens in the below cases:
>>> y_score = np.array([[1, 0, 0], [0, 0, 1]])
>>> print coverage_error(y_true, y_score)
2.0
>>> y_score = np.array([[1, 0, 1], [0, 1, 1]])
>>> print coverage_error(y_true, y_score)
2.0
How come the error is same in both the cases?

You can have a look at User Guide 3.3.3. Multilabel ranking metrics
with
One thing you need to take care is how to compute ranks and break ties in ranking y_score.
To be specific, the first case:
In [4]: y_true
Out[4]:
array([[1, 0, 0],
[0, 1, 1]])
In [5]: y_score
Out[5]:
array([[1, 0, 0],
[0, 0, 1]])
For the 1st sample, the 1st true label is true, and the rank of 1st score is 1.
For the 2ed sample, the 2ed and 3rd true label are true, and the ranks of score are 3 and 1 respectively, so the max rank is 3.
The average is (3+1)/2=2.
the second case:
In [7]: y_score
Out[7]:
array([[1, 0, 1],
[0, 1, 1]])
For the 1st sample, the 1st true label is true, and the rank of 1st score is 2.
For the 2ed sample, the 2ed and 3rd true label are true, and the ranks of score are 2 and 2 respectively, so the max rank is 2.
The average is (2+2)/2=2.
Edit:
The rank is within one sample of y_score. The formula says the rank of a label is the number of labels (including itself) whose score is greater than or equal to its score.
It is just like sort the labels by y_score, and the label with largest score is ranked 1, the second largest is ranked 2, the third largest is ranked 3, etc. But if the second and third largest labels have the same score, they are both ranked 3.
Notice that y_score is
Target scores, can either be probability estimates of the positive class, confidence values, or binary decisions.
The goal is to have all true labels predicted, so we need to include all the labels with higher or equal scores than the true label.

Related

Precision / Recall for multi-label classification using torchmetrics

Is there a function or a set of arguments that I can use in order to calculate Precision and Recall for a multi-label problem?
Note that with multi-label I mean that each sample can be classified into more than one class.
The following is not returning what I would expect:
import torch
from torchmetrics import Precision
target = torch.tensor([
[0, 0, 1, 1, 0], # Sample 1 belongs to class 2 and 3 (zero-indexed)
[0, 0, 1, 0, 0], # Sample 2 belongs to class 2 (zero-indexed)
])
preds = torch.tensor([
[0, 0, 0, 0, 0], # Sample 1 predicted to belong to no class
[0, 0, 0, 0, 0], # Sample 2 predicted to belong to no class
])
metric = Precision(num_classes=5, mdmc_average="samplewise")
print(metric(preds, target))
It returns: tensor(0.7000), but it should be 0% since there are no True Positives.

Calculate cosine similarity and output without duplicates?

I have the following vectors in my toy example:
data = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': [21, -0.1, 0.003, 4, 2.1]
})
I have calculated similarity matrix (by excluding the id column).
from sklearn.metrics.pairwise import cosine_similarity
# Calculate the pairwise cosine similarities
S = cosine_similarity(data.drop('id', axis=1))
T = S.tolist()
df = pd.DataFrame.from_records(T)
It returns me a matrix/dataframe with all options including self similarity and duplicates.
Is there any efficient method to calculate similarity without self similarities (vector is 100% similar to itself) and duplicates (vectors 1 and 2 has 89% similarity, I don't need vectors 2 and 1 similarity as it's the same).
The best solution I found so far is to take the lower triangle under the diagonal:
[In] S[np.triu_indices_from(S, k=1)]
[Out] array([ 0.93420158, -0.93416293, 0.99856978, -0.81303909, -0.99999999,
0.91379242, -0.96724292, -0.91374841, 0.96727042, -0.78074903])
What this does is take only those values that are under the 1 diagonal, so basically excluding the ones and the repeating values. This gives you a numpy array, too.

Getting Concordance result of lifelines CoxPH model in a dataframe

I am using CoxPH implementation of lifelines package in python. Currently, results are in tabular view of coefficients and related stats and can be seen with print_summary(). Here is an example
df = pd.DataFrame({'duration': [4, 6, 5, 5, 4, 6],
'event': [0, 0, 0, 1, 1, 1],
'cat': [0, 1, 0, 1, 0, 1]})
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event', show_progress=True)
cph.print_summary()
out[]
[Table of results from print_summary()][1]
How can I get only Concordance index as dataframe or list. cph.summary
returns a dataframe of main results i.e. p-values and coef but it does not include concordance index and other surrounding information.
you can access the c-index with cph.concordance_index_ - and you could put this into a list or dataframe if you wish.
You can also compute the concordance index for Cox model using a small script available at this link. The code is given below.
from lifelines.utils import concordance_index
cph = CoxPHFitter().fit(df, 'T', 'E')
Cindex = concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])
This code will give C-index value, which also matches with cph.concordance_index_

Scikit-learn R2 always zero

I'm trying to test my Scikit-learn machine learning algorithm with a simple R^2 score, but for some reason it always returns zero.
import numpy
from sklearn.metrics import r2_score
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294]).reshape(1, -1)
training = numpy.array([0, 3, 1, 0]).reshape(1, -1)
r2 = r2_score(training, prediction, multioutput="raw_values")
print r2
[ 0. 0. 0. 0.]
This is a single four-part value, not four separate values. How do I get proper R^2 scores?
If you are trying to calculate the r2 value between two vectors you should just pass two one dimensional arrays. See the documentation
In the example you provided, the first item is compared to the first item, but note you only have one list in each the prediction and training, so it is calculating R2 for 0.1567 to 0, which is 0, then it calculates it for 4.7528 to 3 which is also 0 and so on... It sounds like you want the R2 for the two vectors like the following:
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294])
training = numpy.array([0, 3, 1, 0])
print(r2_score(training, prediction))
0.472439485
If you have multi-dimensional arrays you can use the multioutput flag to determine what the output should look like:
#modified from the scikit-learn example
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(r2_score(y_true, y_pred, multioutput='raw_values'))
array([ 0.96543779, 0.90816327])
Here the output is where the first item of each list in y_true is compared to the first item in each list of y_pred, the second item to the second and so on

scikit-learn: Get selected features for prediction data

I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])

Resources