Why not look at the precision and recall of both classes combined in a classification report? - scikit-learn

I was looking at the classification report from sklearn. I am wondering, why did they omit a potential third row with precision and recall values for both classes together? Why were they split apart, and what's the disadvantage to considering these metrics with both classes combined?

"Precision and recall values for both classes together" is contained in the classification_report as macro averages and weighted averages for precision, recall, and f1-score.
Compare the column in classification_report to the values computed when calling precision_score(y_true, y_pred):
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
y_true = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2]
y_pred = [0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 0]
print(classification_report(y_true, y_pred))
print(round(precision_score(y_true, y_pred, average='macro'), 2))
print(round(precision_score(y_true, y_pred, average='weighted'), 2))
Running this results in the following. Notice that macro-averaged precision is 0.64 and weighted-average precision is 0.67, and both those are listed in the bottom rows of the table:
precision recall f1-score support
0 0.43 0.60 0.50 5
1 0.50 0.57 0.53 7
2 1.00 0.57 0.73 7
accuracy 0.58 19
macro avg 0.64 0.58 0.59 19
weighted avg 0.67 0.58 0.60 19
0.64
0.67

Related

Dividing 1.0 into three random proportions in Python

How can I divide 1.0 into three random proportions in Python? Below is an expected output.
Iteration 0:
Output: 0.5, 0.25, 0.25
Iteration 1:
Output: 0.4, 0.35, 0.25
Iteration 2:
Output: 0.2, 0.25, 0.55
...
Generate two random integers in [0, 1], e.g. with random.random(). Sort them in ascending order and call them a < b. Then your proportions will be a, b-a, 1-b.

How can I use Python to convert an adjacency matrix to a transition matrix?

I am trying to convert a matrix like
1 1 0
0 1 1
0 1 1
to become
1 ⅓ 0
0 ⅓ ½
0 ⅓ ½
I was thinking about summing the rows and then dividing by them, but I was wondering if there was a better way to accomplish this using numpy or any other way in Python.
You can do it using numpy like below
import numpy as np
arr = np.array([[1, 1, 0],
[0, 1, 1],
[0, 1, 1]])
print(arr/arr.sum(axis=0))
[[1.0.33333333 0.]
[0.0.33333333 0.5]
[0.0.33333333 0.5]]

how to sort nested dictionaries based on multiple keys

I need to sort the dictionary dicti and display as follows:
compile the following statistics for each player:
Number of best-of-5 set matches won
Number of best-of-3 set matches won
Number of sets won
Number of games won
Number of sets lost
Number of games lost
You should print out to the screen (standard output) a summary in decreasing order of ranking, where the ranking is according to the criteria 1-6 in that order (compare item 1, if equal compare item 2, if equal compare item 3 etc, noting that for items 5 and 6 the comparison is reversed).
I have stored the results in dictionary but I am not familiar with sorting of dictionaries. I've no clue how to do it.
dicti={'Federer': {'gameswon': 142, 'gameslost': 143, 'setswon': 13, 'setslost': 16, 'fivesetmatch': 3, 'threesetmatch': 1},
'Nadal': {'gameswon': 143, 'gameslost': 142, 'setswon': 16, 'setslost': 13, 'fivesetmatch': 2, 'threesetmatch': 2},
'Halep': {'gameswon': 15, 'gameslost': 12, 'setswon': 2, 'setslost': 1, 'fivesetmatch': 0, 'threesetmatch': 1},
'Wozniacki': {'gameswon': 12, 'gameslost': 15, 'setswon': 1, 'setslost': 2, 'fivesetmatch': 0, 'threesetmatch': 0}}
Use pandas for data analysis and getting insights
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(dicti)
>>> df
Federer Nadal Halep Wozniacki
gameswon 142 143 15 12
gameslost 143 142 12 15
setswon 13 16 2 1
setslost 16 13 1 2
fivesetmatch 3 2 0 0
threesetmatch 1 2 1 0
>>> df.describe()
Federer Nadal Halep Wozniacki
count 6.000000 6.000000 6.000000 6.00000
mean 53.000000 53.000000 5.166667 5.00000
std 69.561484 69.558608 6.554896 6.69328
min 1.000000 2.000000 0.000000 0.00000
25% 5.500000 4.750000 1.000000 0.25000
50% 14.500000 14.500000 1.500000 1.50000
75% 110.500000 110.500000 9.500000 9.50000
max 143.000000 143.000000 15.000000 15.00000
For example,
For number of games won you could do
>>> df.loc['gameswon'].sum()
312

Why does xgboost produce the same predictions and nan values for features when using entire dataset?

Summary
I am using Python v3.7 and xgboost v0.81. I have continuous data (y) at a US state level by each week from 2015 to 2019. I'm trying to regress on the following features to y: year, month, week, region (encoded). I've set the train as August 2018 and before and the test is September 2018 and onward. When I train the model this way, two weird things happen:
feature_importances are all nan
predictions are all the same (0.5, 0.5....)
What I've tried
Fixing any of the features to a single variable allows the model to train appropriately and the two weird issues encountered previously are gone. Ex. year==2017 or region==28
Code
(I know this is a temporal problem but this general case exhibits the problem as well)
X = df[['year', 'month', 'week', 'region_encoded']]
display(X)
y = df.target
display(y)
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.1)
model = XGBRegressor(n_jobs=-1, n_estimators=1000).fit(X_train, y_train)
display(model.predict(X_test)[:20])
display(model.feature_importances_)
Results - some of the predictions and the feature importances
year month week region_encoded
0 2015 10 40 0
1 2015 10 40 1
2 2015 10 40 2
3 2015 10 40 3
4 2015 10 40 4
0 272.0
1 10.0
2 290.0
3 46.0
4 558.0
Name: target, dtype: float64
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], dtype=float32)
array([nan, nan, nan, nan], dtype=float32)
If the target variable has NaN in it, even just one, that is enough for many machine learning algorithms to break. This is usually because when an unhandled NaN is present in the target variable in the update step of many ML algorithms for example computing derivatives, the NaN propagates. Although, I cannot say too much about which step in XGBoost does this.
For example, the analytical solution for linear regression.
import numpy as np
import numpy.linalg as la
from scipy import stats
y = np.array([0, 1, 2, 3, np.nan, 5, 6, 7, 8, 9])
x = stats.norm().rvs((len(y), 3))
# Main effects estimate
m_hat = la.inv(x.T # x) # x.T # y
>>> [nan nan nan]

Evaluate F-score for individual label by cross validation in multi-label classification

I have a multi-label dataset and I want to determine the F-score value for each individual label with cross-validation test. Is there any example code implemented in sklearn or skmultilearn? Its documentation seems to provide only value for the entire dataset.
You can use scikit-learn's classification report, suppose you have y and y_predict
from sklearn.metrics import classification_report
y = [0, 1, 2, 2, 2]
y_pred = [1, 0, 2, 2, 1]
classes = {'Banana':0,'Apple':1,'Orange':2}
print(classification_report(y, y_pred,target_names=classes.keys()))
output
precision recall f1-score support
Banana 0.00 0.00 0.00 1
Apple 0.00 0.00 0.00 1
Orange 1.00 0.67 0.80 3
avg / total 0.60 0.40 0.48 5
Alternatively you can use
print(f1_score(y, y_pred,average=None))
and you'll get the label scores in a list
[ 0. 0. 0.8]
Of course you can use a KFolds iterator and go through all the folds and get their f1 for each label, but I dont see why you'd want to do that.
In a case where you're using cross validation, you can get, an f1 score per fold, this is because the scoring is used to evaluate the model and choose the best. see the example below
from sklearn import svm, datasets
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = svm.SVC()
cross_val_score(model, X, y, cv=10, scoring='f1_weighted')
will output an array of 10 scores, 1 per fold
array([ 1. , 0.93265993, 1. , 1. , 1. ,
0.93265993, 0.93265993, 1. , 1. , 1. ])

Resources