sklearn.metrics.ConfusionMatrixDisplay using scientific notation - python-3.x

I am generating a confusion matrix as follows:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(truth_labels, predicted_labels, labels=n_classes)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp = disp.plot(cmap="Blues")
plt.show()
However, some of my values for True Positive, True Negative, etc. are over 30,000, and they are being displayed in scientific format (3e+04). I want to show all digits and have found the values_format parameter in the ConfusionMatrixDisplay documentation. I have tried using it like this:
disp = ConfusionMatrixDisplay(confusion_matrix=cm, values_format='')
But I get a type error:
TypeError: __init__() got an unexpected keyword argument 'values_format'.
What I am doing wrong? Thanks in advance!

In case somebody runs into the same problem, I just found the answer. The values_format argument had to be added to disp.plot, not to the ConfusionMatrixDisplay call, as such:
disp.plot(cmap="Blues", values_format='')

Related

Absolute value function not recognized as Disciplined Convex Program (CVXPY)

I am trying to run the following optimization using CVXPY:
import cvxpy as cp
import numpy as np
weights_vec = cp.Variable(10)
er_vec = cp.Parameter(10, value=np.random.randn(10))
prev_h_vec = cp.Parameter(10, value=np.ones(10))
tcost_vec = cp.Parameter(10, value=[0.03]*10)
objective = cp.Maximize(weights_vec # er_vec - tcost_vec # cp.abs(weights_vec - prev_h_vec))
prob = cp.Problem(objective)
prob.solve()
However, I get the following error:
cvxpy.error.DCPError: Problem does not follow DCP rules. Specifically:
The objective is not DCP. Its following subexpressions are not:
param516 # abs(var513 + -param515)
The absolute function is convex. Hence, I am not quite sure why CVX is throwing an error for the absolute value function in the objective.
DCP-ness depends on the sign of tcost_vec.
As this is a (unconstrained) parameter it's not okay.
Both of the following will work:
# we promise it's nonnegative
tcost_vec = cp.Parameter(10, value=[0.03]*10, nonneg=True)
# it's fixed and can by analyzed
tcost_vec = np.array([-0.03]*10)
Given the code as posted, there is not reason to use parameters (yet).

shap.force_plot() raises Exeption: In v0.20 force_plot now requires the base value as the first parameter

I'm using Catboost and would like to visualize shap_values:
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=300)
model.fit(X, y,cat_features=cat_features)
pool1 = Pool(data=X, label=y, cat_features=cat_features)
shap_values = model.get_feature_importance(data=pool1, fstr_type='ShapValues', verbose=10000)
shap_values.shape
Output: (32769, 10)
X.shape
Output: (32769, 9)
Then I do the following and an exception is raised:
shap.initjs()
shap.force_plot(shap_values[0,:-1], X.iloc[0,:])
Exception: In v0.20 force_plot now requires the base value as the first parameter! Try shap.force_plot(explainer.expected_value, shap_values) or for multi-output models try shap.force_plot(explainer.expected_value[0], shap_values[0]).
The following works, but I would like to make force_plot() work:
shap.initjs()
shap.summary_plot(shap_values[:,:-1], X)
I read the Documentation but can't make sense of explainer. I tried:
explainer = shap.TreeExplainer(model,data=pool1)
#Also tried:
explainer = shap.TreeExplainer(model,data=X)
but I get: TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Can anyone point me in the right direction? THX
I had the same error as below-
Exception: In v0.20 force_plot now requires the base value as the
first parameter! Try shap.force_plot(explainer.expected_value,
shap_values) or for multi-output models try
shap.force_plot(explainer.expected_value[0], shap_values[0]).
This helped me resolve the issue-
import shap
explainer = shap.TreeExplainer(model,data=X)
shap.initjs()
shap.force_plot(explainer.expected_value[0],X.iloc[0,:])
Also for the below issue -
TypeError: ufunc 'isnan' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
Check your data, if it contains any NaN's or missing values.
Hope this helps!
try this:
shap.force_plot(explainer.expected_value, shap_values.values[0, :], X.iloc[0, :])
Building on #Sparsha's answer, since I was still getting errors, what worked for me was:
explainer = shap.TreeExplainer(model, data = X)
shap_values = explainer.shap_values(X_train)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0], feature_names = explainer.data_feature_names)

Find Max Value in a field of a shapefile

I have a shapefile (mich_co.shp) which I try to find the county with max population. My idea is to use max() function it's not possible. Here is my code so far:
from osgeo import ogr
import os
shapefile = "C:/Users/root/Python/mich_co.shp"
driver = ogr.GetDriverByName("ESRI Shapefile")
dataSource = driver.Open(shapefile, 0)
layer = dataSource.GetLayer()
for feature in layer:
print(feature.GetField("pop"))
layer.ResetReading()
The code above however only print all values of "pop" field like this:
10635.0
9541.0
112039.0
29234.0
23406.0
15477.0
8683.0
58990.0
106935.0
17465.0
156067.0
43868.0
135099.0
I tried:
print(max(feature.GetField("pop")))
but it returns TypeError: 'float' object is not iterable. For this, I've also tried:
for feature in range(layer):
and it returns TypeError: 'Layer' object cannot be interpreted as an integer.
Any helps of hints would be much appreciated.
Thanks you!
max() needs an iterable, such as a list. Try to build a list:
pops = [ feature.GetField("pop") for feature in layer ]
print(max(pops))

How can I generate classification report by removing this error?

I want to generate classification report of dataset movie_reviews from corpus which has already target names [pos , neg]. but found an error.
Code:
movie_train_clf = Pipeline([('vect',CountVectorizer(stop_words='english')),('tfidf',TfidfTransformer()),('clas',BernoulliNB(fit_prior=True))])
movie_train_clas = movie_train_clf.fit(movie_train.data ,movie_train.target)
predict = movie_train_clas.predict(movie_train.data)
np.mean(predict==movie_train.target)
Now I use classification report
from sklearn.metrics import classification_report
print(classification_report(predict, movie_train_clas,target_names==target_names))
Error:
TypeError: iteration over a 0-d array.
please help me with correct syntax.
There are multiple errors in your code:
1) You have the wrong order of arguments in classification_report. As per the documentation:
classification_report(y_true, y_pred, ...
First argument is the true labels and second one is the predicted labels.
2) You are using movie_train_clas in the place of true labels. movie_train_clas as per your code is the return value of movie_train_clf.fit(), so its the movie_train_clf itself. fit() returns itself, so you cannot use that in place of ground truth labels.
3) As #AmiTavory spotted, the current error is due to comparison operator (==) used in place of assignment (=). The correct call to classification_report should be:
classification_report(movie_train.target, predict, target_names=target_names)

Sklearn LogisticRegression predict_proba result look weird

I am quite new to SKlearn, Machine learning and its related. I have searched for a day but still cannot figure out the answer.
model = LogisticRegression(C=1)
model.fit(X, y)
print(model.predict_proba(X_test))
// output
[[ 1.01555532e-08 2.61926230e-01 7.37740949e-01 3.32810963e-04]]
I am quite confused whether the output is correct or not. When I tried on SVM with the same dataset, I got [[ 0.21071225 0.42531172 0.01024818 0.35372784]] which looks like probability and this is what I want. How can I make LogisticRegression model get the same probability style like SVM? What do I misunderstand?
This is just printing-style!
Have a look at this demo:
Code:
import numpy as np
p = np.array([[ 1.01555532e-08, 2.61926230e-01, 7.37740949e-01, 3.32810963e-04]])
print('p: ', p)
print('sum: ', p.sum()) # approximately a probability-distribution?
np.set_printoptions(suppress=True)
print('p: ', p) # same print as above
# but printing-style was changed before!
Output:
p: [[1.01555532e-08 2.61926230e-01 7.37740949e-01 3.32810963e-04]]
sum: 1.0000000001185532
p: [[0.00000001 0.26192623 0.73774095 0.00033281]]
Numpy uses a lot of code to decide on how to print your arrays, depending on the values inside! Here we changed something, using np.set_printoptions.
Your output looks different, because the output of your SVM-prediction has no small values, like the other one did!
suppress : bool, optional
Whether or not suppress printing of small floating point values using scientific notation (default False).
The use of scientific-notation also applies to python's types:
x = 0.00000001
print(x)
# 1e-08

Resources