Identify best GridsearchCV scoring metric for food prediction in XGBoost - python-3.x

I am using GridSearchCV to find the best parameter that help me tune XGBoost for a food prediction algorithm.
I am struggling to identify the best scoring metric that would result in the best profit (sales margin minus wastage costs) as this is ultimately what I am looking for. In running the script below and plugging it into the data (I reserved some data for testing only), I noticed that a better R2 seems to be better than a better RMSE in obtaining a higher profit. But I am struggling to find an explanation which will help me guide to the best scoring method.
Here some infos on the situation:
It costs me 6 USD to produce the product and 9 USD to sell, so my margin is 3 USD. Therefore my wastage is 6 USD multiplied by (production minus sales quantities), whereas my earnings are sales quantities multiplied by 3.
Example: I produce 100, sell 70, waste 30 my earnings are 70*3 - 30*6 = 30
So I have an imbalance between sales and wastage.
Main Question: Which scoring metric puts a higher penalty weight on the over-prediction?
My current code:
X = consumption[feature_names]
y = consumption['Meal1']
data_dmatrix = xgb.DMatrix(data=X,label=y)
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
'min_child_weight':[1, 2],
'gamma': [0.05,0.06],
'reg_alpha':range(1, 2),
'colsample_bytree': [0.22, 0.23],
'n_estimators': range(28, 29),
'max_depth': range(3, 8),
'reg_alpha':range(1, 2),
'reg_lambda':range(1, 2),
'subsample': [0.7,0.8,0.9],
'learning_rate': [0.1,0.2],
}
fixed_params = {'objective':'reg:squarederror','booster':'gbtree' }
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(**fixed_params)
# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, scoring="r2", cv=5, verbose=1)
# Fit grid_mse to the data
grid_mse.fit(X,y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest Score found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Related

Multiclass classification per class recall equals per class accuracy?

I've got a multiclass problem. I'm using sklearn.metrics to calculate the confusion matrix, overall accuracy, per class precision, per class recall and per class F1-score.
Now I wanted to calculate the per class accuracy. Since there is no method in sklearn for this I used another one which i got from a google search. I've now realised, that the per class recall equals the per class accuracy. Can anyone explain to me if this holds true and if yes, why?
I found an explanation here, but I'm not sure since there the micro-recall equals the overall accuracy if I'm understanding it correctly. And I'm looking for the per class accuracy.
I too experienced same results. because per class Recall = TP/TP+FN , Here TP+FN is same as all the samples of a class. So the formula becomes similar to accuracy.
This generally doesn't hold. Accuracy and recall are calculated using different formulas and are different measures explaining something else.
Recall is the percentage of true positive data points compared to all data points that are predicted as positive by your classifier.
Accuracy is the percentage of all examples that are classified correctly, including positive and negative.
If they are equal, this is either coincidence or you have an error is your method of calculating them. Most likely this will be coincidence.
EDIT:
I will show why it's not the case with an example that can be generalised to N classes.
Let's assume three classes: 0, 1, 2 with the following confusion matrix:
[[3 0 1]
[2 5 0]
[0 1 4]]
When we want to calculate measures per class, we do this binary. For example for class 0, we combine 1 and 2 into 'not 0'. This results in the following confusion matrix:
[[3 1]
[2 9]]
Resulting in:
TP = 3
FT = 5
FN = 1
TN = 9
Accuracy = (TN + TP) / (N + P)
Recall = TP / (TN + FN)
So you can already tell from these formulas, that they are really not equal. To disprove an hypothesis in mathematics it suffices to show a counter example. In this case an example that show that accuracy is not equal to recall.
In this example filled in we get:
Accuracy = 12/18 = 2/3
Recall = 3/4
And 2/3 is not equal to 3/4. Thus disproving the hypothesis that per class accuracy is equal to per class recall.
It is however also possible to provide examples for which the hypothesis is correct. But because it is not in general, it is disproven.
Not sure if you are looking for average per-class accuracy as a single metric or per-class accuracy as separate metrics for each class.
For per-class accuracy as a separate metric for each class, see the code below. It's the same as recall-micro per class.
For average per-class accuracy as a single metric, it is equivalent to recall-macro (which is equivalent to balanced accuracy in sklearn). See the code blow.
Here is the empirical demonstration in code.
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score
label_class1 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels = label_class1 + label_class2
pred_class1 = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
pred_class2 = [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
pred = pred_class1 + pred_class2
# 1. calculate accuracy scores per class
score_accuracy_class1 = accuracy_score(label_class1, pred_class1)
score_accuracy_class2 = accuracy_score(label_class2, pred_class2)
print(score_accuracy_class1) # 0.6
print(score_accuracy_class2) # 0.9
# 2. calculate recall scores per class
score_recall_class1 = recall_score(label_class1, pred_class1, average='micro')
score_recall_class2 = recall_score(label_class2, pred_class2, average='micro')
print(score_recall_class1) # 0.6
print(score_recall_class2) # 0.9
assert score_accuracy_class1 == score_recall_class1
assert score_accuracy_class2 == score_recall_class2
# 3. this also means that average per-class accuracy is equivalent to averaged recall and balanced accuracy
score_balanced_accuracy1 = (score_accuracy_class1 + score_accuracy_class2) / 2
score_balanced_accuracy2 = (score_recall_class1 + score_recall_class2) / 2
score_balanced_accuracy3 = balanced_accuracy_score(labels, pred)
score_balanced_accuracy4 = recall_score(labels, pred, average='macro')
print(score_balanced_accuracy1) # 0.75
print(score_balanced_accuracy2) # 0.75
print(score_balanced_accuracy3) # 0.75
print(score_balanced_accuracy4) # 0.75
# balanced accuracy, average per-class accuracy and recall-macro are equivalent
assert score_balanced_accuracy1 == score_balanced_accuracy2 == score_balanced_accuracy3 == score_balanced_accuracy4
These official docs say: "balanced accuracy ... is defined as the average of recall obtained on each class."
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

What is the the role of using OneVsRestClassifier wrapper around XGBClassifier?

I have a multiclass classficiation problem with 3 classes.
0 - on a given day (24h) my laptop battery did not die
1 - on a given day my laptop battery died before 12AM
2 - on a given day my laptop battery died at or after 12AM
(Note that these categories are mutually exclusive. The battery is not recharged once it died)
I am interested to know the predicted probability for each 3 classes. More specifically, I intend to derive 2 types of warning:
If the prediction for class 1 is higher then a threshold x: 'Your battery is at risk of dying in the morning.'
If the prediction for class 2 is higher then a threshold y: 'Your battery is at risk of dying in the afternoon.'
I can generate the the probabilities by using xgboost.XGBClassifier with the appropriate parameters for a multiclass problem.
import numpy as np
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from xgboost import XGBClassifier
X = np.array([
[10, 10],
[8, 10],
[-5, 5.5],
[-5.4, 5.5],
[-20, -20],
[-15, -20]
])
y = np.array([0, 1, 1, 1, 2, 2])
clf1 = XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42)
clf1.fit(X, y)
clf1.predict_proba([[-19, -20]])
Results:
array([[0.15134096, 0.3304505 , 0.51820856]], dtype=float32)
But I can also wrap this with sklearn.multiclass.OneVsRestClassifier. Which then produces slightly different results:
clf2 = OneVsRestClassifier(XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42))
clf2.fit(X, y)
clf2.predict_proba([[-19, -20]])
Results:
array([[0.10356173, 0.34510303, 0.5513352 ]], dtype=float32)
I was expecting the two approaches to produce the same results. My understanding was that XGBClassifier is also based on a one-vs-rest approach in a multiclass case, since there are 3 probabilities in the output and they sum up to 1.
Can you tell me where the difference comes from, and how the respective results should be interpreted? And most important, which is approach is better suited to solve my problem.

random forest regression low score

I'm trying to use random forest regression for predicting a car's price. I got data from cars.com, cleaned the data, kept some features(year, mileage, exteriorColor etc.), while categorical features didn't seem work with the algorithm, so I set dummy variables for categorical features( because only numerical feature works with tress??), I got low score.
The final data looks like this:
Year Model Price Mileage Engine CityFuelEconomy HighwayFuelEconomy ExteriorColor
2013 2 6900 37100 3.0 20 30 1
I performed random forest by default, and also with parameter tuning by GridSearch, both results are not ideal.
#by default
In: from sklearn.metrics import explained_variance_score
explained_variance_score(train_y, model.predict(train_x))
Out: 0.5569482176630063
In: model.score(test_x, test_y)
Out: 0.5299303064708601
Train MAE: 993.199536787152
Test MAE: 1094.8346295258416
#GridSearch
Best Score is: 0.5305298726822617
Best Parameters are: {'criterion': 'mse', 'max_depth': 15, 'max_features': 3,
'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 500}
forest.score(X_val, y_val)
Score: 0.56
I'm new to machine learning, I don't know which algorithm better fits what kind of dataset, can anyone help me with improvement of this or possible reasons of what might happened? thanks!

ALS model - predicted full_u * v^t * v ratings are very high

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v?
! rm -rf ml-1m.zip ml-1m
! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .
from pyspark.mllib.recommendation import Rating
ratingsRDD = sc.textFile('ratings.dat') \
.map(lambda l: l.split("::")) \
.map(lambda p: Rating(
user = int(p[0]),
product = int(p[1]),
rating = float(p[2]),
)).cache()
from pyspark.mllib.recommendation import ALS
rank = 50
numIterations = 20
lambdaParam = 0.1
model = ALS.train(ratingsRDD, rank, numIterations, lambdaParam)
Then extract the product features ...
import json
import numpy as np
pf = model.productFeatures()
pf_vals = pf.sortByKey().values().collect()
pf_keys = pf.sortByKey().keys().collect()
Vt = np.matrix(np.asarray(pf_vals))
full_u = np.zeros(len(pf_keys))
def set_rating(pf_keys, full_u, key, val):
try:
idx = pf_keys.index(key)
full_u.itemset(idx, val)
except:
pass
set_rating(pf_keys, full_u, 260, 9), # Star Wars (1977)
set_rating(pf_keys, full_u, 1, 8), # Toy Story (1995)
set_rating(pf_keys, full_u, 16, 7), # Casino (1995)
set_rating(pf_keys, full_u, 25, 8), # Leaving Las Vegas (1995)
set_rating(pf_keys, full_u, 32, 9), # Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
set_rating(pf_keys, full_u, 335, 4), # Flintstones, The (1994)
set_rating(pf_keys, full_u, 379, 3), # Timecop (1994)
set_rating(pf_keys, full_u, 296, 7), # Pulp Fiction (1994)
set_rating(pf_keys, full_u, 858, 10), # Godfather, The (1972)
set_rating(pf_keys, full_u, 50, 8) # Usual Suspects, The (1995)
recommendations = full_u*Vt*Vt.T
top_ten_ratings = list(np.sort(recommendations)[:,-10:].flat)
print("predicted rating value", top_ten_ratings)
top_ten_recommended_product_ids = np.where(recommendations >= np.sort(recommendations)[:,-10:].min())[1]
top_ten_recommended_product_ids = list(np.array(top_ten_recommended_product_ids))
print("predict rating prod_id", top_ten_recommended_product_ids)
However the predicted ratings seem way too high:
('predicted rating value', [313.67320347694897, 315.30874327316576, 317.1563289268388, 317.45475214423948, 318.19788673744563, 319.93044594688428, 323.92448427140653, 324.12553531632761, 325.41052886977582, 327.12199687047649])
('predict rating prod_id', [49, 287, 309, 558, 744, 802, 1839, 2117, 2698, 3111])
This appears to be incorrect. Any tips appreciated.
I think the approach mentioned would work if you only care about the ranking of the movies. If you want to get an actual rating there seem to be something of in terms dimension/scaling.
The idea here, is to guess the latent representation of your new user. Normally, for a user already in the factorization, user i, you have his latent representation u_i (the ith row in model.userFeatures()) and you get his rating for a given movie (movie j) using model.predict which basically multiply u_i by the latent representation of the product v_j. you can get all the predicted ratings at once if you multiply with the whole v: u_i*v.
For a new user you have to guess what is his latent representation u_new from full_u_new.
Basically you want 50 coefficients that represent your new user affinity towards each of the latent product factor.
For simplicity and since it was enough for my implicit feedback use case, I simply used the dot product, basically projecting the new user on the product latent factor: full_u_new*V^t gives you 50 coefficient, the coeff i being how much your new user looks like product latent factor i. and it works especially well with implicit feedback.
So, using the dot product will give you that but it won't be scaled and it explains the high scores you are seeing.
To get usable scores you need a more accurately scaled u_new, I think you could get that using the cosine similarity, like they did [here]https://github.com/apache/incubator-predictionio/blob/release/0.10.0/examples/scala-parallel-recommendation/custom-query/src/main/scala/ALSAlgorithm.scala
The approach mentioned by #ScottEdwards2000 in the comment is interesting too, but rather different. You could indeed look for the most similar user(s) in your training set. If there are more than one you could get the average. I don't think it would do too badly but it is a really different approach and you need the full rating matrix (to find the most similar user(s)). Getting one close user should definitely solve the scaling problem. If you manage to make both approach work you could compare the results!

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

Resources