Optimization of predictions from sklearn model (e.g. RandomForestRegressor) - scikit-learn

Does anyone used any optimization models on fitted sklearn models?
What I'd like to do is fit model based on train data and using this model try to find the best combination of parameters for which model would predict the biggest value.
Some example, simplified code:
import pandas as pd
df = pd.DataFrame({
'temperature': [10, 15, 30, 20, 25, 30],
'working_hours': [10, 12, 12, 10, 30, 15],
'sales': [4, 7, 6, 7.3, 10, 8]
})
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = df.drop(['sales'], axis=1)
y = df['sales']
model.fit(X, y);
Our baseline is a simple loop and predict all combination of variables:
results = pd.DataFrame(columns=['temperature', 'working_hours', 'sales_predicted'])
import numpy as np
for temp in np.arange(1,100.01,1):
for work_hours in np.arange(1,60.01,1):
results = pd.concat([
results,
pd.DataFrame({
'temperature': temp,
'working_hours': work_hours,
'sales_predicted': model.predict(np.array([temp, work_hours]).reshape(1,-1))
}
)
]
)
print(results.sort_values(by='sales_predicted', ascending=False))
Using that way it's difficult or impossible to:
* do it fast (brute method)
* implement constraint concerning two or more variables dependency
We tried PuLP library and PyOmo library, but both doesn't allow to put model.predict function as an objective function returning error:
TypeError: float() argument must be a string or a number, not 'LpVariable'
Do anyone have any idea how we can get rid off loop and use some other stuff?

When people talk about optimizing fitted sklearn models, they usually mean maximizing accuracy/performance metrics. So if you are trying to maximize your predicted value, you can definitely improve your code to achieve it more efficiently, like below.
You are collecting all the predictions in a big results dataframe, and then sorting it in ascending order. Instead, you can just search for an increase in your target variable (sales_predicted) on-the-fly, using a simple if logic. So just change your loop into this:
max_sales_predicted = 0
for temp in np.arange(1, 100.01, 1):
for work_hours in np.arange(1, 60.01, 1):
sales_predicted = model.predict(np.array([temp, work_hours]).reshape(1, -1))
if sales_predicted > max_sales_predicted:
max_sales_predicted = sales_predicted
desired_temp = temp
desired_work_hours = work_hours
So that you can only take into account any specification that produces a predictiong that exceeds your current target, and else, do nothing.
The result of my code is the same as yours, i.e. a max_sales_predicted value of 9.2. Also, desired_temp and desired_work_hours now give you the specification that produce that maxima. Hope this helps.

Related

Why does sklearn.model_selection.GridSearchCV not have a consistent result?

I change the code from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html a little bit, which looks like this:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear','rbf'), 'C':[10,20, 15, 4]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
clf.best_params_
Then the result is:
{'C': 10, 'kernel': 'rbf'}
But if I change the code to:
parameters = {'kernel':('linear','rbf'), 'C':[4, 10,20, 15]}
You can see the only change is the sequence of C list. But the result is:
{'C': 4, 'kernel': 'rbf'}
It looks like GridSearchCV just uses the first parameter combination.
So I have a few questions about this:
In this case, scoring is the default (None), so what function actually uses here? And why the above situation happens?
As far as I know, when we use LatentDirichletAllocation and GridSearchCV, the scoring function is log likelihood even scoring=None. If I understand correctly, then GridSearchCV can automatically pick a scoring function when it combines different models?

What is the the role of using OneVsRestClassifier wrapper around XGBClassifier?

I have a multiclass classficiation problem with 3 classes.
0 - on a given day (24h) my laptop battery did not die
1 - on a given day my laptop battery died before 12AM
2 - on a given day my laptop battery died at or after 12AM
(Note that these categories are mutually exclusive. The battery is not recharged once it died)
I am interested to know the predicted probability for each 3 classes. More specifically, I intend to derive 2 types of warning:
If the prediction for class 1 is higher then a threshold x: 'Your battery is at risk of dying in the morning.'
If the prediction for class 2 is higher then a threshold y: 'Your battery is at risk of dying in the afternoon.'
I can generate the the probabilities by using xgboost.XGBClassifier with the appropriate parameters for a multiclass problem.
import numpy as np
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from xgboost import XGBClassifier
X = np.array([
[10, 10],
[8, 10],
[-5, 5.5],
[-5.4, 5.5],
[-20, -20],
[-15, -20]
])
y = np.array([0, 1, 1, 1, 2, 2])
clf1 = XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42)
clf1.fit(X, y)
clf1.predict_proba([[-19, -20]])
Results:
array([[0.15134096, 0.3304505 , 0.51820856]], dtype=float32)
But I can also wrap this with sklearn.multiclass.OneVsRestClassifier. Which then produces slightly different results:
clf2 = OneVsRestClassifier(XGBClassifier(objective = 'multi:softprob', num_class = 3, seed = 42))
clf2.fit(X, y)
clf2.predict_proba([[-19, -20]])
Results:
array([[0.10356173, 0.34510303, 0.5513352 ]], dtype=float32)
I was expecting the two approaches to produce the same results. My understanding was that XGBClassifier is also based on a one-vs-rest approach in a multiclass case, since there are 3 probabilities in the output and they sum up to 1.
Can you tell me where the difference comes from, and how the respective results should be interpreted? And most important, which is approach is better suited to solve my problem.

fit_transform vs transform when doing inference

I have trained a keras model and saved it. I now want to use the model in a web app for inference. I want to preprocess the inputs by scaling them using StandardScaler() from sklearn.
But whenever i run transform(inputs) an error occurs wanting me to do fitting first. This was the code
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.transform(inputs)
preds = model.predict(inputs, batch_size = 1)
I then changed the code inorder to do fitting
from sklearn.preprocessing import StandardScaler
inputs = [1,8,0,0,4,18,4,3,576,9,8,8,14,1,0,4,0,0,3,6,0,1,1]
inputs = scale.fit_transform(inputs)
preds = model.predict(inputs, batch_size = 1)
It worked but the scaled data are all bunch of zeros regardless of the inputs i provide, making wrong predicitions. Am certain am missing some key concepts here, i am asking for help. Thank you
The standard scaler function has formula:
z = (x - u) / s
Here,
x: Element
u: Mean
s: Standard Deviation
This element transformation is done column-wise.
Therefore, when you call to fit the values of mean and standard_deviation are calculated.
Eg:
from sklearn.preprocessing import StandardScaler
import numpy as np
x = np.random.randint(50,size = (10,2))
x
Output:
array([[26, 9],
[29, 39],
[23, 26],
[29, 22],
[28, 41],
[11, 6],
[42, 40],
[ 1, 25],
[ 0, 39],
[44, 45]])
Now, fitting the standard scaler
scale = StandardScaler()
scale.fit(x)
You can see the mean and standard deviation using the built methods for the StandardScaler object
# Mean
scale.mean_ # array([23.3, 29.2])
# Standard Deviation
scale.scale_ # array([14.36697602, 13.12859475])
You transform these values using the transform method.
scale.transform(x)
Output:
array([[ 0.18793099, -1.53862621],
[ 0.3967432 , 0.74646222],
[-0.02088122, -0.24374277],
[ 0.3967432 , -0.54842122],
[ 0.32713913, 0.89880145],
[-0.85613006, -1.76713506],
[ 1.3015961 , 0.82263184],
[-1.55217075, -0.31991238],
[-1.62177482, 0.74646222],
[ 1.44080424, 1.20347991]])
Calculation for 1st element:
z = (26 - 23.3) / 14.36697602
z = 0.18793099
How to use this?
The transformation should be done before training your model. The training should be done on transformed data. And for the prediction, the test data should use the same mean and standard deviation values as your training data. ie. Do not use fit method on the test data. You should use the object that was used to transform the training data to transform your test data.

XGRegressor not fitting data

I would like to run a CV for an XGBoost tree regression on my X_train, y_train data. My target is of integer values from 25 to 40. I tried to run this code on my training dataset
# A parameter grid for XGBoost
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
cv_params = {
'min_child_weight': [1, 3, 5],
'gamma': [0.5, 1, 2, 3],
'subsample': [i/10.0 for i in range(6,11)],
'colsample_bytree': [i/10.0 for i in range(6,11)],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.02, 0.1]
}
# Initialize XGB
xgb_for_gridsearch = XGBRegressor(
n_estimators = 1000,
objective = 'reg:logistic',
seed = 7
)
# Initialize GridSearch
xgb_grid = GridSearchCV(
estimator = xgb_for_gridsearch,
param_grid = cv_params,
scoring = 'explained_variance',
cv = 5,
n_jobs = -1
)
xgb_grid.fit(X_train, y_train)
xgb_grid.grid_scores_
I get an error the fit().
I kinda expected that the CV would just take forever, but not really an error. The error output is a couple of thousand lines long, so I will just put the only part that relates to my code:
During handling of the above exception, another exception occurred:
JoblibXGBoostError Traceback (most recent call last)
<ipython-input-44-a5c1d517107d> in <module>()
25 )
26
---> 27 xgb_grid.fit(X_train, y_train)
Does anyone know what this relates to?
Am I using conflicting parameters?
Would it be better to use xgboost.cv()?
I can also add the whole error code if that would help, should I just add it at the bottom of this question?
UPDATE: added error to a Gist, as suggested XGRegressor_not_fitting_data, since the error is too long.
Thanks for adding the full error code, it is easier to help you.
A github repo is fine, yet you may find it easier to use https://gist.github.com/ or https://pastebin.com/
Note that the most helpfull line of the full error is generally the last one, which contains here:
label must be in [0,1] for logistic regression
It seems you have used logistic regression (objective = 'reg:logistic', in your code), which is a classification loss, and so it requires y_train to be an array of either 0 or 1.
You can easily fix it with something like
y_train_bin = (y_train == 1).astype(int)
xgb_grid.fit(X_train, y_train_bin)

how does sklearn compute the Accuracy score step by step?

I was reading about the metrics used in sklearn but I find pretty confused the following:
In the documentation sklearn provides a example of its usage as follows:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
0.5
I understood that sklearns computes that metric as follows:
I am not sure about the process, I would like to appreciate if some one could explain more this result step by step since I was studying it but I found hard to understand, In order to understand more I tried the following case:
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3,0]
y_true = [0, 1, 2, 3,0]
print(accuracy_score(y_true, y_pred))
0.6
And I supposed that the correct computation would be the following:
but I am not sure about it, I would like to see if someone could support me with the computation rather than copy and paste the sklearn's documentation.
I have the doubt if the i in the sumatory is the same as the i in the formula inside the parenthesis, it is unclear to me, I don't know if the number of elements in the sumatory is related just to the number of elements in the sample of if it depends on also by the number of classes.
The indicator function equals one only if the variables in its arguments are equal, else it’s value is zero. Therefor when y is equal to yhat the indicator function produces a one counting as a correct classification. There is a code example in python and numerical example below.
import numpy as np
yhat=np.array([0,2,1,3])
y=np.array([0,1,2,3])
acc=np.mean(y==yhat)
print( acc)
example
A simple way to understand the calculation of the accuracy is:
Given two lists, y_pred and y_true, for every position index i, compare the i-th element of y_pred with the i-th element of y_true and perform the following calculation:
Count the number of matches
Divide it by the number of samples
So using your own example:
y_pred = [0, 2, 1, 3, 0]
y_true = [0, 1, 2, 3, 0]
We see matches on indices 0, 3 and 4. Thus:
number of matches = 3
number of samples = 5
Finally, the accuracy calculation:
accuracy = matches/samples
accuracy = 3/5
accuracy = 0.6
And for your question about the i index, it is the sample index, so it is the same for both the summation index and the Y/Yhat index.

Resources