How to perform Grid Search on a multi-label classification problem? - scikit-learn

I am learning Scikit-Learn and I am trying to perform a grid search on a multi label classification problem. This is what I wrote:
from sklearn.model_selection import GridSearchCV
param_grid = [
{'randomforestclassifier__n_estimators': [3, 10, 30], 'randomforestclassifier__max_features': [2, 4, 5, 8]},
{'randomforestclassifier__bootstrap': [False], 'randomforestclassifier__n_estimators': [3, 10], 'randomforestclassifier__max_features': [2, 3, 4]}
]
rf_classifier = OneVsRestClassifier(
make_pipeline(RandomForestClassifier(random_state=42))
)
grid_search = GridSearchCV(rf_classifier, param_grid=param_grid, cv=5, scoring = 'f1_micro')
grid_search.fit(X_train_prepared, y_train)
However when I run it I get the following error:
ValueError: Invalid parameter randomforestclassifier for estimator
OneVsRestClassifier(estimator=Pipeline(steps=[('randomforestclassifier',
RandomForestClassifier(random_state=42))])). Check the list of available parameters
with `estimator.get_params().keys()`.
I tried to run also the grid_search.estimator.get_params().keys() command but I just get a list of parameters containing the ones that I have written, therefore I am not sure what I should do.
Would you be able to suggest what the issue is and how I can run the grid search properly?

You would have been able to define param_grid as you did in case rf_classifier was a Pipeline object. Quoting the Pipeline's docs
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__'.
In your case, instead, rf_classifier is OneVsRestClassifier instance. Therefore, before setting the parameters for the RFC, you'll need to be able to access the pipeline instance, which you can do via OneVsRestClassifier estimator's parameter, as follows:
param_grid = [
{'estimator__randomforestclassifier__n_estimators': [3, 10, 30],
'estimator__randomforestclassifier__max_features': [2, 4, 5, 8]
},
{'estimator__randomforestclassifier__bootstrap': [False],
'estimator__randomforestclassifier__n_estimators': [3, 10],
'estimator__randomforestclassifier__max_features': [2, 3, 4]
}
]

Related

ValueError: Class label 1 not present when specifying class_weight in RandomForestClassifier with k-fold cv

I'm doing a binary classification on time-series data. Since it's for an academic project, I want to test classical ML models such as RandomForestClassifier as well.
However, while using TimeSeriesSplit K-fold Cross-Validation, it is possible that while training; labels have only one class instead of both, which is raising ValueError.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(class_weight={1:10, 0:1})
rfc.fit([[0, 0, 1], [1, 0, 1]], [0, 0])
This gives,
ValueError: Class label 1 not present.
I know it doesn't make sense to train with only one label, but then it works fine if we don't specify class_weight. Is this a bug?
How do I get around this programmatically if I'm automating my testing?
I think the issue has been fixed on 11 Mar 2022.
I am able to run below code without any problem today:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(class_weight={1:10, 0:1})
rfc.fit([[0, 0, 1], [1, 0, 1]], [0, 1])
rfc.predict([[1, 1, 1]])
Output: array([1])
rfc.fit([[0, 0, 1], [1, 0, 1]], [0, 0])
rfc.predict([[1, 1, 1]])
Output: array([0])

PyTorch - Change weights of Conv2d

For some reason, I cannot seem to assign all the weights of a Conv2d layer in PyTorch - I have to do it in two steps. Can anyone help me with what I am doing wrong?
layer = torch.nn.Conv2d(in_channels=1, out_channels=2, kernel_size=(2,2), stride=(2,2))
layer.state_dict()['weight']
gives me a tensor of size (2,1,2,2)
tensor([[[[ 0.4738, -0.2197],
[-0.3436, -0.0754]]],
[[[ 0.1662, 0.4098],
[-0.4306, -0.4828]]]])
When I try to assign weights like so
layer.state_dict()['weight'] = torch.tensor([
[[[ 1, 2],
[3, 4]]],
[[[-1, -2],
[-3, -4]]]
])
the weights don't change. However, if I do something like this
layer.state_dict()['weight'][0] = torch.tensor([
[[[1, 2],
[3, 4]]],
])
layer.state_dict()['weight'][1] = torch.tensor([
[[[-1, -2],
[-3, -4]]],
])
The weights change. Why is this so?
I'm not sure about why you can't directly assign them but the more proper way to achieve what you're trying to do would be
layer.load_state_dict({'weight': torch.tensor([[[[0.4738, -0.2197],
[-0.3436, -0.0754]]],
[[[0.1662, 0.4098],
[-0.4306, -0.4828]]]])}, strict=False)

Grid search and XGBClassifier using class weights

I am trying to use scikit-learn GridSearchCV together with XGBoost XGBClassifier wrapper for my unbalanced multi-class classification problem. So far I have used a list of class weights as an input for the scale_pos_weight argument, but this does not seem to work as all my predictions are for the majority class. This is probably because in the documentation of the XGBClassifier it is mentioned that scale_pos_weight can only be used for binary classification problems.
So my question is, how can I input sample/class weights for a multi-class classification task using scikit-learn GridSearchCV?
My code is below:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', np.unique(training_targets),
training_targets[target_label[0]])
random_state = np.random.randint(0, 1000)
parameters = {
'max_depth': [3, 4, 5],
'learning_rate': [0.1, 0.2, 0.3],
'n_estimators': [50, 100, 150],
'gamma': [0, 0.1, 0.2],
'min_child_weight': [0, 0.5, 1],
'max_delta_step': [0],
'subsample': [0.7, 0.8, 0.9, 1],
'colsample_bytree': [0.6, 0.8, 1],
'colsample_bylevel': [1],
'reg_alpha': [0, 1e-2, 1, 1e1],
'reg_lambda': [0, 1e-2, 1, 1e1],
'base_score': [0.5]
}
xgb_model = xgb.XGBClassifier(scale_pos_weight = class_weights, silent = True,
random_state = random_state)
clf = GridSearchCV(xgb_model, parameters, scoring = 'f1_micro', n_jobs = -1, cv = 5)
clf.fit(training_features, training_targets.values[:, 0])
model = clf.best_estimator_
The scale_pos_weight is only for binary classification, so it won't work on multi-label classification tasks.
For your case, it's more advisable to use the weight parameter as described here (https://xgboost.readthedocs.io/en/latest/python/python_api.html). The argument will be an array which each element represents the weight you assigned for the corresponding data point.
The idea is essentially to manually assign different weights to different classes. There's no standard in how you need to assign weights, it's more up to your decision. The more weight a sample is being assigned, the more it affects the objective function during the training.
However, if you use the scikit learn API format, you cannot specify the weight parameter nor using the DMAtrix format. Thankfully, xgboost has its own cross validation function, which you can find details here: https://xgboost.readthedocs.io/en/latest/python/python_api.html
I suggest that you use the compute_sample_weight() function and set weights for each sample by looking at your labels. This will solve your problem in the most elegant way. See below for 3 classes (-1,0,1):
sample_weights=compute_sample_weight({-1:4,0:1,1:4},Train_Labels)
random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb,return_train_score=True, scoring=score,cv=ps, n_jobs=-1, verbose=3, random_state=1001 )
random_search.fit(Train,Train_Labels,sample_weight=sample_weights)
In a multi-class setup we need to pass sample_weight parameter with a list of values (weights) matching the count of data-points (for example number of rows in X_train), to fit() of XGBoostClassifier. Check the docs.
While using XGBoostClassifier with scikit-learn GridSearchCV, you can pass sample_weight directly to the fit() of GridSearchCV.
Note: Tried in scikit-learn version 1.1.1. Not sure from which version onwards this is supported.
For example:
def get_weights(cls):
class_weights = {
# class-labels based on your dataset.
0: 1,
1: 4,
2: 1,
}
return [class_weights[cl] for cl in cls]
grid = {
"max_depth": [3, 4, 5, 6],
"n_estimators": range(20, 70, 10),
"learning_rate": np.arange(0.25, 0.50, 0.05),
}
xgb_clf = XGBClassifier(random_state=42, n_jobs=-1)
xgb_cvm = GridSearchCV(estimator=xgb_clf, param_grid=grid, n_jobs=-1, cv=5)
xgb_cvm.fit(X, y, sample_weight=get_weights(y))

Linear Regression using RNN tensorflow python

Can anyone please site an example on how linear regression can be implemented using RNN in pure tensorflow other than using keras API..
Eg :
x_train = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
I ll divide them into 5 batches of 2 element each so x_train will be of shape [5, 2]
[1, 2]
[3, 4]
[5, 6]..
y_train = [3, 5, 7,...]
Where the challenge is that.. i should be giving the model to train upon 1 and 2 then from that the model should have and inference of 3, like wise 3 and 4 and it should prediction 5.. and so on.. I have tried a way around but had problems with shape and the losses where really high..
The same problem in a deep network works perfect with around 100 epochs, 100 data points and a loss of 0.01 (MSE).
Can anyone please help me..?

What n_estimators and max_features means in RandomForestRegressor

I was reading about fine tuning the model using GridSearchCV and I came across a Parameter Grid Shown below :
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
Here I am not getting the concept of n_estimator and max_feature. Is it like n_estimator means number of records from data and max_features means number of attributes to be selected from data?
After Going further I got this result :
>> grid_search.best_params_
{'max_feature':8, 'n_estimator':30}
So the thing is I am not getting what Actually this result want to say..
After reading the documentation for RandomForest Regressor you can see that n_estimators is the number of trees to be used in the forest. Since Random Forest is an ensemble method comprising of creating multiple decision trees, this parameter is used to control the number of trees to be used in the process.
max_features on the other hand, determines the maximum number of features to consider while looking for a split. For more information on max_features read this answer.
n_estimators: This is the number of trees (in general the number of samples on which this algorithm will work then it will aggregate them to give you the final answer) you want to build before taking the maximum voting or averages of predictions. The higher number of trees give you better performance but makes your code slower.
max_features: The number of features to consider when looking for the best split.
>> grid_search.best_params_ :- {'max_feature':8, 'n_estimator':30}
This means they are best hyperparameter you should run model among n_estimators{3,10,30} or max_features {2, 4, 6, 8}

Resources