How to optimize xgb regression model? - python-3.x

I am trying to make time series predictions using XGBoost. (XGBRegressor)
I used GrindSearchCV like this:
parameters = {'nthread': [4],
'objective': ['reg:linear'],
'learning_rate': [0.01, 0.03, 0.05],
'max_depth': [3, 4, 5, 6, 7, 7],
'min_child_weight': [4],
'silent': [1],
'subsample': [1],
'colsample_bytree': [0.7, 0.8],
'n_estimators': [500]}
xgb_grid = GridSearchCV(xgb, parameters, cv=2, n_jobs=5,
verbose=True)
xgb_grid.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_test, y_test)],
early_stopping_rounds=100,
verbose=True)
print(xgb_grid.best_score_)
print(xgb_grid.best_params_)
And got those :
0.307153826086191
{'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 4, 'min_child_weight': 4, 'n_estimators': 500, 'nthread': 4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}
I tried implementing those parameters and calculate the error. I got this:
MSE: 4.579726929529167
MAE: 1.6753722069363144
I know that an error of 1.6 is not very good for predictions. It has to be < 0.9.
I tried to micro adjust the parameters but I have not managed to reduce error more than that.
I found something about the date format, maybe that is the problem ? My data is like this : yyyy-MM-dd HH:mm.
I am new to machine learning and that's what I managed to do after some examples and tutorials. What should I do to lower it, or what should I search for to learn ?
I mention that I found various examples like this one, but I didn't understood, and of course it did not work.

Related

Ensembling KNeighbours and Decision Tree Using Voting Classifier

I have a classification problem for which I am trying to build an ensemble using two classifiers, say for example KNeighbours, Decision Tree.In addition to this, I want to implement it using Pipeline. Now this is my attempt to the problem:
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())],voting='soft'))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
On running this following error pops up:
Invalid parameter knn for estimator
Pipeline(steps=[('scaler', StandardScaler()),
('regressor', VotingClassifier(
estimators=[('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())
]
)
)
]
).
Check the list of available parameters with `estimator.get_params().keys()`.
I belive their is some error in how I have defined the parameter grid. Please help me out in this.
Since it's nested, you'll need to specify both prefixes, like this:
parameters = [{'regressor__knn__n_neighbors': np.arange(1, 5), #} { And you'd probably want it to be a single grid?
'regressor__clf__n_estimators': [10, 20, 30],
'regressor__clf__criterion': ['gini', 'entropy'],
'regressor__clf__max_depth': [5, 10, 15],
'regressor__clf__max_features': ['log2', 'sqrt', None]}]
Also, your max_depth and max_features values switched their supposed places somehow, fixed that. (And 'auto' does the same as 'sqrt', at least for the recent versions.)

Invalid parameter learning_rate for estimator RegressorChain(base_estimator=XGBRegressor

I am trying to apply RandomizedSearchCV on a RegressorChain XGBoost model but I got an error : Invalid parameter learning_rate for estimator RegressorChain(base_estimator=XGBRegressor.
If I comment all the values in grid dict, it works otherwise it doesn't accept any param.
Same models (XGBRegressor and RegressorChain) are working fine alone. The RandomizedSearchCV is not accepting the the params in grid dict
# Setup the parameters grid
grid = {
'n_estimators': [100, 500, 1000],
'max_depth': [5, 10, 20, 30],
'max_features': ["auto", "sqrt"],
'eta': [0.09, 0.1, 0.2],
'booster': ["dart", "gblinear"]
}
clf = XGBRegressor(objective='reg:squarederror')
chain = RegressorChain(base_estimator=clf, order=[0, 1, 2, 3, 4,5])
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=chain,
param_distributions=grid,
n_iter=10, # number of models to try
cv=5,
verbose=1,
random_state=42,
refit=True)
# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train) # 'rs' is short
Since the XGBRegressor is the base_estimator of RegressorChain, the parameters of XGBRegressor become nested and must be addressed with base_estimator__xxx:
grid = {
'base_estimator__n_estimators': [100, 500, 1000],
'base_estimator__max_depth': [5, 10, 20, 30],
'base_estimator__max_features': ["auto", "sqrt"],
'base_estimator__eta': [0.09, 0.1, 0.2],
'base_estimator__booster': ["dart", "gblinear"]
}

PyTorch - Change weights of Conv2d

For some reason, I cannot seem to assign all the weights of a Conv2d layer in PyTorch - I have to do it in two steps. Can anyone help me with what I am doing wrong?
layer = torch.nn.Conv2d(in_channels=1, out_channels=2, kernel_size=(2,2), stride=(2,2))
layer.state_dict()['weight']
gives me a tensor of size (2,1,2,2)
tensor([[[[ 0.4738, -0.2197],
[-0.3436, -0.0754]]],
[[[ 0.1662, 0.4098],
[-0.4306, -0.4828]]]])
When I try to assign weights like so
layer.state_dict()['weight'] = torch.tensor([
[[[ 1, 2],
[3, 4]]],
[[[-1, -2],
[-3, -4]]]
])
the weights don't change. However, if I do something like this
layer.state_dict()['weight'][0] = torch.tensor([
[[[1, 2],
[3, 4]]],
])
layer.state_dict()['weight'][1] = torch.tensor([
[[[-1, -2],
[-3, -4]]],
])
The weights change. Why is this so?
I'm not sure about why you can't directly assign them but the more proper way to achieve what you're trying to do would be
layer.load_state_dict({'weight': torch.tensor([[[[0.4738, -0.2197],
[-0.3436, -0.0754]]],
[[[0.1662, 0.4098],
[-0.4306, -0.4828]]]])}, strict=False)

GridSearchCV gives different result

I have applied a gridsearchCV with estimators of DecisionTreeClassifier, RandomForestClassifier, LogisticRegression, XGBClassifier used all of them in ensemble learning.
The result that gridSearchCV will give with all these estimators are different in my system and my friend's system with same data of testing and training, I don't know why?
We are using same data for training and testing but gridsearch is giving different result for these data in both the system, just want to know what should be changed to make the system to give same result on any system?
gs_dt = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'max_depth': [ 2, 4, 6, 8, 10],
'criterion':['gini','entropy'],
"max_features":["auto", None],
"max_leaf_nodes":[10,20,30,40]}],
scoring=scoring,
cv=10,
refit='recall')
gs_rf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1, oob_score = True,class_weight={1: 10/11, 0: 1/11}),
param_grid=[{'max_depth': [4, 6, 8, 10, 12, 16, 20, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 4, 8],
'min_samples_split': [10, 20]}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
gs_lr = GridSearchCV(estimator=LogisticRegression(multi_class='ovr',random_state=42,class_weight={1:10, 0:1}),
param_grid=[{'C': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1 ,1],
'penalty':['l1','l2']}],
scoring=scoring,
cv=10,
refit='recall')
gs_gb = GridSearchCV(estimator=XGBClassifier(n_jobs=-1),
param_grid=[{'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [4, 6, 8, 10, 12, 16, 20],
'min_samples_leaf': [4, 8, 12, 16, 20],
'max_features': ['auto', 'sqrt']}],
scoring=scoring,
cv=10,
n_jobs=4,
refit='recall')
For example first gridsearchcv gives this result on my system:
DecisionTreeClassifier(class_weight={1: 10, 0: 1}, criterion='gini',
max_depth=8, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=42,
splitter='best')
And on my friend's system it gives:
DecisionTreeClassifier(class_weight={0: 1, 1: 10}, criterion='gini',
max_depth=10, max_features=None, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=42, splitter='best')
Similarly I got different result on my and my friend's system.

Extract best_params after running GridSearch within a pipeline

So I ran a very thorough GridSearch with 10-fold cross-val in an integrated pipeline in the following manner-
pipeline_rf = Pipeline([
('standardize', MinMaxScaler()),
('grid_search_lr', GridSearchCV(
RandomForestClassifier(),
param_grid={'bootstrap': [True],
'max_depth': [50, 100, 150, 200],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 500, 1000, 1500]},
cv=10,
n_jobs=-1,
scoring='roc_auc',
verbose=2,
refit=True
))
])
pipeline_rf.fit(X_train, y_train)
How should I go about extracting the best set of parameters?
You first need to get the gridSearchCV object from the pipeline, and then call best_params_ on it. This can be done by:
pipeline_rf.named_steps['grid_search_lr'].best_params_

Resources