I have been working on the below script for random forest classification and am running into some problems related to the performance of the randomized search - it's taking a very long time to complete & I wonder if there is either something I am doing wrong or something I could do better to make it faster.
Would anybody be able to suggest speed/performance improvements I could make?
Thanks in advance!
forest_start_time = time.time()
model = RandomForestClassifier()
param_grid = {
'bootstrap': [True, False],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [200, 300, 500, 1000]
}
bestforest = RandomizedSearchCV(estimator = model,
param_distributions = param_grid,
cv = 3, n_iter = 10,
n_jobs = available_processor_count)
bestforest.fit(train_features, train_labels.ravel())
forest_score = bestforest.score(test_features, test_labels.ravel())
print(forest_score)
forest_end_time = time.time()
forest_duration = forest_start_time-forest_end_time
The only way to speed this up is to 1) reduce the features or/and use more CPU cores n_jobs = -1:
bestforest = RandomizedSearchCV(estimator = model,
param_distributions = param_grid,
cv = 3, n_iter = 10,
n_jobs = -1)
Related
I tried mimic the behavior of pytorch adaptive_avg_pool2d, but I found the result not same:
def test_pool():
a = np.fromfile("in.bin", dtype=np.float32)
a = np.reshape(a, [1, 12, 25, 25])
a = torch.as_tensor(a)
b = F.adaptive_avg_pool2d(a, [7, 7])
print(b)
print(b.shape)
avg_pool = torch.nn.AvgPool2d([7, 7], [3, 3])
c = avg_pool(a)
print(c)
print(c.shape)
what is the principles behind pytorch adaptive_avg_pool2d?
So I did the following:
MLP = MLPRegressor()
parameter_space = {
'hidden_layer_sizes': [(32,), (32,16), (32,16,8), (32,16,8,4), (32,16,8,4,2), (32,32), (32,32,32), (32,32,32,32), (32,32,32,32,32), (16,8,4,2)],
'activation': ['relu'],
'solver': ['adam'],
'learning_rate_init': [1, 0.1, 0.01, 0.001,0.0001,0.00001],
'max_iter': [5000],
'shuffle': [True, False],
'random_state': [0],
'early_stopping': [True, False],
'n_iter_no_change': [50],
}
gs_MLP = GridSearchCV(estimator = MLP, param_grid= parameter_space, cv = 7, n_jobs = -1)
gs_MLP_fit = gs_MLP.fit(X, y)
gs_MLP.score(X,y)
And I noticed that whenever I change the order within the hidden_layer_size it gives different answers. First it said (16,8,4,2) and when I put (16,8,4,2) at end it said (32,32,32,32) is the best.
I assume this has to do with the random_state? Do I have to put it in MLPRegressor() instead? As in MLPRegressor(random_state = 0)
I have defined my parameter grid and gridsearch here. The weird thing is, the output does not include any of the parameter options I set. E.g. max features has been set to auto.
Have I done something wrong?
from sklearn.grid_search import GridSearchCV
param_grid = {
'bootstrap': [True],
'max_depth': [90, 100, 110],
'max_features': [2, 3, 10, 20],
'min_samples_leaf': [3, 4, 5, 10],
'min_samples_split': [2, 5, 8, 10, 12],
'n_estimators': [10, 20, 50, 60, 70]
}
model = RandomForestClassifier()
# Instantiate the grid search model
best = GridSearchCV(estimator = model, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
best.fit(x, y.ravel())
You have to take the return value of the best.fit() function.
fitted_grid = best.fit(x, y.ravel())
best_classifier = fitted_grid.best_estimator_
best_parameters = fitted_grid.best_params_
I did not see that part in your code snippet, so maybe that's where you were missing something ?
I'm trying to build a bayesian inference model in pymc3 and I'm getting the following error:
data = [[24, 38.7], [25, 38.6], [26, 38.9], [27, 41.4], [28, 39.7], [29, 41.1], [30, 38.7], [31, 37.6],
[32, 36.3], [33, 36.9], [34, 35.7], [35, 33.8], [36, 33.2], [37, 30.1], [38, 27.8], [39, 22.8],
[40, 21.4], [41, 15.4], [42, 11.2], [43, 9.2], [44, 5.4], [45, 3.0], [46, 1.6]]
data = np.array(data)
x = data[:, 0]
y = data[:, 1]
plt.scatter(x, y, color="red")
with pm.Model() as change_point_model:
switchpoint = pm.DiscreteUniform('switchpoint', lower=x.min(), upper=x.max())
beta0 = pm.Normal('beta0', mu=40, sd=10)
beta1 = pm.Normal('beta1', mu=90, sd=10)
gamma0 = pm.Normal('gamma0', mu=0, sd=5)
gamma1 = pm.Normal('gamma1', mu=0, sd=5)
epsilon = pm.Normal('epsilon', mu=0, sd=1)
intercept = pm.math.switch(switchpoint <= x, beta0, gamma0)
x_coeff = pm.math.switch(switchpoint <= x, beta1, gamma1)
y_pred = pm.Normal('y_pred', mu=intercept + x_coeff * x, sd=epsilon, observed=y)
step1 = pm.NUTS([beta0, beta1, gamma0, gamma1])
step2 = pm.Metropolis([switchpoint])
# In this example we are deliberativelly choosing the metropolis sampler
trace = pm.sample(2000, step=[step1, step2], progressbar=True)
pm.traceplot(trace[100:])
And the error that I am getting is the following:
ValueError: Bad initial energy: inf. The model might be misspecified.
Hence, after doing some readings, I found that model.logp(model.test_point) is returning a -inf. Hence, how do I solve this error. Any help is much appreciated!!
You are modelling the standard deviation of a normal distribution with a normal. The test point for that is 0.0, which has 0 probability of occurring.
If you change epsilon to Gamma('epsilon', alpha=2.0, beta=0.5) or similar, you should be fine.
So I ran a very thorough GridSearch with 10-fold cross-val in an integrated pipeline in the following manner-
pipeline_rf = Pipeline([
('standardize', MinMaxScaler()),
('grid_search_lr', GridSearchCV(
RandomForestClassifier(),
param_grid={'bootstrap': [True],
'max_depth': [50, 100, 150, 200],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 500, 1000, 1500]},
cv=10,
n_jobs=-1,
scoring='roc_auc',
verbose=2,
refit=True
))
])
pipeline_rf.fit(X_train, y_train)
How should I go about extracting the best set of parameters?
You first need to get the gridSearchCV object from the pipeline, and then call best_params_ on it. This can be done by:
pipeline_rf.named_steps['grid_search_lr'].best_params_