GridsearchCV with RandomForest - python-3.x

So I am doing some parameter thing with RandomForest and GridsearchCV. Here is my code.
#Import 'GridSearchCV' and 'make_scorer'
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
Create the parameters list you wish to tune
parameters = {'n_estimators':[5,10,15]}
#Initialize the classifier
clf = GridSearchCV(RandomForestClassifier(), parameters)
#Make an f1 scoring function using 'make_scorer'
f1_scorer = make_scorer(f1_scorer)
#Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
print(clf.get_params().keys())
#Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train_100,y_train_100)
So the issue is the following error: "ValueError: Invalid parameter max_features for estimator GridSearchCV. Check the list of available parameters with estimator.get_params().keys()."
I followed the advice given by the error and the output of print(clf.get_params().keys()) is below. However even when I copy and paste these titles into my parameter dictionary I still get an error. I've hunted around stack overflow and most people are using really similar parameter dictionaries to mine. Anyone have any idea on how to iron out this issue? Thanks again!
dict_keys(['pre_dispatch', 'cv', 'estimator__max_features', 'param_grid', 'refit', 'estimator__min_impurity_split', 'n_jobs', 'estimator__random_state', 'error_score', 'verbose', 'estimator__min_samples_split', 'estimator__n_jobs', 'fit_params', 'estimator__min_weight_fraction_leaf', 'scoring', 'estimator__warm_start', 'estimator__criterion', 'estimator__verbose', 'estimator__bootstrap', 'estimator__class_weight', 'estimator__oob_score', 'iid', 'estimator', 'estimator__max_depth', 'estimator__max_leaf_nodes', 'estimator__min_samples_leaf', 'estimator__n_estimators', 'return_train_score'])

I think the problem is with the two lines:
clf = GridSearchCV(RandomForestClassifier(), parameters)
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
What this is essentially doing is creating an object with a structure like:
grid_obj = GridSearchCV(GridSearchCV(RandomForestClassifier()))
which is probably one more GridSearchCV than you want.

Related

how to cross validate pca in sklearn pipeline without overfitting?

My input is time series data. I want to decompose the dataset with PCA (I dont want to do PCA on the entire dataset first because that would be overfitting) and then use feature selection on each component (fitted on a KNN Regressor model).
This is my code so far:
tscv = TimeSeriesSplit(n_splits=10)
pca = PCA(n_components=.5,svd_solver='full').fit_transform()
knn = KNeighborsRegressor(n_jobs=-1)
sfs = SequentialFeatureSelector(estimator=knn,n_features_to_select='auto',tol=.001,scoring=custom_scorer,n_jobs=-1)
pipe = Pipeline(steps=[("pca", pca), ("sfs", sfs), ("knn", knn)])
cv_score = cross_val_score(estimator=pipe,X=X,y=y,scoring=custom_scorer,cv=tscv,verbose=10)
print(np.average(cv_score),' +/- ',np.std(cv_score))
print(X.columns)
The problem is I want to make sure PCA isnt looking over the entire dataset when it calculates which features variance. I also want it to be fit transformed, but it doesnt work. With the following error codes:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<bound method PCA.fit_transform of PCA(svd_solver='full')>' (type <class 'method'>) doesn't
or
TypeError: fit_transform() missing 1 required positional argument: 'X'
You should not use pca = PCA(...).fit_transform nor pca = PCA(...).fit_transform() when defining your pipeline.
Instead, you should use pca = PCA(...). The fit_transform method is automatically called within the pipeline during the model fitting (in cross_val_score).

confusion matrix for multiclass data

I have fit the following model to my 7 classes data and I want to create a confusion matrix for my model:
history1 = model1.fit(data_generator.flow(train_x, to_categorical(train_y),batch_size=BATCH_SIZE),
steps_per_epoch=len(train_x) / BATCH_SIZE,
validation_data=data_generator.flow(val_x, to_categorical(val_y),batch_size=BATCH_SIZE),
validation_steps=len(val_x) / BATCH_SIZE,epochs=NUM_EPOCHS)
Also when I did this, predicting the train set results:
y_train_pred = model1.predict(train_x)
cm_train = confusion_matrix(train_y, y_train_pred)
It gave me this error:
Classification metrics can't handle a mix of unknown and multiclass targets
Can you please guide me how to do it?
You seem to be using
sklearn.metrics.confusion_matrix
try using
from sklearn.metrics import multilabel_confusion_matrix
instead.

_score() missing 1 required argument 'y_true'

I need to run AgglomerativeClustering method and my The code is,
model = AgglomerativeClustering()
params = {"n_clusters": [2,3,4]}
clf = GridSearchCV(model, params, n_jobs=1, cv=10, refit=False)
clf.fit(self.data, None)
Then I get the error saying "_score() missing 1 required argument 'y_true'". However for clustering there is no y. Any solution?
As you dont specifiy your scoring function in GridSearchCV it will use (based on the docs):
If None, the estimator’s score method is used.
Your estimator is model so it is: AllgomerativeClustering, and if we check up this docs, there is no scoring function in it. On top you are missing the train/test split fo this kind of algorithm. Currently GridSearchCV is not designed to use clusterign algorithm, you can have a check here how to continue.

Can RandomizedSearchCV output feature importance based on the best model?

After use RandomizedSearchCV to find the best hyperparameters, is there a way to find the following outputs?
1. save the best model as an object
2. output feature importance
gbm = GradientBoostingClassifier()
rand = RandomizedSearchCV(gbm, param_distributions=param_dist, cv=10,
scoring='roc_auc', n_iter=10, random_state=5)
rand.fit(X_train, y_train_num)
Use the best_params_ parameter and save it into a dictionary. From the dictionary retrain the model and call the values by the keys.
top_params = rand.best_params_
gbm_model = GradientBoostingClassifier(learning_rate=top_params['learning_rate'], max_depth=top_params["max_depth"], ...)
gbm_model.fit(X_train, y_train_num)
gbm_model.feature_importances_

How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification

If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.

Resources