I am new to random forest regression. I am trying to build a random forest regression to predict company's earning. Do I need to worry about collinearity problem for random forest regression? Thanks!
If there is a collinearity problem, how can I deal with it?
Related
With the sklearn RandomForest, you can get the prediction of each tree of the random forest like so:
per_tree_pred = [tree.predict(X) for tree in clf.estimators_]
xgboost as the option of fitting random forest, which is quite nice since we can leverage the GPU, but is there any way to get the prediction from each tree like we can in sklearn? This can be very useful to get a sense of the uncertainty of the predictions.
Thanks,
I'm translating a random forest using h20 and r into a random forest using SciKit Learn's Random Forest Classifier with python. H2o's randomForest model has an argument 'stopping_rounds'. Is there a way to do this in python using the SKLearn Random Forest Classifier model? I've looked through the documentation, so I'm afraid I might have to hard code this.
No, I don't believe scikit-learn algorithms have any sort of automatic early stopping mechanism (that's what stopping_rounds relates to in H2O algorithms). You will have to figure out the optimal number of trees manually.
Per the sklearn random forest classifier docs, early stopping is determined by the min_impurity_split (deprecated) and min_impurity_decrease arguments. It doesn't seem to have the same functionality as H2O, but it might be what you're looking for.
enter image description hereI am working with random forest regression model for my thesis work. My data set is small (about 3000 samples and 20 features) and it's over fitting on train data. Since data set is small and i don't want to split data into train(train+oob), test sets.So I am using a bagging regression to avoid over fitting problem.
I'm trying to evaluate performance metrics for regression model. I can calculate RMSE and MAE values for train set but I don't know how to check these metrics for out of bag data. Need suggestions
Thanks in advance
I am trying to train a Random Forest Regressor from sklearn. The Features I want to train on are of different types, numeric continuous, numeric categorical, textual categorical(name/nationality), latitude and longitude.
What I want to know is given all the features, how do I determine the most useful feature set to train my Random Forest Regressor?
First, run your random forest model on data.
rf= RandomForestRegressor()
rf.fit(train_data,train_labels)
Then use feature importance attribute to know the importance of features from where you can filter out the features.
print(rf.feature_importances_)
And again run your model on selected features.
There are many more techniques you can use like correlation, pca etc. Having a domain knowledge also gives you an edge while building a model.
I have a new question about scikit for you.
Classification problem, logistic regression as estimator.
I have my X dataset, with my features.
I want to use my algorithm through cross validation and I have two ways: I split manually my dataset in 5 subsets, end I iterate for 5 times leaving every time a different set for testing. I obtain my scores, but what I want now is the average of the coefficients to use with the estimator to predict on a new dataset. I read somewhere on stackoverflow that it's possible to pass the coefficients to the scikit logistic regression estimator.
The otherway is to use cross_val_score:
lrmodel=LogisticRegression(penalty='l2',C=1)
cv.cross_val_score(lrmodel, Xf, y, cv=5,scoring='log_loss', verbose=0)
gives me the cross-entropy after a cross validation estimation. But what if now I want to use the average coefficients and use the estimator for a new prediction on my new yet unlabeled dataset?
Thank you!