I am doing kaggle inclass challege of bosten hosing prices and learnt that XGBoost is faster than RandomForest but when implemented was slower.i Want to ask when XGBoost becomes faster and when RandomForest??.I am new to machine learning and need your help.Thanking in advance
Mainly, the parameters you choose have strong impact in the speed of your algorithm, (e.g learning rate, depth of the tree, number of features etc.), there's a trade-off between accuracy and speed, so i suggest you put the parameters you've chosen for every model and see how to change it to get faster performance with reasonable accuracy.
Related
In Keras official documentation for ReduceLROnPlateau class (https://keras.io/api/callbacks/reduce_lr_on_plateau/)
they mention that
"Models often benefit from reducing the learning rate"
Why is that so?
It's counter-intuitive for me at least, since from what I know- a higher learning rate allows taking further steps from my current position.
Thanks!
Neither too high nor too low learning rate should be considered for training a NN. A large learning rate can miss the global minimum and in extreme cases can cause the model to diverge completely from the optimal solution. On the other hand, a small learning rate can stuck to a local minimum.
ReduceLROnPlateau purpose is to track your model's performance and reduce the learning rate when there is no improvement for x number of epochs. The intuition is that the model approached a sub-optimal solution with current learning rate and oscillate around the global minimum. Reducing the learning rate would enable the model to take smaller learning steps to the optimal solution of the cost function.
Image source
I classify clients by many little xgboost models created from different parts of dataset.
Since it is hard to support many models manually, I decided to automate hyperparameters tuning via Hyperopt and features selection via Boruta.
Would you advise me please, what should go first: hyperparameters tuning or features selection? On the other hand, it does not matter.
After features selection, the number of features decreases from 2500 to 100 (actually, I have 50 true features and 5 categorical features turned to 2 400 via OneHotEncoding).
If some code is needed, please, let me know. Thank you very much.
Feature selection (FS) can be considered as a preprocessing activity, wherein, the aim is to identify features having low bias and low variance [1].
Meanwhile, the primary aim of hyperparameter optimization (HPO) is to automate hyper-parameter tuning process and make it possible for users to apply Machine Learning (ML) models to practical problems effectively [2]. Some important reasons for applying HPO techniques to ML models are as follows [3]:
It reduces the human effort required, since many ML developers spend considerable time tuning the hyper-parameters, especially for large datasets or complex ML algorithms with a large number of hyper-parameters.
It improves the performance of ML models. Many ML hyper-parameters have different optimums to achieve best performance in different datasets or problems.
It makes the models and research more reproducible. Only when the same level of hyper-parameter tuning process is implemented can different ML algorithms be compared fairly; hence, using a same HPO method on different ML algorithms also helps to determine the most suitable ML model for a specific problem.
Given the above difference between the two, I think FS should be first applied followed by HPO for a given algorithm.
References
[1] Tsai, C.F., Eberle, W. and Chu, C.Y., 2013. Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39, pp.240-247.
[2] M. Kuhn, K. Johnson Applied Predictive Modeling Springer (2013) ISBN: 9781461468493.
[3] F. Hutter, L. Kotthoff, J. Vanschoren (Eds.), Automatic Machine Learning: Methods, Systems, Challenges, 9783030053185, Springer (2019)
I am novice in DS/ML stuff. I am trying to solve Titanic case study in Kaggle, however my approach is not systematic till now. I have used correlation to find relationship between variables and have used KNN and Random Forest Classification, however my models performance has not improved. I have selected features based on the result of correlation between variables.
Please guide me if there are certain sk-learn methods which can be used to identify features which can contribute significantly in forecasting.
Through Various Boosting Techniques You can Improve accuracy approx 99% I suggest you to use Gradient Boosting.
I am using Random forests in scikit-learn. I used feature_importances_ to see how much each feature is important in prediction goal. But I don't understand what is this score. Googling feature_importances_ says it is the mean decrease impurity. But I'm still confused whether this is the same as mean decrease gigi impurity. If so, how it is calculated for trees and random forests? Beside the math I want to really understand what does it mean.
feature_importances_ function will tell you how much each feature is contributing towards prediction (Information gain)
Random forest classify the independent variables or features based on Gini, Information Gain, Chi-square or entropy. Those features will get high score which contribute maximum to the information gain.
Can anyone explain the difference between the RandomForestClassifier and ExtraTreesClassifier in scikit learn. I've spent a good bit of time reading the paper:
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006
It seems these are the difference for ET:
1) When choosing variables at a split, samples are drawn from the entire training set instead of a bootstrap sample of the training set.
2) Splits are chosen completely at random from the range of values in the sample at each split.
The result from these two things are many more "leaves".
Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling.
In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs but it's hard to guess when it's the case without trying both first (and tuning n_estimators, max_features and min_samples_split by cross-validated grid search).
ExtraTrees classifier always tests random splits over fraction of features (in contrast to RandomForest, which tests all possible splits over fraction of features)
The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees). Here is a good resource to know more about their difference in more detail Random forest vs extra tree.