Default Number of Trees in Random Forest Implementation of Apache Spark - apache-spark

I am using the Random Forest model of Apache Spark. However, it does not mention the default number of trees used by the model. Is there is some way to know the default value of "numTrees" parameter?

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html
Set to 20, as per this source.

Related

Adjust the tree parameters for a specific tree node

I use DecisionTreeClassifier from sklearn
I need to correct splitter (feature), min_samples_leaf used in the particular tree node.
How can I do it?
You cannot define the min_samples_leaf for a single node, because the model would probably end up assigning fewer samples to other nodes than the min_samples_leaf of the whole model to ensure compliance with the rule applicable to this individual node.
If you are dealing with imbalanced data set, I suggest you oversample or undersample your data prior to inputting in the model or you could manually set the class weights.
According to scikit-learn's user guide:
Balance your dataset before training to prevent the tree from being
biased toward the classes that are dominant. Class balancing can be
done by sampling an equal number of samples from each class, or
preferably by normalizing the sum of the sample weights
(sample_weight) for each class to the same value.

Incremental learning - Set Initial Weights or values for Parameters from previous model for ML algorithm in Spark 2.0

I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.
How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.
It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

metalearning algorithm issue in Super Learner Algorithm in h2o-ai

i had succeeded in implementing a super-learner in H2o-ai and spark
but as per the second step super-learner utilizes a meta learning algorithm
Super-learner algorithm
1Set up the ensemble.
1.a Specify a list of L base algorithms (with a specific set of model parameters).
1.b Specify a metalearning algorithm
the complete algorithm is available at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html
So for the same meta learning algorithm i had utilized a function
val metaLearningModel= new H2ODeepLearning()(hc, spark.sqlContext)
And it seems that it is using an inbuilt package from h2o-ai so i want to know which meta learning algorithm it is using as default
The default metalearner algorithm is noted on the Stacked Ensemble User Guide page that you've linked above. There is also more information available at the metalearner_algorithm page.
The default metalearner is:
"AUTO" (GLM with non negative weights, and if validation_frame is present, lambda_search is set to True; may change over time). This is the default.

How handle categorical features in the latest Random Forest in Spark?

In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo
What's about the ML Random Forest? In the user guide there is an example that uses VectorIndexer that converts the categorical features in vector as well, but it's written "Automatically identify categorical features, and index them"
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest, and it's recommended to do one-hot encoding to avoid this, that seems to not make sense in the case of this algorithm, and especially given the official example mentioned above!
I noticed also that when having a lot of categories(>1000) in the categorical column, once they are indexed with StringIndexer, random forest algorithm asks me setting the MaxBin parameter, supposed to be used with continuous features. Does it mean that the features more than number of bins will be treated as continuous, as it's specified in the official example, and so StringIndexer is OK for my categorical column, or does it mean that the whole column with numerical still nominal features will be bucketized with assumption that the variables are continuous?
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest,
This is actually incorrect. Tree models (including RandomForest) depend on column metadata to distinguish between categorical and numerical variables. Metadata can be provided by ML transformers (like StringIndexer or VectorIndexer) or added manually. The old mllib RDD-based API, which is used internally by ml models, uses categoricalFeaturesInfo Map for the same purpose.
Current API just takes the metadata and converts to the format expected by categoricalFeaturesInfo.
OneHotEncoding is required only for linear models, and recommended, although not required, for multinomial naive Bayes classifier.

How to set cutoff while training the data in Random Forest in Spark

I am using Spark Mlib to train the data for classification using Random Forest Algorithm. The MLib provides a RandomForest Class which has trainClassifier Method which does the required.
Can I set a threshold value while training the data set, similar to the cutoff option provided in R's randomForest Package.
http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
I found the RandomForest Class of MLib provides options only to pass number of trees, impurity, number of classes etc but there is nothing like threshold or cut off option available. Can it be done by any way.
The short version is no, if we look at RandomForestClassifier.scala you can see that it always simply selects the max. You could override the predict function if, but its not super clean. I've added a jira to track adding this.

Resources