metalearning algorithm issue in Super Learner Algorithm in h2o-ai - apache-spark

i had succeeded in implementing a super-learner in H2o-ai and spark
but as per the second step super-learner utilizes a meta learning algorithm
Super-learner algorithm
1Set up the ensemble.
1.a Specify a list of L base algorithms (with a specific set of model parameters).
1.b Specify a metalearning algorithm
the complete algorithm is available at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html
So for the same meta learning algorithm i had utilized a function
val metaLearningModel= new H2ODeepLearning()(hc, spark.sqlContext)
And it seems that it is using an inbuilt package from h2o-ai so i want to know which meta learning algorithm it is using as default

The default metalearner algorithm is noted on the Stacked Ensemble User Guide page that you've linked above. There is also more information available at the metalearner_algorithm page.
The default metalearner is:
"AUTO" (GLM with non negative weights, and if validation_frame is present, lambda_search is set to True; may change over time). This is the default.

Related

HuggingFace Summarization: effect of specifying both `do_sample` and `num_beams`

I am using a HuggingFace summarization pipeline to generate summaries using a fine-tuned model. The summarizer object is initialised as follows:
from transformers import pipeline
summarizer = pipeline(
"summarization",
model=model,
tokenizer=tokenizer,
num_beams=5,
do_sample=True,
no_repeat_ngram_size=3,
max_length=1024,
device=0,
batch_size=8
)
According to the documentation, setting num_beams=5 means that the top 5 choices are retained when a new token in the sequence is generated based on a language model, and the model moves forward discarding all other possibilities, and repeating this after every new token is generated. However, this option seems to be apparently incompatible with do_sample=True which seems to activate a behaviour where new tokens are picked based on some random strategy (which doesn't have to be uniformly random of course, but I don't know the details of this process). Could anyone explain clearly how num_beams=5 and do_sample=True would work together (no error is raised so I assume this is a valid summarizer configuration)?
First difference is that temperature is applied to the logits.
The second difference is that instead of taking the top token of the beam per beam, the choice of the beam is sampled from the distribution of that beam:
https://github.com/huggingface/transformers/blob/main/src/transformers/generation_utils.py#L2626
I believe the rest stays the same, but you can continue to read the code to be 100% sure

Incremental learning - Set Initial Weights or values for Parameters from previous model for ML algorithm in Spark 2.0

I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.
How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.
It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

How does VectorSlicer work in Spark 2.0?

In the Spark official documentation,
VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
Does this select the important features from the set of features?
If that is the case how is it done without the mention of a dependent variable?
I am trying to perform data clustering and I need the important features which will contribute to the clusters better. Can I use VectorSlicer for this?
Does this select the important features from the set of features?
It doesn't. It literally slices the vector to select only specified indices.
and need the important features which will contribute to the clusters better.
If you have categorical data consider using ChiSqSelector.
Otherwise you can use dimensionality reduction like PCA. It won't be the same as feature selection but should provide similar benefits (keep only the most important signals, discard the rest).

How to use sickit learn to calculate the k-means feature importance

I use scikit-learn to do clustering by k-means:
from sklearn import cluster
k = 4
kmeans = cluster.KMeans(n_clusters=k)
but another question is :
How to use scikit learn to calculate the k-means feature importance?
Unfortunately, to my knowledge there is no such thing as "feature importance" in the context of a k-means algorithm - at least in the understanding that feature importance means "automatic relevance determination" (as in the link below).
In fact, the k-means algorithm treats all features equally, since the clustering procedure depends on the (unweighted) Euclidean distances between data points and cluster centers.
More generally, there exist clustering algorithms which perform automatic feature selection or automatic relevance determination, or generic feature selection methods for clustering. A specific (and arbitrary) example is
Roth and Lange, Feature Selection in Clustering Problems, NIPS 2003
I have answered this on StackExchange, you can partially estimate the most important features for, not the whole clustering problem, rather each cluster's most important features. Here is the answer:
I faced this problem before and developed two possible methods to find the most important features responsible for each K-Means cluster sub-optimal solution.
Focusing on each centroid’s position and the dimensions responsible for the highest Within-Cluster Sum of Squares minimization
Converting the problem into classification settings (Inspired by the paper: "A Supervised Methodology to Measure the Variables Contribution to a Clustering").
I have written a detailed article here Interpretable K-Means: Clusters Feature Importances. GitHub link is included as well if you want to try it.

How to control feature subsetting in random forest in scikit-learn?

I am trying to change the way that random forest algorithm using in subsetting features for every node. The original algorithm as it is implemented in Scikit-learn way is randomly subsetting. I want to define which subset for every new node from several choices of several subsets. Is there direct way in scikit-learn to control such method? If not, is there any way to update the same code of Scikit-learn? If yes, which function in the source code is what you think should be updated?
Short version: This is all you.
I assume by "subsetting features for every node" you are referring to the random selection of a subset of samples and possibly features used to train individual trees in the forest. If that's what you mean, then you aren't building a random forest; you want to make a nonrandom forest of particular trees.
One way to do that is to build each DecisionTreeClassifier individually using your carefully specified subset of features, then use the VotingClassifier to combine the trees into a forest. (That feature is only available in 0.17/dev, so you may have to build your own, but it is super simple to build a voting classifier estimator class.)

Resources