how to do multiple target linear regression in Spark MLLib? - apache-spark

Spark ML LinearRegression seems to regress against a single label.
LabeledPoint(label: Double, features: Array[Double])
https://spark.apache.org/docs/0.8.1/api/mllib/org/apache/spark/mllib/regression/LabeledPoint.html
However, with my problem, I need to predict a vector
e.g.
LabeledPoint(label: Array[Double], features: Array[Double])
Is there a way for me to do this? (this is supported in sickit-learn and I am trying to do it in spark)
ps 1: If this is not possible in MLLib directly, is there a tutorial on how to implement this from scratch using spark?
ps 2: My output labels is a 60 element vector. So I could run a LinearRegression 60 times and then run 60 predictions to predict. But that seems like a hack

There is no native implementation from what I've known but if you look at the scikit-learn implementation for Multioutput regression it says that the "strategy consists of fitting one regressor per target. Since each target is represented by exactly one regressor it is possible to gain knowledge about the target by inspecting its corresponding regressor".
This means that a potential implementation could be to parallelize the regression step for each target. You could then distribute the calculation at the same time to speed things up.

Related

Parallelization of sklearn functions using MPI without cross-validation

I have a group of time series which I want to apply a LASSO regression using sklearn on them. As the datasets is pretty sparse I need whole length of time series so that I can't cross-validate. The datasets are big and training process is time consuming which I have to run it on a cluster.
In order to use different nodes I use MPI. As far as I know there is possibility to use sklearn function on cluster using MPI. This possibility basically works with cross-validation chunks, like following issue:
https://github.com/sebp/scikit-learn-mpi-grid-search
I was wondering if there is any other way to use MPI to parallelize process of training in sklearn without cross-validation? I think it would mean that underlying algorithm of sklearn function should use parallelization.

Incremental learning - Set Initial Weights or values for Parameters from previous model for ML algorithm in Spark 2.0

I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.
How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.
It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

What does the CV stand for in sklearn.linear_model.LogisticRegressionCV?

scikit-learn has two logistic regression functions:
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegressionCV
I'm just curious what the CV stands for in the second one. The only acronym I know in ML that matches "CV" is cross-validation, but I'm guessing that's not it, since that would be achieved in scikit-learn with a wrapper function, not as part of the logistic regression function itself (I think).
You are right in guessing that the latter allows the user to perform cross validation. The user can pass the number of folds as an argument cv of the function to perform k-fold cross-validation (default is 10 folds with StratifiedKFold).
I would recommend reading the documentation for the functions LogisticRegression and LogisticRegressionCV
Yes, it's cross-validation. Excerpt from the docs:
For the grid of Cs values (that are set by default to be ten values in a logarithmic scale between 1e-4 and 1e4), the best hyperparameter is selected by the cross-validator StratifiedKFold, but it can be changed using the cv parameter.
The point here is the following:
yes: sklearn has general model-selection wrappers providing CV-functionality for all those classifiers/regressors
but: when the classifier/regressor is known/fixed a-priori (to some extent) or sometimes even some CV-model, one can gain advantages using these facts with specialized code bound to one classifier/regressor resulting in improved performance!
Typically:
CV already embedded in optimization-algorithm
Efficient warm-starting (instead of full re-optimization after just the change of one parameter like alpha)
It seems, at least the latter idea is used in sklearn's LogisticRegressionCV, as seen in this excerpt:
In the case of newton-cg and lbfgs solvers, we warm start along the path i.e guess the initial coefficients of the present fit to be the coefficients got after convergence in the previous fit, so it is supposed to be faster for high-dimensional dense data.
May I also refer you to this section in scikit-learn documentation which I beleive explains it well:
Some models can fit data for a range of values of some parameter
almost as efficiently as fitting the estimator for a single value of
the parameter. This feature can be leveraged to perform a more
efficient cross-validation used for model selection of this parameter.
The most common parameter amenable to this strategy is the parameter
encoding the strength of the regularizer. In this case we say that we
compute the regularization path of the estimator.
And logistic regression is one such model. That's why scikit-learn has the dedicated LogisticRegressionCV class that does this.
There are some things left out on other answers, e.g. about gridsearch functionality. See the docs:
cross-validation estimator
An estimator that has built-in cross-validation capabilities to automatically select the best hyper-parameters (see the User Guide). Some example of cross-validation estimators are ElasticNetCV and LogisticRegressionCV. Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements. An exception is the RidgeCV class, which can instead perform efficient Leave-One-Out CV.
https://scikit-learn.org/stable/glossary.html#term-cross-validation-estimator
https://github.com/amueller/talks_odt/blob/master/2015/nyc-open-data-2015-andvanced-sklearn.pdf

How to set cutoff while training the data in Random Forest in Spark

I am using Spark Mlib to train the data for classification using Random Forest Algorithm. The MLib provides a RandomForest Class which has trainClassifier Method which does the required.
Can I set a threshold value while training the data set, similar to the cutoff option provided in R's randomForest Package.
http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
I found the RandomForest Class of MLib provides options only to pass number of trees, impurity, number of classes etc but there is nothing like threshold or cut off option available. Can it be done by any way.
The short version is no, if we look at RandomForestClassifier.scala you can see that it always simply selects the max. You could override the predict function if, but its not super clean. I've added a jira to track adding this.

Getting probability of each new observation being an outlier when using scikit-learn OneClassSVM

I'm new to scikit-learn, and SVM methods in general. I've got my data set working well with scikit-learn OneClassSVM in order to detect outliers; I train the OneClassSVM using observation all of which are 'inliers' and then use predict() to generate binary inlier/outlier predictions on my testing set of data.
However to continue further with my analysis I'd like to get the probabilities associated with each new observation in my test set. E.g. The probability of being an outlier associated with each new observation. I've noticed other classification methods in scikit-learn offer the ability to pass the parameter probability=True to compute this, but OneClassSVM does not offer this. Is there an easy way to get these results?
I was searching for an answer for the same question of yours until I got to this page. Stuck for sometime, then, I went back to check the original LIBSVM package since OneClassSVM of scikit-learn is based on the implementation of LIBSVM as stated here.
At the main page of LIBSVM, they state the following for option '-b' that is used to activate returning probability output scores for some variants of SVM:
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
In other words, the one-class SVM which is of type SVM (neither SVC nor SVR) does not have implementation for probability estimation.
If I go and try to force this option (i.e. -b) using the command line interface of LIBSVM, for example:
./svm-train -s 2 -t 2 -b 1 heart_scale
I receive the following error message:
ERROR: one-class SVM probability output not supported yet
In summary, this very desired output is not yet supported by LIBSVM and thus, scikit-learn is not offering it for the moment. I hope in near future, they activate this functionality and update the thread here.
It provides decision function scores which in theory is the distance from the marginal decision boundary between normal and anomales OCSVM does unsupervised classification. This means that the anomaly inside the algorithm is defined based on the distance to the origin (quoted from Scholkopf's paper from NIPS https://papers.nips.cc/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf).
TLDR: use
clf.decision_function(samples) * (-1)
as scores. you get a sparse distributiion of scores.

Resources