Spark Multi Label classification - apache-spark

I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.
How can I do this with Spark ?
Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.
to view the code in scikit learn, please click on the following link:
https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc

Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.
The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.
Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.
Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.

Related

Scalling out sklearn models / xgboost

I wonder how / if it is possible to run sklearn models / xgboost training for a large dataset.
If I use a dataframe that contains several giga-bytes, the machine crashes during the training.
Can you assist me please?
Scikit-learn documentation has an in-depth discussion about different strategies to scale models to bigger data.
Strategies include:
Streaming instances
Extracting features
Incremental learning (see also the partial_fit entry in the glossary)

Incremental learning - Set Initial Weights or values for Parameters from previous model for ML algorithm in Spark 2.0

I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.
How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.
It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

One Class Classification Models in Spark

Are there any implementations of One class classifiers in the Spark?
There doesn't appear to be anything in ML or MLlib, but I was hoping that there was an extension developed by someone in the community that would provide some way of producing a trained classification model where only one labeled class is available in the training data.
It's Java, not Spark, but LibSVM has a one class SVM classifer, and calling it from Spark shouldn't be a problem.

In spark mllib, can LogisticRegressionWithSGD do multiple classification tasks?

I want to use LogisticRegressionWithSGD to do multiple classification tasks, but there is no setNumClasses method in org.apache.spark.mllib.classification.LogisticRegressionWithSGD. I know that LogisticRegressionWithLBFGS can do multiple classification tasks, but why LogisticRegressionWithSGD cann't ?
Multiclass classification using LogisticRegressionWithSGD() is not supported, though it is a requested feature: https://issues.apache.org/jira/browse/SPARK-10179 . It was decided not to add this feature since SparkML will be the main Machine Learning API for Spark in future, not Spark Mllib.

How to use Mahout classifiers in action?

I would like to classify a bunch of documents using Apache Mahout and by using a naive bayes classifier. I do all the pre-processing and convert my training data set into feature vector and then train the classifier. Now I want to pass a bunch of new instances (to-be-classified instances) to my model in order to classify them.
However, I'm under the impression that the pre-processing must be done for my to-be-classified instances and the training data set together? If so, how come I can use the classifier in real world scenarios where I don't have the to-be-classified instances at the time I'm building my model?
How about Apache Spark? Howe thing work there? Can I make a classification model and the use it to classify unseen instances later?
As of Mahout 0.10.0, Mahout provides a Spark backed Naive Bayes implementation which can be run from the CLI, the Mahout shell or embedded into an application:
http://mahout.apache.org/users/algorithms/spark-naive-bayes.html
Regarding the classification of new documents outside of the training/testing sets, there is a tutorial here:
http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
Which explains how to tokenize (using trival java native String methods), vectorize and classify unseen text using the dictionary and the df-count from the training/testing sets.
Please note that the tutorial is meant to be used from the Mahout-Samsara Environment's spark-shell, however the basic idea can be adapted and embedded into an application.

Resources