Are there any implementations of One class classifiers in the Spark?
There doesn't appear to be anything in ML or MLlib, but I was hoping that there was an extension developed by someone in the community that would provide some way of producing a trained classification model where only one labeled class is available in the training data.
It's Java, not Spark, but LibSVM has a one class SVM classifer, and calling it from Spark shouldn't be a problem.
Related
I want to use LogisticRegressionWithSGD to do multiple classification tasks, but there is no setNumClasses method in org.apache.spark.mllib.classification.LogisticRegressionWithSGD. I know that LogisticRegressionWithLBFGS can do multiple classification tasks, but why LogisticRegressionWithSGD cann't ?
Multiclass classification using LogisticRegressionWithSGD() is not supported, though it is a requested feature: https://issues.apache.org/jira/browse/SPARK-10179 . It was decided not to add this feature since SparkML will be the main Machine Learning API for Spark in future, not Spark Mllib.
I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.
How can I do this with Spark ?
Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.
to view the code in scikit learn, please click on the following link:
https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc
Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.
The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.
Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.
Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.
I am currently working with spark mllib.
I have created a text classifier using the Gradient Boosting algorithm with the class GradientBoostedTrees:
Gradient Boosted Trees
Currently I obtain the predictions to know the class of new elements but I would like to obtain the class probabilities (the value of the output before the hard decision).
In other mllib algorithms like logistic regression you can remove the threshold from the classifier to obtain the class probabilities but I can not find a way to do the same procedure with GradientBosstedTrees.
As far as I know, it's not currently possible but it is possible with random forest.
You can see this link...I have explained a procedure here
Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output
In order to implement the predicted probabilities and thresholds one need to write program using the trees from
print(model.toDebugString)
output. I tried to understand how the tree works to predict which is fairly simple outside Spark.
It seems that in Spark MLLIB it is not possible to obtain the class probabilities.
You can only obtain the final classification decision.
That's a pity because that information would be very useful (If you classify a sample as positive with 99.99% of posibilities is not the same than 51%) and it is not difficult to obtain that information once the model has been trained.
An alternative is using a different software like xgboost: https://github.com/dmlc/xgboost
I would like to classify a bunch of documents using Apache Mahout and by using a naive bayes classifier. I do all the pre-processing and convert my training data set into feature vector and then train the classifier. Now I want to pass a bunch of new instances (to-be-classified instances) to my model in order to classify them.
However, I'm under the impression that the pre-processing must be done for my to-be-classified instances and the training data set together? If so, how come I can use the classifier in real world scenarios where I don't have the to-be-classified instances at the time I'm building my model?
How about Apache Spark? Howe thing work there? Can I make a classification model and the use it to classify unseen instances later?
As of Mahout 0.10.0, Mahout provides a Spark backed Naive Bayes implementation which can be run from the CLI, the Mahout shell or embedded into an application:
http://mahout.apache.org/users/algorithms/spark-naive-bayes.html
Regarding the classification of new documents outside of the training/testing sets, there is a tutorial here:
http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
Which explains how to tokenize (using trival java native String methods), vectorize and classify unseen text using the dictionary and the df-count from the training/testing sets.
Please note that the tutorial is meant to be used from the Mahout-Samsara Environment's spark-shell, however the basic idea can be adapted and embedded into an application.
I am trying to find the subject of a conversation from a predefined set of SUBJECTS being stored in a file.
Would like to know if it is possible using the SPARK or MLLIB?
Thanks for your help in advance.
Have a look at MLlib which supports multinomial naive Bayes, typically used for document classification:
http://spark.apache.org/docs/latest/mllib-naive-bayes.html
Here more details on how it is implemented:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
To train the model you will have to prepare some training data, basically a file with the class (the subject) and the frequency of the terms relevant to your classification.