How to use Mahout classifiers in action? - apache-spark

I would like to classify a bunch of documents using Apache Mahout and by using a naive bayes classifier. I do all the pre-processing and convert my training data set into feature vector and then train the classifier. Now I want to pass a bunch of new instances (to-be-classified instances) to my model in order to classify them.
However, I'm under the impression that the pre-processing must be done for my to-be-classified instances and the training data set together? If so, how come I can use the classifier in real world scenarios where I don't have the to-be-classified instances at the time I'm building my model?
How about Apache Spark? Howe thing work there? Can I make a classification model and the use it to classify unseen instances later?

As of Mahout 0.10.0, Mahout provides a Spark backed Naive Bayes implementation which can be run from the CLI, the Mahout shell or embedded into an application:
http://mahout.apache.org/users/algorithms/spark-naive-bayes.html
Regarding the classification of new documents outside of the training/testing sets, there is a tutorial here:
http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
Which explains how to tokenize (using trival java native String methods), vectorize and classify unseen text using the dictionary and the df-count from the training/testing sets.
Please note that the tutorial is meant to be used from the Mahout-Samsara Environment's spark-shell, however the basic idea can be adapted and embedded into an application.

Related

Incremental learning - Set Initial Weights or values for Parameters from previous model for ML algorithm in Spark 2.0

I am trying for setting the initial weights or parameters for a machine learning (Classification) algorithm in Spark 2.x. Unfortunately, except for MultiLayerPerceptron algorithm, no other algorithm is providing a way to set the initial weights/parameter values.
I am trying to solve Incremental learning using spark. Here, I need to load old model re-train the old model with new data in the system. How can I do this?
How can I do this for other algorithms like:
Decision Trees
Random Forest
SVM
Logistic Regression
I need to experiment multiple algorithms and then need to choose the best performing one.
How can I do this for other algorithms like:
Decision Trees
Random Forest
You cannot. Tree based algorithms are not well suited for incremental learning, as they look at the global properties of the data and have no "initial weights or values" that can be used to bootstrap the process.
Logistic Regression
You can use StreamingLogisticRegressionWithSGD which exactly implements required process, including setting initial weights with setInitialWeights.
SVM
In theory it could be implemented similarly to streaming regression StreamingLogisticRegressionWithSGD or StreamingLinearRegressionWithSGD, by extending StreamingLinearAlgorithm, but there is no such implementation built-in, ans since org.apache.spark.mllib is in a maintanance mode, there won't be.
It's not based on spark, but there is a C++ incremental decision tree.
see gaenari.
Continuous chunking data can be inserted and updated, and rebuilds can be run if concept drift reduces accuracy.

Spark Multi Label classification

I am looking to implement with Spark, a multi label classification algorithm with multi output, but I am surprised that there isn’t any model in Spark Machine Learning libraries that can do this.
How can I do this with Spark ?
Otherwise Scikit Learn Logistic Regresssion support multi label classification in input/output , but doesn't support a huge data for training.
to view the code in scikit learn, please click on the following link:
https://gist.github.com/mkbouaziz/5bdb463c99ba9da317a1495d4635d0fc
Also in Spark there is Logistic Regression that supports multilabel classification based on the api documentation. See also this.
The problem that you have on scikitlearn for the huge amount of training data will disappear with spark, using an appropriate Spark configuration.
Another approach is to use binary classifiers for each of the labels that your problem has, and get multilabel by running relevant-irrelevant predictions for that label. You can easily do that in Spark using any binary classifier.
Indirectly, what might also be of help, is to use multilabel categorization with nearest-neighbors, which is also state-of-the-art. Some nearest neighbors Spark extensions, like Spark KNN or Spark KNN graphs, for instance.

How train a classifier on different feature types together? Like String,numeric,Categorical, timestamp etc

I am a newbie in field of machine Learning. I have taken Udacity's "Introduction to Machine Learning" course. So I know running basic classifiers using sklearn and python. But all the classifiers they taught in the course was trained on a single data type.
I have a problem wherein I want to classify a code commit as "clean" or "buggy".
I have a feature set which contains String data (like name of person), Categorical data (say "clean" vs "buggy"), numeric data (like no. of commits) and timestamp data (like time of commit). How can I train a classifier based on these three features simultaneously. Lets assuming that I plan on using a Naive Bayes classifier and sklearn. Please Help!
I am trying to implement the paper. Any help would really be appreciable.
Many machine learning classifiers like logistic regression, random forest, decision trees and SVM work fine with both continuous and categorical features. My guess is that you have two paths to follow. The first one is data pre-processing. For example, convert all string/cateogorical data (name of a person) to integers or you can use ensemble learning.
Ensemble learning is when you combine different classifiers (each one dealing with one kind of heterogeneous feature) using majority vote, for example, so they can find a consensus in classification. Hope it helps.

Extract CNN features using Caffe and train using SVM

I want to extract features using caffe and train those features using SVM. I have gone through this link: http://caffe.berkeleyvision.org/gathered/examples/feature_extraction.html. This links provides how we can extract features using caffenet. But I want to use Lenet architecture here. I am unable to change this line of command for Lenet:
./build/tools/extract_features.bin models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel examples/_temp/imagenet_val.prototxt fc7 examples/_temp/features 10 leveldb
And also, after extracting the features, how to train these features using SVM? I want to use python for this. For eg: If I get features from this code:
features = net.blobs['pool2'].data.copy()
Then, how can I train these features using SVM by defining my own classes?
You have two questions here:
Extracting features using LeNet
Training an SVM
Extracting features using LeNet
To extract the features from LeNet using the extract_features.bin script you need to have the model file (.caffemodel) and the model definition for testing (.prototxt).
The signature of extract_features.bin is here:
Usage: extract_features pretrained_net_param feature_extraction_proto_file extract_feature_blob_name1[,name2,...] save_feature_dataset_name1[,name2,...] num_mini_batches db_type [CPU/GPU] [DEVICE_ID=0]
So if you take as an example val prototxt file this one (https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/train_val.prototxt), you can change it to the LeNet architecture and point it to your LMDB / LevelDB. That should get you most of the way there. Once you did that and get stuck, you can re-update your question or post a comment here so we can help.
Training SVM on top of features
I highly recommend using Python's scikit-learn for training an SVM from the features. It is super easy to get started, including reading in features saved from Caffe's format.
Very lagged reply, but should help.
Not 100% what you want, but I have used the VGG-16 net to extract face features using caffe and perform a accuracy test on a small subset of the LFW dataset. Exactly what you needed is in the code. The code creates classes for training and testing and pushes them into the SVM for classification.
https://github.com/wajihullahbaig/VGGFaceMatching

NLTK NER: Continuous Learning

I have been trying to use NER feature of NLTK. I want to extract such entities from the articles. I know that it can not be perfect in doing so but I wonder if there is human intervention in between to manually tag NEs, will it improve?
If yes, is it possible with present model in NLTK to continually train the model. (Semi-Supervised Training)
The plain vanilla NER chunker provided in nltk internally uses maximum entropy chunker trained on the ACE corpus. Hence it is not possible to identify dates or time, unless you train it with your own classifier and data(which is quite a meticulous job).
You could refer this link for performing he same.
Also, there is a module called timex in nltk_contrib which might help you with your needs.
If you are interested to perform the same in Java better look into Stanford SUTime, it is a part of Stanford CoreNLP.

Resources