Identifying the suject of a conversation using spark MLlib - apache-spark

I am trying to find the subject of a conversation from a predefined set of SUBJECTS being stored in a file.
Would like to know if it is possible using the SPARK or MLLIB?
Thanks for your help in advance.

Have a look at MLlib which supports multinomial naive Bayes, typically used for document classification:
http://spark.apache.org/docs/latest/mllib-naive-bayes.html
Here more details on how it is implemented:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
To train the model you will have to prepare some training data, basically a file with the class (the subject) and the frequency of the terms relevant to your classification.

Related

opennlp sample training data for disease

I'm using OpenNLP for data classification. I could not find TokenNameFinderModel for disease here. I know I can create my own model but I was wondering is there any large sample training data available for disease?
You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Hope this helps!

How train a classifier on different feature types together? Like String,numeric,Categorical, timestamp etc

I am a newbie in field of machine Learning. I have taken Udacity's "Introduction to Machine Learning" course. So I know running basic classifiers using sklearn and python. But all the classifiers they taught in the course was trained on a single data type.
I have a problem wherein I want to classify a code commit as "clean" or "buggy".
I have a feature set which contains String data (like name of person), Categorical data (say "clean" vs "buggy"), numeric data (like no. of commits) and timestamp data (like time of commit). How can I train a classifier based on these three features simultaneously. Lets assuming that I plan on using a Naive Bayes classifier and sklearn. Please Help!
I am trying to implement the paper. Any help would really be appreciable.
Many machine learning classifiers like logistic regression, random forest, decision trees and SVM work fine with both continuous and categorical features. My guess is that you have two paths to follow. The first one is data pre-processing. For example, convert all string/cateogorical data (name of a person) to integers or you can use ensemble learning.
Ensemble learning is when you combine different classifiers (each one dealing with one kind of heterogeneous feature) using majority vote, for example, so they can find a consensus in classification. Hope it helps.

How to use Mahout classifiers in action?

I would like to classify a bunch of documents using Apache Mahout and by using a naive bayes classifier. I do all the pre-processing and convert my training data set into feature vector and then train the classifier. Now I want to pass a bunch of new instances (to-be-classified instances) to my model in order to classify them.
However, I'm under the impression that the pre-processing must be done for my to-be-classified instances and the training data set together? If so, how come I can use the classifier in real world scenarios where I don't have the to-be-classified instances at the time I'm building my model?
How about Apache Spark? Howe thing work there? Can I make a classification model and the use it to classify unseen instances later?
As of Mahout 0.10.0, Mahout provides a Spark backed Naive Bayes implementation which can be run from the CLI, the Mahout shell or embedded into an application:
http://mahout.apache.org/users/algorithms/spark-naive-bayes.html
Regarding the classification of new documents outside of the training/testing sets, there is a tutorial here:
http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
Which explains how to tokenize (using trival java native String methods), vectorize and classify unseen text using the dictionary and the df-count from the training/testing sets.
Please note that the tutorial is meant to be used from the Mahout-Samsara Environment's spark-shell, however the basic idea can be adapted and embedded into an application.

How to get the probability per instance in classifications models in spark.mllib

I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages?
In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a different threshold. But this cannot be done for RandomForest (see How to set cutoff while training the data in Random Forest in Spark)
Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.
There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014
There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.
And here is a note from #sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.
Reference : source.
MAJOR EDIT: This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.

topic modeling on mallet

I'm currently doing the topic modeling things (beginner)
I was thinking using mallet for some tool to get me understand this area, but, my problem is, I'd like to train a model based on, let's say, 1000 documents, to construct a model and using the model on a new single document to generate its potential topics.
But, as far as I read about mallet tutorial, it always says like this tool or API is useful on a corpus of texts, which means, it's used to find topics within several documents.
Is there a way that it can find topic on single document based on the model (or inference parameter it learned / constructed from the 1000 documents?)
Is there any other tool that can do this?
Thanks a lot!
You can refer the example code src/cc/mallet/examples/TopicModel.java which describes how to clustering and infer the new instance.
Actually when you run the simple LDA on a directory the model assigns topic proportions to each of the documents of that directory based on "an already" trained model from a part of your corpus. So, topic proportions are assigned with a certain probability to each of the documents (already ranked by the probability of appearance of that topic to that specific document).

Resources