Apache Spark Naive Bayes based Text Classification - apache-spark

im trying to use Apache Spark for document classification.
For example i have two types of Class (C and J)
Train data is :
C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese
And test data is :
Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?
How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.
How can i do this with Apache Spark?

Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.
There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).

Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.
def testBayesClassifier(hiveCnt:SQLContext){
val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featureData = hashTF.transform(wordsData) //key step 1
val trainDataRdd = featureData.select("category","features").map {
case Row(label: Int, features: Vector) => //key step 2
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
//train the model
val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
//same for the test data
val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
val testWordData = tokenizer.transform(testData)
val testFeatureData = hashTF.transform(testWordData)
val testDataRdd = testFeatureData.select("category","features").map {
case Row(label: Int, features: Vector) =>
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
}

You can use mlib's naive bayes classifier for this. A sample example is given in the link.
http://spark.apache.org/docs/latest/mllib-naive-bayes.html

There any many classification methods (logistic regression, SVMs, neural networks,LDA, QDA...), you can either implement yours or use MLlib classification methods (actually, there are logistic regression and SVM implemented in MLlib)
What you need to do is transform your features to a vector, and labels to doubles.
For examples, your dataset will look like:
1, (2,1,0,0,0,0)
1, (2,0,1,0,0,0)
0, (1,0,0,1,0,0)
0, (1,0,0,0,1,1)
And tour test vector:
(3,0,0,0,1,1)
Hope this helps

Related

Handling imbalanced class in Spark

I am trying to experiment with credit card fraud detection dataset through spark mllib.
The dataset that I have has many 0's(meaning non-fraud) compared to 1's(meaning fraud).
I wanted to know to solve a class imbalance problem like the above do we have any available algorithm in spark like SMOTE.
I am using logistic regression as the model
You can try weightCol within logistic regression, Something like this:
temp = train.groupby("LabelCol").count()
new_train = train.join(temp, "LabelCol", how = 'leftouter')
num_labels = train_data.select(countDistinct(train_data.score)).first()[0]
train1 = new_train.withColumn("weight",(new_train.count()/(num_labels * new_train["count"])))
# Logistic Regrestion Initiation
lr = LogisticRegression(weightCol = "weight", family = 'multinomial')

ML Pipeline and metrics: Precision, Recall, AUC-ROC, F1Score

I'm using ML Pipeline, something like:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(columns)
.setOutputCol("features");
LogisticRegression lr = new LogisticRegression().setLabelCol(targetColumn);
lr.setMaxIter(10).setRegParam(0.01).setFeaturesCol("features");
Pipeline logisticRegression = new Pipeline();
logisticRegression.setStages(new PipelineStage[] {assembler, lr});
PipelineModel logisticRegressionModel = logisticRegression.fit(learningData);
What I want to is the way to get standard metric like Precision, Recall, AUC-ROC, F1-SCORE, ACCURACY on this model.
I've found BinaryClassificationMetrics - but not sure if it's compatible at all.
RegressionEvaluator seems to return only mse|rmse|r2|mae.
So what is the right way to extract Precision, Recall, etc with ML Pipeline?
Couple of things missing from Ryan's answer above.
I can confirm the following works (Note: my use case was for Multiclass Classification)
val scoredTestSet = model.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").rdd.map(r => (r.getDouble(0), r.getDouble(1)))
val multiModelMetrics = new MulticlassMetrics(predictionAndLabelsRDD)
once you have scored data, get the prediction and label and pass that to BinaryClassificationMetrics
something like below (thought it's in scala I hope it helps)
val scoredTestSet = logisticRegressionModel.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").map(r => (r.getDouble(0), r.getDouble(1)))
val binMetrics = new BinaryClassificationMetrics(predictionAndLabels)
// binMetrics.areaUnderROC
other examples from https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
prediction in this case is 1.0 or 0.0
you can also extract the probability and use that instead of the prediction so that binMetrics can show data for multiple thresholds

How to increase the accuracy of neural network model in spark?

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
// Load training data
val data = MLUtils.loadLibSVMFile(sc,"/home/.../neural.txt").toDF()
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
val layers = Array[Int](4, 5, 4, 4)
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))
I am using MultilayerPerceptronClassifier to build neural network in Spark. I am getting 62.5% of accuracy. What all parameters I should change to get good accuracy?
As some people has said , the question is too broad and cant be answered without more detail but some advice(independently of the models/altorithms used or the tools and libraries for implementing them) would be:
Use a cross validation set and perform some cross validation with different network architectures.
Plot "Learning curves"
Identify if you are having high bias or high variance
See if you can or need to apply feature scaling and/or normalization.
Do some "Error Analysis"(manually verify which examples failed and evaluate or categorize them to see if you can find a pattern)
Not neccesarily in that order, but that could help you identify if you have underfitting, overfitting, if you need more training data, add or remove features, add regularization, etc. In summary , perform machine learning debugging.
Hope that helps, you can find more deep details about this in Andrew Ngs series of videos, starting with this:
https://www.youtube.com/watch?v=qIfLZAa32H0

Spark and categorical string variables

I'm trying to understand how spark.ml handles string categorical independent variables. I know that in Spark I have to convert strings to doubles using StringIndexer.
Eg., "a"/"b"/"c" => 0.0/1.0/2.0.
But what I really would like to avoid is then having to use OneHotEncoder on that column of doubles. This seems to make the pipeline unnecessarily messy. Especially since Spark knows that the data is categorical. Hopefully the sample code below makes my question clearer.
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression
val df = sqlContext.createDataFrame(Seq(
Tuple2(0.0,"a"), Tuple2(1.0, "b"), Tuple2(1.0, "c"), Tuple2(0.0, "c")
)).toDF("y", "x")
// index the string column "x"
val indexer = new StringIndexer().setInputCol("x").setOutputCol("xIdx").fit(df)
val indexed = indexer.transform(df)
// build a data frame of label, vectors
val assembler = (new VectorAssembler()).setInputCols(List("xIdx").toArray).setOutputCol("features")
val assembled = assembler.transform(indexed)
// build a logistic regression model and fit it
val logreg = (new LogisticRegression()).setFeaturesCol("features").setLabelCol("y")
val model = logreg.fit(assembled)
The logistic regression sees this as a model with only one independent variable.
model.coefficients
res1: org.apache.spark.mllib.linalg.Vector = [0.7667490491775728]
But the independent variable is categorical with three categories = ["a", "b", "c"]. I know I never did a one of k encoding but the metadata of the data frame knows that the feature vector is nominal.
import org.apache.spark.ml.attribute.AttributeGroup
AttributeGroup.fromStructField(assembled.schema("features"))
res2: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"attrs":
{"nominal":[{"vals":["c","a","b"],"idx":0,"name":"xIdx"}]},
"num_attrs":1}}
How do I pass this information to LogisticRegression? Is this not the whole point of keeping dataframe metadata? There does not seem to be a CategoricalFeaturesInfo in SparkML. Do I really need to do a 1 of k encoding for each categorical feature?
Maybe I am missing something, but this really looks like the job for RFormula (https://spark.apache.org/docs/latest/ml-features.html#rformula).
As the name suggests, it takes an "R-style" formula that describes how the feature vector is composed from the input data columns.
For each categorical input columns (that is, StringType as type) it adds a StringIndexer + OneHotEncoder to the final pipeline implementing the formula under the hoods.
The output is a feature vector (of doubles) that can be used with any algorithm in the org.apache.spark.ml package, as the one you are targeting.

Train Spark k-means with Mahout vectors

I have some Mahout vectors in my hdfs in sequence file format. Is it possible to use the same vectors in some way to train a KMeans model in Spark? I could just convert the existing Mahout vectors into Spark vectors (mllib) but I'd like to avoid that.
Mahout vectors are not directly supported by Spark. You would - along the lines of your concern - need to convert them to Spark Vectors.
val sc = new SparkContext("local[2]", "MahoutTest")
val sfData = sc.sequenceFile[NullWritable, MVector](dir)
val xformedVectors = sfData.map { case (label, vect) =>
import collection.JavaConversions._
(label, Vectors.dense(vect.all.iterator.map{ e => e.get}.toArray))
}

Resources