Train Spark k-means with Mahout vectors - apache-spark

I have some Mahout vectors in my hdfs in sequence file format. Is it possible to use the same vectors in some way to train a KMeans model in Spark? I could just convert the existing Mahout vectors into Spark vectors (mllib) but I'd like to avoid that.

Mahout vectors are not directly supported by Spark. You would - along the lines of your concern - need to convert them to Spark Vectors.
val sc = new SparkContext("local[2]", "MahoutTest")
val sfData = sc.sequenceFile[NullWritable, MVector](dir)
val xformedVectors = sfData.map { case (label, vect) =>
import collection.JavaConversions._
(label, Vectors.dense(vect.all.iterator.map{ e => e.get}.toArray))
}

Related

How to convert spark mllib word2vec model to glove txt format?

I use Spark MLlib to train a domain specific word2vec model and I need to use it in glove word2vec format.
How can I convert it to glove txt format?
After trying a bit in SparkShell, I found below code works for me:
val vectors = model.getVectors
val writer = new BufferedWriter(new FileWriter(file))
vectors foreach ( entry => writer.write(entry._1 + " " + entry._2.mkString(" ") + "\n") )

ML Pipeline and metrics: Precision, Recall, AUC-ROC, F1Score

I'm using ML Pipeline, something like:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(columns)
.setOutputCol("features");
LogisticRegression lr = new LogisticRegression().setLabelCol(targetColumn);
lr.setMaxIter(10).setRegParam(0.01).setFeaturesCol("features");
Pipeline logisticRegression = new Pipeline();
logisticRegression.setStages(new PipelineStage[] {assembler, lr});
PipelineModel logisticRegressionModel = logisticRegression.fit(learningData);
What I want to is the way to get standard metric like Precision, Recall, AUC-ROC, F1-SCORE, ACCURACY on this model.
I've found BinaryClassificationMetrics - but not sure if it's compatible at all.
RegressionEvaluator seems to return only mse|rmse|r2|mae.
So what is the right way to extract Precision, Recall, etc with ML Pipeline?
Couple of things missing from Ryan's answer above.
I can confirm the following works (Note: my use case was for Multiclass Classification)
val scoredTestSet = model.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").rdd.map(r => (r.getDouble(0), r.getDouble(1)))
val multiModelMetrics = new MulticlassMetrics(predictionAndLabelsRDD)
once you have scored data, get the prediction and label and pass that to BinaryClassificationMetrics
something like below (thought it's in scala I hope it helps)
val scoredTestSet = logisticRegressionModel.transform(testSet)
val predictionLabelsRDD = scoredTestSet.select("prediction", "label").map(r => (r.getDouble(0), r.getDouble(1)))
val binMetrics = new BinaryClassificationMetrics(predictionAndLabels)
// binMetrics.areaUnderROC
other examples from https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification
prediction in this case is 1.0 or 0.0
you can also extract the probability and use that instead of the prediction so that binMetrics can show data for multiple thresholds

Spark: use each Array[Vector] from RDD[(Int,Array[Vector])] in MlLib k-mean clustering

I have Vectors grouped by id in RDD like this: RDD[(Int,Array[Vector])]. I want to make clustering on each group of vectors by id in separation.
MlLib k-mean algorithm require RDD[Vector] as an argument:
val kmean = new KMeans().setK(3)
.setEpsilon(100)
.setMaxIterations(10)
.setInitializationMode("k-means||")
.setSeed(System.currentTimeMillis())
But obviously - when I map my RDD - I get Array[Vector] not wrapped with RDD:
// not work since e._2 is an Array[Vector] not RDD[Vector]!
rdd.map(e => kmean.run(e._2))
So the question is - how can I perform such clustering?
Thanks for help in advice!

Linear Regression on Apache Spark

We have a situation where we have to run linear regression on millions of small datasets and store the weights and intercept for each of these datasets. I wrote the below scala code to do so, wherein I fed each of these datasets as a row in an RDD and then I try to run the regression on each(data is the RDD which has (label,features) stored in it in each row, in this case we have one feature per label):
val x = data.flatMap { line => line.split(' ')}.map { line =>
val parts = line.split(',')
val parsedData1 = LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
val model = LinearRegressionWithSGD.train(sc.parallelize(List(parsedData1)),100)//using parallelize to convert data to type RDD
(model.intercept,model.weights)
}
The problem here is that, LinearRegressionWithSGD expects an RDD for input, and nested RDDs are not supported in Spark. I chose this approach as all these datasets can be run independent of each other and hence I wanted to distribute them (Hence, ruled out looping).
Can you please suggest if I can use other types (Arrays, Lists etc) to input as a dataset to LinearRegressionWithSGD or even a better approach which will still distribute such computations in Spark?
val modelList = for {item <- dataSet} yield {
val data = MLUtils.loadLibSVMFile(context, item).cache()
val model = LinearRegressionWithSGD.train(data)
model
}
Maybe you can separate your input data into several files and store in HDFS.
Use the directory of those files as input, you can get a list of models.

Apache Spark Naive Bayes based Text Classification

im trying to use Apache Spark for document classification.
For example i have two types of Class (C and J)
Train data is :
C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese
And test data is :
Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?
How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.
How can i do this with Apache Spark?
Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.
There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).
Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.
def testBayesClassifier(hiveCnt:SQLContext){
val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featureData = hashTF.transform(wordsData) //key step 1
val trainDataRdd = featureData.select("category","features").map {
case Row(label: Int, features: Vector) => //key step 2
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
//train the model
val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
//same for the test data
val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
val testWordData = tokenizer.transform(testData)
val testFeatureData = hashTF.transform(testWordData)
val testDataRdd = testFeatureData.select("category","features").map {
case Row(label: Int, features: Vector) =>
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
}
You can use mlib's naive bayes classifier for this. A sample example is given in the link.
http://spark.apache.org/docs/latest/mllib-naive-bayes.html
There any many classification methods (logistic regression, SVMs, neural networks,LDA, QDA...), you can either implement yours or use MLlib classification methods (actually, there are logistic regression and SVM implemented in MLlib)
What you need to do is transform your features to a vector, and labels to doubles.
For examples, your dataset will look like:
1, (2,1,0,0,0,0)
1, (2,0,1,0,0,0)
0, (1,0,0,1,0,0)
0, (1,0,0,0,1,1)
And tour test vector:
(3,0,0,0,1,1)
Hope this helps

Resources