Printing ClusterID and its elements using Spark KMeans algo. - apache-spark

I have this program which prints the MSSE of Kmeans algorithm on apache-spark. There are 20 clusters generated. I am trying to print the clusterID and the elements that got assigned to respective clusterID. How do i loop over the clusterID to print the elements.
Thank you guys!!
val sc = new SparkContext("local", "KMeansExample","/usr/local/spark/", List("target/scala-2.10/kmeans_2.10-1.0.jar"))
// Load and parse the data
val data = sc.textFile("kmeans.csv")
val parsedData = data.map( s => Vectors.dense(s.split(',').map(_.toDouble)))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
val clusterCenters = clusters.clusterCenters map (_.toArray)
println("The Cluster Centers are = " + clusterCenters)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

as I know you should run predict for each elements.
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
List<Vector> vectors = parsedData.collect();
for(Vector vector: vectors){
System.out.println("cluster "+clusters.predict(vector) +" "+vector.toString());
}

Related

Spark Find the Occurrences of Matched Strings

how i can find the occurence of the matched string as per the below code snippet, i'm able to get the filtered strings as an output , but not the occurences
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile("file:///tmp/ganesh/*")
val matched_pattern = input.filter(line => line.contains("Title"))
// Split it up into words.
val words = matched_pattern.flatMap(line => line.split(" "))
// Transform into pairs and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile("file:///tmp/sparkout")
}
}
Here is an example - with broadcast variable usage. stopWords is in fact include words.
val dfsFilename = "/FileStore/tables/7dxa9btd1477497663691/Text_File_01-880f5.txt"
val readFileRDD = spark.sparkContext.textFile(dfsFilename)
// res4: Array[String] = Array(The the is Is a A to To OK ok I) //stopWords
val stopWordsInput = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")
val stopWords = stopWordsInput.flatMap(x => x.split(" ")).map(_.trim).collect.toSet
val broadcasted = sc.broadcast(stopWords)
val wcounts1 = readFileRDD.map(x => (x.replaceAll("[^A-Za-z0-9]", " ")
.trim.toLowerCase))
.flatMap(line=>line.split(" "))
.filter(broadcasted.value.contains(_))
.map(word=>(word, 1))
.reduceByKey(_ + _)
wcounts1.collect
returns:
res2: Array[(String, Int)] = Array((The,1), (I,3), (to,1), (the,1))
You can embellish with broadcast on the stopWords -which is what I did.
I saw you XML input and a replaceAll. You can fiddle with that to your liking. I also added a clause to put it all to lower case.

How can I select balanced sampling for binary classification?

There is my code, load data from hive, and do sample balance:
// Load SubSet Data
val dataList = DataLoader.loadSubTrainTestData(hiveContext.sql(sampleDataHql))
// Split Data to Train and Test
val data = dataList.randomSplit(Array(0.7, 0.3), seed = 11L)
// Random balance train data
val sampleCount = data(0).map(rec => (rec.label, 1)).reduceByKey(_ + _)
val positiveSample = data(0).filter(_.label == 1).cache()
val positiveSize = positiveSample.count()
val negativeSample = data(0).filter(_.label == 0).cache()
val negativeSize = negativeSample.count()
// Build train data
val trainData = positiveSample ++
negativeSample.sample(withReplacement = false, 1.0 * positiveSize.toFloat / negativeSize, System.nanoTime())
// Data size
val trainDataSize = positiveSize + negativeSize
val testDataSize = trainDataSize * 3.0 / 7.0
and i calculate the trainDataSize and testDataSize for evaluate the model confidence
Ok I haven't tested this code, but it should go like this :
val data: RDD[LabeledPoint] = ???
val fractions: Map[Double, Double] = Map(0.0 -> 0.5, 1.0 -> 0.5)
val sampledData: RDD[LabeledPoint] = data
.keyBy(_.label)
.sampleByKeyExact(false, fractions) // Optionally with seed
.values
You can convert your LabeledPoint into PairRDDs than apply a sampleByKeyExact using the fractions you wish to use.

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

I'm running a Bernoulli Naive Bayes using code:
val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Unfortunately this code isn't working for NaiveBayes.
Concerning the probabilities for Bernouilli Naive Bayes, here is an example :
// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))
// Transforming dummy data into LabeledPoint
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
// labels
val labels = model.labels
// Probabilities for all feature vectors
val features = parsedData.map(lp => lp.features)
model.predictProbabilities(features).take(10) foreach println
// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector ${testVector} => probability : ${model.predictProbabilities(testVector)}")
As for the AUC :
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Concerning the inquiry from the chat :
val results = parsedData.map { lp =>
val probs: Vector = model.predictProbabilities(lp.features)
(for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
}.flatMap(identity)
results.take(10).foreach(println)
// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)
and if you are only interested in the argmax classes :
val results = training.map { lp => val probs: Vector = model.predictProbabilities(lp.features)
val bestClass = probs.argmax
(labels(bestClass), probs(bestClass))
}
results.take(10) foreach println
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
Note: Works with Spark 1.5+
EDIT: (for Pyspark users)
It seems like some are having troubles getting probabilities using pyspark and mllib. Well that's normal, spark-mllib doesn't present that function for pyspark.
Thus you'll need to use the spark-ml DataFrame-based API :
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes
df = spark.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)
model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction |probability |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0 |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0 |
# |[0.0,1.0]|0.0 |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0 |
# |[1.0,0.0]|1.0 |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0 |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
You'll just need to select your prediction column and compute your AUC.
For more information about Naive Bayes in spark-ml, please refer to the official documentation here.

Registering kmeans model as UDF

Hi I am trying to use Spark kmeans model to predict the cluster number. But when I register it and use it in SQL it gives me a
java.lang.reflect.InvocationTargetException
def findCluster(s:String):Int={
model.predict(feautarize(s))
}
I am using the below
%sql select findCluster((text)) from tweets
The same works if i use it directly
findCluster("hello am vishnu")
output 1
It is impossible to reproduce the problem with a code you've provided. Assuming that model is org.apache.spark.mllib.clustering.KMeansModel here is step by step solution
First lets import required libraries and set RNG seed:
import scala.util.Random
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
Random.setSeed(0L)
Generate random train set:
// Generate random training set
val trainData = sc.parallelize((1 to 1000).map { _ =>
val off = if(Random.nextFloat > 0.5) 0.5 else -0.5
Vectors.dense(Random.nextFloat + off, Random.nextFloat + off)
})
Run KMeans
// Train KMeans with 2 clusters
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(trainData, numClusters, numIterations)
Create UDF
// Create broadcast variable with model and prediction function
val model = sc.broadcast(clusters)
def findCluster(v: org.apache.spark.mllib.linalg.Vector):Int={
model.value.predict(v)
}
// Register UDF
sqlContext.udf.register("findCluster", findCluster _)
Prepare test set
// Create test set
case class Coord(v: org.apache.spark.mllib.linalg.Vector)
val testData = sqlContext.createDataFrame(sc.parallelize((1 to 100).map { _ =>
val off = if(Random.nextFloat > 0.5) 0.5 else -0.5
Coord(Vectors.dense(Random.nextFloat + off, Random.nextFloat + off))
}))
// Register test set df
testData.registerTempTable("testData")
// Check if it works
sqlContext.sql("SELECT findCluster(v) FROM testData").take(1)
Result:
res3: Array[org.apache.spark.sql.Row] = Array([1])

Spark: How to perform prediction using trained data set (MLLIB: SVMWithSGD)

I am new to Spark. I am able to train the DataSet. But not able use the trained data set to make predictions.
Here is the code to train the data which is 1800x4000 matrix.
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/myfile.txt")
val parsedData = data.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
val firstDataPoint = parsedData.take(1)(0)
// Building the model
val numIterations = 100
val model = SVMWithSGD.train(parsedData, numIterations)
//val model = LinearRegressionWithSGD.train(parsedData,numIterations)
val labelAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
println("Training Error = " + trainErr)
Now I load the data to be used to perform the prediction: Data is vector of 1800 values
val test = sc.textFile("data/mllib/ridge-data/data.txt")
But not sure how to perform prediction using this data. Please help.
First load the labeledPoints from the textFile (keep in mind you had to save the RDD with saveAsTextFile):
JavaRDD<LabeledPoint> test = MLUtils.loadLabeledPoints(init.context, "hdfs://../test/", 30).toJavaRDD();
JavaRDD<Tuple2<Object, Object>> scoreAndLabels = test.map(
new Function<LabeledPoint, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(LabeledPoint p) {
Double score = model.predict(p.features());
return new Tuple2<Object, Object>(score, p.label());
}
}
);
Now collect the scores and iterate over them:
List<Tuple2<Object, Object>> scores = scoreAndLabels.collect();
for(Tuple2<Object, Object> score : scores){
System.out.println(score._1 + " \t" + score._2);
}
It is in Java, but maybe you can convert it :)
But the prediction values do not make sense:
-18.841544889249917 0.0
168.32916035523283 1.0
420.67763915879794 1.0
-974.1942589201286 0.0
71.73602841256813 1.0
233.13636224524993 1.0
-1000.5902168199027 0.0
Does anybody know what they mean?

Resources