generating vector from text data for KMeans using spark - apache-spark

I am new to Spark and Machine Learning. I am trying to cluster using KMeans Some data like
1::Hi How are you
2::I am fine, how about you
In the data, separator is :: and Actual text to cluster is second column that has text data.
After reading on the spark official page and numerous articles I have written following code but I am not able to generate the vector to provide as input to KMeans.train step.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
val sc = new SparkContext("local", "test")
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val rawData = sc.textFile("data/mllib/KM.txt").map(line => line.split("::")(1))
val sentenceData = rawData.toDF("sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val clusters = KMeans.train(featurizedData, 2, 10)
I am getting following error
<console>:27: error: type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val clusters = KMeans.train(featurizedData, 2, 10)
Please suggest how to process input data for KMeans
Thanks in advance.

Finaly I get it working after replacing the following code.
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val clusters = KMeans.train(featurizedData, 2, 10)
With
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val kmeans = new KMeans().setK(2).setFeaturesCol("features").setPredictionCol("prediction")
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))

Related

Spark ML- prediction in KMeans

I have created a KMeans model using Spark ML methods.
val kmeans = new KMeans()
val model = kmeans.fit(df)
I got my model ready. But how to predict that in which cluster new data points will fall. In MLlib, model.predict(Vector) predict the cluster for the new data points. I saw the transform method on the model but its not working.
Thanks Jacek Laskowski for clarifying Oli. Its working fine for me now. It was a simple mistake. Below is the whole code.
val conf = new SparkConf().setMaster("local").setAppName("ml Kmeans")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
val trainingData = spark.read.json(spark.sparkContext.wholeTextFiles("file:/home/iot/data/traingJson.json").values)
val parsedData = trainingData.select("value.humidity", "value.speed", "value.temperature", "value.vibration")
val assembler = new VectorAssembler().setInputCols(Array("humidity", "speed", "temperature", "vibration")).setOutputCol("features")
val df = assembler.transform(parsedData)
val kmeans = new KMeans()
val model = kmeans.fit(df)
model.write.save("file:/home/iot/data/model1")
//--------------------------------Testing the Model------------------------
val uploadModel=KMeansModel.load("file:/home/iot/data/model1")
val testData = spark.read.json(spark.sparkContext.wholeTextFiles("file:/home/iot/data/testJson.json").values).select("value.humidity", "value.speed", "value.temperature", "value.vibration")
val assembler=new VectorAssembler().setInputCols(Array("humidity","speed","temperature","vibration")).setOutputCol("features")
val df = assembler.transform(testData)
model.transform(df).show(false)

Why does StandardScaler not attach metadata to the output column?

I noticed that the ml StandardScaler does not attach metadata to the output column:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
val df = spark.read.option("header", true)
.option("inferSchema", true)
.csv("/path/to/cars.data")
val strId1 = new StringIndexer()
.setInputCol("v7")
.setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
.setInputCol("v8")
.setOutputCol("v8_IDX")
val assmbleFeatures: VectorAssembler = new VectorAssembler()
.setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
.setOutputCol("featuresRaw")
val scalerModel = new StandardScaler()
.setInputCol("featuresRaw")
.setOutputCol("scaledFeatures")
val plm = new Pipeline()
.setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
.fit(df)
val dft = plm.transform(df)
dft.schema("scaledFeatures").metadata
Gives:
res1: org.apache.spark.sql.types.Metadata = {}
This example works on this dataset (just adapt path in code above).
Is there a specific reason for this? Is it likely that this feature will be added to Spark in the future? Any suggestions for a workaround that does not include duplicating the StandardScaler?
While discarding metadata is probably not the most fortunate choice, scaling indexed categorical features doesn't make any sense. Values returned by the StringIndexer are just labels.
If you want to scale numerical features, it should be a separate stage:
val numericAssembler: VectorAssembler = new VectorAssembler()
.setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
.setOutputCol("numericFeatures")
val scaler = new StandardScaler()
.setInputCol("numericFeatures")
.setOutputCol("scaledNumericFeatures")
val finalAssembler: VectorAssembler = new VectorAssembler()
.setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
.setOutputCol("features")
new Pipeline()
.setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
.fit(df)
Keeping in mind concerns raised at the beginning of this answer, you can also try copying the metadata:
val result = plm.transform(df).transform(df =>
df.withColumn(
"scaledFeatures",
$"scaledFeatures".as(
"scaledFeatures",
df.schema("featuresRaw").metadata)))
esult.schema("scaledFeatures").metadata
{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"v0"},{"idx":1,"name":"v1"},{"idx":2,"name":"v2"},{"idx":3,"name":"v3"},{"idx":4,"name":"v4"},{"idx":5,"name":"v5"},{"idx":6,"name":"v6"}],"nominal":[{"vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"}]},"num_attrs":8}}

Spark ML - create a features vector from new data element to predict on

tl;dr
I have fit a LinearRegression model in Spark 2.10 - after using StringIndexer and OneHotEncoder I have a ~44 element features vector. For a new bit of data I'd like to get a prediction on, how can I create a features vector from the new data element?
More Detail
First, this is completely contrived example to learn how to do this. Using logs with the fields:
"elapsed_time", "api_name", "method", and "status_code"
We will create a model of label elapsed_time and use the other fields as our feature set. The complete code will be shared below.
Steps - condensed
Read in our data to a DataFrame
Index each of our features using StringIndexer
OneHotEncode indexed features with OneHotEncoder
Create our features vector with VectorAssembler
Split data into training and testing sets
Fit the model & predict on test data
Results were horrible, but like I said this is a contrived exercise...
What I need to learn how to do
If a new log entry came in to a streaming application for example, how would I go about creating a feature vector from the new data and pass it in to predict()?
A new log entry might be:
{api_name":"/sample_api_1/v2","method":"GET","status_code":"200","elapsed_time":39}
Post VectorAssembler
status_code_vector
(14,[0],[1.0])
api_name_vector
(27,[0],[1.0])
method_vector
(3,[0],[1.0])
features vector
(44,[0,14,41],[1.0,1.0,1.0])
Le Code
%spark
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler, StringIndexerModel, VectorSlicer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.DataFrame
val logs = sc.textFile("/Users/z001vmk/data/sample_102M.txt")
val dfLogsRaw: DataFrame = spark.read.json(logs)
val dfLogsFiltered = dfLogsRaw.filter("status_code != 314").drop("extra_column")
// Create DF with our fields of concern.
val dfFeatures: DataFrame = dfLogsFiltered.select("elapsed_time", "api_name", "method", "status_code")
// Contrived goal:
// Use elapsed time as our label given features api_name, status_code, & method.
// Train model on small (100Mb) dataset
// Be able to predict elapsed_time given a new record similar to this example:
// --> {api_name":"/sample_api_1/v2","method":"GET","status_code":"200","elapsed_time":39}
// Indexers
val statusCodeIdxr: StringIndexer = new StringIndexer().setInputCol("status_code").setOutputCol("status_code_idx").setHandleInvalid("skip")
val apiNameIdxr: StringIndexer = new StringIndexer().setInputCol("api_name").setOutputCol("api_name_idx").setHandleInvalid("skip")
val methodIdxr: StringIndexer = new StringIndexer().setInputCol("method").setOutputCol("method_idx").setHandleInvalid("skip")
// Index features:
val dfIndexed0: DataFrame = statusCodeIdxr.fit(dfFeatures).transform(dfFeatures)
val dfIndexed1: DataFrame = apiNameIdxr.fit(dfIndexed0).transform(dfIndexed0)
val indexed: DataFrame = methodIdxr.fit(dfIndexed1).transform(dfIndexed1)
// OneHotEncoders
val statusCodeEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(statusCodeIdxr.getOutputCol).setOutputCol("status_code_vec")
val apiNameEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(apiNameIdxr.getOutputCol).setOutputCol("api_name_vec")
val methodEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(methodIdxr.getOutputCol).setOutputCol("method_vec")
// Encode feature vectors
val encoded0: DataFrame = statusCodeEncoder.transform(indexed)
val encoded1: DataFrame = apiNameEncoder.transform(encoded0)
val encoded: DataFrame = methodEncoder.transform(encoded1)
// Limit our dataset to necessary elements:
val dataset0 = encoded.select("elapsed_time", "status_code_vec", "api_name_vec", "method_vec").withColumnRenamed("elapsed_time", "label")
// Assemble feature vectors
val assembler: VectorAssembler = new VectorAssembler().setInputCols(Array("status_code_vec", "api_name_vec", "method_vec")).setOutputCol("features")
val dataset1 = assembler.transform(dataset0)
dataset1.show(5,false)
// Prepare the dataset for training (optional):
val dataset: DataFrame = dataset1.select("label", "features")
dataset.show(3,false)
val Array(training, test) = dataset.randomSplit(Array(0.8, 0.2))
// Create our Linear Regression Model
val lr: LinearRegression = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("label").setFeaturesCol("features")
val lrModel = lr.fit(training)
val predictions = lrModel.transform(test)
predictions.show(20,false)
This can all be pasted into a Zeppelin notebook if you're interested.
Wrapping up
So, what I've been scouring about for is how to transform new data into a ~35ish element feature vector and and use the model fit to the training data to transform it and get a prediction. I suspect there is metadata either held in the model itself or that would need to be maintained from the StringIndexers in this case - but that's what I cannot find.
Very happy to be pointed to docs or examples - all help appreciated.
Thank you!
Short answer: Pipeline models.
Just to make sure you understand, though, you don't want to create your model when you start an app, if you don't have to. Unless you're going to use DataSets and feedback, it's just silly. Create your model in a Spark Submit session (or use a notebook session like Zeppelin) and save it down. That's doing your data science.
Most DS guys hand the model over, and let the DevOps/Data Engineers use it. All they have to do is call a .predict() on the object after it's been loaded into memory.
After going down the road of using a PipelineModel, this became quite simple. Hat tip to #tadamhicks for getting me to look at piplines sooner than later.
Below is an updated code block that performs basically the same model creation, fit, and prediction as above but does so using pipelines and has an added bit where we predict on a newly created DataFrame to simulate how to predict on new data.
There is likely a cleaner way to rename/create our label column, but we'll leave that as a future enhancement.
%spark
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler, StringIndexerModel, VectorSlicer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.DataFrame
val logs = sc.textFile("/data/sample_102M.txt")
val dfLogsRaw: DataFrame = spark.read.json(logs)
val dfLogsFiltered = dfLogsRaw.filter("status_code != 314").drop("extra_column")
.select("elapsed_time", "api_name", "method", "status_code","cache_status")
.withColumnRenamed("elapsed_time", "label")
val Array(training, test) = dfLogsFiltered.randomSplit(Array(0.8, 0.2))
// Indexers
val statusCodeIdxr: StringIndexer = new StringIndexer().setInputCol("status_code").setOutputCol("status_code_idx").setHandleInvalid("skip")
val apiNameIdxr: StringIndexer = new StringIndexer().setInputCol("api_name").setOutputCol("api_name_idx").setHandleInvalid("skip")
val methodIdxr: StringIndexer = new StringIndexer().setInputCol("method").setOutputCol("method_idx").setHandleInvalid("skip")//"cache_status"
val cacheStatusIdxr: StringIndexer = new StringIndexer().setInputCol("cache_status").setOutputCol("cache_status_idx").setHandleInvalid("skip")
// OneHotEncoders
val statusCodeEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(statusCodeIdxr.getOutputCol).setOutputCol("status_code_vec")
val apiNameEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(apiNameIdxr.getOutputCol).setOutputCol("api_name_vec")
val methodEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(methodIdxr.getOutputCol).setOutputCol("method_vec")
val cacheStatusEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(cacheStatusIdxr.getOutputCol).setOutputCol("cache_status_vec")
// Vector Assembler
val assembler: VectorAssembler = new VectorAssembler().setInputCols(Array("status_code_vec", "api_name_vec", "method_vec", "cache_status_vec")).setOutputCol("features")
val lr: LinearRegression = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("label").setFeaturesCol("features")
val pipeline = new Pipeline().setStages(Array(statusCodeIdxr, apiNameIdxr, methodIdxr, cacheStatusIdxr, statusCodeEncoder, apiNameEncoder, methodEncoder, cacheStatusEncoder, assembler, lr))
val plModel: PipelineModel = pipeline.fit(training)
plModel.write.overwrite().save("/tmp/spark-linear-regression-model")
plModel.transform(test).select("label", "prediction").show(5,false)
val dataElement: String = """{"api_name":"/sample_api/v2","method":"GET","status_code":"200","cache_status":"MISS","elapsed_time":39}"""
val newDataRDD = spark.sparkContext.makeRDD(dataElement :: Nil)
val newData = spark.read.json(newDataRDD).withColumnRenamed("elapsed_time", "label")
val loadedPlModel = PipelineModel.load("/tmp/spark-linear-regression-model")
loadedPlModel.transform(newData).select("label", "prediction").show

How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?
ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// [label: double, features: vector]
trainingData org.apache.spark.sql.DataFrame = ???
val nFolds: Int = ???
val numTrees: Int = ???
val metric: String = ???
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(numTrees)
val pipeline = new Pipeline().setStages(Array(rf))
val paramGrid = new ParamGridBuilder().build() // No parameter search
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
// "f1" (default), "weightedPrecision", "weightedRecall", "accuracy"
.setMetricName(metric)
val cv = new CrossValidator()
// ml.Pipeline with ml.classification.RandomForestClassifier
.setEstimator(pipeline)
// ml.evaluation.MulticlassClassificationEvaluator
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds)
val model = cv.fit(trainingData) // trainingData: DataFrame
Using PySpark:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
trainingData = ... # DataFrame[label: double, features: vector]
numFolds = ... # Integer
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = MulticlassClassificationEvaluator() # + other params as in Scala
pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder.
.addGrid(rf.numTrees, [3, 10])
.addGrid(...) # Add other parameters
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
model = crossval.fit(trainingData)
To build on zero323's great answer using Random Forest Classifier, here is a similar example for Random Forest Regressor:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.regression.RandomForestRegressor // CHANGED
import org.apache.spark.ml.evaluation.RegressionEvaluator // CHANGED
import org.apache.spark.ml.feature.{VectorAssembler, VectorIndexer}
val numFolds = ??? // Integer
val data = ??? // DataFrame
// Training (80%) and test data (20%)
val Array(train, test) = data.randomSplit(Array(0.8,0.2))
val featuresCols = data.columns
val va = new VectorAssembler()
va.setInputCols(featuresCols)
va.setOutputCol("rawFeatures")
val vi = new VectorIndexer()
vi.setInputCol("rawFeatures")
vi.setOutputCol("features")
vi.setMaxCategories(5)
val regressor = new RandomForestRegressor()
regressor.setLabelCol("events")
val metric = "rmse"
val evaluator = new RegressionEvaluator()
.setLabelCol("events")
.setPredictionCol("prediction")
// "rmse" (default): root mean squared error
// "mse": mean squared error
// "r2": R2 metric
// "mae": mean absolute error
.setMetricName(metric)
val paramGrid = new ParamGridBuilder().build()
val cv = new CrossValidator()
.setEstimator(regressor)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)
val model = cv.fit(train) // train: DataFrame
val predictions = model.transform(test)
predictions.show
val rmse = evaluator.evaluate(predictions)
println(rmse)
Evaluator metric source:
https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.evaluation.RegressionEvaluator

Input format Problems with MLlib

I want to run a SVM Regression, but have problems with input format. Right now my train and test set for one customer looks like this:
1 '12262064 |f offer_quantity:1
has_bought_brand_company:1 has_bought_brand_a:6.79 has_bought_brand_q_60:1.0
has_bought_brand:2.0 has_bought_company_a:1.95 has_bought_brand_180:1.0
has_bought_brand_q_180:1.0 total_spend:218.37 has_bought_brand_q:3.0 offer_value:1.5
has_bought_brand_a_60:2.79 has_bought_brand_60:1.0 has_bought_brand_q_90:1.0
has_bought_brand_a_90:2.79 has_bought_company_q:1.0 has_bought_brand_90:1.0
has_bought_company:1.0 never_bought_category:1 has_bought_brand_a_180:2.79
If tried to read this textfile into Spark, but without success. What am I missing? Do I have to delete feature names? Right now its in Vowal Wabbit format.
My code looks like this:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "mllib/data/train.txt")
Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
``I get an answer, but my AUC value is 1, which shouldnt be the case.
scala> println("Area under ROC = " + auROC)
Area under ROC = 1.0
I think your File is not in LIBSVM format.If you can convert the file to libsvm format
or
you will have to load it as normal file and then create a label point
This is what i did for my file.
import org.apache.spark.mllib.feature.HashingTF
val tf = new HashingTF(2)
val tweets = sc.textFile(tweetInput)
val labelPoint = tweets.map(l=>{
val parts = l.split(' ')
var t=tf.transform(parts.tail.map(x => x).sliding(2).toSeq)
LabeledPoint(parts(0).toDouble,t )
}).cache()
labelPoint.count()
val model = LinearRegressionWithSGD.train(labelPoint, numIterations)

Resources