How to prepare for training data in mllib - apache-spark

TL;DR;
How do I use mllib to train my wiki data (text & category) for prediction against tweets?
I have trouble figuring out how to convert my tokenized wiki data so that it can be trained through either NaiveBayes or LogisticRegression. My goal is to use the trained model for comparison against tweets*. I've tried using pipelines with LR and HashingTF with IDF for NaiveBayes but I keep getting wrong predictions. Here's what I've tried:
*Note that I would like to use the many categories in the wiki data for my labels...I've only seen binary classification (it's one category or another)....is it possible to do what I want?
Pipeline w LR
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.RegexTokenizer
case class WikiData(category: String, text: String)
case class LabeledData(category: String, text: String, label: Double)
val wikiData = sc.parallelize(List(WikiData("Spark", "this is about spark"), WikiData("Hadoop","then there is hadoop")))
val categoryMap = wikiData.map(x=>x.category).distinct.zipWithIndex.mapValues(x=>x.toDouble/1000).collectAsMap
val labeledData = wikiData.map(x=>LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setPattern("/W+")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(labeledData)
model.transform(labeledData).show
Naive Bayes
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documentsAsWordSequenceAlready)
import org.apache.spark.mllib.feature.IDF
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
tf.cache()
val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
//to create tfidfLabeled (below) I ran a map set the labels...but again it seems to have to be 1.0 or 0.0?
NaiveBayes.train(tfidfLabeled)
.predict(hashingTF.transform(tweet))
.collect

ML LogisticRegression doesn't support multinomial classification yet, but it is supported by both MLLib NaiveBayes and LogisticRegressionWithLBFGS. In the first case it should work by default:
import org.apache.spark.mllib.classification.NaiveBayes
val nbModel = new NaiveBayes()
.setModelType("multinomial") // This is default value
.run(train)
but for logistic regression you should provide a number of classes:
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(n) // Set number of classes
.run(trainingData)
Regarding preprocessing steps it is a quite broad topic and it is hard to give you a meaningful advice without an access to your data so everything you find below is just a wild guess:
as far I understand you use wiki data for training and tweets for testing. If that's true it is generally speaking a bad idea. You can expect that both sets use significantly different vocabulary, grammar and spelling
simple regex tokenizer can perform pretty well on standardized text but from my experience it won't work well on informal text like tweets
HashingTF can be a good way to obtain a baseline model but it is extremely simplified approach, especially if you don't apply any filtering steps. If you decide to use it you should at least increase number of features or use a default value (2^20)
EDIT (Preparing data for Naive Bayes with IDF)
using ML Pipelines:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.IDF
import org.apache.spark.sql.Row
val tokenizer = ???
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("rawFeatures")
val idf = new IDF()
.setInputCol(hashingTF.getOutputCol)
.setOutputCol("features")
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf))
val model = pipeline.fit(labeledData)
model
.transform(labeledData)
.select($"label", $"features")
.map{case Row(label: Double, features: Vector) => LabeledPoint(label, features)}
using MLlib transformers:
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.{IDF, IDFModel}
val labeledData = wikiData.map(x =>
LabeledData(x.category, x.text, categoryMap.get(x.category).getOrElse(0.0)))
val p = "\\W+".r
val raw = labeledData.map{
case LabeledData(_, text, label) => (label, p.split(text))}
val hashingTF: org.apache.spark.mllib.feature.HashingTF = new HashingTF(1000)
val tf = raw.map{case (label, text) => (label, hashingTF.transform(text))}
val idf: org.apache.spark.mllib.feature.IDFModel = new IDF().fit(tf.map(_._2))
tf.map{
case (label, rawFeatures) => LabeledPoint(label, idf.transform(rawFeatures))}
Note: Since transformers require JVM access MLlib version won't work in PySpark. If you prefer Python you have to split data transform and zip.
EDIT (Preparing data for ML algorithms):
While following piece of code looks valid at first glance
val categoryMap = wikiData
.map(x=>x.category)
.distinct
.zipWithIndex
.mapValues(x=>x.toDouble/1000)
.collectAsMap
val labeledData = wikiData.map(x=>LabeledData(
x.category, x.text, categoryMap.get(x.category).getOrElse(0.0))).toDF
it won't generate valid labels for ML algorithms.
First of all ML expects labels to be in (0.0, 1.0, ..., n.0) where n is number of classes. If your example pipeline where one of the classes get label 0.001 you'll get an error like this:
ERROR LogisticRegression: Classification labels should be in {0 to 0 Found 1 invalid labels.
The obvious solution is to avoid division when you generate mapping
.mapValues(x=>x.toDouble)
While it will work for LogisticRegression other ML algorithms will still fail. For example with RandomForestClassifier you'll get
RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
What it interesting ML version of RandomForestClassifier, unlike its MLlib counterpart, doesn't provide a method to set a number of classes. Turns out it expects special attributes to be set on a DataFrame column. The simplest approach is to use StringIndexer mentioned in the error message:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val pipeline = new Pipeline()
.setStages(Array(indexer, tokenizer, hashingTF, idf, lr))
val model = pipeline.fit(wikiData.toDF)

Related

Multiple Evaluators in CrossValidator - Spark ML

Is it possible to have more than 1 evaluator in a CrossValidator to get R2 and RMSE at the same time?
Instead of having two different CrossValidator:
val lr_evaluator_rmse = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("rmse")
val lr_evaluator_r2 = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("r2")
val lr_cv_rmse = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
val lr_cv_r2 = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Something like this:
val lr_cv = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEvaluator(lr_evaluator_r2)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Thanks in advance
The PySpark documentation on CrossValidator indicates that the evaluator argument is a single entity --> evaluator: Optional[pyspark.ml.evaluation.Evaluator] = None
The solution I went with was to create separate pipelines for each evaluator. For example,
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator
# Convert inputs to vector assembler
vec_assembler = VectorAssembler(inputCols=[inputs], outputCol="features")
# Create Random Forest Classifier pipeline
rf = RandomForestClassifier(labelCol="label", seed=42)
multiclass_evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")
binary_evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label")
# Plop model objects into cross validator
cv1 = CrossValidator(estimator=rf, evaluator=multiclass_evaluator, numFolds=3, parallelism=4, seed=42)
cv2 = CrossValidator(estimator=rf, evaluator=binary_evaluator, numFolds=3, parallelism=4, seed=42)
# Put all step in a pipeline
pipeline1 = Pipeline(stages=[vec_assembler, cv1])
pipeline2 = Pipeline(stages=[vec_assembler, cv2])

Spark 2.1.1: How to predict topics in unseen documents on already trained LDA model in Spark 2.1.1?

I am training an LDA model in pyspark (spark 2.1.1) on a customers review dataset. Now based on that model I want to predict the topics in the new unseen text.
I am using the following code to make the model
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.ml.clustering import DistributedLDAModel, LocalLDAModel
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.sql.functions import *
import pyspark.sql.functions as F
path = "D:/sparkdata/sample_text_LDA.txt"
sc = SparkContext("local[*]", "review")
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)
data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()
docDF = spark.createDataFrame(data)
remover = StopWordsRemover(inputCol="words",
outputCol="stopWordsRemoved")
stopWordsRemoved_df = remover.transform(docDF).cache()
Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")
model = Vector.fit(stopWordsRemoved_df)
result = model.transform(stopWordsRemoved_df)
corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()
# Cluster the documents topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary
print(ldaModel.describeTopics())
wordNumbers = 10 # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))
def topic_render(topic): # specify vector id of words to actual words
terms = topic[0]
result = []
for i in range(wordNumbers):
term = vocabArray[terms[i]]
result.append(term)
return result
topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()
for topic in range(len(topics_final)):
print("Topic" + str(topic) + ":")
for term in topics_final[topic]:
print (term)
print ('\n')
Now I have a dataframe with a column having new customer reviews and I want to predict that to which topic cluster they belong.
I have searched for answers, mostly the following way is recommended, as here Spark MLlib LDA, how to infer the topics distribution of a new unseen document?.
newDocuments: RDD[(Long, Vector)] = ...
topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
However, I get the following error:
'LDAModel' object has no attribute 'toLocal'.
Neither do it have topicDistribution attribute.
So are these attributes not supported in spark 2.1.1?
So any other way to infer topics from the unseen data?
You're going to need to pre-process the new data:
# import a new data set to be passed through the pre-trained LDA
data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index
documents_new = data_text_new
#documents_new = documents.dropna(subset=['Preprocessed Document'])
# process the new data set through the lemmatization, and stopwork functions
processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)
# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]
Then you can just pass it through the trained LDA as a function. All you need is that bow_corpus:
ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
If you want it out in a csv try this:
a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new
topic_0=[]
topic_1=[]
topic_2=[]
for i in a:
topic_0.append(i[0][1])
topic_1.append(i[1][1])
topic_2.append(i[2][1])
d = {'Your Target Column': b['Your Target Column'].tolist(),
'topic_0': topic_0,
'topic_1': topic_1,
'topic_2': topic_2}
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')
I hope this helps :)

Spark ML - create a features vector from new data element to predict on

tl;dr
I have fit a LinearRegression model in Spark 2.10 - after using StringIndexer and OneHotEncoder I have a ~44 element features vector. For a new bit of data I'd like to get a prediction on, how can I create a features vector from the new data element?
More Detail
First, this is completely contrived example to learn how to do this. Using logs with the fields:
"elapsed_time", "api_name", "method", and "status_code"
We will create a model of label elapsed_time and use the other fields as our feature set. The complete code will be shared below.
Steps - condensed
Read in our data to a DataFrame
Index each of our features using StringIndexer
OneHotEncode indexed features with OneHotEncoder
Create our features vector with VectorAssembler
Split data into training and testing sets
Fit the model & predict on test data
Results were horrible, but like I said this is a contrived exercise...
What I need to learn how to do
If a new log entry came in to a streaming application for example, how would I go about creating a feature vector from the new data and pass it in to predict()?
A new log entry might be:
{api_name":"/sample_api_1/v2","method":"GET","status_code":"200","elapsed_time":39}
Post VectorAssembler
status_code_vector
(14,[0],[1.0])
api_name_vector
(27,[0],[1.0])
method_vector
(3,[0],[1.0])
features vector
(44,[0,14,41],[1.0,1.0,1.0])
Le Code
%spark
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler, StringIndexerModel, VectorSlicer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.DataFrame
val logs = sc.textFile("/Users/z001vmk/data/sample_102M.txt")
val dfLogsRaw: DataFrame = spark.read.json(logs)
val dfLogsFiltered = dfLogsRaw.filter("status_code != 314").drop("extra_column")
// Create DF with our fields of concern.
val dfFeatures: DataFrame = dfLogsFiltered.select("elapsed_time", "api_name", "method", "status_code")
// Contrived goal:
// Use elapsed time as our label given features api_name, status_code, & method.
// Train model on small (100Mb) dataset
// Be able to predict elapsed_time given a new record similar to this example:
// --> {api_name":"/sample_api_1/v2","method":"GET","status_code":"200","elapsed_time":39}
// Indexers
val statusCodeIdxr: StringIndexer = new StringIndexer().setInputCol("status_code").setOutputCol("status_code_idx").setHandleInvalid("skip")
val apiNameIdxr: StringIndexer = new StringIndexer().setInputCol("api_name").setOutputCol("api_name_idx").setHandleInvalid("skip")
val methodIdxr: StringIndexer = new StringIndexer().setInputCol("method").setOutputCol("method_idx").setHandleInvalid("skip")
// Index features:
val dfIndexed0: DataFrame = statusCodeIdxr.fit(dfFeatures).transform(dfFeatures)
val dfIndexed1: DataFrame = apiNameIdxr.fit(dfIndexed0).transform(dfIndexed0)
val indexed: DataFrame = methodIdxr.fit(dfIndexed1).transform(dfIndexed1)
// OneHotEncoders
val statusCodeEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(statusCodeIdxr.getOutputCol).setOutputCol("status_code_vec")
val apiNameEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(apiNameIdxr.getOutputCol).setOutputCol("api_name_vec")
val methodEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(methodIdxr.getOutputCol).setOutputCol("method_vec")
// Encode feature vectors
val encoded0: DataFrame = statusCodeEncoder.transform(indexed)
val encoded1: DataFrame = apiNameEncoder.transform(encoded0)
val encoded: DataFrame = methodEncoder.transform(encoded1)
// Limit our dataset to necessary elements:
val dataset0 = encoded.select("elapsed_time", "status_code_vec", "api_name_vec", "method_vec").withColumnRenamed("elapsed_time", "label")
// Assemble feature vectors
val assembler: VectorAssembler = new VectorAssembler().setInputCols(Array("status_code_vec", "api_name_vec", "method_vec")).setOutputCol("features")
val dataset1 = assembler.transform(dataset0)
dataset1.show(5,false)
// Prepare the dataset for training (optional):
val dataset: DataFrame = dataset1.select("label", "features")
dataset.show(3,false)
val Array(training, test) = dataset.randomSplit(Array(0.8, 0.2))
// Create our Linear Regression Model
val lr: LinearRegression = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("label").setFeaturesCol("features")
val lrModel = lr.fit(training)
val predictions = lrModel.transform(test)
predictions.show(20,false)
This can all be pasted into a Zeppelin notebook if you're interested.
Wrapping up
So, what I've been scouring about for is how to transform new data into a ~35ish element feature vector and and use the model fit to the training data to transform it and get a prediction. I suspect there is metadata either held in the model itself or that would need to be maintained from the StringIndexers in this case - but that's what I cannot find.
Very happy to be pointed to docs or examples - all help appreciated.
Thank you!
Short answer: Pipeline models.
Just to make sure you understand, though, you don't want to create your model when you start an app, if you don't have to. Unless you're going to use DataSets and feedback, it's just silly. Create your model in a Spark Submit session (or use a notebook session like Zeppelin) and save it down. That's doing your data science.
Most DS guys hand the model over, and let the DevOps/Data Engineers use it. All they have to do is call a .predict() on the object after it's been loaded into memory.
After going down the road of using a PipelineModel, this became quite simple. Hat tip to #tadamhicks for getting me to look at piplines sooner than later.
Below is an updated code block that performs basically the same model creation, fit, and prediction as above but does so using pipelines and has an added bit where we predict on a newly created DataFrame to simulate how to predict on new data.
There is likely a cleaner way to rename/create our label column, but we'll leave that as a future enhancement.
%spark
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler, StringIndexerModel, VectorSlicer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.DataFrame
val logs = sc.textFile("/data/sample_102M.txt")
val dfLogsRaw: DataFrame = spark.read.json(logs)
val dfLogsFiltered = dfLogsRaw.filter("status_code != 314").drop("extra_column")
.select("elapsed_time", "api_name", "method", "status_code","cache_status")
.withColumnRenamed("elapsed_time", "label")
val Array(training, test) = dfLogsFiltered.randomSplit(Array(0.8, 0.2))
// Indexers
val statusCodeIdxr: StringIndexer = new StringIndexer().setInputCol("status_code").setOutputCol("status_code_idx").setHandleInvalid("skip")
val apiNameIdxr: StringIndexer = new StringIndexer().setInputCol("api_name").setOutputCol("api_name_idx").setHandleInvalid("skip")
val methodIdxr: StringIndexer = new StringIndexer().setInputCol("method").setOutputCol("method_idx").setHandleInvalid("skip")//"cache_status"
val cacheStatusIdxr: StringIndexer = new StringIndexer().setInputCol("cache_status").setOutputCol("cache_status_idx").setHandleInvalid("skip")
// OneHotEncoders
val statusCodeEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(statusCodeIdxr.getOutputCol).setOutputCol("status_code_vec")
val apiNameEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(apiNameIdxr.getOutputCol).setOutputCol("api_name_vec")
val methodEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(methodIdxr.getOutputCol).setOutputCol("method_vec")
val cacheStatusEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(cacheStatusIdxr.getOutputCol).setOutputCol("cache_status_vec")
// Vector Assembler
val assembler: VectorAssembler = new VectorAssembler().setInputCols(Array("status_code_vec", "api_name_vec", "method_vec", "cache_status_vec")).setOutputCol("features")
val lr: LinearRegression = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("label").setFeaturesCol("features")
val pipeline = new Pipeline().setStages(Array(statusCodeIdxr, apiNameIdxr, methodIdxr, cacheStatusIdxr, statusCodeEncoder, apiNameEncoder, methodEncoder, cacheStatusEncoder, assembler, lr))
val plModel: PipelineModel = pipeline.fit(training)
plModel.write.overwrite().save("/tmp/spark-linear-regression-model")
plModel.transform(test).select("label", "prediction").show(5,false)
val dataElement: String = """{"api_name":"/sample_api/v2","method":"GET","status_code":"200","cache_status":"MISS","elapsed_time":39}"""
val newDataRDD = spark.sparkContext.makeRDD(dataElement :: Nil)
val newData = spark.read.json(newDataRDD).withColumnRenamed("elapsed_time", "label")
val loadedPlModel = PipelineModel.load("/tmp/spark-linear-regression-model")
loadedPlModel.transform(newData).select("label", "prediction").show

Text classification using Naive Bayes (Hashing Term Frequnecy)

I am trying to build a text classification model using Naive-bayes algorithm.
Here's my sample data (label and feature):
1|combusting [chemical]
1|industrial purposes
1|
2|salt for preserving,
2|other for foodstuffs
2|auxiliary
2|fluids for use with abrasives
3|vulcanisation
3|accelerators
3|anti-frothing solutions for batteries
4|anti-frothing solutions for accumulators
4|acetates
4|[chemicals]*
4|acetate of cellulose, unprocessed
Following is my sample code
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.feature.HashingTF
val rawData = sc.textFile("~/data.csv")
val rawData1 = rawData.map(x => x.replaceAll(",",""))
val htf = new HashingTF(1000)
val parsedData = rawData1.map { line =>
val values = (line.split("|").toSeq)
val featureVector = htf.transform(values(1).split(" "))
val label = values(0).toDouble
LabeledPoint(label, featureVector)
}
val splits = parsedData.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 2.0, modelType = "multinomial")
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new MulticlassMetrics(predictionAndLabels)
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
val testData1 = htf.transform("salt")
val predictionAndLabels1 = model.predict(testData1)
I am getting approx 33% accuracy (very less), and testing data predict wrong label. I have printed parsedData which contains label and feature as below:
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
(3.0,(1000,[49],[1.0]))
(1.0,(1000,[48],[1.0]))
I am not able to find it out what's missing; hashing term frequency function seems generating repeated data term frequency. Kindly suggest me to improve the model performance, Thanks in advance
You have to ask yourself many questions before starting implementing your algorithm:
Your texts looks very short, what is the size of your vocabulary, answering this will help you in tuning the value of the HashingTF dimensionality. In your case, you might need to use lower value.
You might need to consider doing some pre-processing on your texts. e.g. using StopWordsRemover, stemming, Tokenizer?
A tokenizer will construct a better text than the ad-hoc text processing you are doing.
Change your parameters, namely the NumFeatures of the HashingTF and the lambda of the Naive Bayes.
Basically in Machin Learning you will need to do CrossValidation on a set of parameters in order to optimise your results. Check this example and try to do something similar by tuning your HashingTF and the lambda as follow:
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
.addGrid(naiveBayes.lambda, Array(0.1, 0.01))
.build()
In general, using Pipelines and CrossValidation works with Naive Bayes for multi-class classification, so have a look on here rather than hardcoding all septs with your hands.

Input format Problems with MLlib

I want to run a SVM Regression, but have problems with input format. Right now my train and test set for one customer looks like this:
1 '12262064 |f offer_quantity:1
has_bought_brand_company:1 has_bought_brand_a:6.79 has_bought_brand_q_60:1.0
has_bought_brand:2.0 has_bought_company_a:1.95 has_bought_brand_180:1.0
has_bought_brand_q_180:1.0 total_spend:218.37 has_bought_brand_q:3.0 offer_value:1.5
has_bought_brand_a_60:2.79 has_bought_brand_60:1.0 has_bought_brand_q_90:1.0
has_bought_brand_a_90:2.79 has_bought_company_q:1.0 has_bought_brand_90:1.0
has_bought_company:1.0 never_bought_category:1 has_bought_brand_a_180:2.79
If tried to read this textfile into Spark, but without success. What am I missing? Do I have to delete feature names? Right now its in Vowal Wabbit format.
My code looks like this:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "mllib/data/train.txt")
Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
``I get an answer, but my AUC value is 1, which shouldnt be the case.
scala> println("Area under ROC = " + auROC)
Area under ROC = 1.0
I think your File is not in LIBSVM format.If you can convert the file to libsvm format
or
you will have to load it as normal file and then create a label point
This is what i did for my file.
import org.apache.spark.mllib.feature.HashingTF
val tf = new HashingTF(2)
val tweets = sc.textFile(tweetInput)
val labelPoint = tweets.map(l=>{
val parts = l.split(' ')
var t=tf.transform(parts.tail.map(x => x).sliding(2).toSeq)
LabeledPoint(parts(0).toDouble,t )
}).cache()
labelPoint.count()
val model = LinearRegressionWithSGD.train(labelPoint, numIterations)

Resources