spark naive bayes prediction analysis - apache-spark

I have used Naive Bayes for Text Classification
Below is the link I used for understanding Naive Bayes
https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
Though I got good prediction results , I was not able to understand the reason for the failure cases
I measured the probability of the features using predictProbabilities to understand the reason for
in correct prediction
Below is my understanding based on which I am trying to find why predictions are wrong in some cases
Assume my test data is like below ( I have around 100000 records for training )
Text Classification
There is a murder in town - HIGH SEVERITY
The old women was murdered - HIGH SEVERITY
Boy was hit by ball in street - LOW SEVERITY
John sprained his ankle while playing - LOW SEVERITY
Now when I do a prediction for below sentence
"There is a murder in city" - I expect the model to predict HIGH SEVERITY.
But at times the model predicts LOW SEVERITY
I pulled up all text which has same words and tried to figure out why this is happening .
If I compute the probability manually using the formula in https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/, it should have been predicted correctly .
But I could not find any clue why the prediction is going wrong.
Kindly let me know if I am missing any critical information
Code Snippet Added Below
My Training Data Frame consists of three columns "id" , "risk", "label"
The text is already lemmetized using stanford NLP
// TOKENIZE DATA
regexTokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setPattern("\\W");
DataFrame tokenized = regexTokenizer.transform(trainingRiskData);
// REMOVE STOP WORDS
remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame stopWordsRemoved = remover.transform(tokenized);
// COMPUTE TERM FREQUENCY USING HASHING
int numFeatures = 20;
hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
idfModel = idf.fit(rawFeaturizedData);
DataFrame featurizedData = idfModel.transform(rawFeaturizedData);
JavaRDD<LabeledPoint> labelledJavaRDD = featurizedData.select("label", "features").toJavaRDD()
.map(new Function<Row, LabeledPoint>() {
#Override
public LabeledPoint call(Row arg0) throws Exception {
LabeledPoint labeledPoint = new LabeledPoint(new Double(arg0.get(0).toString()),
(org.apache.spark.mllib.linalg.Vector) arg0.get(1));
return labeledPoint;
}
});
NaiveBayes naiveBayes = new NaiveBayes(1.0, "multinomial");
NaiveBayesModel naiveBayesModel = naiveBayes.train(labelledJavaRDD.rdd(), 1.0);
Once the training model is built , test data is passed through the same transformations and prediction is done using below code
Column 3 is label in test data frame.
Column 7 is features in test data frame
LabeledPoint labeledPoint = new LabeledPoint(new Double(dataFrameRow.get(3).toString()),
(org.apache.spark.mllib.linalg.Vector) dataFrameRow.get(7));
double predictedLabel = naiveBayesModel.predict(labeledPoint.features());

Related

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:
1) Import packages
from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
2) Preparing the dataset
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus = result_tfidf.select("id", "features")
3) Training the LDA model
lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
4) Replicate the model
#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")
#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)
The following results are different despite my assumption to be equal:
topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)
So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?
Thank you :)
I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.
I came up with below solution after trying multitude of things:
Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.
All in all now my results are same even after several iterations. Hope this helps.

Spark/Pyspark: SVM - How to get Area-under-curve?

I have been dealing with random forest and naive bayes lately. Now i want to use a Support vector machine.
After fitting the model i wanted to use the output columns "probability" and "label" to compute the AUC value. But now I have seen that there is no column "probability" for SVM?!
Here you can see how I have done so far:
from pyspark.ml.classification import LinearSVC
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
scores = model.transform(train)
results = scores.select('probability', 'label')
# Create Score-Label Set for 'BinaryClassificationMetrics'
results_collect = results.collect()
results_list = [(float(i[0][0]), 1.0-float(i[1])) for i in results_collect]
scoreAndLabels = sc.parallelize(results_list)
metrics = BinaryClassificationMetrics(scoreAndLabels)
print("AUC-value: " + str(round(metrics.areaUnderROC,4)))
That was my approach how I have done this in the past for random forest and naive bayes. I thought I could do it with svm too... But that does not work because there is no output column "probability".
Does anyone know why the column "probability" does not exist? And how i can compute the AUC-value now?
Using the most recent spark/pyspark to the time of this answer:
If you use the pyspark.ml module (unlike mllib), you can work with Dataframe as the interface:
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
test_prediction = model.transform(test)
Create the evaluator (see it's source code for settings):
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
Apply evaluator to data (again, source code shows more options):
evaluation = evaluator.evaluate(test_prediction)
The result of evaluate is, by default, the "Area Under Curve":
print("evaluation (area under ROC): %f" % evaluation)
SVM algorithm doesn't provide probability estimates, but only some scores.
There is an algorithm proposed by Platt to compute probabilities given SVM scores, but it's criticized but some and apparently not implemented in Spark.
Btw, there was a similar question What does the score of the Spark MLLib SVM output mean?

LDA model prediction nonconsistance

I trained a LDA model and load it into the environment to transform the new data:
from pyspark.ml.clustering import LocalLDAModel
lda = LocalLDAModel.load(path)
df = lda.transform(text)
The model will add a new column called topicDistribution. In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice.
May I ask the reason why and how to fix it?
LDA uses randomness when training and, depending on the implementation, when infering new data. The implementation in Spark is based on EM MAP inference so I believe it only uses randomness when training the model. This means that the results will be different each time the algorithm is trained and run.
To get the same results when running on the same input and same parameters, you can set the random seed when training the model. For example, to set the random seed to 1:
model = LDA.train(data, k=2, seed=1)
To set the seed when transforming new data, create a parameter map to overwrite the default value (None for seed).
lda = LocalLDAModel.load(path)
paramMap[lda.seed] = 1L
df = lda.transform(text, paramMap)
For more information about overwriting model parameters, see here.

Text Classification using Spark ML

I have a free text description based on which I need to perform a classification. For example the description can be that of an incident. Based on the description of the incident , I need to predict the risk associated with the event . For eg : "A murder in town" - this description is a candidate for "high" risk.
I tried logistic regression but realized that currently there is support only for binary classification. For Multi class classification ( there are only three possible values ) based on free text description , what would be the most suitable algorithm? ( Linear Regression or Naive Bayes )
Since you are using spark, I assume you have bigdata, so -I am no expert- but after reading your answer, I would like to make some points.
Create the Training (80%) and Testing Data Sets (20%)
I would partition my data to Training (60-70%), Testing (15-20%) and Evaluation (15-20%) sets..
The idea is that you can fine tune your classification algorithm w.r.t. the Training set, but we really want to do with with Classification tasks, is to have them classify unseen data. So fine tune your algorithm with the Testing set, and when you are done, use the Evaluation set, to get a real understanding of how things work!
Stop words
If your data are articles from Newspapers and such,I personally haven't seen any significant improvement by using more sophisticated stop words removal approaches...
But that's just a personal statement, but if I were you, I wouldn't focus on that step.
Term Frequency
How about using Term Frequency-Inverse Document Frequency (TF-IDF) term weighting instead? You may want to read: How can I create a TF-IDF for Text Classification using Spark?
I would try both and compare!
Multinomial
Do you have any particular reason to try the Multinomial Distribution? If no, since when n is 1 and k is 2 the multinomial distribution is the Bernoulli distribution, as stated in Wikipedia, which is supported.
Try both and compare ( this is something you have to get used to, if you wish to make your model better! :) )
I also see that apache-spark-mllib offers Random forests, which might worth a read, at least! ;)
If your data is not that big, I would also try Support vector machines (SVMs), from scikit-learn, which however supports python, so you should switch to pyspark or plain python, abandoning spark. BTW, if you are actually going for sklearn, this might come in handy: How to split into train, test and evaluation sets in sklearn?, since Pandas plays nicely along with sklearn.
Hope this helps!
Off-topic:
This is really not the way to ask a question in Stack Overflow. Read How to ask a good question?
Personally, if I were you, I would do all the things you have done in your answer first, and then post a question, summarizing my approach.
As for the bounty, you may want to read: How does the Bounty System work?
This is how I solved the above problem.
Though prediction accuracy is not bad ,the model has to be tuned further
for better results.
Experts , please revert back if you find anything wrong.
My input data frame has two columns "Text" and "RiskClassification"
Below are the sequence of steps to predict using Naive Bayes in Java
Add a new column "label" to the input dataframe . This column will basically decode the risk classification like below
sqlContext.udf().register("myUDF", new UDF1<String, Integer>() {
#Override
public Integer call(String input) throws Exception {
if ("LOW".equals(input))
return 1;
if ("MEDIUM".equals(input))
return 2;
if ("HIGH".equals(input))
return 3;
return 0;
}
}, DataTypes.IntegerType);
samplingData = samplingData.withColumn("label", functions.callUDF("myUDF", samplingData.col("riskClassification")));
Create the Training ( 80 % ) and Testing Data Sets ( 20 % )
For eg :
DataFrame lowRisk = samplingData.filter(samplingData.col("label").equalTo(1));
DataFrame lowRiskTraining = lowRisk.sample(false, 0.8);
Union All the dataframes to build the complete training data
Building test data is slightly tricky . Test Data should have all data which
is not present in the training data
Start transformation of training data and build the model
6 . Tokenize the text column in the training data set
Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words");
DataFrame tokenized = tokenizer.transform(trainingRiskData);
Remove Stop Words. (Here you can also do advanced operations like lemme, stemmer, POS etc using Stanford NLP library)
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame stopWordsRemoved = remover.transform(tokenized);
Compute Term Frequency using HashingTF. CountVectorizer is another way to do this
int numFeatures = 20;
HashingTF hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(rawFeaturizedData);
DataFrame featurizedData = idfModel.transform(rawFeaturizedData);
Convert the featurized input into JavaRDD . Naive Bayes works on LabeledPoint
JavaRDD<LabeledPoint> labelledJavaRDD = featurizedData.select("label", "features").toJavaRDD()
.map(new Function<Row, LabeledPoint>() {
#Override
public LabeledPoint call(Row arg0) throws Exception {
LabeledPoint labeledPoint = new LabeledPoint(new Double(arg0.get(0).toString()),
(org.apache.spark.mllib.linalg.Vector) arg0.get(1));
return labeledPoint;
}
});
Build the model
NaiveBayes naiveBayes = new NaiveBayes(1.0, "multinomial");
NaiveBayesModel naiveBayesModel = naiveBayes.train(labelledJavaRDD.rdd(), 1.0);
Run all the above transformations on the test data also
Loop through the test data frame and perform the below actions
Create a LabeledPoint using the "label" and "features" in the test data frame
For eg : If the test data frame has label and features in the third and seventh column , then
LabeledPoint labeledPoint = new LabeledPoint(new Double(dataFrameRow.get(3).toString()),
(org.apache.spark.mllib.linalg.Vector) dataFrameRow.get(7));
Use the Prediction Model to predict the label
double predictedLabel = naiveBayesModel.predict(labeledPoint.features());
Add the predicted label also as a column to the test data frame.
Now test data frame has the expected label and the predicted label.
You can export the test data to csv and do analysis or you can compute the accuracy programatically as well.

Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. It only takes ~200 seconds to fit 10 topics on 9M documents (tweets).
val numTopics=10
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
.setK(numTopics)
.setMaxIterations(2)
.setDocConcentration(-1) // use default symmetric document-topic prior
.setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)
/**
* Print results
*/
// Print training time
println(s"Finished training LDA model. Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")
numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA#72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel#351e2b4c
Finished training LDA model. Summary:
Training time (sec) 202.640775542
However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. Why? (I'm trying to get the log perplexity out so I can optimize k, the # of topics).
ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.
In general, calculating the perplexity is not a straightforward matter:
https://stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation
Also setting the number of topics by only looking at perplexity might not be the right approach: https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus
LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. I calculated perplexity on both, local and distributed: they take quite some time. I mean looking at the code, they have nested map calls on the whole Dataset.
Calling:
docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)
for (9M * nonzero BOW entries) times can take quite some time. The Code is from:
https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312
Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls.
Btw. the default for docConcentration is Vectors.dense(-1) and not just an Int.
Btw. number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble.

Resources