K-Means clustering is biased to one center - apache-spark

I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions), I'm finally getting a KMeansModel...but when I try to predict, I keep getting the same center. Is this because of the small dataset, or because I'm comparing a multi-word document against a smaller amount of words(1-20) query? Or is there something else I'm doing wrong? See the below code:
//Preprocessing of data includes splitting into words
//and removing words with only 1 or 2 characters
val corpus: RDD[Seq[String]]
val hashingTF = new HashingTF(100000)
val tf = hashingTF.transform(corpus)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf).cache
val kMeansModel = KMeans.train(tfidf, 3, 10)
val queryTf = hashingTF.transform(List("music"))
val queryTfidf = idf.transform(queryTf)
kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied
More a checklist than an answer:
A single word query or a very short sentence is probably not a good choice especially when combined with a large feature vector. I would start with significant fragments of the documents from the corpus
Manually check similarity between query an each cluster. Is it even remotely similar to each cluster?
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
import breeze.linalg.functions.cosineDistance
import org.apache.spark.mllib.linalg.{Vector, SparseVector, DenseVector}
def toBreeze(v: Vector): BV[Double] = v match {
case DenseVector(values) => new BDV[Double](values)
case SparseVector(size, indices, values) => {
new BSV[Double](indices, values, size)
val centers = kMeansModel.clusterCenters.map(toBreeze(_))
val query = toBreeze(queryTfidf)
centers.map(c => cosineDistance(query, c))
Does K-Means converge? Depending on a dataset and initial centroids ten or twenty iterations can be not enough. Try to increase this number to one thousand or so and see if the problem persist.
Is your corpus diverse enough to form meaningful clusters? Try to find centroids for each document in you corpus. Do you get a relatively uniform distribution or almost all documents are assigned to a single cluster.
Perform visual inspection. Take your tfidf RDD convert to a matrix, apply PCA, plot, color by cluster and see if you get a meaningful results.
Plot centroids as well and check if these cover possible cluster. If not check convergence once again.
You can also check similarities between centroids:
(0 until centers.size)
.flatMap(i => ((i + 1) until centers.size)
.map(j => (i, j, 1 - cosineDistance(centers(i), centers(j)))))
Is your pre-processing thorough enough? Simple removal of the short words most likely won't suffice. I would at lest extend it using with stopwords removal. Some stemming wouldn't hurt too.
K-Means results depend on the initial centroids. Try running an algorithm multiple times an see if problem persists.
Try more sophisticated algorithm like LDA


precomputed matric cost much memory in dbscan in cluster

There are 40 million datasets in my scieniao.Can dbscan support so large datasets in sklean?Below is my code
for line in open("./raw_data1"):
#for line in sys.stdin:
tagid_result = [0]*10
line = line.strip()
fields = line.split("\t")
if len(fields)<6:
tagid = fields[3]
tagids = tagid.split(":")
for i in tagids:
tagid_result[tagid2idx[i]] = 1
distance_matrix = pairwise_distances(X, metric='jaccard')
#print (distance_matrix)
dbscan = DBSCAN(eps=0.1, min_samples=1200, metric="precomputed", n_jobs=-1)
db = dbscan.fit(distance_matrix)
for i in range(0,len(db.labels_)):
print (db.labels_[i])
How can it improve my code to support large datasets?
DBSCAN itself never requires your data to be available as a matrix, and will only need linear memory.
Unfortunately for you, the sklearn authors decided to implement DBSCAN a bit different than the original article. This causes their implementation to potentially use much more memory. In cases such as yours, these decisions can have drawbacks.
For Jaccard distance, the neighborhood search of DBSCAN can be nicely accelerated for example with inverted indexes. But even so, you only need to compute one row at a time if you implement the "textbook" version of DBSCAN yourself.

Filtering Spark DataFrame on new column

Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?
Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).

Text Classification using Spark ML

I have a free text description based on which I need to perform a classification. For example the description can be that of an incident. Based on the description of the incident , I need to predict the risk associated with the event . For eg : "A murder in town" - this description is a candidate for "high" risk.
I tried logistic regression but realized that currently there is support only for binary classification. For Multi class classification ( there are only three possible values ) based on free text description , what would be the most suitable algorithm? ( Linear Regression or Naive Bayes )
Since you are using spark, I assume you have bigdata, so -I am no expert- but after reading your answer, I would like to make some points.
Create the Training (80%) and Testing Data Sets (20%)
I would partition my data to Training (60-70%), Testing (15-20%) and Evaluation (15-20%) sets..
The idea is that you can fine tune your classification algorithm w.r.t. the Training set, but we really want to do with with Classification tasks, is to have them classify unseen data. So fine tune your algorithm with the Testing set, and when you are done, use the Evaluation set, to get a real understanding of how things work!
Stop words
If your data are articles from Newspapers and such,I personally haven't seen any significant improvement by using more sophisticated stop words removal approaches...
But that's just a personal statement, but if I were you, I wouldn't focus on that step.
Term Frequency
How about using Term Frequency-Inverse Document Frequency (TF-IDF) term weighting instead? You may want to read: How can I create a TF-IDF for Text Classification using Spark?
I would try both and compare!
Do you have any particular reason to try the Multinomial Distribution? If no, since when n is 1 and k is 2 the multinomial distribution is the Bernoulli distribution, as stated in Wikipedia, which is supported.
Try both and compare ( this is something you have to get used to, if you wish to make your model better! :) )
I also see that apache-spark-mllib offers Random forests, which might worth a read, at least! ;)
If your data is not that big, I would also try Support vector machines (SVMs), from scikit-learn, which however supports python, so you should switch to pyspark or plain python, abandoning spark. BTW, if you are actually going for sklearn, this might come in handy: How to split into train, test and evaluation sets in sklearn?, since Pandas plays nicely along with sklearn.
Hope this helps!
This is really not the way to ask a question in Stack Overflow. Read How to ask a good question?
Personally, if I were you, I would do all the things you have done in your answer first, and then post a question, summarizing my approach.
As for the bounty, you may want to read: How does the Bounty System work?
This is how I solved the above problem.
Though prediction accuracy is not bad ,the model has to be tuned further
for better results.
Experts , please revert back if you find anything wrong.
My input data frame has two columns "Text" and "RiskClassification"
Below are the sequence of steps to predict using Naive Bayes in Java
Add a new column "label" to the input dataframe . This column will basically decode the risk classification like below
sqlContext.udf().register("myUDF", new UDF1<String, Integer>() {
public Integer call(String input) throws Exception {
if ("LOW".equals(input))
return 1;
if ("MEDIUM".equals(input))
return 2;
if ("HIGH".equals(input))
return 3;
return 0;
}, DataTypes.IntegerType);
samplingData = samplingData.withColumn("label", functions.callUDF("myUDF", samplingData.col("riskClassification")));
Create the Training ( 80 % ) and Testing Data Sets ( 20 % )
For eg :
DataFrame lowRisk = samplingData.filter(samplingData.col("label").equalTo(1));
DataFrame lowRiskTraining = lowRisk.sample(false, 0.8);
Union All the dataframes to build the complete training data
Building test data is slightly tricky . Test Data should have all data which
is not present in the training data
Start transformation of training data and build the model
6 . Tokenize the text column in the training data set
Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words");
DataFrame tokenized = tokenizer.transform(trainingRiskData);
Remove Stop Words. (Here you can also do advanced operations like lemme, stemmer, POS etc using Stanford NLP library)
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame stopWordsRemoved = remover.transform(tokenized);
Compute Term Frequency using HashingTF. CountVectorizer is another way to do this
int numFeatures = 20;
HashingTF hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(rawFeaturizedData);
DataFrame featurizedData = idfModel.transform(rawFeaturizedData);
Convert the featurized input into JavaRDD . Naive Bayes works on LabeledPoint
JavaRDD<LabeledPoint> labelledJavaRDD = featurizedData.select("label", "features").toJavaRDD()
.map(new Function<Row, LabeledPoint>() {
public LabeledPoint call(Row arg0) throws Exception {
LabeledPoint labeledPoint = new LabeledPoint(new Double(arg0.get(0).toString()),
(org.apache.spark.mllib.linalg.Vector) arg0.get(1));
return labeledPoint;
Build the model
NaiveBayes naiveBayes = new NaiveBayes(1.0, "multinomial");
NaiveBayesModel naiveBayesModel = naiveBayes.train(labelledJavaRDD.rdd(), 1.0);
Run all the above transformations on the test data also
Loop through the test data frame and perform the below actions
Create a LabeledPoint using the "label" and "features" in the test data frame
For eg : If the test data frame has label and features in the third and seventh column , then
LabeledPoint labeledPoint = new LabeledPoint(new Double(dataFrameRow.get(3).toString()),
(org.apache.spark.mllib.linalg.Vector) dataFrameRow.get(7));
Use the Prediction Model to predict the label
double predictedLabel = naiveBayesModel.predict(labeledPoint.features());
Add the predicted label also as a column to the test data frame.
Now test data frame has the expected label and the predicted label.
You can export the test data to csv and do analysis or you can compute the accuracy programatically as well.

Spark: Normalising/Stantardizing test-set using training set statistics

This is a very common process in Machine Learning.
I have a dataset and I split it into training set and test set.
Since I apply some normalizing and standardization to the training set,
I would like to use the same info of the training set (mean/std/min/max
values of each feature), to apply the normalizing and standardization
to the test set too. Do you know any optimal way to do that?
I am aware of the functions of MinMaxScaler, StandardScaler etc..
You can achieve this via a few lines of code on both the training and test set.
On the training side there are two approaches:
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each
Using SQL
from pyspark.sql.functions import mean, min, max
In [6]: df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
On the testing data - you can then manually "normalize the data using the statistics obtained above from the training data. You can decide in which sense you wish to normalize: e.g.
Student's T
val normalized = testData.map{ m =>
(m - trainMean) / trainingSampleStddev
Feature Scaling
val normalized = testData.map{ m =>
(m - trainMean) / (trainMax - trainMin)
There are others: take a look at https://en.wikipedia.org/wiki/Normalization_(statistics)

Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. It only takes ~200 seconds to fit 10 topics on 9M documents (tweets).
val numTopics=10
val lda = new LDA()
.setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
.setDocConcentration(-1) // use default symmetric document-topic prior
.setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)
* Print results
// Print training time
println(s"Finished training LDA model. Summary:")
println(s"Training time (sec)\t$elapsed")
numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA#72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel#351e2b4c
Finished training LDA model. Summary:
Training time (sec) 202.640775542
However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. Why? (I'm trying to get the log perplexity out so I can optimize k, the # of topics).
res95: Double = 7.006006572908673
Took 1212 seconds.
In general, calculating the perplexity is not a straightforward matter:
Also setting the number of topics by only looking at perplexity might not be the right approach: https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus
LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. I calculated perplexity on both, local and distributed: they take quite some time. I mean looking at the code, they have nested map calls on the whole Dataset.
docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)
for (9M * nonzero BOW entries) times can take quite some time. The Code is from:
https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312
Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls.
Btw. the default for docConcentration is Vectors.dense(-1) and not just an Int.
Btw. number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble.
