spark GMM fail to divide points to correct clusters - apache-spark

I have data set which looks like this: (user names has been obfuscated, also there are 40 users - I didn't want to show them all)
(a1,List(1.0, 1015.0))
(a2,List(2.0, 2015.0))
(a3,List(3.0, 3015.0))
(a4,List(1.0, 1015.0))
(a5,List(0.0, 15.0))
(a6,List(0.0, 15.0))
Basically, I want to create a sample app where there are 4 clusters which are really obvious (10 users from each cluster). And show that users with same characteristics fall into the same cluster.
The code creates vectors from the data, trains a model and predicts according to the model learned:
val cuttedData = data1.map(s => s.replace("List", "").replace("(", "").replace(")", "").trim.split(','))
val parsedData1 = cuttedData.map(item => Vectors.dense(item.drop(1).map(t => t.toDouble))).cache()
parsedData1.foreach(println)
val gmm1 = new GaussianMixture().setK(4).run(parsedData1)
cuttedData.foreach(item => {println(item(0) + " : " + gmm1.predict(Vectors.dense(item.drop(1).map(i => i.toDouble))))})
The issue is that the users are predicted to only 3 clusters, so that the features -> cluster relations are:
(1.0, 1015.0) -> 1
(0.0, 15.0) -> 0
(2.0, 2015.0), (3.0, 3015.0) -> 2
the clusters as printed from the model are:
weight=0.150000
mu=[2.4445803037860923E-11,15.000000024445805]
sigma=
5.7824573923325105E-11 5.782457392332511E-8
5.782457392332511E-8 5.782457377877571E-5
weight=0.150000
mu=[2.4445803037860923E-11,15.000000024445805]
sigma=
5.7824573923325105E-11 5.782457392332511E-8
5.782457392332511E-8 5.782457377877571E-5
weight=0.371495
mu=[2.1293846059560595,2144.3846059560597]
sigma=
0.4824411855092578 482.4411855092562
482.4411855092562 482441.185509255
weight=0.328505
mu=[1.8536826474572847,1868.6826474572852]
sigma=
0.4795182413266584 479.51824132665774
479.51824132665774 479518.24132665637
I don't understand why the users are classified wrong, I tried increasing the K parameter to 100, and using Normalizer for the data but it didn't help.
Another thing to notice is that when I use KMEANS for the same data it works perfectly.

Related

XGBoost training fails if null values exists (setHandleInvalid "keep" exists for whole pipeline)

I'm training an XGBoostRegressor model using Spark (Scala), and I've noticed that the number of predicted values is less than what was given to the model using model.transform(df).
The problem is due to the fact there are (And should be, as per my use case) NULL values. I've handled those alongside the way by using setHandleInvalid in every phase I have (and specifically - stringIndexer, oneHotEncoder, vectorAssembler).
But, still, if I'm using "keep", the model fails to train, but if I'm using "skip" (btw, only on the vectorAssembler), so the model managed to train, but just "discards" the records where even 1 field has a null.
Tried tons of google but didn't really see any solution for it.
Would appreciate anyone's input.
Thanks in advance.
Spark, Scala, XGBoost Docs, saw several PRs that didn't help, tried several strategies for dealing Null values but not even one succeeded.
For keep case (where the train fails) ->
.setInputCol("country_code")
.setOutputCol("country_code_indexed")
.setHandleInvalid("keep")
val oneHotEncoder = new OneHotEncoderEstimator()
.setInputCol("user_country_code_indexed")
.setOutputCol("user_country_oneHotEncoded")
.setHandleInvalid("keep")
val assembler = new VectorAssembler()
.setInputCols(trainUpdated.drop("label",
"someCol1",
"someCol2",
"country_code",
"country_code_indexed").columns)
.setOutputCol("features")
.setHandleInvalid("keep")
val xgboostRegressor = new XGBoostRegressor(Map[String, Any](
"num_round" -> 100,
"num_workers" -> 10, //num of instances * num of cores is the max.
"objective" -> "reg:linear",
"eta" -> 0.1,
"gamma" -> 0.5,
"max_depth" -> 6,
"early_stopping_rounds" -> 9,
"seed" -> 1234,
"lambda" -> 0.4,
"alpha" -> 0.3,
"colsample_bytree" -> 0.6,
"subsample" -> 0.3
))
Then I get ->
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
Expected result - model trains with null values (as it's its default behavior...) and returns exact num of records as it had during train / test (fit / transform, same strategy for both).
I would like to claim I've discussed with the XGBoost creators about this issue and I've contributed to the community by updating the documentation accordingly about it. The new doc is here (missing values section) - https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html

Reconstructing k-means using pre-computed cluster centres

I'm using k-means for clustering with number of clusters 60. Since, some of the clusters are coming out as meaning less, I've deleted those cluster centers from cluster center array(count = 8) and saved in clean_cluster_array.
This time, I'm re-fitting k-means model with init = clean_cluster_centers. and n_clusters = 52 and max_iter = 1 because i want to avoid re-fitting as much as possible.
The basic idea is to recreate new model with clean_cluster_centers . The problem here is since, we are removing large number of clusters; The model is quickly configuring to more stable centers even with n_iter = 1. Is there any way to recreate k-means model?
If you've fitted a KMeans object, it has a cluster_centers_ attribute. You can directly update it by doing something like this:
cls.cluster_centers_ = new_cluster_centers
So if you want a new object with the clean cluster centers, just do something like the following:
cls = KMeans().fit(X)
cls2 = cls.copy()
cls2.cluster_centers_ = new_cluster_centers
And now, since the predict function only checks that your object has a non-null attribute called cluster_centers_, you can use the predict function
def predict(self, X):
"""Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, `cluster_centers_` is called
the code book and each value returned by `predict` is the index of
the closest code in the code book.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns
-------
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return _labels_inertia(X, x_squared_norms, self.cluster_centers_)[0]

Spark: Normalising/Stantardizing test-set using training set statistics

This is a very common process in Machine Learning.
I have a dataset and I split it into training set and test set.
Since I apply some normalizing and standardization to the training set,
I would like to use the same info of the training set (mean/std/min/max
values of each feature), to apply the normalizing and standardization
to the test set too. Do you know any optimal way to do that?
I am aware of the functions of MinMaxScaler, StandardScaler etc..
You can achieve this via a few lines of code on both the training and test set.
On the training side there are two approaches:
MultivariateStatisticalSummary
http://spark.apache.org/docs/latest/mllib-statistics.html
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each
Using SQL
from pyspark.sql.functions import mean, min, max
In [6]: df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
On the testing data - you can then manually "normalize the data using the statistics obtained above from the training data. You can decide in which sense you wish to normalize: e.g.
Student's T
val normalized = testData.map{ m =>
(m - trainMean) / trainingSampleStddev
}
Feature Scaling
val normalized = testData.map{ m =>
(m - trainMean) / (trainMax - trainMin)
}
There are others: take a look at https://en.wikipedia.org/wiki/Normalization_(statistics)

Spark: Dimensions mismatch error with RDD[LabeledPoint] union

I would ideally like to do the following:
In essence, what I want to do is for my dataset that is RDD[LabeledPoint], I want to control the ratio of positive and negative labels.
val training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "training_data.tsv")
This dataset has both cases and controls included in it. I want to control the ratio of cases to controls (my dataset is skewed). So I want to do something like sample training_data such that the ratio of cases to controls is 1:2 (instead of 1:500 say).
I was not able to do that therefore, I separated the training data into cases and controls as below and then was trying to combine them later using union operator, which gave me the Dimensions mismatch error.
I have two datasets (both in Libsvm format):
val positives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positives.tsv")
val negatives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negatives.tsv")
I want to combine these two to form training data. Note both are in libsvm format.
training = positives.union(negatives)
When I use the above training dataset in model building (such as logistic regression) I get error since both positives and negatives can have different number of columns/dimensions. I get this error: "Dimensions mismatch when merging with another summarizer" Any idea how to handle that?
In addition, I also want to do samplings such as
positives_subset = positives.sample()
I was able to solve this in the following way:
def create_subset(training: RDD[LabeledPoint], target_label: Double, sampling_ratio: Double): RDD[LabeledPoint] = {
val training_filtered = training.filter { case LabeledPoint(label, features) => (label == target_label) }
val training_subset = training_filtered.sample(false, sampling_ratio)
return training_subset
}
Then calling the above method as:
val positives = create_subset(training, 1.0, 1.0)
val negatives_sampled = create_subset(training, 0.0, sampling_ratio)
Then you can take the union as:
val training_subset_double = positives.union(negatives_double)
and then I was able to use the training_subset_double for model building.

K-Means clustering is biased to one center

I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions), I'm finally getting a KMeansModel...but when I try to predict, I keep getting the same center. Is this because of the small dataset, or because I'm comparing a multi-word document against a smaller amount of words(1-20) query? Or is there something else I'm doing wrong? See the below code:
//Preprocessing of data includes splitting into words
//and removing words with only 1 or 2 characters
val corpus: RDD[Seq[String]]
val hashingTF = new HashingTF(100000)
val tf = hashingTF.transform(corpus)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf).cache
val kMeansModel = KMeans.train(tfidf, 3, 10)
val queryTf = hashingTF.transform(List("music"))
val queryTfidf = idf.transform(queryTf)
kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied
This question seems somewhat related to this one
More a checklist than an answer:
A single word query or a very short sentence is probably not a good choice especially when combined with a large feature vector. I would start with significant fragments of the documents from the corpus
Manually check similarity between query an each cluster. Is it even remotely similar to each cluster?
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
import breeze.linalg.functions.cosineDistance
import org.apache.spark.mllib.linalg.{Vector, SparseVector, DenseVector}
def toBreeze(v: Vector): BV[Double] = v match {
case DenseVector(values) => new BDV[Double](values)
case SparseVector(size, indices, values) => {
new BSV[Double](indices, values, size)
}
}
val centers = kMeansModel.clusterCenters.map(toBreeze(_))
val query = toBreeze(queryTfidf)
centers.map(c => cosineDistance(query, c))
Does K-Means converge? Depending on a dataset and initial centroids ten or twenty iterations can be not enough. Try to increase this number to one thousand or so and see if the problem persist.
Is your corpus diverse enough to form meaningful clusters? Try to find centroids for each document in you corpus. Do you get a relatively uniform distribution or almost all documents are assigned to a single cluster.
Perform visual inspection. Take your tfidf RDD convert to a matrix, apply PCA, plot, color by cluster and see if you get a meaningful results.
Plot centroids as well and check if these cover possible cluster. If not check convergence once again.
You can also check similarities between centroids:
(0 until centers.size)
.toList
.flatMap(i => ((i + 1) until centers.size)
.map(j => (i, j, 1 - cosineDistance(centers(i), centers(j)))))
Is your pre-processing thorough enough? Simple removal of the short words most likely won't suffice. I would at lest extend it using with stopwords removal. Some stemming wouldn't hurt too.
K-Means results depend on the initial centroids. Try running an algorithm multiple times an see if problem persists.
Try more sophisticated algorithm like LDA

Resources