Clustering with unknown number of clusters in Spark - apache-spark

I have a very large dataset (about 3.5M) of text messages. I am using tf-idf vector to represent each message in this dataset. I want to cluster the messages of the same topic together and I don't know the actual clusters or even the number of them.
So I searched a little and found that Optics, DBSCAN, or HDBSCAN can do this job but there is no implementation of them is spark ml or mllib. according to this In spark mllib there are implementations of K-means, Gaussian mixture, Power iteration clustering (PIC), Latent Dirichlet allocation (LDA), Bisecting k-means and Streaming k-means.
So my problem is, all of them need K as an input and I don't have it. Is there any clustering algorithm implemented in Spark that find the number of clusters on its own?

Got a little bit too long for a comment. I'll try to explain it here.
Do you have the data on which topic a message belongs? Then you can simply do a group by that topic to group all the messages with similar topics.
That's one thing. And if you are trying to derive topics (K) from the dataset itself, then you need little more statistics to build a sound feature set to cluster them. Then you can come to a conclusion on K by varying it and find the best K with minimal error. There is a famous method called elbow method.
Check this out. https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/#:~:text=There%20is%20a%20popular%20method,fewer%20elements%20in%20the%20cluster.

Related

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

How to use clustering to group sentences with similar intents?

I'm trying to develop an program in Python that can process raw chat data and cluster sentences with similar intents so they can be used as training examples to build a new chatbot. The goal is to make it as quick and automatic (i.e. no parameters to enter manually) as possible.
1- For feature extraction, I tokenize each sentence, stem its words and vectorize it using Sklearn's TfidfVectorizer.
2- Then I perform clustering on those sentence vectors with Sklearn's DBSCAN. I chose this clustering algorithm because it doesn't require the user to specify the desired number of clusters (like the k parameter in k-means). It throws away a lot of sentences (considering them as outliers), but at least its clusters are homogeneous.
The overall algorithm works on relatively small datasets (10000 sentences) and generates meaningful clusters, but there are a few issues:
On large datasets (e.g. 800000 sentences), DBSCAN fails because it requires too much memory, even with parallel processing on a powerful machine in the cloud. I need a less computationally-expensive method, but I can't find another algorithm that doesn't make weird and heterogeneous sentence clusters. What other options are there? What algorithm can handle large amounts of high-dimensional data?
The clusters that are generated by DBSCAN are sentences that have similar wording (due to my feature extraction method), but the targeted words don't always represent intents. How can I improve my feature extraction so it better captures the intent of a sentence? I tried Doc2vec but it didn't seem to work well with small datasets made of documents that are the size of a sentence...
A standard implementation of DBSCAN is supposed to need only O(n) memory. You cannot get lower than this memory requirement. But I read somewhere that sklearn's DBSCAN actually uses O(n²) memory, so it is not the optimal implementation. You may need to implement this yourself then, to use less memory.
Don't expect these methods to be able to cluster "by intent". There is no way an unsupervised algorithm can infer what is intended. Most likely, the clusters will just be based on a few key words. But this could be whether people say "hi" or "hello". From an unsupervised point of view, this distinction gives two nice clusters (and some noise, maybe also another cluster "hola").
I suggest to train a supervised feature extraction based on a subset where you label the "intent".

General principles behind Spark MLlib parallelism

I'm new to Spark (and to cluster computing framework) and I'm wondering about the general principles followed by the parallel algorithms used for machine learning (MLlib). Are they essentially faster because Spark distributes training data over multiple nodes? If yes, I suppose that all nodes share the same set of parameters right? And that they have to combine (ex: summing) the intermediate calculations (ex: the gradients) on a regular basis, am I wrong?
Secondly, suppose I want to fit my data with an ensemble of models (ex: 10). Wouldn't it be simpler in this particular context to run my good old machine-learning program independently on 10 machines instead of having to write complicated code (for me at least!) for training in a Spark cluster?
Corollary question: is Spark (or other cluster computing framework) useful only for big data applications for which we could not afford training more than one model and for which training time would be too much long on a single machine?
You correct about the general principle. Typical MLlib algorithm is a an iterative procedure with local phase and data exchange.
MLlib algorithms are not necessarily faster. They try to solve two problems:
disk latency.
memory limitations on a single machine.
If you can process data on a single node this can be orders of magnitude faster than using ML / MLlib.
The last question is hard to answer but:
It is not complicated to train ensembles:
def train_model(iter):
items = np.array(list(iter))
model = ...
return model
rdd.mapPartitions(train_model)
There are projects which already do that (https://github.com/databricks/spark-sklearn)

How to determine the number of topics in the LDA (Latent Dirichlet Allocation) alogrithm for text clustering?

I am using the LDA algorithm to cluster many documents into different topics. The LDA algorithm needs an input parameter: the number of topics. How could I determine this?
I am using the Reuter corpora to benchmark my solution. And Reuter corpora has topic numbers ready. Should I input the the same topic number when I clustering Reuter text? And comparing my clustering result to Reuter's?
But when in production, how could I know the number of topics before I actually cluster based on the topics. It's kind of like a chicken-egg problem.
One way you can approach this is through k means. Through Silhouette (or the elbow curves, but I guess that will require manual intervention) you can get the optimal number of clusters. You can use this number as the number of topics.

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?
Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.
A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.
First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling".
Important links:
https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/
Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1:
Try out different values of k, select the one that has the largest likelihood.
Method 2:
Instead of LDA, see if you can use HDP-LDA
Method 3:
If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.
Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

Resources