How can I use KMeans to cluster tweets in Spark? - apache-spark

I'd like to cluster tweets based on topic (ex. all Amazon tweets in one cluster, all Netflix tweets in another, etc.) The thing is, all the incoming tweets are already filtered on these keywords, but they're jumbled up, and I'm just categorizing them as they come in.
I'm using Spark streaming and am looking for a way to vectorize these tweets. Because this is batch processing, I don't have access to the entire corpus of tweets.

If you have a predefined vocabulary with potentially multiple terms selected simultaneously - e.g. a set of non-mutually-exclusive tweet categories that you are interested in - then you can have a binary vector in which each bit represents one of the categories.
If the categories are mutually exclusive then what could you hope to achieve by clustering? Specifically there would be no "gray area" in which some observations belong to CategorySet-A, others to CategorySet-B and others to some in-between combination. If every observation is hard-capped at one category than you have discrete points not clusters.
If instead you wish to cluster based on similar sets of words - then you might need to know the "vocabulary" up-front - which in this case means: "what are the tweet terms that I care about". In that case you can use a bag of words model to compare the tweets - and then cluster based on the generated vectors.
Now if you are uncertain of the vocabulary apriori - which is the likely case here since you do not know what would be the content of the next tweet - then you will likely resort to re-clustering on a regular basis - as you gain new words. You can then use an updated bag of words that includes the newly "seen" terms. Note that this incurs processing cost and latency. To avoid the cost/latency you have to decide ahead of time which terms to restrict your clustering on: which may be possible if you're interested in a targeted subject.


How to measure similarity between sentences inside a cluster after clustering?

I'm conducting topic modeling analysis on messages from public Telegram groups, super new to this area so just learning.
I've been following this example here (, and tried swapping out the HDBSCAN clustering algorithm with the one in BERT's documentation util.community_detection (
When I output the results of the clusters in this example (4899 Telegram messages), I get something that looks like this.
Topic: just a cluster label
Doc: all the messages in that cluster combined together
0: top keywords found via tf-idf
The problem I'm concerned with is that, there are clearly a ton of messages that are basically identical to each other, I've marked them in yellow. A few examples,
Cluster 3: this is just a bunch of "hellos" and variations thereof
Cluster 5: this is just a bunch of "Ok"s, people saying yes / ok
Cluster 7: people just saying thanks and variations on that
Cluster 9: some variations and misspellings of the word "gas"
Cluster 19: just "siap" which I think means "sorry if I already posted"
To a human reader I feel like this type of text should just be excluded from the analysis altogether, the question is how do I detect it.
Since they're already grouped together by the clustering algorithm, the algorithm must have ways to measure the "similarity" between these messages within a cluster. But I don't seem to be able to find these values exposed anywhere or what it's called. Like for example the HDBSCAN algorithm (, I skimmed through the doc a few times and didn't find any such property or measure exposed, am I missing something here?
My hypothesis is that for the cases where it's just a word or a short phrase repeated over and over again, this similarity value must be super super high, and I'd just say "clusters whose internal similarity is higher than this threshold are getting thrown out".
Any help & advice would be greatly appreciated, thanks!
Index the corpus of your interest (for e.g. FAISS) just for an idea, example code is below:
def build_index(self):
""":returns an inverted index for the search documents"""
vectors = [self.encode(document) for document in self.documents]
index = faiss.IndexIDMap(faiss.IndexFlatIP(768)) # dimensionality of vector space
# Add document vectors into index after transforming into numpy arrays. IDs should match len(documents)
index.add_with_ids(np.array([vec.numpy() for vec in vectors]), np.array(range(0, len(self.documents))))
return index
Then perform any similarity metric like L2 Euclidean distance or cosine similarity with dot products. Essentially, concept is that once we transform vectors in an n-dimensional space, vectors with similar semantics are grouped together. Therefore, computing similarity is just computing the angle between them and applying a cosine on it. Similar vectors have less angle, therefore higher cosine value & vice-versa.
Check the following topics for your problem.
Cosine Similarity
Sentence Vectors (similar to word vectors, but are good for long documents)
Check this repository for a better understanding of sentence vectorization and computing similarity to retrieve top n sentences.
In short,
Create an index file using FAISS for your data of interest.
Compute similarity by calling one of its methods.
Get top n most similar results.
Removing stop words:
Essentially your problem can be attributed to a list of finite stop words. If you can identify ones to some finite value (e.g. some 25) such different key words at max, then the task becomes stop word removal. Please use NLTK / Spacy libraries for easy stop word removal. You can also specify them in a list of strings, write a condition where if a token matches with one of those strings, they’re deleted from downstream processing. Stop words are omitted & is a necessary pre-processing task in NLP. Your task of telegram
data is also similar to Twitter analysis. Check this & this.

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Determine text similarity through cluster analysis

I am a senior bachelor student in CS and I currently work on my thesis. For this thesis I wrote a program that uses density-based clustering approach. More specifically, OPTICS algorithm. I have an idea of how to use it, but I don't know if it is valid.
I want to use this algorithm for text classification. Texts are points in the set that have to be clustered, so that the resulting hierarchy consists of categories and subcategories of texts. For example, one such set is "Scientific literature", consisting of subsets "Mathematics", "Biology" etc.
I came up with the idea that I can analyze texts for specific words that are encountered in particular text more often than in the whole dataset, also excluding insignificant words like prepositions. Perhaps I can use open source natural language parsers for that purpose, like Stanford parser. After that the program combines these "characteristic words" from each text into one set, and a certain amount of the most frequent words can be taken from this set. That amount becomes the dimentionality for the clustering, and each word's frequency in a particular text is used as a coordinate of a point. Thus we can cluster them.
The question is, is that idea valid or a complete nonsense? Can clustering in general and density-based clustering in particular be used for such classification? Maybe there is some kind of literature that can point me in the right direction?
Clustering != classification.
Run the clustering algorithm, and study the results. Most likely, there will not be a cluster "scientific literature" with subjects "mathematics" - what do you do then?
Also, clusters will only give you sets, that is too coarse for similarity search - on the contrary, you need first to solve the similarity problem, before you can run clustering algorithms such as OPTICS.
The "idea" you described is pretty much what everybody has been trying for years already.

PredictionIO for Content Recommendation e.g. Tweets

I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.

Distance dependent Chinese Restaurant Process maybe

I'm new to machine learning and want to implement the distance dependent Chinese Restaurant process in MATLAB for the clustering of audio tracks.
I'm looking to use the dd-CRP on 26 features. I'm guessing the process might go like this
Read in 1st feature vector and assign it a "table"
Read in 2nd feature vector and compare it to the 1st "table", maybe using the cosine angle(due to high dimension) of the two vectors and if it agrees within some defined theta, join that table, else start a new one.
Read in next feature and repeat step 2 for the new feature vector for each existing table.
While this is occurring, I will be keeping track of how many tables there are.
I will be running the algorithm over say for example 16 audio tracks. The way the audio will be fed into the algorithm is the first feature vector will be from say the first frame from audio track 1, the second feature vector from form the first frame in track 2 etc. as I'm trying to find out which audio tracks like to cluster together most, but I don't want to define how many centroids there are. Obviously I'll have to keep track of which audio track is at which "table".
Does this make sense?
This is not a Chinese Restaurant Process. This is a heuristic algorithm which has some similarity to a Chinese Restaurant Process. In a CRP everything is phrased in terms of priors over the assignments of items to clusters (the tables analogy), and these are combined with a likelihood function for each cluster (which formalises the similarity function you described). Inference is then done by Gibbs Sampling, which means non-deterministically sampling which cluster each track is assigned to in turn given all the other assignments. Variational methods for non-parametrics are still in a very preliminary state.
Why do you want to use a CRP? Do you think you'll get something out of it beyond more conventional clustering methods? The bar to entry for the implementation and proper understanding of non-parametrics is pretty high, and they're often of little practical use at the moment because of the constraints on inference I mentioned.
You can use the X-means algorithm, which automatically determines the optimal number of centroids (and hence number of clusters) based on the Bayesian Information Criterion (or BIC). In short, the algorithm looks for how dense each cluster is, and how far is each cluster from the other.
