Which clustering algorithm should i choose - python-3.x

I am seeking a recommendation of an algorithm for clustering. I'm trying to cluster items from a store inventory (inventory length = 35'000 items) based on their description (string format).
After the text pre processing phase I proceed as follows:
I tokenize the text (word_tokenize from nltk.tokenize);
I create a dictionary (gensim.corpora.Dictionary);
From the dictionary a create a corpus with doc2bow;
Then I create a TfidfModel with gensim.models from corpus;
Finally I create a similarity matrix with gensim Similarity;
When I'm trying to run a clustering model (AgglomerativeClustering) on the similarity matrix it runs from hours and did not finish. The matrix's dimension is 35kx35k floats.
Is there any other approach to perform this clustering problem without hitting the curse of dimensionality?
Thanks.
Example code:
data['Product'] = data.apply(lambda row: preprocess_text_prod(row['Product']), axis = 1)
gen_docs = [[w for w in word_tokenize(text)] for text in data['Product']]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims_description = gensim.similarities.Similarity('./data/',tf_idf[corpus], num_features=len(dictionary))
sims = np.multiply(sims_factors, sims_description)#please ignore this line
del sims_factors, sims_description
gc.collect()
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(n_clusters = None,
compute_full_tree = True, distance_threshold = 0.25).fit(sims)
z = clustering.labels_

Related

Map BERTopic topic IDs back to the training dataframe

I have trained a BERTopic model on a dataframe of length of 400k. I want to map the topics of each document in a new column inside the dataframe. I could do that by running a for loop on all the documents and do topic_model.transform(doc) on them. The only problem is, it takes more than a second to transform each document into its topic and it would take days for the whole dataset.
Is there a way to achieve this faster since I want to map the topics on the training data.
I tried:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
topics = []
for text in df.texts:
tops = topic_model.transform(text)
topics.append(tops)
df['topics'] = topics
There is no need to recalculate the topics as you already retrieved them when using .fit_transform. There, the topics that you retrieve are in the exact same order as the input documents. Therefore, you can perform the following:
# The `topics` that you get here are in the exact same order as `docs`
# `topics[0]` belongs to `docs[0]`, `topics[1]` to `docs[1]`, etc.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
For those using .fit instead of .fit_transform, you can also access the topics and their documents as follows:
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})
From the source code, the transform() function of the BERTopic class is able accept a list of documents -- so you don't need to loop over your dataframe calling transform() multiple times for each document.
Secondly, it seems that if you don't pass your pre-trained document embeddings to the transform() function, embeddings will be set to None and you'll be calling _extract_embeddings() every single time which is likely what is causing the poor performance. The solution is to pass the embeddings to your transform() call. In the dummy example shown below, this improves speed of classification of 1,000 documents by approx. 1,555x (68.43 vs 0.044 seconds).
Example
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import random
import pandas as pd
# Create dummy data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
random.seed(756)
training_docs = random.sample(docs, 1000)
testing_docs = random.sample(docs, 1000)
# Instantiate and fit topic model to training docs
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(training_docs, show_progress_bar=True)
topic_model = BERTopic().fit(training_docs, embeddings)
topic_model.reduce_topics(training_docs, nr_topics=5) # Reduce num of topics, default = 20
# Determine topics on testing docs
topics, probs = topic_model.transform(testing_docs, embeddings)
# topics, probs = topic_model.transform(testing_docs) # ~1,555x slower
df = pd.DataFrame({"docs": testing_docs, "topics": topics})
print(df)
print(topic_model.get_topic_info())

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:
1) Import packages
from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
2) Preparing the dataset
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus = result_tfidf.select("id", "features")
3) Training the LDA model
lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
4) Replicate the model
#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")
#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)
The following results are different despite my assumption to be equal:
topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)
So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?
Thank you :)
I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.
I came up with below solution after trying multitude of things:
Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.
All in all now my results are same even after several iterations. Hope this helps.

One Hot Encoding a composite field

I want to transform multiple columns with same categorical values using a OneHotEncoder. I created a composite field and tried to use OneHotEncoder on it as below: (Items 1-3 are from the same list of items)
import pyspark.sql.functions as F
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = StringIndexer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
encoder = OneHotEncoder(setInputCol="basketIndex", setOutputCol="basketVec")
encoded = encoder.transform(indexed)
def myConcat(*cols):
return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])
I am getting an out of memory error.
Does this approach work? How do I one hot encode a composite field or multiple columns with categorical values from same list?
If you have categorical values array why you didn't try CountVectorizer:
import pyspark.sql.functions as F
from pyspark.ml.feature import CountVectorizer
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = CountVectorizer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
Note: I can't comment yet (due to the fact that I'm a new user).
What is the cardinality of your "item1", "item2" and "item3"
More specifically, what are the values that the following prints is giving ?
k1 = df.item1.nunique()
k2 = df.item2.nunique()
k3 = df.item3.nunique()
k = k1 * k2 * k3
print (k1, k2, k3)
One hot encoding is basically creating a very sparse matrix of same number of rows as your original dataframe with k number of additional columns, where k = products of the three numbers printed above.
Therefore, if your 3 numbers are large, you get out of memory error.
The only solutions are to:
(1) increase your memory or
(2) introduce a hierarchy among the categories and use the higher level categories to limit k.

Reconstructing k-means using pre-computed cluster centres

I'm using k-means for clustering with number of clusters 60. Since, some of the clusters are coming out as meaning less, I've deleted those cluster centers from cluster center array(count = 8) and saved in clean_cluster_array.
This time, I'm re-fitting k-means model with init = clean_cluster_centers. and n_clusters = 52 and max_iter = 1 because i want to avoid re-fitting as much as possible.
The basic idea is to recreate new model with clean_cluster_centers . The problem here is since, we are removing large number of clusters; The model is quickly configuring to more stable centers even with n_iter = 1. Is there any way to recreate k-means model?
If you've fitted a KMeans object, it has a cluster_centers_ attribute. You can directly update it by doing something like this:
cls.cluster_centers_ = new_cluster_centers
So if you want a new object with the clean cluster centers, just do something like the following:
cls = KMeans().fit(X)
cls2 = cls.copy()
cls2.cluster_centers_ = new_cluster_centers
And now, since the predict function only checks that your object has a non-null attribute called cluster_centers_, you can use the predict function
def predict(self, X):
"""Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, `cluster_centers_` is called
the code book and each value returned by `predict` is the index of
the closest code in the code book.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns
-------
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return _labels_inertia(X, x_squared_norms, self.cluster_centers_)[0]

Spark cosine distance between rows using Dataframe

I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between rows. --> This is need for Item-Item Recommender System.
All that I've read about it is referred to computing similarity over columns Apache Spark Python Cosine Similarity over DataFrames
May someone say is it possible to compute a cosine distance elegantly between rows using PySpark Dataframe's API or RDD's or I have to do it manually?
That's just some code to show what I intend to do
def cosineSimilarity(vec1, vec2):
return vec1.dot(vec2) / (LA.norm(vec1) * LA.norm(vec2))
#p.s model is ALS
Pred_Factors = model.itemFactors.cache() #Pred_Factors = DataFrame[id: int, features: array<float>]
sims = []
for _id,_feature in Pred_Factors.toLocalIterator():
for id, feature in Pred_Factors.toLocalIterator():
itemFactor = _feature
sims = sims.append(_id, cosineSimilarity(asarray(feature),itemFactor))
sims = sc.parallelize(l)
sortedSims = sims.takeOrdered(10, key=lambda x: -x[1])
Thanks in Advance for all the help
You can use mllib.feature.IndexedRowMatrix's columnSimilarities function. It uses cosine metrics as distance function. It computes similarities between columns so, you have to take transpose before applying this function.
pred_ = IndexedRowMatrix(Pred_Factors.rdd.map(lambda x: IndexedRow(x[0],x[1]))).toBlockMatrix().transpose().toIndexedRowMatrix()
pred_sims = pred.columnSimilarities()

Resources