How to get all documents per topic in bertopic modeling - nlp

I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic.
topic_model = BERTopic(verbose=True, embedding_model=embedding_model,
nr_topics = 'auto',
n_gram_range = (3,3),
top_n_words = 10,
calculate_probabilities=True,
seed_topic_list = topic_list,
)
topics, probs = topic_model.fit_transform(docs_test)
representative_doc = topic_model.get_representative_docs(topic#1)
representative_doc
this topic contain more then 300 documents but bertopic only shows 3 of them with .get_representative_docs

There are probably solutions that are more elegant because I am not an expert, but I can share what worked for me (as there are no answers yet):
"topics, probs = topic_model.fit_transform(docs_test)" returns the topics.
Therefore, you can combine this output and the documents.
For example, combine them into a (pandas.)dataframe using
df = pd.DataFrame({'topic': topics, 'document': docs_test})
Now you can filter this dataframe for each topic to identify the referring documents.
topic_0 = df[df.topic == 0]

There is an API from BERTopic get_document_info() which returns the dataframe for each document and associated topic for it. https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.get_document_info
The response from this API is shown below:
index
Document
Topic
Name
...
0
doc1_text
241
kw1_kw2_
...
1
doc2_text
-1
kw1_kw2_
...
You can use this dataframe to get all the documents associated for a particular topic using pandas groupby or however you prefer.
T = topic_model.get_document_info(docs)
docs_per_topics = T.groupby(["Topic"]).apply(lambda x: x.index).to_dict()
The code returns a dictionary shown as below:
{
-1: Int64Index([3,10,11,12,15,16,18,19,20,22,...365000], dtype='int64',length=149232),
0: Int64Index([907,1281,1335,1337,...308420,308560,308645],dtype='int64',length=5127),
...
}

Related

Map BERTopic topic IDs back to the training dataframe

I have trained a BERTopic model on a dataframe of length of 400k. I want to map the topics of each document in a new column inside the dataframe. I could do that by running a for loop on all the documents and do topic_model.transform(doc) on them. The only problem is, it takes more than a second to transform each document into its topic and it would take days for the whole dataset.
Is there a way to achieve this faster since I want to map the topics on the training data.
I tried:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
topics = []
for text in df.texts:
tops = topic_model.transform(text)
topics.append(tops)
df['topics'] = topics
There is no need to recalculate the topics as you already retrieved them when using .fit_transform. There, the topics that you retrieve are in the exact same order as the input documents. Therefore, you can perform the following:
# The `topics` that you get here are in the exact same order as `docs`
# `topics[0]` belongs to `docs[0]`, `topics[1]` to `docs[1]`, etc.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
For those using .fit instead of .fit_transform, you can also access the topics and their documents as follows:
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})
From the source code, the transform() function of the BERTopic class is able accept a list of documents -- so you don't need to loop over your dataframe calling transform() multiple times for each document.
Secondly, it seems that if you don't pass your pre-trained document embeddings to the transform() function, embeddings will be set to None and you'll be calling _extract_embeddings() every single time which is likely what is causing the poor performance. The solution is to pass the embeddings to your transform() call. In the dummy example shown below, this improves speed of classification of 1,000 documents by approx. 1,555x (68.43 vs 0.044 seconds).
Example
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import random
import pandas as pd
# Create dummy data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
random.seed(756)
training_docs = random.sample(docs, 1000)
testing_docs = random.sample(docs, 1000)
# Instantiate and fit topic model to training docs
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(training_docs, show_progress_bar=True)
topic_model = BERTopic().fit(training_docs, embeddings)
topic_model.reduce_topics(training_docs, nr_topics=5) # Reduce num of topics, default = 20
# Determine topics on testing docs
topics, probs = topic_model.transform(testing_docs, embeddings)
# topics, probs = topic_model.transform(testing_docs) # ~1,555x slower
df = pd.DataFrame({"docs": testing_docs, "topics": topics})
print(df)
print(topic_model.get_topic_info())

AWS Glue dynamic frames not getting updated

We are currrently facing an issue where we cannot insert more than 600K records in oracle db using AWS glue. We are getting connection reset error and DBA's are currently looking into it. As a temporary solution we thought of adding data in chunks by splitting a dataframe into multiple dataframe and looping this list of dataframe to add data. We are sure that splitting algorithm works fine and here is the code we use
def split_by_row_index(df, num_partitions=10):
# Let's assume you don't have a row_id column that has the row order
t = df.withColumn('_row_id', monotonically_increasing_id())
# Using ntile() because monotonically_increasing_id is discontinuous across partitions
t = t.withColumn('_partition', ntile(num_partitions).over(Window.orderBy(t._row_id)))
return [t.filter(t._partition == i + 1) for i in range(num_partitions)]
Here each DF have unique data but somehow when we convert this df in dynamic frame in loop it is we are getting common data in each dynamic frame. here is small snippet for this example
df_trns_details_list = split_by_row_index(df_trns_details, int(df_trns_details.count() / 100000))
trnsDetails1 = DynamicFrame.fromDF(df_trns_details_list[0], glueContext, "trnsDetails1")
trnsDetails2 = DynamicFrame.fromDF(df_trns_details_list[1], glueContext, "trnsDetails2")
print(df_trns_details_list[0].count())# counts are same
print(trnsDetails1.count())
print('-------------------------------')
print(df_trns_details_list[1].count()) # counts are same
print(trnsDetails2.count())
print('-------------------------------')
subDf1 = trnsDetails1.toDF().select(col("id"), col("details_id"))
subDf2 = trnsDetails2.toDF().select(col("id"), col("details_id"))
common = subDf1.intersect(subDf2)
# ------------------ common data exists----------------
print(common.count())
subDf3 = df_trns_details_list[0].select(col("id"), col("details_id"))
subDf4 = df_trns_details_list[1].select(col("id"), col("details_id"))
#------------------0 common data----------------
common1 = subDf3.intersect(subDf4)
print(common1.count())
here Id and details_id combination will be unique
We used this logic in multiple areas where it worked not sure why this is happening.
We are also quite new to Python and AWS Glue so any suggestion to improve it also welcomed. Thanks

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:
1) Import packages
from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
2) Preparing the dataset
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus = result_tfidf.select("id", "features")
3) Training the LDA model
lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
4) Replicate the model
#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")
#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)
The following results are different despite my assumption to be equal:
topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)
So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?
Thank you :)
I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.
I came up with below solution after trying multitude of things:
Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.
All in all now my results are same even after several iterations. Hope this helps.

Cartesian product error on ALS recommendation

I'm trying to show a list of movie recommendations for an user. The model has been trained but when trying to show the prediction, I'm getting an error.
als = ALS(maxIter=5, regParam=0.01, userCol="userID",
itemCol="movieID", ratingCol="rating")
# ratings is a DataFrame of (movieID, rating, userID)
model = als.fit(ratings)
# allMovies is a DataFrame of (movieID, userID)
# it has userID=0 and all distinct movieID
recommendations = model.transform(allMovies)
recommendations.take(20)
Using the from pyspark.ml.recommendation.ALS library and
when running the last line, I'm getting the error
Detected cartesian product for LEFT OUTER join between logical plans.
Why is this happening? Thanks!
For answering my own question. It seems that you shouldn't use transform but the recommendForUserSubset method.
Before model.transform you have to define ALS something like ALS(input_col = 'something like input feature',output_col = predicted_rating) or you can have this way to work possibly.
rank = 10
numIterations = 100
model = ALS.train(ratings, rank, numIterations) #where ratings is dataframe
recommendation = model.predictAll(alMovies).map(lambda r: ((r[0], r[1]), r[2]))
Hope this helps.

How can I access computed metrics for each fold in a CrossValidatorModel

How can I get the computed metrics for each fold from a CrossValidatorModel in spark.ml? I know I can get the average metrics using model.avgMetrics but is it possible to get the raw results on each fold to look at eg. the variance of the results?
I am using Spark 2.0.0.
Studying the spark code here
For the folds, you can do the iteration yourself like this:
val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed))
//K-folding operation starting
//for each fold you have multiple models created cfm. the paramgrid
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
val trainingDataset = sparkSession.createDataFrame(training, schema).cache()
val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
trainingDataset.unpersist()
var i = 0
while (i < numModels) {
val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
}
This is in scala, but the ideas are very clearly outlined.
Take a look at this answer that outlines results per fold. Hope this helps.

Resources