Map BERTopic topic IDs back to the training dataframe - python-3.x

I have trained a BERTopic model on a dataframe of length of 400k. I want to map the topics of each document in a new column inside the dataframe. I could do that by running a for loop on all the documents and do topic_model.transform(doc) on them. The only problem is, it takes more than a second to transform each document into its topic and it would take days for the whole dataset.
Is there a way to achieve this faster since I want to map the topics on the training data.
I tried:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
topics = []
for text in df.texts:
tops = topic_model.transform(text)
topics.append(tops)
df['topics'] = topics

There is no need to recalculate the topics as you already retrieved them when using .fit_transform. There, the topics that you retrieve are in the exact same order as the input documents. Therefore, you can perform the following:
# The `topics` that you get here are in the exact same order as `docs`
# `topics[0]` belongs to `docs[0]`, `topics[1]` to `docs[1]`, etc.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
For those using .fit instead of .fit_transform, you can also access the topics and their documents as follows:
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})

From the source code, the transform() function of the BERTopic class is able accept a list of documents -- so you don't need to loop over your dataframe calling transform() multiple times for each document.
Secondly, it seems that if you don't pass your pre-trained document embeddings to the transform() function, embeddings will be set to None and you'll be calling _extract_embeddings() every single time which is likely what is causing the poor performance. The solution is to pass the embeddings to your transform() call. In the dummy example shown below, this improves speed of classification of 1,000 documents by approx. 1,555x (68.43 vs 0.044 seconds).
Example
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
import random
import pandas as pd
# Create dummy data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
random.seed(756)
training_docs = random.sample(docs, 1000)
testing_docs = random.sample(docs, 1000)
# Instantiate and fit topic model to training docs
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(training_docs, show_progress_bar=True)
topic_model = BERTopic().fit(training_docs, embeddings)
topic_model.reduce_topics(training_docs, nr_topics=5) # Reduce num of topics, default = 20
# Determine topics on testing docs
topics, probs = topic_model.transform(testing_docs, embeddings)
# topics, probs = topic_model.transform(testing_docs) # ~1,555x slower
df = pd.DataFrame({"docs": testing_docs, "topics": topics})
print(df)
print(topic_model.get_topic_info())

Related

In Pytorch, how can i shuffle a DataLoader?

I have a dataset with 10000 samples, where the classes are present in an ordered manner. First I loaded the data into an ImageFolder, then into a DataLoader, and I want to split this dataset into a train-val-test set. I know the DataLoader class has a shuffle parameter, but thats not good for me, because it only shuffles the data when enumeration happens on it. I know about the RandomSampler function, but with it, i can only take n amount of data randomly from the dataset, and i have no control of what is being taken out, so one sample might be present in the train,test and val set at the same time.
Is there a way to shuffle the data in a DataLoader? The only thing i need is the shuffle, after that i can subset the data.
The Subset dataset class takes indices (https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset). You can probably exploit that to get this functionality as below. Essentially, you can get away by shuffling the indices and then picking the subset of the dataset.
# suppose dataset is the variable pointing to whole datasets
N = len(dataset)
# generate & shuffle indices
indices = numpy.arange(N)
indices = numpy.random.permutation(indices)
# there are many ways to do the above two operation. (Example, using np.random.choice can be used here too
# select train/test/val, for demo I am using 70,15,15
train_indices = indices [:int(0.7*N)]
val_indices = indices[int(0.7*N):int(0.85*N)]
test_indices = indices[int(0.85*N):]
train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)

How to get all documents per topic in bertopic modeling

I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic.
topic_model = BERTopic(verbose=True, embedding_model=embedding_model,
nr_topics = 'auto',
n_gram_range = (3,3),
top_n_words = 10,
calculate_probabilities=True,
seed_topic_list = topic_list,
)
topics, probs = topic_model.fit_transform(docs_test)
representative_doc = topic_model.get_representative_docs(topic#1)
representative_doc
this topic contain more then 300 documents but bertopic only shows 3 of them with .get_representative_docs
There are probably solutions that are more elegant because I am not an expert, but I can share what worked for me (as there are no answers yet):
"topics, probs = topic_model.fit_transform(docs_test)" returns the topics.
Therefore, you can combine this output and the documents.
For example, combine them into a (pandas.)dataframe using
df = pd.DataFrame({'topic': topics, 'document': docs_test})
Now you can filter this dataframe for each topic to identify the referring documents.
topic_0 = df[df.topic == 0]
There is an API from BERTopic get_document_info() which returns the dataframe for each document and associated topic for it. https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.get_document_info
The response from this API is shown below:
index
Document
Topic
Name
...
0
doc1_text
241
kw1_kw2_
...
1
doc2_text
-1
kw1_kw2_
...
You can use this dataframe to get all the documents associated for a particular topic using pandas groupby or however you prefer.
T = topic_model.get_document_info(docs)
docs_per_topics = T.groupby(["Topic"]).apply(lambda x: x.index).to_dict()
The code returns a dictionary shown as below:
{
-1: Int64Index([3,10,11,12,15,16,18,19,20,22,...365000], dtype='int64',length=149232),
0: Int64Index([907,1281,1335,1337,...308420,308560,308645],dtype='int64',length=5127),
...
}

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:
1) Import packages
from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
2) Preparing the dataset
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus = result_tfidf.select("id", "features")
3) Training the LDA model
lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
4) Replicate the model
#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")
#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)
The following results are different despite my assumption to be equal:
topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)
So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?
Thank you :)
I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.
I came up with below solution after trying multitude of things:
Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.
All in all now my results are same even after several iterations. Hope this helps.

TensorFlow: extract data with a given feature, from NSynth Dataset

I have a data set of TFRecord files of serialized TensorFlow Example protocol buffers with one Example proto per note, downloaded from https://magenta.tensorflow.org/datasets/nsynth. I am using the test set, which is approximately 1 Gb, in case someone wants to download it, to check the code below. Each Example contains many features: pitch, instrument ...
The code that reads in this data is:
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
# Reading input data
dataset = tf.data.TFRecordDataset('../data/nsynth-test.tfrecord')
# Convert features into tensors
features = {
"pitch": tf.FixedLenFeature([1], dtype=tf.int64),
"audio": tf.FixedLenFeature([64000], dtype=tf.float32),
"instrument_family": tf.FixedLenFeature([1], dtype=tf.int64)}
parse_function = lambda example_proto: tf.parse_single_example(example_proto,features)
dataset = dataset.map(parse_function)
# Consuming TFRecord data.
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size=3)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
batch = iterator.get_next()
sess.run(batch)
Now, the pitch ranges from 21 to 108. But I want to consider data of a given pitch only, e.g. pitch = 51. How do I extract this "pitch=51" subset from the whole dataset? Or alternatively, what do I do to make my iterator go through this subset only?
What you have looks pretty good, all you're missing is a filter function.
For example if you only wanted to extract pitch=51, you should add after your map function
dataset = dataset.filter(lambda example: tf.equal(example["pitch"][0], 51))

How to iterate TfidfVectorizer() on pandas dataframe

I have a large pandas dataframe with 10 million records of news articles. So, this is how I have applied TfidfVectorizer.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(df['articles'])
It took alot of time to process all documents. All I wants to iterate each article in dataframe one at a time or is it possible that I can pass documents in chunks and it keep updating existing vocabulary without overwriting old dictionary of vocabulary?
I have gone through this SO post but not exactly getting how to applied it on pandas. I have also heard about Python generators but not exactly whether its useful here.
You can iterate in chunks as below. The solution has been adapted from here
def ChunkIterator():
for chunk in pd.read_csv(csvfilename, chunksize=1000):
for doc in chunk['articles'].values:
yield doc
corpus = ChunkIterator()
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(corpus)

Resources