How to get all documents per topic after merging in BERTopic? - topic-modeling

I am using BERTopic to generate topics on my dataset. After the initial topics are created, I used hierarchical clustering to identify some topics I considered too specific, so I created a list of lists of topics to merge and applied .merge_topics. Which works as intended. However, my topics list is not updated, since I only defined it when applying fit.transform to my dataset.
The answer should be something like in this thred How to get all docoments per topic in bertopic modeling, but in my case I first have to access the new topics. Any suggestions?
This is my code:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all')['data']
topic_model = BERTopic()
topics, probabilities = topic_model.fit_transform(docs)
topics_to_merge = [[1, 2], [3, 4]]
topic_model.merge_topics(docs, topics_to_merge)
df = pd.DataFrame({'topic': topics, 'document': docs})

Apparently, the answer is very simple. Bertopic stores the topics that were generated for each document in topic_model.topics_.
So at the end, just create df using the following code:
import pandas as pd
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})

In the v0.13 release of BERTopic, there is now the option to extract document meta-data, including topics and probabilities, as follows:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Get your data
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Train your model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Create a dataframe with document metadata
df = topic_model.get_document_info(docs)
The output will look something like this:
>>> topic_model.get_document_info(docs)
Document Topic Name Top_n_words Probability ...
I am sure some bashers of Pens... 0 0_game_team_games_season game - team - games... 0.200010 ...
My brother is in the market for... -1 -1_can_your_will_any can - your - will... 0.420668 ...
Finally you said what you dream... -1 -1_can_your_will_any can - your - will... 0.807259 ...
Think! It is the SCSI card doing... 49 49_windows_drive_dos_file windows - drive - docs... 0.071746 ...
1) I have an old Jasmine drive... 49 49_windows_drive_dos_file windows - drive - docs... 0.038983 ...

Related

How can I link this file to my .ipynb file to collect frequent data from the first dataset to the 9th dataset

data set imagePlease use python language. I'm a beginner in frequent data mining systems. I'm trying to understand. Be simple and detailed as much as possible please
I tried using the for loop to collect data from a range but I'm still learning so I don't know how to implement it (keeps giving me the error "index 1 is out of bounds for axis 1 with size 1"). Please guide me.
NB: I was trying to construct a data frame but I don't know how to. Help me with that too
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
# Calling DataFrame constructor
Data = pd.read_csv('retail.txt', header = None)
# Intializing the list
transacts = []
# populating a list of transactions
for i in range(1, 9):
transacts.append([str(Data.values[i,j]) for j in range(1, 2000)])
df = pd.DataFrame()

doc2bow expects an array of unicode tokens on input, not a single string

Topic modeling
how does this corpora Dictionary is Tokenize, since i am using Dataframe as dataset.
from gensim import corpora, models
from gensim.models import CoherenceModel
# create a dictionary
dictionary = corpora.Dictionary([listed_reviews])
# create a corpus
corpus = [dictionary.doc2bow(text) for text in reviews['ReviewBody']]
# create a LDA model
lda_model = models.ldamodel.LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=10,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# display the topics
lda_model.print_topics(num_topics=10, num_words=10)
# create a coherence model
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews['ReviewBody'], dictionary=dictionary, coherence='c_v')
# compute coherence score
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Hey Guys i want to build topic modelling but this error. i know little bit but not sure how to fix.

How to evaluate NMF Topic Modeling by using Confusion Matrix?

I am doing topic modeling using NMF model. I want to evaluate its performance by confusion matrix or if there are other better methods to evaluate NMF, I am ok with that also. I tried to find tutorials or other resources on internet but couldn't find anything that help me solve my problem. Below is the complete code which I am using for NMF topic modeling.
import pandas as pd
import numpy as np
dataset = pd.read_csv(r'Preprocess_Data.csv')
dataset = reviews_datasets.head(20000)
dataset.dropna()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
tfidf_vect = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = tfidf_vect.fit_transform(dataset['Text'].values.astype('U'))
from sklearn.decomposition import NMF
nmf = NMF(n_components=5, random_state=42)
nmf.fit(doc_term_matrix)
import random
for i in range(10):
random_id = random.randint(0,len(tfidf_vect.get_feature_names()))
print(tfidf_vect.get_feature_names()[random_id])
first_topic = nmf.components_[0]
top_topic_words = first_topic.argsort()[-10:]
for i in top_topic_words:
print(tfidf_vect.get_feature_names()[I])
for i,topic in enumerate(nmf.components_):
print(f'Top 10 words for topic #{i}:')
print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
print('\n')
Thanks in advance for the suggestions and advices.
If you have labels associated with documents, then you can train a classifier using the topic-document representations as document features and test on the topic-document representations of the testing set.
Otherwise, you need to stick to unsupervised metrics, e.g. the most well-known is topic coherence which measures how related the top-N words of the topics are.
You can find all these measures and many others here: https://github.com/mind-Lab/octis#available-metrics

Recover feature names from pipeline [duplicate]

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

How to cluster "text document" with "spherical k-means" using Python?

I have finished implementing the traditional k-means text clustering. However, right now, I need to revise my program to "spherical k-means text clustering" but have not succeeded yet.
I've searched for solutions on sites but still cannot revise my program successfully.
The followings are the resources that should be helpful with my project but I still cannot figure out a way yet.
https://github.com/jasonlaska/spherecluster
https://github.com/khyatith/Clustering-newsgroup-dataset
Spherical k-means implementation in Python
This is my traditional K-means program:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.externals import joblib #store model
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tag_document) //tag_document is a list that contains many strings
true_k = 3 //assume that i want to have 3 clusters
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#store
joblib.dump(model,'save/cluster.pkl')
#restore
clu2 = joblib.load('save/cluster.pkl')
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
I expect to cluster text documents with "spherical k-means clustering".
First you need to check if your texts are similar when the cosine distance between two similar texts is small. After that, you can just normalize vectors and cluster with kmeans.
I did something like this:
k = 20
kmeans = KMeans(n_clusters=k,init='random', random_state=0)
normalizer = Normalizer(copy=False)
sphere_kmeans = make_pipeline(normalizer, kmeans)
sphere_kmeans = sphere_kmeans.fit_transform(word2vec-tfidf-vectors)

Resources