How does the number of iterations influence LDA topics modelled? - nlp

I am running LDA(sklearn library) on a corpus and my model achieves a coherence score of 0.58(K= 9 topics) for 500 iterations to 100. However, when I reduced the number of iterations, I was able to achieve a better coherence score of 0.60(K=10 topics).
I have performed the following pre-processing steps:
Lowercase
Removal of numbers, punctuations
Lemmatization
Bigrams
Please could you help me understand this behavior? How does iterations influence the number of topics?
Are there additional metrics that are more reliable to finalise the number of topics?

Related

What is the best way to evaluate and compare Bayesian Optimisation surrogate models?

This question is not relating to hyperparameter tuning.
I am trying to find the best multidimensional combination of parameters for an application based on a given metric by using bayesian optimisation to search the parameter space and efficiently find the most optimal parameters with the fewest number of evaluations. The model gives some sets of parameters, some that it has a high prediction of the metric, and others it has high uncertainty about.
These 2-300 outputs per cycle are then experimentally validated, and the accumulated results fed back into the model to get a better set of parameters for the next iterations for a total of about 6 iterations (12-1500 data points total) . The search space is large, and there are a limited amount of iterations that can be performed.
Because of this, I need to evaluate several surrogate models on their performance within this search space. I need to evaluate the search efficiency (how quickly can each one find the most optimal candidates e.g one will take 3 cycles, the other 8, other 20 etc) and the theoretical proportion of the search space that each model can search given the same data e.g 20% of the search space given 3% experimentally validated data from the search space.
I am using the BoTorch library to build the bayesian optimisation model. I also already have a set of real world experimental data from different cycles from the first model I tried. At the moment I am using Gaussian Processes but want to benchmark different settings for the GP but also different architectures such as Bayesian Networks.
I would like to know how to go about evaluating these models for the search efficiency and design space uncertainty. Any thoughts about how to benchmark and compare surrogate models generally are most welcome.
Thanks.

Truncate LDA topics

I am training an LDA model. While I obtain decently interpretable topics (based on the top words), particular documents tend to load heavily on very "generic" topics rather than specialized ones -- even though the most frequent words in the document are specialized.
For example, I have a real estate report as a document. Top words by frequency are "rent", "reit", "growth". Now, I have a "specialized" topic with top words being exactly those three. However, the loading of the specialized topic is 9%, and 32% goes to a topic which is very diffuse and the top words are rather common.
How can I increase the weight of "specialized" topics? Is it possible to truncate topics such that I only include the top 10 words and assign zero probability to anything else? Is it desirable to do so?
I am using the gensim package. Thank you!
It seems that you want a very precise control over the topics which looks much more like clustering with a set of centroids chosen ahead of time than LDA which is generally not very deterministic and hence controllable.
One of the ways you can strive to achieve your goal with LDA is to filter more words out of the documents (same as you do with stopwords). Then the "rather common" words that go into one of the topics stop obscuring the LDA model creation process and you get more crisply delineated topics (hopefully).
Removing the most common words is quite a common practice for preprocessing in topic modeling. Because topics are usually generated from the most frequent words, but usually these words are not very informative. You can also remove the most common words as a post-processing step (See Pulling Out the Stops: Rethinking Stopword Removal for Topic Models)
About having sparser word-topic distributions, you can use Non-negative Matrix Factorization (NMF) instead of LDA. If you adjust the sparsity parameters, you can get more spiked proportions of the topics. You can use scikit-learn NMF's implementation.

Number of keywords in text cluster

I'm working in a decently-sized data set, and wish to identify what # topics make sense. I used both NMF and LDA (sklearn implementation), but the key question: what is a suitable measure for success. Visually I have in many topics only a few height-weight keywords (the other weights ~ 0), and a few topics with more bell-shaped distribution of the topics. What is the target: a topic with a few words, high weight, rest low (a spike) or a bell-shape distribution, gradual reduction of weights over a large # keywords
NMF
or the LDA method
that gives mostly a bell-shape (not curve, obviously)
I also use a weighted jaccard (set overlap of the keywords, weighted; there are no doubt better methods, but this is kind-of intuitive
Your thoughts on this?
best,
Andreas
code at https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html?highlight=document%20word%20matrix
There are a few commonly used evaluation metrics that can give a good intuition of the quality of your topic sets in general, as well as your choice of k (number of topics). A recent paper by Dieng et al. (Topic Modeling in Embedded Spaces) uses two of the best measures: coherence and diversity. In conjunction, coherence and diversity give an idea of how well-clustered topics are. Coherence measures the similarities of words in each topic using their co-occurrences in documents, and diversity measures the similarity between topics based on the overlap of topics. If you score low in diversity, that means that words are overlapping in topics, and you might want to increase k.
There's really no "best way to decide k," but these kind of measures can help you decide whether to increase or decrease the number.

Dynamic number of topics in topic models

I am new to topic modelling.
My aim is to find key topics from a document. I am planning to use lda for the purpose. But in lda the number of topics should be predefined.I believe if a document from some other domain which was not in the training corpus comes,it will not give proper results. Is there any alternative solution? Is my thought is correct?
Two good candidates for learning the topics are Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP) topic models.
For LDA, the number of topics K is fixed and assumed to be known ahead of time. Fast inference algorithms, such as on-line Variational Bayes (VB) algorithm implemented in scikit and gensim enable training on very large data sets (e.g. New York Times or Wikipedia) By training on large corpora and setting K high, we can avoid the problem of over-fitting and learn meaningful topics for out of sample documents. For LDA, cross-validation is commonly used to set K by evaluating perplexity for different number of topics and choosing K that minimizes perplexity.
Alternatively, HDP topic model (implemented in gensim) learns the number of topics from data automatically. By setting the concentration parameters and the truncation levels, the number of topics is inferred by the model. Efficient inference algorithms such as online variational inference for HDPs enable training on massive datasets and discovery of meaningful topics.

How to determine the number of topics in the LDA (Latent Dirichlet Allocation) alogrithm for text clustering?

I am using the LDA algorithm to cluster many documents into different topics. The LDA algorithm needs an input parameter: the number of topics. How could I determine this?
I am using the Reuter corpora to benchmark my solution. And Reuter corpora has topic numbers ready. Should I input the the same topic number when I clustering Reuter text? And comparing my clustering result to Reuter's?
But when in production, how could I know the number of topics before I actually cluster based on the topics. It's kind of like a chicken-egg problem.
One way you can approach this is through k means. Through Silhouette (or the elbow curves, but I guess that will require manual intervention) you can get the optimal number of clusters. You can use this number as the number of topics.

Resources