how to determine the number of topics for LDA? - nlp

I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?

Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.

A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.

First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling".
Important links:
https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/

Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1:
Try out different values of k, select the one that has the largest likelihood.
Method 2:
Instead of LDA, see if you can use HDP-LDA
Method 3:
If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.

Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

Related

Word2Vec clustering: embed with low dimensionality or with high dimensionality and then reduce?

I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.

Truncate LDA topics

I am training an LDA model. While I obtain decently interpretable topics (based on the top words), particular documents tend to load heavily on very "generic" topics rather than specialized ones -- even though the most frequent words in the document are specialized.
For example, I have a real estate report as a document. Top words by frequency are "rent", "reit", "growth". Now, I have a "specialized" topic with top words being exactly those three. However, the loading of the specialized topic is 9%, and 32% goes to a topic which is very diffuse and the top words are rather common.
How can I increase the weight of "specialized" topics? Is it possible to truncate topics such that I only include the top 10 words and assign zero probability to anything else? Is it desirable to do so?
I am using the gensim package. Thank you!
It seems that you want a very precise control over the topics which looks much more like clustering with a set of centroids chosen ahead of time than LDA which is generally not very deterministic and hence controllable.
One of the ways you can strive to achieve your goal with LDA is to filter more words out of the documents (same as you do with stopwords). Then the "rather common" words that go into one of the topics stop obscuring the LDA model creation process and you get more crisply delineated topics (hopefully).
Removing the most common words is quite a common practice for preprocessing in topic modeling. Because topics are usually generated from the most frequent words, but usually these words are not very informative. You can also remove the most common words as a post-processing step (See Pulling Out the Stops: Rethinking Stopword Removal for Topic Models)
About having sparser word-topic distributions, you can use Non-negative Matrix Factorization (NMF) instead of LDA. If you adjust the sparsity parameters, you can get more spiked proportions of the topics. You can use scikit-learn NMF's implementation.

Number of keywords in text cluster

I'm working in a decently-sized data set, and wish to identify what # topics make sense. I used both NMF and LDA (sklearn implementation), but the key question: what is a suitable measure for success. Visually I have in many topics only a few height-weight keywords (the other weights ~ 0), and a few topics with more bell-shaped distribution of the topics. What is the target: a topic with a few words, high weight, rest low (a spike) or a bell-shape distribution, gradual reduction of weights over a large # keywords
NMF
or the LDA method
that gives mostly a bell-shape (not curve, obviously)
I also use a weighted jaccard (set overlap of the keywords, weighted; there are no doubt better methods, but this is kind-of intuitive
Your thoughts on this?
best,
Andreas
code at https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html?highlight=document%20word%20matrix
There are a few commonly used evaluation metrics that can give a good intuition of the quality of your topic sets in general, as well as your choice of k (number of topics). A recent paper by Dieng et al. (Topic Modeling in Embedded Spaces) uses two of the best measures: coherence and diversity. In conjunction, coherence and diversity give an idea of how well-clustered topics are. Coherence measures the similarities of words in each topic using their co-occurrences in documents, and diversity measures the similarity between topics based on the overlap of topics. If you score low in diversity, that means that words are overlapping in topics, and you might want to increase k.
There's really no "best way to decide k," but these kind of measures can help you decide whether to increase or decrease the number.

How to test a text clustering application?

I am developing an application to cluster documents according to their topics. I am using the LDA (Latent Dirichlet Allocation) algorithm. Now the prototype is ready and there are some results.
I am looking for a reasonable way to test it. My current approach is to print out the topics and some of their related documents respectively. And manually evaluate them. I can think of the following test points:
The documents within a topic are on that topic indeed.
The topics are substantially different from each other.
Is there any best practice to do this? Is there any objective metric for this rather than my subjective evaluation?
1.after training, we get the topic word matrix P(z|w) , every row is the word's prob assign to the topic, so you can print out the top N words of every topic,and eval them , it would be easy comparing to eval topic with document
2.I think the problem you are asking here is whether the training has converged,I simply eval the P(z|w) ,when the P(z|w) is stable , it means model converge at the param (alpha,beta,topic_num)we choose. and when we tune the topic num , we can get the stable P(z|w) respect to all the topic_num, we choose topic_num respect to the max P(z|w) . you can refer to the paper
http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf
3.as to how to tune alpha beta, and effcient way to tune topic_num , Hanna M. Wallach do a lot of research about that,I simply do this by intuition,since the dataset is too large http://people.cs.umass.edu/~wallach/

Topic modelling, but with known topics?

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems.
For the sake of being thorough, I have the following pieces of information as input:
A set of documents (segments of DNA from one organism, where each segment is a document)
A document can only have one topic in this scenario
A set of topics (segments of DNA from other organisms)
Words in this case are triplets of bases (for now)
The question I want to answer is: For the current document, what is its topic? In other words, for the given DNA segment, which other organism (same species) did it most likely come from? There could have been mutations and such since the exchange of segments occurred, so the two segments won't be identical.
The main difference between this and the classical LDA model is that I know the topics ahead of time.
My initial idea was to take a pLSA model (http://en.wikipedia.org/wiki/PLSA) and just set the topic nodes explicitly, then perform standard EM learning (if only there were a decent library that could handle Bayesian parameter learning with latent variables...), followed by inference using whatever algorithm (which shouldn't matter, because the model is a polytree anyway).
Edit: I think I've solved it, for anyone who might stumble across this. I figured out that you can use labelled LDA and just assign every label to every document. Since each label has a one-to-one correspondence with a topic, you're effectively saying to the algorithm: for each document, choose the topic from this given set of topics (the label set), instead of making up your own.
I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.
I also have a set of documents (pdf documents anywhere from 1 to 200
pages), though mine are regular English text data.
A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
Words (standard English, though named entities and acronyms are included in my corpus)
LDAesk approach: Guided LDA
Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.
Non-LDAesk approach: ~KNN
What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together.
For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector.
I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.
I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.
Why not simply use a supervised topic model. Jonathan Chang's lda package in R has an slda function that is quite nice. There is also a very helpful demo. Just install the package and run demo(slda).

Resources