How to test a text clustering application? - nlp

I am developing an application to cluster documents according to their topics. I am using the LDA (Latent Dirichlet Allocation) algorithm. Now the prototype is ready and there are some results.
I am looking for a reasonable way to test it. My current approach is to print out the topics and some of their related documents respectively. And manually evaluate them. I can think of the following test points:
The documents within a topic are on that topic indeed.
The topics are substantially different from each other.
Is there any best practice to do this? Is there any objective metric for this rather than my subjective evaluation?

1.after training, we get the topic word matrix P(z|w) , every row is the word's prob assign to the topic, so you can print out the top N words of every topic,and eval them , it would be easy comparing to eval topic with document
2.I think the problem you are asking here is whether the training has converged,I simply eval the P(z|w) ,when the P(z|w) is stable , it means model converge at the param (alpha,beta,topic_num)we choose. and when we tune the topic num , we can get the stable P(z|w) respect to all the topic_num, we choose topic_num respect to the max P(z|w) . you can refer to the paper
http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf
3.as to how to tune alpha beta, and effcient way to tune topic_num , Hanna M. Wallach do a lot of research about that,I simply do this by intuition,since the dataset is too large http://people.cs.umass.edu/~wallach/

Related

Truncate LDA topics

I am training an LDA model. While I obtain decently interpretable topics (based on the top words), particular documents tend to load heavily on very "generic" topics rather than specialized ones -- even though the most frequent words in the document are specialized.
For example, I have a real estate report as a document. Top words by frequency are "rent", "reit", "growth". Now, I have a "specialized" topic with top words being exactly those three. However, the loading of the specialized topic is 9%, and 32% goes to a topic which is very diffuse and the top words are rather common.
How can I increase the weight of "specialized" topics? Is it possible to truncate topics such that I only include the top 10 words and assign zero probability to anything else? Is it desirable to do so?
I am using the gensim package. Thank you!
It seems that you want a very precise control over the topics which looks much more like clustering with a set of centroids chosen ahead of time than LDA which is generally not very deterministic and hence controllable.
One of the ways you can strive to achieve your goal with LDA is to filter more words out of the documents (same as you do with stopwords). Then the "rather common" words that go into one of the topics stop obscuring the LDA model creation process and you get more crisply delineated topics (hopefully).
Removing the most common words is quite a common practice for preprocessing in topic modeling. Because topics are usually generated from the most frequent words, but usually these words are not very informative. You can also remove the most common words as a post-processing step (See Pulling Out the Stops: Rethinking Stopword Removal for Topic Models)
About having sparser word-topic distributions, you can use Non-negative Matrix Factorization (NMF) instead of LDA. If you adjust the sparsity parameters, you can get more spiked proportions of the topics. You can use scikit-learn NMF's implementation.

word2vec: weighted How can I give negative training data?

Im reusing word2vec for products on my website and users. I would like to say that a user is NEGATIVELY associated to a product if he has visited the page < 5 seconds and POSITIVELY if he spent > 30 seconds on the page. Is there a way to specify this in word2vec? Or is there some other tool that enables this?
Although your question is not well defined but I think you want to store the relation of user with the product which has nothing to do with word2vec. word2vec essentially gives you a mapping from strings to contiguous domain vectors. In your problem you should give a separate new feature of User-Product relationship (NEGATIVE or POSITIVE) along with the word2vec features and you can let the model retrain the word-embeddings according to this new POSITIVE/NEGATIVE feature while solving your particular task. This way the model will adjust the word-embeddings and get some of the desired effect of the POSITIVE/NEGATIVE features.
Please be more elaborate so that I can answer your question in a better way.

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?
Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.
A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.
First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling".
Important links:
https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/
Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1:
Try out different values of k, select the one that has the largest likelihood.
Method 2:
Instead of LDA, see if you can use HDP-LDA
Method 3:
If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.
Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

Topic modelling, but with known topics?

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems.
For the sake of being thorough, I have the following pieces of information as input:
A set of documents (segments of DNA from one organism, where each segment is a document)
A document can only have one topic in this scenario
A set of topics (segments of DNA from other organisms)
Words in this case are triplets of bases (for now)
The question I want to answer is: For the current document, what is its topic? In other words, for the given DNA segment, which other organism (same species) did it most likely come from? There could have been mutations and such since the exchange of segments occurred, so the two segments won't be identical.
The main difference between this and the classical LDA model is that I know the topics ahead of time.
My initial idea was to take a pLSA model (http://en.wikipedia.org/wiki/PLSA) and just set the topic nodes explicitly, then perform standard EM learning (if only there were a decent library that could handle Bayesian parameter learning with latent variables...), followed by inference using whatever algorithm (which shouldn't matter, because the model is a polytree anyway).
Edit: I think I've solved it, for anyone who might stumble across this. I figured out that you can use labelled LDA and just assign every label to every document. Since each label has a one-to-one correspondence with a topic, you're effectively saying to the algorithm: for each document, choose the topic from this given set of topics (the label set), instead of making up your own.
I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.
I also have a set of documents (pdf documents anywhere from 1 to 200
pages), though mine are regular English text data.
A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
Words (standard English, though named entities and acronyms are included in my corpus)
LDAesk approach: Guided LDA
Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.
Non-LDAesk approach: ~KNN
What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together.
For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector.
I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.
I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.
Why not simply use a supervised topic model. Jonathan Chang's lda package in R has an slda function that is quite nice. There is also a very helpful demo. Just install the package and run demo(slda).

Incrementally Trainable Entity Recognition Classifier

I'm doing some semantic-web/nlp research, and I have a set of sparse records, containing a mix of numeric and non-numeric data, representing entities labeled with various features extracted from simple English sentences.
e.g.
uid|features
87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue
4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street
23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice
09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana
928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious
294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana
sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry
...
A single entity may appear more than once (but with a different UID each time), and may have overlapping features with its other occurrences. A second data set represents which of the above UIDs are definitely the same.
e.g.
uid|sameas
87w39423|234k2j,234l24jlsd,dsdf9887s
4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf
23432424|io43po5,2l3jk42,sdf90s8df
09834502|294234234,sd09f8098
...
What algorithm(s) would I use to incrementally train a classifier that could take a set of features, and instantly recommend the N most similar UIDs and probability of whether or not those UIDs actually represent the same entity? Optionally, I'd also like to get a recommendation of missing features to populate and then re-classify to get a more certain matches.
I researched traditional approximate nearest neighbor algorithms. such as FLANN and ANN, and I don't think these would be appropriate since they're not trainable (in a supervised learning sense) nor are they typically designed for sparse non-numeric input.
As a very naive first-attempt, I was thinking about using a naive bayesian classifier, by converting each SameAs relation into a set of training samples. So, for each entity A with B sameas relations, I would iterate over each and train the classifier like:
classifier = Classifier()
for entity,sameas_entities in sameas_dataset:
entity_features = get_features(entity)
for other_entity in sameas_entities:
other_entity_features = get_features(other_entity)
classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])
classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])
And then use it like:
>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)
[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]
>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)
[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]
>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)
['obj_called','has_fur','has_claws','lives_at_zoo']
How viable is this approach? The initial batch training would be horribly slow, at least O(N^2), but incremental training support would allow updates to happen more quickly.
What are better approaches?
I think this is more of a clustering than a classification problem. Your entities are data points and the sameas data is a mapping of entities to clusters. In this case, clusters are the distinct 'things' your entities refer to.
You might want to take a look at semi-supervised clustering. A brief google search turned up the paper Active Semi-Supervision for Pairwise Constrained Clustering which gives pseudocode for an algorithm that is incremental/active and uses supervision in the sense that it takes training data indicating which entities are or are not in the same cluster. You could derive this easily from your sameas data, assuming that - for example - uids 87w39423 and 4535k3l535 are definitely distinct things.
However, to get this to work you need to come up with a distance metric based on the features in the data. You have a lot of options here, for example you could use a simple Hamming distance on the features, but the choice of metric function here is a little bit arbitrary. I'm not aware of any good ways of choosing the metric, but perhaps you have already looked into this when you were considering nearest neighbour algorithms.
You can come up with confidence scores using the distance metric from the centres of the clusters. If you want an actual probability of membership then you would want to use a probabilistic clustering model, like a Gaussian mixture model. There's quite a lot of software to do Gaussian mixture modelling, I don't know of any that is semi-supervised or incremental.
There may be other suitable approaches if the question you wanted to answer was something like "given an entity, which other entities are likely to refer to the same thing?", but I don't think that is what you are after.
You may want to take a look at this method:
"Large Scale Online Learning of Image Similarity Through Ranking" Gal Chechik, Varun Sharma, Uri Shalit and Samy Bengio, Journal of Machine Learning Research (2010).
[PDF] [Project homepage]
More thoughts:
What do you mean by 'entity'? Is entity the thing that is referred by 'obj_called'? Do you use the content of 'obj_called' to match different entities, e.g. 'John' is similar to 'John Doe'? Do you use proximity between sentences to indicate similar entities? What is the greater goal (task) of the mapping?

Resources