In MALLET topic modelling, the --output-topic-keys [FILENAME] option outputs beside each topic a parameter that in the tutorial in the MALLET site called "Dirichlet parameter " of the topic.
I want to know what does this parameter represent? is it β in the LDA model? and if not what is it and what is it's meaning and use.
I noted that when I don't use the parameter optimization option while generating the topic model, this parameter differs in version 2.0.7 than in version 2.0.8. I want to know why this difference happens.
here's version 2.0.7 output
and 2.0.8
I know that the output differs by each run, but I am only concerned with this parameter.
The topic model inference algorithm used in Mallet involves repeatedly sampling new topic assignments for each word holding the assignments of all other words fixed. The factors that control this process are (1) how often the current word type appears in each topic and (2) how many times each topic appears in the current document. The smoothing parameters ensure that these values are never zero for any topic: beta for the first factor, alpha for the second.
You can think of the alpha parameter being displayed here as the number of "imaginary" words in each topic that are added. In the first case, topic 0 has 2.5 imaginary words of weight in every document. The default value for this parameter was initially 50 / numTopics. Larger values encourage models to have more uniform topic distributions in documents, smaller values encourage more sparsity. The general experience was that 50 was too large, and that 5 is a better default. This was changed in 2.0.8.
The default is to make the alpha weight equal for all topics. With hyperparameter optimization on, these values can vary. Usually what you will find is that a topic with a large value will contain "near stopwords" that are frequent in most documents and don't have much content. Topics with very small values are often unusual and distinctive documents. Topics in the middle are often the most interesting.
If I understand it correctly, the parameter is alpha, not beta.
You can use an asymmetric alpha using the flag
--optimize-interval INTEGER
to reestimate the hyperparameters every INTEGER iterations.
Related
I have a follow-up question to the one asked here: Mallet topic modeling - topic keys output parameter
I hope I can still get a more detailed explanation of this subject because I have trouble understanding these numbers in the output files.
What can the summation of the output numbers tell us? For example, with 20 topics and an optimization value 20 on 2000 iterations, the summation of the output is approximately 2. With the same corpus, but with 15 topics/1000 iterations/optimization 10 the result is 0,77 and with 10 topics/1000 iterations/optimization 10 it's 0,72. What does this mean? Does it even mean anything?
Also, these people are referring to these results as parameters, but for my understanding, the parameter is the optimization interval and not the result in the output. So what is the correct way to refer to the result in the output? Frequency of the topic? Is it a procentage of something? What part did I get wrong?
You're correct that parameter is being used to mean two different things here.
Parameters of the statistical model are values that determine the properties of that model. In this case they determine which topics we expect to occur more often, and how confident we are of that. In some cases these are set by the user, in other cases they are set by the inference algorithm.
Parameters of the inference algorithm are settings that determine the procedure by which we set the parameters of the statistical model.
An additional confusion is that when model parameters are explicitly set by the user, Mallet uses the same interface as for algorithm settings.
The numbers you see are the parameters of a Dirichlet distribution that describes our prior expectation of the mix of topics in a document. You can think of it as having two parts: proportions and magnitude. If you rescale the numbers to add up to 1.0, the resulting proportions would tell you the model's guess at which topics occur most frequently. The actual sum of the numbers (the magnitude) tells you how confident the model is that this is the actual proportion you will see in a document. Smaller values indicate more variability.
A possible explanation for the numbers you're seeing (and please treat this as raw speculation) is that the 20 topic model has more flexibility to fit consistent topics, and so it is about three times more confident that there are topics that consistently occur more often in documents. As the number of topics decreases, the specificity of topics drops, so it is more likely that any particular topic could be large in any given document.
I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.
Let's say I have a user search query which looks like:
"the happy bunny"
I have already computed tf-idf and have something like this (following are made up example values) for each document in which I am searching (of coures the idf is always the same):
tf idf score
the 0.06 1 0.06 * 1 = 0.06
happy 0.002 20 0.002 * 20 = 0.04
bunny 0.0005 60 0.0005 * 60 = 0.03
I have two questions with what to do next.
Firstly, the still has the highest score, even though it is adjusted for rarity by idf, still it's not exactly important - do you think I should square the idf values to weight in terms of rare words, or would this give bad results? Otherwise I'm worried that the is getting equal importance to happy and bunny, and it should be obvious that bunny is the most important word in the search. As long as rare always equals important then it would be always a good idea to weight in terms of rarity, but if that is not always the case then doing so could really mess up the results.
Secondly and more importantly: what is the best/preferred method for combining the scores for each word together to give each document a single score that represents how well it reflects the entire search query? I was thinking of adding them, but it has become apparent that that is going to give higher priority to a document containing 10,000 happy but only 1 bunny instead of another document with 500 happy and 500 bunny (which would be a better match).
First, make sure that you are computing the correct TF-IDF values. As others have pointed they do not look right. TF is relative to specific documents, and we often do not need to compute them for queries (since raw term frequency is almost always 1 in queries). There are different types of TF functions to pick from (check the Wikipedia page on tf-idf, it has a good coverage). Log Normalisation is common and the most efficient scheme, since it saves an extra disk access to get the respective document's total frequency maxF that is needed for something like Double Normalisation. When you are dealing with large volumes of documents this can be expensive, especially if you can't bring these into memory. A bit of insight on inverted files can go a long way in understanding some of the underlying complexities. Log normalisation is efficient and is a non-linear function, therefore better than raw frequency.
Once you are certain on your weighting scheme, then you may want to consider a stop list to get rid of very common/noisy words. These do not contribute to the rank of documents. It is generally recommended to use a stop list of high frequency, very common words. Do a search and you will find many available, including the one that Lucene uses.
The remaining lies on your ranking strategy and that will depend on your implementation/model. The vector space model (VSM) is simple and readily available with libraries like Lucene, Lemur, etc. VSM computes the Dot product or scalar of the weights of common terms between the query and a document. Term weights are normalised via vector length normalisation (which solves your second question), and the result of applying the model is a value between 0 and 1. This is also justified/interpreted as the Cosine of the angle between two vectors in a planar graph, or the Euclidean distance divided by the Euclidean vector length of two vectors.
One of the earliest comprehensive studies on weighting schemes and ranking with VSM is an article by Salton (pdf) and is a good read if you are interested in Information Retrieval. A bit outdated perhaps (notice how log normalisation is not mentioned in the article).
Your best read I believe is the book Introduction to Information Retrieval by Christopher Manning. It will take you through everything that you need to know, from indexing to ranking schemes, etc. A bit lacking on ranking models (does not cover some of the more complex probabilistic approaches).
You should reconsider your TF and IDF values, they do not look correct. The TF value is usually just how often the word occurs, so if the word "the" appeared 20 times it's tf value would be 20. A word like "the" should have a very low IDF value (possibly around 4 decimal places, 0.000...).
You could use stop word removal if word like the are not necessary, they would be removed rather than just given a low score.
A vector space model could be used for this.
can you compute tf-idf for amalgamated terms? That is, you first generate a sentiment that considers each of its component as equal before treating the sentiment as a single term for which you now compute the tf-idf
I'm using gensim's package to implement LSI on a corpus. My goal is to find out the most frequently occurring distinct topics that appear in the corpus.
If I don't know the number of topics that are in the corpus (I'd estimate anywhere from 5 to 20), what is the best approach in setting the number of topics that LSI should search for? Is it better to look for a large number of topics (20-30), or a small number of topics (~5)?
From Radim himself:
that's a good question, but unfortunately without a good answer.
It is not true that increasing the number of dimensions always
improves retrieval accuracy. In fact, if you use all the dimensions
(=full rank of the training matrix), LSI will give you exactly the
same documents that you entered in, so LSI would become pointless.
If you're interested in the math side of it, have a look at this
issue: https://github.com/piskvorky/gensim/issues/28 Otherwise, just
set the dimensions to a few hundred~thousand which is the accepted
standard. Or try several different choices, measure the accuracy and
select dimensionality that works the best on your problem.
Best, Radim
This is what I do sometimes when I'm confused. Since you've already narrowed down to your topics from 5-20, you can iterate b/w some of these values and see which value fits the best.
##Declare values for N_TOPICS
for i in lda.show_topics(topics=-N_TOPICS, topn=20, log=False, formatted=True):
print "TOPIC {0}: {1}\n".format(count, i)
I have a set of documents that each consist of N words. The ith word of each document is selected from a common set of words, Wi={wi1, wi2, wi3, wi4}.
For example, the first word in each document might be selected from: {'alpha', 'one', 'first', 'lowest'}. The second word might be selected from: {'beta', 'two', 'second', 'lower'}. And so on.
These words may belong to different topics. For example, one topic might consist of {'alpha', 'beta', 'gamma', etc...}. Another topic might be {'alpha', 'two', 'third', etc...}. Each document has a different topic usage (just like a normal topic model).
To generate a new document, you go through each position 1...N. For the ith word, you select a topic based on the document's topic usage, then select a word from Wi based on the topic's word usage. Therefore, each topic will have N total words - one for each position.
My question is how do I learn the latent parameters in this model? Specifically, I want to know (1) the topic usage of each document, and (2) the word composition of each topic. This looks very similar to a topic model, but I don't know if I can use anything out of the box?
Because I can write out the likelihood of the data given the parameters, I tried implementing an EM algorithm to estimate (1) topic usage, then use this to update (2) word usage (and keep iterating until convergence). However, this was really really slow.
Another thing I have read is if I can write the joint density function, I can try sampling from the posterior density to learn these hidden parameters (using MCMC). Does this sound feasible? I have ~100 documents, each document is ~1000 words long, and at each word position, you can select from 6 words.
If anyone can help or give advice, I'd really appreciate it. Thanks!
Pure EM doesn't work for topic models: you need to use variational EM, as demonstrated in the original Blei et al paper. The other way to do inference is to use a collapsed Gibbs' sampler, as described by Griffiths' et al's Finding Scientific Topics paper (and many others too).
Can you clarify the generative process? Are you saying you draw from a different topic for each position i (standard LDA)? That you want a different set of topics for position i than position i+1? That knowing the topic assignment for position i means you know the assignment for i+1..n? If you could write it out as a standard generative model then figuring out how to do inference should be pretty straightforward (and I'd be happy to edit my answer).
You don't explicitly state this, but it looks implicit in your 4th paragraph: Given the topic, the distribution for which word is chosen for the i'th slot is independent of the words that have (or will be) selected for the other slots.
If this is a case, then your model is (conforms to) a Naive Bayesian Classifier, and that is probably your best bet; this type of model can be configured with one pass through the training data, and should be much better than using the EM algorithm.
If you have correlations between the words, given the class, you may want to look into "Tree-Augmented Bayesian Classifiers".