Topic Modeling with Mallet - topic keys output parameter - topic-modeling

I have a follow-up question to the one asked here: Mallet topic modeling - topic keys output parameter
I hope I can still get a more detailed explanation of this subject because I have trouble understanding these numbers in the output files.
What can the summation of the output numbers tell us? For example, with 20 topics and an optimization value 20 on 2000 iterations, the summation of the output is approximately 2. With the same corpus, but with 15 topics/1000 iterations/optimization 10 the result is 0,77 and with 10 topics/1000 iterations/optimization 10 it's 0,72. What does this mean? Does it even mean anything?
Also, these people are referring to these results as parameters, but for my understanding, the parameter is the optimization interval and not the result in the output. So what is the correct way to refer to the result in the output? Frequency of the topic? Is it a procentage of something? What part did I get wrong?

You're correct that parameter is being used to mean two different things here.
Parameters of the statistical model are values that determine the properties of that model. In this case they determine which topics we expect to occur more often, and how confident we are of that. In some cases these are set by the user, in other cases they are set by the inference algorithm.
Parameters of the inference algorithm are settings that determine the procedure by which we set the parameters of the statistical model.
An additional confusion is that when model parameters are explicitly set by the user, Mallet uses the same interface as for algorithm settings.
The numbers you see are the parameters of a Dirichlet distribution that describes our prior expectation of the mix of topics in a document. You can think of it as having two parts: proportions and magnitude. If you rescale the numbers to add up to 1.0, the resulting proportions would tell you the model's guess at which topics occur most frequently. The actual sum of the numbers (the magnitude) tells you how confident the model is that this is the actual proportion you will see in a document. Smaller values indicate more variability.
A possible explanation for the numbers you're seeing (and please treat this as raw speculation) is that the 20 topic model has more flexibility to fit consistent topics, and so it is about three times more confident that there are topics that consistently occur more often in documents. As the number of topics decreases, the specificity of topics drops, so it is more likely that any particular topic could be large in any given document.

Related

Handling optional data in Logistic regression

I am working with data which contains marks and other features of students and trying to predict whether they will get a high salary or not using scikit-learn in python. I ran into a problem,
since a student does not take all the subject his/her score in a subject is -1 if he has not taken the subject (a student can take multiple subjects).
Below a snapshot taken from the data file:
Snapshot
I am trying to find a way to interpret the -1 in a way that doesn't alter the data much.
My Approach:
Take the percentile marks for each student and then take the average of all percentiles for each student giving a single number for each student which a lot easier to work with but this method may lose some information about the distribution of marks.
Fill the -1 value with the average of marks for all the students in that subject, but this will not work if the data is biased towards one subject
Is there any better way the deal with this kind of data?
Your "-1"'s amount to missing data, so you are asking how to approach a classification task with missing data. See here and here and here, among many others, for discussions on this topic.
A couple important considerations that come to mind:
One option is to "impute" the missing values, which is what you're describing with using "average marks." This approach often requires the assumption that the data is "missing at random" which in your case is unlikely to be true: for example, a bad student is more likely to not take a difficult subject, so missing values tell you something.
Using regression models (like logistic regression) are in general going to require some type of imputation. But there are other models, like decision trees or Random forests, that can handle missing data without imputation.

Mallet topic modeling - topic keys output parameter

In MALLET topic modelling, the --output-topic-keys [FILENAME] option outputs beside each topic a parameter that in the tutorial in the MALLET site called "Dirichlet parameter " of the topic.
I want to know what does this parameter represent? is it β in the LDA model? and if not what is it and what is it's meaning and use.
I noted that when I don't use the parameter optimization option while generating the topic model, this parameter differs in version 2.0.7 than in version 2.0.8. I want to know why this difference happens.
here's version 2.0.7 output
and 2.0.8
I know that the output differs by each run, but I am only concerned with this parameter.
The topic model inference algorithm used in Mallet involves repeatedly sampling new topic assignments for each word holding the assignments of all other words fixed. The factors that control this process are (1) how often the current word type appears in each topic and (2) how many times each topic appears in the current document. The smoothing parameters ensure that these values are never zero for any topic: beta for the first factor, alpha for the second.
You can think of the alpha parameter being displayed here as the number of "imaginary" words in each topic that are added. In the first case, topic 0 has 2.5 imaginary words of weight in every document. The default value for this parameter was initially 50 / numTopics. Larger values encourage models to have more uniform topic distributions in documents, smaller values encourage more sparsity. The general experience was that 50 was too large, and that 5 is a better default. This was changed in 2.0.8.
The default is to make the alpha weight equal for all topics. With hyperparameter optimization on, these values can vary. Usually what you will find is that a topic with a large value will contain "near stopwords" that are frequent in most documents and don't have much content. Topics with very small values are often unusual and distinctive documents. Topics in the middle are often the most interesting.
If I understand it correctly, the parameter is alpha, not beta.
You can use an asymmetric alpha using the flag
--optimize-interval INTEGER
to reestimate the hyperparameters every INTEGER iterations.

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Number of Latent Semantic Indexing topics

I'm using gensim's package to implement LSI on a corpus. My goal is to find out the most frequently occurring distinct topics that appear in the corpus.
If I don't know the number of topics that are in the corpus (I'd estimate anywhere from 5 to 20), what is the best approach in setting the number of topics that LSI should search for? Is it better to look for a large number of topics (20-30), or a small number of topics (~5)?
From Radim himself:
that's a good question, but unfortunately without a good answer.
It is not true that increasing the number of dimensions always
improves retrieval accuracy. In fact, if you use all the dimensions
(=full rank of the training matrix), LSI will give you exactly the
same documents that you entered in, so LSI would become pointless.
If you're interested in the math side of it, have a look at this
issue: https://github.com/piskvorky/gensim/issues/28 Otherwise, just
set the dimensions to a few hundred~thousand which is the accepted
standard. Or try several different choices, measure the accuracy and
select dimensionality that works the best on your problem.
Best, Radim
This is what I do sometimes when I'm confused. Since you've already narrowed down to your topics from 5-20, you can iterate b/w some of these values and see which value fits the best.
##Declare values for N_TOPICS
for i in lda.show_topics(topics=-N_TOPICS, topn=20, log=False, formatted=True):
print "TOPIC {0}: {1}\n".format(count, i)

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

Resources