Number of Latent Semantic Indexing topics - topic-modeling

I'm using gensim's package to implement LSI on a corpus. My goal is to find out the most frequently occurring distinct topics that appear in the corpus.
If I don't know the number of topics that are in the corpus (I'd estimate anywhere from 5 to 20), what is the best approach in setting the number of topics that LSI should search for? Is it better to look for a large number of topics (20-30), or a small number of topics (~5)?

From Radim himself:
that's a good question, but unfortunately without a good answer.
It is not true that increasing the number of dimensions always
improves retrieval accuracy. In fact, if you use all the dimensions
(=full rank of the training matrix), LSI will give you exactly the
same documents that you entered in, so LSI would become pointless.
If you're interested in the math side of it, have a look at this
issue: https://github.com/piskvorky/gensim/issues/28 Otherwise, just
set the dimensions to a few hundred~thousand which is the accepted
standard. Or try several different choices, measure the accuracy and
select dimensionality that works the best on your problem.
Best, Radim

This is what I do sometimes when I'm confused. Since you've already narrowed down to your topics from 5-20, you can iterate b/w some of these values and see which value fits the best.
##Declare values for N_TOPICS
for i in lda.show_topics(topics=-N_TOPICS, topn=20, log=False, formatted=True):
print "TOPIC {0}: {1}\n".format(count, i)

Related

Topic Modeling with Mallet - topic keys output parameter

I have a follow-up question to the one asked here: Mallet topic modeling - topic keys output parameter
I hope I can still get a more detailed explanation of this subject because I have trouble understanding these numbers in the output files.
What can the summation of the output numbers tell us? For example, with 20 topics and an optimization value 20 on 2000 iterations, the summation of the output is approximately 2. With the same corpus, but with 15 topics/1000 iterations/optimization 10 the result is 0,77 and with 10 topics/1000 iterations/optimization 10 it's 0,72. What does this mean? Does it even mean anything?
Also, these people are referring to these results as parameters, but for my understanding, the parameter is the optimization interval and not the result in the output. So what is the correct way to refer to the result in the output? Frequency of the topic? Is it a procentage of something? What part did I get wrong?
You're correct that parameter is being used to mean two different things here.
Parameters of the statistical model are values that determine the properties of that model. In this case they determine which topics we expect to occur more often, and how confident we are of that. In some cases these are set by the user, in other cases they are set by the inference algorithm.
Parameters of the inference algorithm are settings that determine the procedure by which we set the parameters of the statistical model.
An additional confusion is that when model parameters are explicitly set by the user, Mallet uses the same interface as for algorithm settings.
The numbers you see are the parameters of a Dirichlet distribution that describes our prior expectation of the mix of topics in a document. You can think of it as having two parts: proportions and magnitude. If you rescale the numbers to add up to 1.0, the resulting proportions would tell you the model's guess at which topics occur most frequently. The actual sum of the numbers (the magnitude) tells you how confident the model is that this is the actual proportion you will see in a document. Smaller values indicate more variability.
A possible explanation for the numbers you're seeing (and please treat this as raw speculation) is that the 20 topic model has more flexibility to fit consistent topics, and so it is about three times more confident that there are topics that consistently occur more often in documents. As the number of topics decreases, the specificity of topics drops, so it is more likely that any particular topic could be large in any given document.

When using word alignment tools like fast_align, does more sentences mean better accuracy?

I am using fast_align https://github.com/clab/fast_align to get word alignments between 1000 German sentences and 1000 English translations of those sentences. So far the quality is not so good.
Would throwing more sentences into the process help fast_align to be more accurate? Say I take some OPUS data with 100k aligned sentence pairs and then add my 1000 sentences in the end of it and feed it to fast_align. Will that help? I can't seem to find any info on whether this would make sense.
[Disclaimer: I know next to nothing about alignment and have not used fast_align.]
Yes.
You can prove this to yourself and also plot the accuracy/scale curve by removing data from your dataset to try it at at even lower scale.
That said, 1000 is already absurdly low, for these purposes 1000 ≈≈ 0, and I would not expect it to work.
More ideal would be to try 10K, 100K and 1M. More comparable to others' results would be some standard corpus, eg Wikipedia or data from the research workshops.
Adding data very different than the data that is important to you can have mixed results, but in this case more data can hardly hurt. We could be more helpful with suggestions if you mention a specific domain, dataset or goal.

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Systematic threshold for cosine similarity with TF-IDF weights

I am running an analysis of several thousand (e.g., 10,000) text documents. I have computed TF-IDF weights and have a matrix with pairwise cosine similarities. I want to treat the documents as a graph to analyze various properties (e.g., the path length separating groups of documents) and to visualize the connections as a network.
The problem is that there are too many similarities. Most are too small to be meaningful. I see many people dealing with this problem by dropping all similarities below a particular threshold, e.g., similarities below 0.5.
However, 0.5 (or 0.6, or 0.7, etc.) is an arbitrary threshold, and I'm looking for techniques that are more objective or systematic to get rid of tiny similarities.
I'm open to many different strategies. For example, is there a different alternative to tf-idf that would make most of the small similarities 0? Other methods to keep only significant similarities?
In short, take the average cosine value of an initial clustering or even all of the initial sentences and accept or reject clusters based on something akin to the following.
One way to look at the problem is to try and develop a score based on a distance from the mean similarity (1.5 standard deviations (86th percentile if the data were normal) tends to mark an outlier with 3 (99.9th percentile) being an extreme outlier), taking the high end for good measure. I cannot remember where, but this idea has had traction in other forums and formed the basis for my similarity.
Keep in mind that the data is not likely to be normally distributed.
average(cosine_similarities)+alpha*standard_deviation(cosine_similarities)
In order to obtain alpha, you could use the Wu Palmer score or another score as described by NLTK. Strong similarities with Wu Palmer should lead to a larger range of acceptance while lower Wu Palmer scores should lead to a more strict acceptance. Therefore, taking 1-Wu Palmer score would be adviseable. You can even use this method for LSA or LDA groups. To be even more strict and take things close to 1.5 or more standard deviations, you could even try 1+Wu Palmer (the cream of the crop), re-find the ultimate K,find the new score, cluster, and repeat.
Beware though, this would mean finding the Wu Palmer of all relevant words and is quite a large computational problem. Also, 10000 documents is peanuts compared to most algorithms. The smallest I have seen for tweets was 15,000 and the 20 news groups set was 20,000 documents. I am pretty sure Alchemy API uses something akin to the 20 news groups set. They definitely use senti-wordnet.
The basic equation is not really mine so feel free to dig around for it.
Another thing to keep in mind is that the calculation is time intensive. It may be a good idea to use a student t value for estimating the expected value/mean wu-palmer score of SOV pairings and especially good if you try to take the entire sentence. Commons Math3 for java/scala includes the distribution as does scipy for python and R should already have something as well.
Xbar +/- tsub(alpha/2)*sample_std/sqrt(sample_size)
Note: There is another option with this weight. You could use an algorithm that adds or subtracts from this threshold until achieving the best result. This would likely not be related solely to the cosine importance but possibly to an inflection point or gap as with Tibshirani's gap statistic.

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

Resources