Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?
Answering myself : it is not possible as of now.
Actually, Spark has 2 implementations for LDA model training, and one is OnlineLDAOptimizer. This approach is especially designed to incrementally update the model with mini batches of documents.
The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.
Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.
Unfortunately, the current mllib API does not allow to load a previously trained LDA model, and add a batch to it.
Some mllib models support an initialModel as starting point for incremental updates (see KMeans, or GMM), but LDA does not currently support that. I filled a JIRA for it : SPARK-20082. Please upvote ;-)
For the record, there's also a JIRA for streaming LDA SPARK-8696
I don't think that such a thing would exist. LDA is probabilistic parameter estimation algorithm ( a very simplified explanation of the process here LDA explained), and adding a document or a few would change all previously computed probabilities, so literally recompute the model.
I don't know about your use case, but you can think about doing an update by batch if your model converges in a reasonable time and discard some of the oldest document at each re-computation to make the estimation faster.
Related
I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn.
Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic model. Gensim's CoherenceModel allows Topic Coherence to be calculated for a given LDA model (several variants are included).
I am interested in leveraging scikit-learn's LDA rather than gensim's LDA for ease of use and documentation (note: I would like to avoid using the gensim to scikit-learn wrapper i.e. actually leverage sklearn’s LDA). From my research, there is seemingly no scikit-learn equivalent to Gensim’s CoherenceModel.
Is there a way to either:
1 - Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline, either through manually converting the scikit-learn model into gensim format or through a scikit-learn to gensim wrapper (I have seen the wrapper the other way around) to generate Topic Coherence?
Or
2 - Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?
I have done quite a bit of research on implementations for this use case online but haven’t seen any solutions. The only leads I have are the documented equations from scientific literature.
If anyone has any knowledge on any similar implementations, or if you could point me in the right direction for creating a manual method for this, that would be great. Thank you!
*Side note: I understand that perplexity and log-likelihood are available in scikit-learn for performance measurements, but these are not as predictive from what I have read.
Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline
As far as I know, there is no "easy way" to do this. You would have to manually reformat the sklearn data structures to be compatible with gensim. I haven't attempted this myself, but this strikes me as an unnecessary step that might take a long time. There is an old Python 2.7 attempt at a gensim-sklearn-wrapper which you might want to look at, but it seems deprecated - maybe you can get some information/inspiration from that.
Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?
The summing-up of vectors you need can be easily achieved with a loop. You can find code samples for a "manual" coherence calculation for NMF. Calculation depends on the specific measure, of course, but sklearn should return you the data you need for the analysis pretty easily.
Resources
It is unclear to me why you would categorically exclude gensim - the topic coherence pipeline is pretty extensive, and documentation exists.
See, for example, these three tutorials (in Jupyter notebooks).
Demonstration of the topic coherence pipeline in Gensim
Performing Model Selection Using Topic Coherence
Benchmark testing of coherence pipeline on Movies dataset
The formulas for several coherence measures can be found in this paper here.
I did document similarity on my corpus using Doc2Vec and it outputting not that good of similarities. I was wondering if I could do a topic model from what Doc2Vec is giving me to increase the accuracy of my model in order to get better similarities?
You should train a new model (like LDA) from the original corpus.
If the native similarities given by the Doc2Vec process aren't very good, maybe you can improve them with tuning your process.
But if that doesn't work, then Doc2Vec hasn't distilled useful info from your data – and downstream calculations built on those (bad) raw numbers aren't likely to get magically better.
In Spark 2.0.1 (pyspark), I want to learn an LDA with the online optimizer. Does this version of the optimizer makes possible the update of the model each day (for example)? I'm not sure I understand the meaning of online here and its implications. Does it mean that:
A) I have to load the entire corpus and the model will learn by mini-batches (and because of that, maybe be faster than its EM counterpart).
B) I can submit to the learner a fraction of the corpus and get a first model and subsequently submit another fraction and get an upgraded version of the first model.
Thanks for clarifying
EDIT: to be specific, what I do is:
from pyspark.ml.clustering import LDA
lda = LDA(k=nclusters, seed=1, optimizer="online")
ldaModel = lda.fit(mydf.select([mydf["id"],mydf["features"]]))
With my ldaModel fitted, can I upgrade it with new df? It should be the case in my opinion since the online optimizer does essentially that, sampling the corpus at each iteration and upgrade the model against a subset of it, doesn't it?
i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.
As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:
newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.
i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.
As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:
newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.