Spark LDA model prediction on new documents [duplicate] - apache-spark

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:
newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

Related

LDA Topic Model Performance - Topic Coherence Implementation for scikit-learn

I have a question around measuring/calculating topic coherence for LDA models built in scikit-learn.
Topic Coherence is a useful metric for measuring the human interpretability of a given LDA topic model. Gensim's CoherenceModel allows Topic Coherence to be calculated for a given LDA model (several variants are included).
I am interested in leveraging scikit-learn's LDA rather than gensim's LDA for ease of use and documentation (note: I would like to avoid using the gensim to scikit-learn wrapper i.e. actually leverage sklearn’s LDA). From my research, there is seemingly no scikit-learn equivalent to Gensim’s CoherenceModel.
Is there a way to either:
1 - Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline, either through manually converting the scikit-learn model into gensim format or through a scikit-learn to gensim wrapper (I have seen the wrapper the other way around) to generate Topic Coherence?
Or
2 - Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?
I have done quite a bit of research on implementations for this use case online but haven’t seen any solutions. The only leads I have are the documented equations from scientific literature.
If anyone has any knowledge on any similar implementations, or if you could point me in the right direction for creating a manual method for this, that would be great. Thank you!
*Side note: I understand that perplexity and log-likelihood are available in scikit-learn for performance measurements, but these are not as predictive from what I have read.
Feed scikit-learn’s LDA model into gensim’s CoherenceModel pipeline
As far as I know, there is no "easy way" to do this. You would have to manually reformat the sklearn data structures to be compatible with gensim. I haven't attempted this myself, but this strikes me as an unnecessary step that might take a long time. There is an old Python 2.7 attempt at a gensim-sklearn-wrapper which you might want to look at, but it seems deprecated - maybe you can get some information/inspiration from that.
Manually calculate topic coherence from scikit-learn’s LDA model and CountVectorizer/Tfidf matrices?
The summing-up of vectors you need can be easily achieved with a loop. You can find code samples for a "manual" coherence calculation for NMF. Calculation depends on the specific measure, of course, but sklearn should return you the data you need for the analysis pretty easily.
Resources
It is unclear to me why you would categorically exclude gensim - the topic coherence pipeline is pretty extensive, and documentation exists.
See, for example, these three tutorials (in Jupyter notebooks).
Demonstration of the topic coherence pipeline in Gensim
Performing Model Selection Using Topic Coherence
Benchmark testing of coherence pipeline on Movies dataset
The formulas for several coherence measures can be found in this paper here.

Online learning of LDA model in Spark

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?
Answering myself : it is not possible as of now.
Actually, Spark has 2 implementations for LDA model training, and one is OnlineLDAOptimizer. This approach is especially designed to incrementally update the model with mini batches of documents.
The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.
Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.
Unfortunately, the current mllib API does not allow to load a previously trained LDA model, and add a batch to it.
Some mllib models support an initialModel as starting point for incremental updates (see KMeans, or GMM), but LDA does not currently support that. I filled a JIRA for it : SPARK-20082. Please upvote ;-)
For the record, there's also a JIRA for streaming LDA SPARK-8696
I don't think that such a thing would exist. LDA is probabilistic parameter estimation algorithm ( a very simplified explanation of the process here LDA explained), and adding a document or a few would change all previously computed probabilities, so literally recompute the model.
I don't know about your use case, but you can think about doing an update by batch if your model converges in a reasonable time and discard some of the oldest document at each re-computation to make the estimation faster.

OnlineLDA in Spark: can I update the model?

In Spark 2.0.1 (pyspark), I want to learn an LDA with the online optimizer. Does this version of the optimizer makes possible the update of the model each day (for example)? I'm not sure I understand the meaning of online here and its implications. Does it mean that:
A) I have to load the entire corpus and the model will learn by mini-batches (and because of that, maybe be faster than its EM counterpart).
B) I can submit to the learner a fraction of the corpus and get a first model and subsequently submit another fraction and get an upgraded version of the first model.
Thanks for clarifying
EDIT: to be specific, what I do is:
from pyspark.ml.clustering import LDA
lda = LDA(k=nclusters, seed=1, optimizer="online")
ldaModel = lda.fit(mydf.select([mydf["id"],mydf["features"]]))
With my ldaModel fitted, can I upgrade it with new df? It should be the case in my opinion since the online optimizer does essentially that, sampling the corpus at each iteration and upgrade the model against a subset of it, doesn't it?

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.
As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:
newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

How to update Spark MatrixFactorizationModel for ALS

I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html.
I also have problems with explicit training like here: Apache Spark ALS collaborative filtering results. They don't make sense
Using implicit training (on both explicit and implicit data) gives me reasonable results, but explicit training doesn't.
While this is ok for me by now, im curious on how to update a model. While my current solution works like
having all user ratings
generate model
get recommendations for user
I want to have a flow like this:
having a base of ratings
generate model once (optional save & load it)
get some ratings by one user on 10 random movies (not in the model!)
get recommendations using the model and the new user ratings
Therefore I must update my model, without completely recompute it. Is there any chance to do so?
While the first way is good for batch processing (like generating recommendations in nightly batches) the second way would be good for nearly-live generating of recommendations.
Edit: the following worked for me because I had implicit feedback ratings and was only interesting in ranking the products for a new user.
More details here
You can actually get predictions for new users using the trained model (without updating it):
To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.
i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v
for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v
This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)
To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.
Note: MLLIB gives you access to the matrix u and v
It seems like you want to be doing some kind of online learning. That's the notion that you're actually updating the model while receiving data. Spark MLLib has limited streaming machine learning options. There's a streaming linear regression and a streaming K-Means.
Many machine learning problems work just fine with batch solutions, perhaps retraining the model every few hours or days. There are probably strategies for solving this.
One option could be an ensemble model where you combine the results of your ALS with another model that helps make predictions about unseen movies.
If you expect to see a lot of previously unseen movies though, collaborative filtering probably doesn't do what you want. If those new movies aren't in the model at all, there's no way for the model to know what other people who watched those liked.
A better option might be to take a different strategy and try some kind of latent semantic analysis on the movies and model concepts of what a movie is (like genre, themes, etc...), that way new movies with various properties and fit into an existing model, and ratings affect how strongly those properties interact together.

Resources