Spark ALS recommendation score - apache-spark

I'm wondering if I can get somehow the sense if predicted values are good or bad. I get a score for each item but is there a way to find maximum available score? Let's say I want to recommended only those best and prediction returned 20 items. How to choose the good ones?
Thanks!

The standard solution is to simply sort the items by their predicted ratings (scores). There is no a priori bound on these values.

Related

How to select the best topic model based on coherence score?

I would like to know how to interpret the graph in the picture. I am trying to obtain the best coherence score for my lda model (alpha= 50/k, beta= 0.01, iteration =10 )
Since I use, c_v instead of umass, I will select the maximun coherence value, but since the graph is not elbow shaped I am not sure how to read it. Any hints would be helpful.

How can I evaluate the implicit feedback ALS algorithm for recommendations in Apache Spark?

How can you evaluate the implicit feedback collaborative filtering algorithm of Apache Spark, given that the implicit "ratings" can vary from zero to anything, so a simple MSE or RMSE does not have much meaning?
To answer this question, you'll need to go back to the original paper that defined what is implicit feedback and the ALS algorithm Collaborative Filtering for Implicit Feedback Datasets
by Yifan Hu, Yehuda Koren and Chris Volinsky.
What is implicit feedback ?
In the absence of explicit ratings, recommender systems can infer user preferences from the more abundant implicit feedback , which indirectly reflect opinion through observing user behavior.
Implicit feedback can include purchase history, browsing history, search patterns, or even mouse movements.
Do same evaluating techniques apply here? Such as RMSE, MSE.
It is important to realize that we do not have a reliable feedback regarding which items are disliked. The absence of a click or purchase can be related to multiple reasons. We also can't track user reactions to our recommendations.
Thus, precision based metrics, such as RMSE and MSE, are not very appropriate, as they require knowing which items users dislike for it to make sense.
However, purchasing or clicking on an item is an indication of having an interest in it. I wouldn't say like because a click or a purchase might have different meaning depending on the context of the recommender.
So making recall-oriented measures applicable in this case. So under this scenario, several metrics have been introduced, the most important being the Mean Percentage Ranking (MPR), also known as Percentile Ranking.
Lower values of MPR are more desirable. The expected value of MPR for random predictions is 50%, and thus MPR > 50% indicates an algorithm no better than random.
Of course, it's not the only way to evaluate recommender systems with implicit ratings but it's the most common one used in practice.
For more information about this metric, I advise you to read the paper stated above.
Ok, now we know what we are going to use but what about Apache Spark?
Apache Spark still doesn't provide an out-of-the-box implementation for this metric but hopefully not for long. There is a PR waiting to be validated https://github.com/apache/spark/pull/16618 concerning adding RankingEvaluator for spark-ml.
The implementation nevertheless isn't complicated. You can refer to the code here if you are interested in getting it sooner.
I hope this answers your question.
One way of evaluating it is to split the data in a training set and a test set with a time cut. This way you train the model using your training set then run predictions and check the predictions against the test set.
Now for evaluation you can use Precision, Recall, F1... metrics.

Truncated SVD Collaborative Filtering

I'm trying to implement collaborative Filtering by using sklearn truncatedSVD method. However, I receive very high rmse and it is because I receive very low ratings for every recommendation.
I perform truncatedSVD on a sparse matrix and I was wondering if this low recommendations are because the truncatedSVD accepts non-rated movies as 0 rated movies? If not, do you know what might cause low recommendations? Thanks!
So, it turned out to be that, if your data set's numeric values don't meaningfully start with zero you cannot apply truncatedSVd, without some adjustments. In case of movie ratings, which are from 1 to 5, you need to mean center the data, where you assign a meaning to zeros. Mean centering the data worked for me and I started to get reasonable rmse values.

Natural language query preprocessing

I am trying to implement a natural language query preprocessing module which would, given a query formulated in natural language, extract the keywords from that query and submit it to an Information Retrieval (IR) system.
At first, I thought about using some training set to compute tf-idf values of terms and use these values for estimating the importance of single words. But on second thought, this does not make any sense in this scenario - I only have a training collection but I dont have access to index the IR data. Would it be reasonable to only use the idf value for such estimation? Or maybe another weighting approach?
Could you suggest how to tackle this problem? Usually, the articles about NLP processing that I read address training and test data sets. But what if I only have the query and training data?
tf-idf (it's not capitalized, fyi) is a good choice of weight. Your intuition is correct here. However, you don't compute tf-idf on your training set alone. Why? You need to really understand what the tf and idf mean:
tf (term frequency) is a statistic that indicates whether a term appears in the document being evaluated. The simplest way to calculate it would simply be a boolean value, i.e. 1 if the term is in the document.
idf (inverse document frequency), on the other hand, measures how likely a term appears in a random document. It's most often calculated as the log of (N/number of document matches).
Now, tf is calculated for each of the document your IR system will be indexing over (if you don't have the access to do this, then you have a much bigger and insurmountable problem, since an IR without a source of truth is an oxymoron). Ideally, idf is calculated over your entire data set (i.e. all the documents you are indexing), but if this is prohibitively expensive, then you can random sample your population to create a smaller data set, or use a training set such as the Brown corpus.

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?
Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.
A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.
First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling".
Important links:
https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/
Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1:
Try out different values of k, select the one that has the largest likelihood.
Method 2:
Instead of LDA, see if you can use HDP-LDA
Method 3:
If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.
Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

Resources