All-pairs similarity using tfidf vectors in pyspark - apache-spark

I'm trying to find similar documents based on their text in spark. I'm using python with Spark.
So far I implemented RowMatrix, IndexedRowMatrix, and CoordinateMatrix to set this up. And then I implemented columnSimilarities (DIMSUM). The problem with DIMSUM is that it's optimized for a lot of features, a few items. http://stanford.edu/~rezab/papers/dimsum.pdf
Our initial approach was to create tf-idf vectors of all words in all documents, then transpose it into a rowmatrix where we have a row for each word and a column for each item. Then we ran columnSimilarities which gives us a coordinateMatrix of ((item_i, item_j), similarity). This just doesn't work well when number of columns > number of rows.
We need a way to calculate all-pairs similarity with a lot of items, a few features. #items=10^7 #features=10^4. At a higher level, we're trying to create an item based recommender that given one item, will return a few quality recommendations based only on the text.

I'd write this as a comment isntead of an answer but SO won't let me commet yet.
This would be "trivially" solved by utilizing ElasticSearch's more-like-this query. From docs you can see how it works and which factors are taken into account, which should be useful info even if you end up implementing this in Python.
They have also implemented other interesting algorithms such as the significant terms aggregation.

Related

Natural Language Processing in Python

How to find similar kind of issues for a new unseen issue based on past trained issues(includes summary and description of issue) using natural language processing in python
If I understand you correctly you have a new issue (query) and you want to look up other similar issues (documents) in your database. If so, then what you need is a way to find the similarity between your query and existing documents. And once you have them, you can rank them and select the most relevant ones. One such method that allows you to do this is Latent Semantic Indexing (LSI).
To do this you'll have to construct a document-term matrix. You'll use your existing document and create a term occurrence matrix across documents. What this means is that you basically record how many times a word appears in a document (or some other complex measure, example- tfidf). This can be done either through a bag of words representation or a TFIDF representation.
Once you have that, you'll have to process your query so that it is in the same form as your documents. Now that you have your query in usable form, you can calculate the cosine similarity between documents and your query. The one with the highest cosine similarity is the closest match.
Note: The topic that you may want to read about is Information Retrieval and LSI is just one such method. You should look into other methods as well.

How to do an item based recommendation in spark mllib?

In Mahout, there is support for item based recommendation using API method:
ItemBasedRecommender.mostSimilarItems(int productid, int maxResults, Rescorer rescorer)
But in Spark Mllib, it appears that the APIs within ALS can fetch recommended products but userid must be provided via:
MatrixFactorizationModel.recommendProducts(int user, int num)
Is there a way to get recommended products based on a similar product without having to provide user id information, similar to how mahout performs item based recommendation.
Spark 1.2x versions do not provide with a "item-similarity based recommender" like the ones present in Mahout.
However, MLlib currently supports model-based collaborative filtering, where users and products are described by a small set of latent factors {Understand the use case for implicit (views, clicks) and explicit feedback (ratings) while constructing a user-item matrix.}
MLlib uses the alternating least squares (ALS) algorithm [can be considered similar to the SVD algorithm] to learn these latent factors.
If you need to construct purely an item-similarity based recommender, I would recommend this:
Represent all items by a feature vector
Construct an item-item similarity matrix by computing a similarity metric (such as cosine) with each items pair
Use this item similarity matrix to find similar items for users
Since similarity matrices do not scale well, (imagine how your similarity matrix would grow if you had 100 items vs 10000 items) this read on DIMSUM might be helpful if you're planning to implement it on a large number of items:
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
Please see my implementation of item-item recommendation model using Apache Spark here. You can implement this by using the productFeatures matrix that is generated when you run the MLib ALS algorithm on user-product-ratings data. The ALS algorithm essentially factorizes two matrix - one is userFeatures and the other is productFeatures matrix. You can run a cosine similarity on the productFeatures rank matrix to find item-item similarity.

What are the applications of length normalization?

I found some info about Length Normalization. I found it mentioned only in the context of search engines. Have people used it for different textual purposes? (please forgive my ignorance. I've truly searched for other uses of it but google keeps confusing the term "normalization" with "scaling"...).
The link you provide in the question already mentions one reason for using length-normalization: to avoid having high term-frequency counts in document vectors. This affects document ranking considerably. A direct application of this is, of course, query-based document retrieval.
There are other algorithm-specific applications as well. For example, if you want to cluster documents using cosine similarity between the vectors: simple clustering algorithms such as k-means may not converge unless the vectors are all on a sphere, i.e. all vectors have the same length.

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?
Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.
A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.
First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling".
Important links:
https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/
Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1:
Try out different values of k, select the one that has the largest likelihood.
Method 2:
Instead of LDA, see if you can use HDP-LDA
Method 3:
If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.
Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

NLP software for classification of large datasets

Background
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
The Question
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
NLTK
TIMBL
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.
After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.
MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.
I would recommend Mahout as it is intended for handling very large scale data sets.
The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps.
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Resources