Document Similarity in ElasticSearch - search

I want to calculate similarity between two documents indexed in elasticsearch. I know it can be done in lucene using term vectors. What is the direct way to do it?
I found that there is a similarity module doing exactly this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html
How do I integrate this in my system? I am using pyelasticsearch for calling elasticsearch commands, but I am open to use the REST api for similarity if needed.

I think the Elasticsearch documentation can easily be mis-interpreted.
Here "similarity" is not a comparison of documents or fields but rather a mechanism for scoring matching documents based on matching terms from the query.
The documentation states:
A similarity (scoring / ranking model) defines how matching documents
are scored.
The similarity algorithms that Elasticsearch supports are probabilistic models based on term distribution in the corpus (index).
In regards to term vectors, this also can be mis-interpreted.
Here "term vectors" refer to statistics for the terms of a document that can easily be queried. It seems that any similarity measurements across term vectors would then have to be done in your application post-query. The documentation on term vectors state:
Returns information and statistics on terms in the fields of a
particular document.
If you need a performant (fast) similarity metric over a very large corpus you might consider a low-rank embedding of your documents stored in an index for doing approximate nearest neighbor searches. After your KNN lookup, which greatly reduces the candidate set, you can do more costly metric calculations for ranking.
Here is an excellent resource for evaluation of approximate KNN solutions:
https://github.com/erikbern/ann-benchmarks

Related

Is it always necessary to either stem/lemmatize words when working with TF-IDF?

I'm using TF-IDF along with cosine similarity in order to compute document similarity. I was wondering if it's always necessary to stem/lemmatize the words in the document. Are there times where based on the task, it's better not to stem/lemmatize?

Natural Language Processing in Python

How to find similar kind of issues for a new unseen issue based on past trained issues(includes summary and description of issue) using natural language processing in python
If I understand you correctly you have a new issue (query) and you want to look up other similar issues (documents) in your database. If so, then what you need is a way to find the similarity between your query and existing documents. And once you have them, you can rank them and select the most relevant ones. One such method that allows you to do this is Latent Semantic Indexing (LSI).
To do this you'll have to construct a document-term matrix. You'll use your existing document and create a term occurrence matrix across documents. What this means is that you basically record how many times a word appears in a document (or some other complex measure, example- tfidf). This can be done either through a bag of words representation or a TFIDF representation.
Once you have that, you'll have to process your query so that it is in the same form as your documents. Now that you have your query in usable form, you can calculate the cosine similarity between documents and your query. The one with the highest cosine similarity is the closest match.
Note: The topic that you may want to read about is Information Retrieval and LSI is just one such method. You should look into other methods as well.

How to do an item based recommendation in spark mllib?

In Mahout, there is support for item based recommendation using API method:
ItemBasedRecommender.mostSimilarItems(int productid, int maxResults, Rescorer rescorer)
But in Spark Mllib, it appears that the APIs within ALS can fetch recommended products but userid must be provided via:
MatrixFactorizationModel.recommendProducts(int user, int num)
Is there a way to get recommended products based on a similar product without having to provide user id information, similar to how mahout performs item based recommendation.
Spark 1.2x versions do not provide with a "item-similarity based recommender" like the ones present in Mahout.
However, MLlib currently supports model-based collaborative filtering, where users and products are described by a small set of latent factors {Understand the use case for implicit (views, clicks) and explicit feedback (ratings) while constructing a user-item matrix.}
MLlib uses the alternating least squares (ALS) algorithm [can be considered similar to the SVD algorithm] to learn these latent factors.
If you need to construct purely an item-similarity based recommender, I would recommend this:
Represent all items by a feature vector
Construct an item-item similarity matrix by computing a similarity metric (such as cosine) with each items pair
Use this item similarity matrix to find similar items for users
Since similarity matrices do not scale well, (imagine how your similarity matrix would grow if you had 100 items vs 10000 items) this read on DIMSUM might be helpful if you're planning to implement it on a large number of items:
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
Please see my implementation of item-item recommendation model using Apache Spark here. You can implement this by using the productFeatures matrix that is generated when you run the MLib ALS algorithm on user-product-ratings data. The ALS algorithm essentially factorizes two matrix - one is userFeatures and the other is productFeatures matrix. You can run a cosine similarity on the productFeatures rank matrix to find item-item similarity.

What are the applications of length normalization?

I found some info about Length Normalization. I found it mentioned only in the context of search engines. Have people used it for different textual purposes? (please forgive my ignorance. I've truly searched for other uses of it but google keeps confusing the term "normalization" with "scaling"...).
The link you provide in the question already mentions one reason for using length-normalization: to avoid having high term-frequency counts in document vectors. This affects document ranking considerably. A direct application of this is, of course, query-based document retrieval.
There are other algorithm-specific applications as well. For example, if you want to cluster documents using cosine similarity between the vectors: simple clustering algorithms such as k-means may not converge unless the vectors are all on a sphere, i.e. all vectors have the same length.

Apache lucene inverted index

Does Lucene index use tf-idf as weights? Is it possible to define your own statistics and weights for each document, and "plug" them into Lucene?
Yes, the default scoring algorithm incorporates tf-idf, and is fully documented in the TFIDFSiilarity documentation.
There are a number of ways to customize the scoring of documents.
The simplest and most common is to incorporate a boost, either on a field at index time, or on a query term when querying.
Many query types modify the scoring used for that query. Examples include ConstantScoreQuery and DisjunctionMaxQuery.
The Similarity you use defines the scoring algorithm. You could select a different one (ex. BM25Similarity).
You can implement your own Similarity, Usually by extending a higher-level implementation such as DefaultSimilarity, TFIDFSimilarity, or SimilarityBase
Just go through this example. It may help help you to know how you can bring custom changes in indexing process
http://lucene.apache.org/core/4_3_1/demo/src-html/org/apache/lucene/demo/IndexFiles.html

Resources