How do you efficiently implement a document similarity search system?

How do you efficiently implement a document similarity search system? - search

How do you implement a "similar items" system for items described by a
set of tags?
In my database, I have three tables, Article, ArticleTag and Tag. Each
Article is related to a number of Tags via a many-to-many
relationship. For each Article i want to find the five most similar
articles to implement a "if you like this article you will like these
too" system.
I am familiar with Cosine similarity
and using that algorithm works very well. But it is way to slow. For
each article, I need to iterate over all articles, calculate the
cosine similarity for the article pair and then select the five
articles with the highest similarity rating.
With 200k articles and 30k tags, it takes me half a minute to
calculate the similar articles for a single article. So I need
another algorithm that produces roughly as good results as cosine
similarity but that can be run in realtime and which does not require
me to iterate over the whole document corpus each time.
Maybe someone can suggest an off-the-shelf solution for this? Most of
the search engines I looked at does not enable document similarity
searching.

Some questions,
How is ArticleTag different from Tag? Or is that the M2M mapping table?
Can you sketch out how you've implemented the cosine matching algorithm?
Why don't you store your document tags in an in memory data structure of some sort, using it only to retrieve document IDs? This way, you only hit the database during retrieval time.
Depending on the freq of document addition, this structure can be designed for fast/slow updates.
Initial intuition towards an answer - I'd say, an online clustering algorithm (perhaps do a Principal Components Analysis on the co-occurrence matrix, which will approximate a K-means cluster?). Better refined once you answer some of those questions above.
Cheers.

You can do it with the Lemur toolkit. With its KeyfileIncIndex, you have to re-retrieve the document from its source; the IndriIndex supports retrieving the document from the index.
But anyway, you index your documents, and then you build a query from the document you want to find similar documents to. You can then do a search with that query, and it will score the other documents for similarity. It's pretty fast in my experience. It treats both source documents and basic queries as documents, so finding similarities is really what it does (unless you're using the Indri parser stuff - that's a bit different, and I'm not sure how it works).

Related

Difference between Elastic Search and Google Search Appliance page ranking

How does the page ranking in elastic search work. Once we create an index is there an underlying intelligent layer that creates a metadata repository and provides results to query based on relevance. I have created several indices and I want to know how the results are ordered once a query is provided. And is there a way to influence these results based on relationships between different records.

Do you mean how documents are scored in elasticsearch? Or are you talking about the 'page-rank' in elasticsearch?
Documents are scored based on how well the query matches the document. This approach is based on the TF-IDF concept term frequency–inverse document frequency. There is, however, no 'page-rank' in elasticsearch. The 'page-rank' takes into consideration how many documents point towards a given document. A document with many in-links is weighted higher than other. This is meant to reflect the fact whether a document is authoritative or not.
In elasticsearch, however, the relations between the documents are not taken into account when it comes to scoring.

Finding possibly matching strings in a large dataset

I'm in the middle of a project where I have to process text documents and enhance them with Wikipedia links. Preprocessing a document includes locating all the possible target articles, so I extract all ngrams and do a comparison against a database containing all the article names. The current algorithm is a simple caseless string comparison preceded by simple trimming. However, I'd like it to be more flexible and tolerant to errors or little text modifications like prefixes etc. Besides, the database is pretty huge and I have a feeling that string comparison in such a large database is not the best idea...
What I thought of is a hashing function, which would assign a unique (I'd rather avoid collisions) hash to any article or ngram so that I could compare hashes instead of the strings. The difference between two hashes would let me know if the words are similiar so that I could gather all the possible target articles.
Theoretically, I could use cosine similiarity to calculate the similiarity between words, but this doesn't seem right for me because comparing the characters multiple times sounds like a performance issue to me.
Is there any recommended way to do it? Is it a good idea at all? Maybe the string comparison with proper indexing isn't that bad and the hashing won't help me here?
I looked around the hashing functions, text similarity algoriths, but I haven't found a solution yet...

Consider using the Apache Lucene API It provides functionality for searching, stemming, tokenization, indexing, document similarity scoring. Its an open source implementation of basic best practices in Information Retrieval
The functionality that seems most useful to you from Lucene is their moreLikeThis algorithm which uses Latent Semantic Analysis to locate similar documents.

Multilingual free-text search in an app with normalized data?

We have enums, free-text, and referenced fields etc. in our DB.
Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.
I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. This seems a bit excessive.
What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?

ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.
Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).
Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.
Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.
It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.

Using Lucene just as an inverted index

Lucene has a great capability of incremental indexing. Which is normally a pain when developing a IR system from scratch.
I would like to know if I can use low-level Lucene APIs to use it only as an Inverted Index, i.e., storage for inverted lists, position information, term frequency, idfs, field storage, etc...
Bottom line is that I want to implement my own weightings and scoring of documents. I'm aware of Similarity class, but It does not give the flexibility I want.

You can certainly make your own query class, and your own scorers etc. The only problem you might run into is if you need global data. (E.g. in tf/idf you need to know, well, the term freq and inverse doc freq.) If there is some other cross-document or cross-term metadata you need for your scoring algorithm, you might run into trouble because there isn't a great way that I know of for storing this.
But basically, as long as your algorithm is vaguely tf/idf or works per document only, I think you should be fine.

Calling search gurus: Numeric range search performance with Lucene?

I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?

Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.
First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.

I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.

At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string