How to efficiently run multiple lucene search queries on MemoryIndex document?

How to efficiently run multiple lucene search queries on MemoryIndex document? - search

I am using MemoryIndex in lucene Java API to index a text content in memory and run queries over it. There can be hundreds of such queries running on a single doc to get matches. I would like to know the efficient way to do this.
Currently I am creating multiple Query objects and looping over them to see which match my text in memory.
The text can be few KBs in size.
Queries will be complex boolean and phrases combined.
Size of a query might be around 1KB max.

This question has been up for quite some time and I will try to answer it myself.
I implemented this by storing all my parsed query objects in a list. I will frame the query using Query Parser and have this stored in my list in memory.
It will improve performance as I do not have to keep building my queries every time a new text comes.
In my case, we had hundreds of complex queries but these were static and do not change. Hence, it made sense to have the parsed queries stored in memory and not build them everytime.
I have implemented this more than a year back at my previous company using apache lucene and java.
Note: One major problem I faced was with the default stopword filter in lucene which was trimming out some parts of the text and this was not the behaviour I needed.
I do not have access to the code anymore, sorry if the answer seems very ambiguous.
Useful classes:
https://lucene.apache.org/core/6_6_2/memory/org/apache/lucene/index/memory/MemoryIndex.html
http://lucene.apache.org/core/6_6_2/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#parse-java.lang.String-

Related

How to speed up a search on large collection of text files (1TB)

I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.
Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?
I have tried using grep and recursively searching through the directory as follows:
LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files
The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.
Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.
My only requirements are:
1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)
2) I will be searching for strings (not regex).
3) I will need to see all occurances of the search string and the file it was within.

There are a lot of answers already, I just wanted to add my two cents:
Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.
Suggestion
I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.
In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.
I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.

You clearly need an index, as almost every answer has suggested. You could totally improve your hardware but since you have said that it is fixed, I won’t elaborate on that.
I have a few relevant pointers for you:
Index only the fields in which you want to find the search term rather than indexing the entire dataset;
Create multilevel index (i.e. index over index) so that your index searches are quicker. This will be especially relevant if your index grows to more than 8 GB;
I wanted to recommend caching of your searches as an alternative, but this will cause a new search to again take half a day. So preprocessing your data to build an index is clearly better than processing the data as the query comes.
Minor Update:
A lot of answers here are suggesting you to put the data in Cloud. I'd highly recommend, even for anonymized medical data, that you confirm with the source (unless you scraped the data from the web) that it is ok to do.

To speed up your searches you need an inverted index. To be able to add new documents without the need to re-index all existing files the index should be incremental.
One of the first open source projects that introduced incremental indexing is Apache Lucense. It is still the most widely used indexing and search engine although other tools that extend its functionality are more popular nowadays. Elasiticsearch and Solr are both based on Lucense. But as long as you don't need a web frontend, support for analytical querying, filtering, grouping, support for indexing non-text files or an infrastrucutre for a cluster setup over multiple hosts, Lucene is still the best choice.
Apache Lucense is a Java library, but it ships with a fully-functional, commandline-based demo application. This basic demo should already provide all the functionality that you need.
With some Java knowledge it would also be easy to adapt the application to your needs. You will be suprised how simple the source code of the demo application is. If Java shouldn't be the language of your choice, its wrapper for Pyhton, PyLucene may also be an alternative. The indexing of the demo application is already reduced nearly to the minimum. By default no advanced functionlity is used like stemming or optimization for complex queries - features, you most likely will not need for your use-case but which would increase size of the index and indexing time.

I see 3 options for you.
You should really consider upgrading your hardware, hdd -> ssd upgrade can multiply the speed of search by times.
Increase the speed of your search on the spot.
You can refer to this question for various recommendations. The main idea of this method is optimize CPU load, but you will be limited by your HDD speed. The maximum speed multiplier is the number of your cores.
You can index your dataset.
Because you're working with texts, you would need some full text search databases. Elasticsearch and Postgres are good options.
This method requires you more disk space (but usually less than x2 space, depending on the data structure and the list of fields you want to index).
This method will be infinitely faster (seconds).
If you decide to use this method, select the analyzer configuration carefully to match what considered to be a single word for your task (here is an example for Elasticsearch)

Worth covering the topic from at two level: approach, and specific software to use.
Approach:
Based on the way you describe the data, it looks that pre-indexing will provide significant help. Pre-indexing will perform one time scan of the data, and will build a a compact index that make it possible to perform quick searches and identify where specific terms showed in the repository.
Depending on the queries, it the index will reduce or completely eliminate having to search through the actual document, even for complex queries like 'find all documents where AAA and BBB appears together).
Specific Tool
The hardware that you describe is relatively basic. Running complex searches will benefit from large memory/multi-core hardware. There are excellent solutions out there - elastic search, solr and similar tools can do magic, given strong hardware to support them.
I believe you want to look into two options, depending on your skills, and the data (it will help sample of the data can be shared) by OP.
* Build you own index, using light-weight database (sqlite, postgresql), OR
* Use light-weight search engine.
For the second approach, using describe hardware, I would recommended looking into 'glimpse' (and the supporting agrep utility). Glimple provide a way to pre-index the data, which make searches extremely fast. I've used it on big data repository (few GB, but never TB).
See: https://github.com/gvelez17/glimpse
Clearly, not as modern and feature rich as Elastic Search, but much easier to setup. It is server-less. The main benefit for the use case described by OP is the ability to scan existing files, without having to load the documents into extra search engine repository.

Can you think about ingesting all this data to elasticsearch if they have a consistent data structure format ?
If yes, below are the quick steps:
1. Install filebeat on your local computer
2. Install elasticsearch and kibana as well.
3. Export the data by making filebeat send all the data to elasticsearch.
4. Start searching it easily from Kibana.

Fs Crawler might help you in indexing the data into elasticsearch.After that normal elasticsearch queries can you be search engine.

I think if you cache the most recent searched medical data it might help performance wise instead of going through the whole 1TB you can use redis/memcached

full text search in databases

I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks

Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you

(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.

Fastest way to search a SQL Server table (or indexed view) column with "like '%search%'"?

Suppose there's a table with columns (UserID, FieldID, Value), with half a million records. I want to see if some search term T(N) occurs anywhere in each Value (i.e. Value.Contains( T(N) ) ).
I think I'm just hitting a wall volume wise, just too many values to sift through. I don't think a Full Text index will help, because it's only useful for StartsWith queries that look at individual words, not occurrences anywhere within the string at all.
Is there a good approach to indexing this kind of data for such a search in SQL Server?

A half-million records is not terribly large, although I don't know the size of the field contents. A couple of ideas - this was too long for a comment or else I may have posted as such.
You could implement a full-text search engine like Elastic, Solr, etc and use it as a sidecar. If when you are doing text searches, you are not otherwise making much use of the other data, this might be easy enough. Note that you could put other data for searching into Elastic or Solr, but I'm not sure if you'd want to duplicate all your data, and those tools aren't really great for a transactional data store.
Another option for volumes this small, assuming you only need basic "contains" searching: create two more tables: keywords and keyword_index (or whatever). When saving, tokenize your text content and write out any new keywords to keywords table and then add the data to the join table. Index everything, and then do your search off the keywords table, joining back to the master via the intermediate keyword_index table.
This is fairly hackish, and getting your keyword handling really dialed in (for stemming, etc) may be a pain. It is a reasonable quick & dirty solution for smaller-scale needs though.

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate

Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.

Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

In CouchDB, are there ways to improve performance of the View index process?

I have some basic views and some map/reduce views with logic. Nothing too complex. Not too many documents. I've tried with 250k, 75k, and 10k documents. Seems like I'm always waiting for view indexing.
Does better, more efficient code in the view help? I'm assuming it's basically processing the view at all levels of aggregation. So there must be some improvement there.
Does emit()-ing less data help? emit(doc.id, doc) vs specifying fewer fields?
Do more or less complex keys impact view indexing?
Or is it all about memory, CPU cores, and processor speed?
There must be some documentation out there, but I can't find anything referencing ways to improve performance.

I would take a deeper look into the reduce function. Try to use the built-in Erlang functions like _sum, _count, instead of writing Javascript.
Complex views can take hours and more, that's normal.
Maybe post such not too complex map/reduce.
And don't forget: indexing all docs is only done once after changing the view (or pushing a whole bunch of new docs). Subsequent new docs are indexed incrementally.
Use a view with &stale=ok to retrieve the "old" data instantly, so you don't have to wait. (But pay attention: you always have to call a view without stale=ok at least once to trigger the indexing process). Or better: use stale=update_after.

The code you write in views is more like CREATE INDEX than SELECT. It should be irrelevant how long it takes, as long as the view builds keep up with the document change rate. Building a view is a sunk (one-time) cost.
When you query the view, that is always a binary tree scan, which operates against a static data set in logarithmic time. That is usually the performance people care about more (in production.)
If you are not seeing behavior like I describe, perhaps we could discuss your view functions and your general approach to your problem. CouchDB is very different from relational databases. In the latter, you have highly structured data and free-form queries. In CouchDB, you have free-form data but highly structured index definitions (views). Except during development, changing and rebuilding views should be rare.

not emitting anything will help, but doing the view creation in smaller batches ( there are scripts that do this automagically ) helps more than anything other than not emitting anything at all, which can't be helped sometimes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string