Is Lucene.Net suitable as the search engine for frequently changing content? - search

Is Lucene.Net suitable as the search engine for frequently changing content?
Or more specificically, can anybody give a subjective opinion on how quickly lucene.net indexes can be updated. Any other approaches to searching frequently changing content would be great.
We’re developing a forum. Forum posts will be frequently added to the forum repository. We think we need these posts to be added to lucene index very quickly (<0.5s) to become available to search. There’ll be about 5E6 posts in the repository initially. Assume search engine running on non-exotic server (I know this is very vague!).
Other suggestions with regard to addressing the issue of searching frequently changing content appreciated. The forum posts need to be searchable on a variable number of named tags (tag name and value must match). A SQL based approach (based on Toxi schema) isn’t giving us the performance we’d like.

Out forums (http://episteme.arstechnica.com) use Lucene as the search backend, so it's doable. Posts aren't indexed quite as quickly as you'd like, but we could solve that by beefing up the indexing hardware and using a smarter caching strategy.
The general answer to this question is: it depends what your write/update pattern is. Forums are relatively easy, since most content is new and existing content is updated less frequently.
For a forum, I'd recommend having an "archive" index and a "live" index. The live index might include posts from the last day, week, year, while the archive index will include a large body of posts that probably won't ever be touched again. So when someone creates a new post, it will initially be indexed in the live index. At a later time, some batch job would clear out the live index, and reindex everything into the archive.
Lucene's very good at querying across multiple indexes. You should abuse that ability. :)

Lucene.Net is extremely fast, however there are many things that can slow down queries when used wrong. I strongly recommend reading the Lucene in Action book by Erik Hatcher and Otis Gospodnetić. It contains a very good chapter about performance testing and tuning.

Related

Search documents that contain some text, but keep information about what field matches

I'm learning node.js and mongodb. I'm learning by solving some problems. I want to make site that can search video database. Each video has title, description, author and a subarray of notes (you can think of it as a comments). Each note has a subarray of manual references to tags documents that exists in tags collection.
I need to search for some text in videos collection. For each resulting video I need to know if search criteria matches some of basic fields (author, title, description) or if some of its notes, including names of tags, matches criteria. Or both.
I know that this may not be right task for beginner but I would really like to make this work. I have some ideas about how to do this, but they probably are not good since I don't know much about mongo and it's capabilities.
What do you suggest, what should I use? Should I use text search capabilities + some aggregation? Should I offload some of work to be processed by application rather than mongo?
I probably don't need details, just directions.
Thank you.
Since nobody answers, I decided to share my idea and how I did it. There is probably better solution, that is way I asked this question.
I did two separate queries using regex, that I merged results in application code.
I used ES6 Map to make union of these two sets.

Cocuhdb a fit for a query like this?

I have been looking for an escape from GAE as the datastore does not support a lot of the things I want to do with it.
So I have looked at CouchDB (among others) and I really like the REST interface and the hosting option I found at Cloudant.
But for all my googling and reading any docs I could find, I still am not sure if it is a good fit.
So I come here in the hope that someone might have more insight.
I write web apps and a lot of the projects I want to do will involve a query that looks like this:
Find all entries that are within a user-input-lat/long bounding box and where start-time is less than user-input-time-1 and end-time is greater than user-input-time-2 and has all tags in user-input-list-of-tags.
Thats not even pseudocode, but I hope it makes sense anyway.
I am not just looking for a "You cannot do that in CouchDB". Some kind of explanation and perhaps something like "If you can live without the tags then you can do this:"
I would like to use the Cloudant service so GeoCouch is apparently out of the question, but they do something that should work like lucene, but does that mean the queries are slow?
As you can tell, I am a bit confused here, so just do your best to straighten me out and I'll be greatfull :)
Not mentioning the tags (which in itself is already a problem), what you describe is a multi-dimensional query : you have several "coordinates" (lat, long, start-time, end-time) and provide a range for each of these coordinates.
On its own, CouchDB cannot perform multi-dimensional queries at all — you only get single-dimension queries across one coordinate.
Tags certainly are possible, but it depends on whether you need documents that have at least one tag in the list, or documents that have all tags in the list. The first case is easy (run one query per tag using the bulk API), the second might require excessive amounts of memory (if a document has N tags, it needs to emit 2N-1 tag-sets in order to match all possible tag combinations involving it, so you should place an upper bound on either the number of tags in a document, or the number of tags in a query).
Lucene does allow multi-dimensional and keyword-based queries, though I cannot vouch for their performance.

real time indexing of live data stream

Let us say, I am maintaining an index of a lot of documents. I want to update the index for newly arriving data to make it as real time as possible. What kind of indexing tool do I need to look at ? I have looked at Sphinx and Lucene and from previous posts, they are recommended for real time indexing.
The delta indexing mechanism used in Sphinx looks like a pretty neat idea.
Some questions I have are
1) How quickly can the document be searchable once it arrives ?
2) How efficient is the index merge process ? (merging the delta index and the main Index)
I understand these are very general questions and I wanted to get an idea if using Sphinx would be the right way to go about this problem.
Sphinx have real-time indexes which allow add/update/delete indexes on the fly.
You can look at Apache Solr (NRT) and Elastic Search for real-time implementations using Lucene. You can look at some benchmarks.

Does a B Tree work well for auto suggest/auto complete web forms?

Auto suggest/complete fields are used all over the web. Google has appeared to master it given that as soon as one types in a search query, suggestions are returned almost instantaneously.
I'm assuming the framework for achieving this involves a fast, in-memory data store on the web tier. We're building a Grails app based around retail products, so a user may search for Can which should suggest things like Canon, Cancun, etc, and wondering if a Java B-tree cached in memory would suffice for quick auto completes returned as JSON over AJAX. Outside of the jQuery AutoComplete field, do any frameworks and/or libraries exist to facilitate the development of this solution?
Autocomplete is a text matching, information retrieval problem. Implementing your own B-tree and writing your own logic to match words to other words is something you could do. But then you would have to implement Porter Stemming, a Vector Space Model, and a String-edit distance calculation.
...or you could use Lucene and its derivatives, which do a lot of this stuff already. If you really care about the data structures used to store this stuff, you could dive into its source. But I highly doubt writing your own and doing it all yourself would be more maintainable and efficient in the long run.
One of the more popular Grails ecosystem plugins for this is Searchable, which was mentioned in Ledbrook & Smith's Grails in Action. It uses Lucene under the covers, and make sit pretty easy to add full-text search to your domain classes. (For example, check out chapter 8 in GinA or the searchable docs).
The Grails Richui plugin has an autocomplete that I've used in the past. We had it hooked up to hit the database every keystroke (which I would not suggest but our data changed often enough that real-time data was required). If your list of things is pretty static though then it could probably work well for you.
http://grails.org/plugin/richui#AutoComplete

Using Lucene like a relational database

I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.

Resources