Lucene has a great capability of incremental indexing. Which is normally a pain when developing a IR system from scratch.
I would like to know if I can use low-level Lucene APIs to use it only as an Inverted Index, i.e., storage for inverted lists, position information, term frequency, idfs, field storage, etc...
Bottom line is that I want to implement my own weightings and scoring of documents. I'm aware of Similarity class, but It does not give the flexibility I want.
You can certainly make your own query class, and your own scorers etc. The only problem you might run into is if you need global data. (E.g. in tf/idf you need to know, well, the term freq and inverse doc freq.) If there is some other cross-document or cross-term metadata you need for your scoring algorithm, you might run into trouble because there isn't a great way that I know of for storing this.
But basically, as long as your algorithm is vaguely tf/idf or works per document only, I think you should be fine.
Related
I have more "Location documents" in my couchdb with longitude and latitude fields. How to find all location documents in database which distance to provided latitude and longitude is less than provided distance.
There is a way how to achieve it using vanilla CouchDB, but it‘s bit tricky.
You can use the fact you can apply two map functions during one request. Second map function can be created using list mechanics.
Lists are not very efficient from computational side, they can‘t cache results as views. But they have one unique feature – you can pass several arguments into list. Moreover, one of your arguments can be, for example, JS code, that is eval-ed inside list function (risky!).
So entire scheme looks like this:
Make view, that performs coarse search
Make list, that receives custom params and refines data set
Make client-side API to ease up querying this chain.
Can‘t provide exact code for your particular case, many details are not clear, but it seems that coarse search must group results to somehow linearly enumerated squares, and list perform more precise calculations.
Please note, that scheme might be inefficient for large datasets since it‘s computationally hungry.
Vanilla CouchDB isn't really built for geospacial queries.
Your best bet is to either use GeoCouch, CouchDB-Lucene or something similar.
Failing that, you could emit a Geohash from your map function, and do range queries over those.
Caveats apply. Queries around Geohash "fault lines" (equator, poles, longitude 180, etc) can give too many or too little results.
There are multiple JavaScript libraries that can help convert to/from Geohash, as well as help with some of those caveats.
CouchDB is not built for dynamic queries, so there is no good/fast way of implementing it in vanilla couchDB.
If you know beforehand which locations you want to calculate the distance from you could create a view for each location and call it with parameters ?startkey=0&endkey=max_distance
function(doc) {
function distance(...){ /* your function for calculating distance */ }
var NY = {lat:40,lon:73}
emit( distance(NY,doc), doc._id);
}
If you do not know the locations beforehand you could solve it by using a temporary view, but I would strongly advise against it since it's slow and should only be used for testing.
We have enums, free-text, and referenced fields etc. in our DB.
Each enum has its own translation, free-text could be in any language. We'd like to do efficient large-scale free-text searching and enum value based searching.
I know of solutions like Solr which are nice, but that would mean we'd have to index entire de-normalized records with all the text of all the languages in the system. This seems a bit excessive.
What are some recommended approaches for searching multilingual normalized data? Anyone tackle this before?
ETL. Extract, Transform, Load. In other words, get the data out of your existing databases, transform it (which is more than merely denormalizing it) and load it into SOLR. The SOLR db will be a lot smaller than the existing databases because there is no relational overhead. And SOLR search takes most of the load off of your existing database servers.
Take a good look at how to configure and use SOLR and learn about SOLR cores. You may want to put some languages in separate cores because that way you can more effectively use the various stemming algorithms in SOLR. But even with multilingual data you can still use bigrams (such as are used with Chinese language analysis).
Having multiple cores makes searching a bit more complex since you can try either a single language index, or an all-languages index. But it is much more effective to group language data and apply language specific stopwords, protected words, stemming and language analysis tools.
Normally you would include some key data in the index so that when you find a record via SOLR search, you can then reference directly into the source db. Also, you can have normalised and non-normalised data together, for instance an enum could be recorded in a normalised field in English as well as a non-normalised field in the same language as the free-text. A field can be duplicated in order to apply two different analysis and filtering treatments.
It would be worth your while to trial this with a subset of your data in order to learn how SOLR works and how best to configure it.
Doeas anybody know if riaksearch has the ability to generate excerpt with highlight points in it similar to lucene does?
Riak Search doesn't expose this functionality out of the box, but with a little work you can create a rough approximation.
Riak Search allows you to feed search results into a MapReduce job. If you do this, then your Map or Reduce function will also get a list of token positions in the document that matched the query (this is exposed as keydata, http://www.basho.com/search.php?q=keydata). Using these positions, you can write code to mark up the document or excerpt portions of text.
I think this functionality will hardly ever be implemented in Riak since it's philisophy implies that it doesn't care about what exactly is stored in the values and therefore does not process them in any meaningful way except providing some metadata like indices.
I'm working on a system that performs matching on large sets of records based on strings and numeric ranges, and date ranges. The String matches are mostly exact matches as far as I can tell, as opposed to less exact full text search type results that I understand lucene is generally designed for. Numeric precision is important as the data concerns prices.
I noticed that Lucene recently added some support for numeric range searching but it's not something it's originally designed for.
Currently the system uses procedural SQL to do the matching and the limits are being reached as to the scalability of the system. I'm researching ways to scale the system horizontally and using search engine technology seems like a possibility, given that there are technologies that can scale to very large data sets while performing very fast search results. I'd like to investigate if it's possible to take a lot of load off the database by doing the matching with the lucene generated metadata without hitting the database for the full records until the matching rules have determined what should be retrieved. I would like to aim eventually for near real time results although we are a long way from that at this point.
My question is as follows: Is it likely that Lucene would perform many times faster and scale to greater data sets more cheaply than an RDBMS for this type of indexing and searching?
Lucene stores its numeric stuff as a trie; a SQL implementation will probably store it as a b-tree or an r-tree. The way Lucene stores its trie and SQL uses an R-tree are pretty similar, and I would be surprised if you saw a huge difference (unless you leveraged some of the scalability that comes from Solr).
As a general question of the performance of Lucene vs. SQL fulltext, a good study I've found is: Jing, Y., C. Zhang, and X. Wang. “An Empirical Study on Performance Comparison of Lucene and Relational Database.” In Communication Software and Networks, 2009. ICCSN'09. International Conference on, 336-340. IEEE, 2009.
First, when executing
exact query, the performance of Lucene is much better than
that of unindexed-RDB, while is almost same as that of
indexed-RDB. Second, when the wildcard query is a prefix
query, then the indexed-RDB and Lucene both perform very
well still by leveraging the index... Third, for combinational query, Lucene performs
smoothly and usually costs little time, while the query time
of RDB is related to the combinational search conditions and
the number of indexed fields. If some fields in the
combinational condition haven’t been indexed, search will
cost much more time. Fourth, the query time of Lucene and
unindexed-RDB has relations with the record complexity,
but the indexed-RDB is nearly independent of it.
In short, if you are doing a search like "select * where x = y", it doesn't matter which you use. The more clauses you add in (x = y OR (x = z AND y = x)...) the better Lucene becomes.
They don't really mention this, but a huge advantage of Lucene is all the built-in functionality: stemming, query parsing etc.
I suggest you read Marc Krellenstein's "Full Text Search Engines vs DBMS".
A relatively easy way to start using Lucene is by trying Solr. You can scale Lucene and Solr using replication and sharding.
At its heart, and in its simplest form, Lucene is a word density search engine. Lucene can scale to handle extremely large data sets and when indexed correctly return results in a blistering speed. For text based searching it is possible and very probable that search results will return quicker in Lucene as opposed to SQL Server/Oracle/My SQL. That being said it is unfair to compare Lucene to traditional RDBMS as they both have completely different usages.
I am just wondering if we could achieve some RDBMS capabilities in lucene.
Example:
1) I have 10,000 project documents (pdf files) which have to be indexed with their content to make them available for search.
2) Every document is related to a SINGLE PROJECT. The project can contain details like project name, number, start date, end date, location, type etc.
I have to search in the contents of the pdf files for a given keyword, but while displaying the results I want to display the project meta data as mentioned in point (2).
My idea is to associate a field called projectId with each pdf file while indexing. Once we get that, we will fire search again for getting project meta data.
This way we could avoid duplicated data. Also, if we want to update the project meta data we will end up updating at a SINGLE PLACE only. Otherwise if we store this meta data with all the pdf doument indexes, we will end up updating all of the documents, which is not the way I am looking for.
please advise.
If I understand you correctly, you have two questions:
Can I store a project id in Lucene and use it for further searches? Yes, you can. This is a common practice.
Can I use this project id to search Lucene for project meta data? Yes, you can. I do not know if this is a good idea. It depends on the frequency of your meta data updates and your access pattern. If the meta data is relatively static, and you only access it by id, Lucene may be a good place to store it. Otherwise, you can use the project id as a primary key to a database table, which could be a better fit.
Sounds like a perfectly good thing to do. The only limitation you'll have (by storing a reference to the project in Lucene rather than the project data itself) is that you won't be able to query both the document text and project metadata at the same time. For example, "documentText:foo OR projectName:bar" . If you have no such requirement, then seems like storing the ID in Lucene which refers to a database row is a fine thing to do.
I am not sure on your overall setup, but maybe Hibernate Search is for you. It would allow you to combine the benefits of a relational database with the power of a fulltext search engine like Lucene. The meta data could live in the database, maybe together with the original pdf documents, while the Lucene documents just contain the searchable data.
This is definitely possible. But always be aware of the fact that you're using Lucene for something that it was not intended for. In general, Lucene is designed for full-text search, not for mapping relational content. So the more complex your system your relational content becomes, the more you'll see a decrease in performance.
In particular, there are a few areas to keep a close eye on:
Storing the value of each field in your index will decrease performance. If you are not overly concerned with sub-second search results, or if your index is relatively small, then this may not be a problem.
Also, be aware that if you are not using the default ranking algorithm, and your custom algorithm requires information about the project in order to calculate the score for each document, this will have a dramatic impact on search performance, as well.
If you need a more powerful index that was designed for relational content, there are hierarchical indexing tools out there (one developed by Apache, called Jackrabbit) that are worth looking into.
As your project continues to grow, you might also check out Solr, also developed by Apache, which provides some added functionality, such as multi-faceted search.
You can use Lucene that way;
Pros:
Full-text search is easy to implement, which is not the case in an RDBMS.
Cons:
Referential integrity: you get it for free in an RDBMS, but in Lucene, you must implement it yourself.