How do I sort Lucene results by field value using a HitCollector? - search

I'm using the following code to execute a query in Lucene.Net
var collector = new GroupingHitCollector(searcher.GetIndexReader());
searcher.Search(myQuery, collector);
resultsCount = collector.Hits.Count;
How do I sort these search results based on a field?
Update
Thanks for your answer. I had tried using TopFieldDocCollector but I got an error saying, "value is too small or too large" when i passed 5000 as numHits argument value. Please suggest a valid value to pass.

The search.Searcher.search method will accept a search.Sort parameter, which can be constructed as simply as:
new Sort("my_sort_field")
However, there are some limitations on which fields can be sorted on - they need to be indexed but not tokenized, and the values convertible to Strings, Floats or Integers.
Lucene in Action covers all of the details, as well as sorting by multiple fields and so on.

What you're looking for is probably TopFieldDocCollector. Use it instead of the GroupingHitCollector (what is that?), or inside it.
Comment on this if you need more info. I'll be happy to help.

In the original (Java) version of Lucene, there is no hard restriction on the size of the the TopFieldDocCollector results. Any number greater than zero is accepted. Although memory constraints and performance degradation create a practical limit that depends on your environment, 5000 hits is trivial and shouldn't pose a problem outside of a mobile device.
Perhaps in porting Lucene, TopFieldDocCollector was modified to use something other than Lucene's "heap" implementation (called PriorityQueue, extended by FieldSortedHitQueue)—something that imposes an unreasonably small limit on the results size. If so, you might want to look at the source code for TopFieldDocCollector, and implement your own similar hit collector using a better heap implementation.
I have to ask, however, why are you trying to collect 5000 results? No user in an interactive application is going to want to see that many. I figure that users willing to look at 200 results are rare, but double it to 400 just as factor of safety. Depending on the application, limiting the result size can hamper malicious screen scrapers and mitigate denial-of-service attacks too.

The constructor for Sort accepting only the string field name has been depreciated. Now you have to create a sort object and pass it in as the last paramater of searcher.Search()
/* sorting by a field of type long called "size" from greatest -> smallest
(signified by passing in true for the last isReversed paramater)*/
Sort sorter = new Sorter(new SortField("size", SortField.Type.LONG, true))
searcher.Search(myQuery, collector, sorter);

Related

How do you save a List<> as a column in a table in room?

I am building an app in which I have a Room entity that one of its columns is supposed to hold a List.
What is the best approach for doing this in an app that uses Flow, Coroutines and Room?
I tried serializing with Jackson (turning the List to a long json String and then bring it back to a List when fetched) but I am not sure if this is the correct approach.
Thank you,
What is the best approach for doing this in an app that uses Flow, Coroutines and Room?
This is very much open to opinion.
From a database perspective the approach would be to have any list as a table and thus
reducing the JSON bloat and thus reducing efficiency,
reduce duplication and thus be more likely to conform to normalisation
not potentially introducing complexities and even greater inefficiencies (e.g. not mentioned in the answer below but wild-character as the first character must do a full table scan)
perhaps consider this question and answer matching multiple title in single query using like keyword where if the table per list approach were taken then a simple SELECT * FROM task WHERE task_tags IN(:taglist) could do the same
From a coding point of view at first the coding is simpler when embedding JSON as the complex code is within the JSON libraries.

an alternative for getAllEntriesByKey?

I want to build a type-ahead function but I need an alternative to getAllEntriesByKey method because the initial data collection is seems to be too large for an acceptable performance.
I would rather like to use the getEntryByKey method and the next X number of documents in a View.
Is something possible? Just jump into a position in a view (matching a specified query) and collect the next X number of documents?
For now I have written most in SSJS.
you can use a combination of NotesView.GetEntryByKey and NotesView.CreateViewNavFrom. This means however you will access the view twice so I do not know if you gain any performance improvement here.
An example (LotusScript) can be found here:
http://lpar.ath0.com/2011/09/19/notesviewentrycollection-vs-notesviewnavigator/
The LotusScript can easily be transformed into SSJS. I have used it something similar before. I can write a blog-post about it.

How to insert an auto increment / sequential number value in CouchDB documents?

I'm currently playing with couchDB a bit and have the following scenario:
I'm implementing an issue tracker. Requirement is that each issue document has (besides it's document _id) a unique numerical sequential number in order to refer to it in a more appropriate way.
My first approach was to have a view which simply returns the count of unique issue documents currently stored. Increment that value on the client side by 1, assign it to my new issue and insert that.
Turned out to be a bad idea, when inserting multiple issues with ajax calls or having multiple clients adding issues at the same time. In latter case is wouldn't be even possible without communication between clients.
Ideally I want the sequential number to be generated on couch, which is afaik not possible due to conflicting states in distributed systems.
Is there any good pattern one could use (maybe on the client side) to approach this? I feel like this is a standard kind of use case (thinking of invoice numbers, etc).
Thanks in advance!
You could use a separate document which is empty, though it only consists of the id and rev. The rev prefix is always an integer, so you could use it as your auto incrementing number.
Just make a POST to your document, this will increase the rev and return it. Then you can use this generated value for your purpose.
Alternative way:
Create a separate document, consisting of value and lock. Then execute something like: "IF lock == true THEN return ELSE set lock = true AND increase value by 1", then do a GET to retrieve the new value and finally set lock = false.
I agree with you that using a view that gives you a document count is not a great idea. And it is the reason that couchdb uses a uuid's instead.
I'm not aware of a sequential id feature in couchdb, but think it's quite easy to write. I'd consider either:
An RPC (eg. with RabbitMQ) call to a single service to avoid concurrency issues. You can then store the latest number in a dedicated document on a specific non distributed couchdb or somewhere else. This may not scale particularly well, but you're writing a heck of an issue tracking system before this becomes an issue.
If you can allow missing numbers, set the uuid algorithm on your couch to sequential and you are at least good until the first buffer overflow. See more info at: http://couchdb.readthedocs.org/en/latest/config/misc.html#uuids-configuration

Transform MongoDB Data on Find

Is it possible to transform the returned data from a Find query in MongoDB?
As an example, I have a first and last field to store a user's first and last name. In certain queries, I wish to return the first name and last initial only (e.g. 'Joe Smith' returned as 'Joe S'). In MySQL a SUBSTRING() function could be used on the field in the SELECT statement.
Are there data transformations or string functions in Mongo like there are in SQL? If so can you please provide an example of usage. If not, is there a proposed method of transforming the data aside from looping through the returned object?
It is possible to do just about anything server-side with mongodb. The reason you will usually hear "no" is you sacrifice too much speed for it to make sense under ordinary circumstances. One of the main forces behind PyMongo, Mike Dirolf with 10gen, has a good blog post on using server-side javascript with pymongo here: http://dirolf.com/2010/04/05/stored-javascript-in-mongodb-and-pymongo.html. His example is for storing a javascript function to return the sum of two fields. But you could easily modify to return the first letter of your user name field. The gist would be something like:
db.system_js.first_letter = "function (x) { return x.charAt(0); }"
Understand first, though, that mongodb is made to be really good at retrieving your data, not really good at processing it. The recommendation (see for example 50 tips and tricks for mongodb developers from Kristina Chodorow by Oreilly) is to do what Andrew tersely alluded to doing above: make a first letter column and return that instead. Any processing can be more efficiently done in the application.
But if you feel that even querying for the fullname before returning fullname[0] from your 'view' is too much of a security risk, you don't need to do everything the fastest possible way. I'd avoided map-reduce in mongodb for awhile because of all the public concerns about speed. Then I ran my first map reduce and twiddled my thumbs for .1 seconds as it processed 80,000 10k documents. I realize in the scheme of things, that's tiny. But it illustrates that just because it's bad for a massive website to take a performance hit on some server side processing, doesn't mean it would matter to you. In my case, I imagine it would take me slightly longer to migrate to Hadoop than to just eat that .1 seconds every now and then. Good luck with your site
The question you should ask yourself is why you need that data. If you need it for display purposes, do that in your view code. If you need it for query purposes, then do as Andrew suggested, and store it as an extra field on the object. Mongo doesn't provide server-side transformations (usually, and where it does, you usually don't want to use them); the answer is usually to not treat your data as you would in a relational DB, but to use the more flexible nature of the data store to pre-bake your data into the formats that you're going to be using.
If you can provide more information on how this data should be used, then we might be able to answer a little more usefully.

riak search result excerpt

Doeas anybody know if riaksearch has the ability to generate excerpt with highlight points in it similar to lucene does?
Riak Search doesn't expose this functionality out of the box, but with a little work you can create a rough approximation.
Riak Search allows you to feed search results into a MapReduce job. If you do this, then your Map or Reduce function will also get a list of token positions in the document that matched the query (this is exposed as keydata, http://www.basho.com/search.php?q=keydata). Using these positions, you can write code to mark up the document or excerpt portions of text.
I think this functionality will hardly ever be implemented in Riak since it's philisophy implies that it doesn't care about what exactly is stored in the values and therefore does not process them in any meaningful way except providing some metadata like indices.

Resources