Query for lucene search result - search

I have a storage of news with the following fields (Title, Body, NewsDate)
I need a best query with the following criteria
1) title is more important but less than date
2) date should be compare to the current date if the date of a document is near the current date it is more valuable (NOTE: It doesn't mean that sorting descending on news date cause there maybe results that their title and its body is more relevant but its older)
this is just another factor for searching and i think it needs custom sorting
3) body has is in the third place
Any solution ?

Like #Guillaume said, you need to use boosting.
You can employ in 2 places: one while indexing (boost title and body), and second (the date field) while querying. The date-field is query-time since it is dynamic
Index time boosting would be like:
Field fld = new Field(....);
fld.setBoost(10f);//10x more important, 1 is default
Query time boost would be get the date diff (say in days or mins) and apply the boost inversely i.e. the greater the diff. the smaller the boost.

You should use Boosting in your schema, instead of very complicated queries.

Related

Can I index EXTRACT(WEEK from startDateTime)? Or, will the query planner use an index directly on 'startDateTime'?

I have a large number of records indexed on some startDateTime field, and want to select aggregates (SUM and COUNT) on all records grouped by WEEKOFYEAR(startDateTime) (i.e., EXTRACT(WEEK FROM startDateTime)). Can I put a secondary index on EXTRACT(WEEK FROM startDateTime)? Or, even better, will the query use an index on startDateTime appropriately to optimize a request grouped by WEEK?
See this similar question about MySQL indices. How would this be handled in the Cloud Spanner world?
Secondary index on generated columns (i.e., EXTRACT(WEEK FROM startDateTime)) are not supported yet. If you have a covering index that includes all the columns required for the query (i.e., startDateTime and other required columns for grouping and aggregation), the planner will use such covering index over the base table but the aggregation is likely to be based on hash aggregation. Unless you aggregate over very long period of time, it should not be a big problem (I admit that it is not ideal though).
If you want to restrict the aggregated time range, you need to spell it out in terms of startDateTime (i.e., you need to convert the min/max datetime to the same type as startDateTime).
Hope this helps.

Query Couchdb by date while maintaining sort order

I am new to couchdb, i have looked at the docs and SO posts but for some reason this simple query is still eluding me.
SELECT TOP 10 * FROM x WHERE DATE BETWEEN startdate AND enddate ORDER BY score
UPDATE: It cannot be done. This is unfortunate since to get this type
of data you have to pull back potentially millions of records (a few
fields) from couch then do either filtering, sorting or limiting
yourself to get the desired results. I am now going back to my
original solution of using _changes to capture and store elsewhere the data i do need to perform that query on.
Here is my updated view (thanks to Dominic):
emit([d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), score], doc.name);
What I need to do is:
Always sort by score descending
Optionally filter by date range (for instance, TODAY only)
Limit by x
Update: Thanks to Dominic I am much closer - but still having an
issue.
?startkey=[2017,1,13,{}]&endkey=[2017,1,10]&descending=true&limit=10&include_docs=true
This brings back documents between the dates sorted by score
However if i want top 10 regardless of date then i only get back top 10 sorted by date (and not score)
For starters, when using complex keys in CouchDB, you can only sort from left to right. This is a common misconception, but read up on Views Collation for a more in-depth explanation. (while you're at it, read the entire Guide to Views as well since you're getting started)
If you want to be able to sort by score, but filter by date only, you can accomplish this by breaking down your timestamp to only show the degree you care about.
function (doc) {
var d = new Date(doc.date)
emit([ d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), score ])
}
You'll end up outputting a more complex key than what you currently have, but you query it like so:
startkey=[2017,1,1]&endkey=[2017,1,1,{}]
This will pick out all the documents on 1-1-2017, and it'll be sorted by score already! (in ascending order, simply swap startkey and endkey to get descending order, no change to the view needed)
As an aside, avoid emitting the entire doc as the value in your view. It is likely more efficient to leverage the include_docs=true parameter, and leaving the value of your emit empty. (please refer to this SO question for more information)
With this exact setup, you'd need separate views in order to query by different precisions. For example, to query by month you just use the year/month and so on.
However, if you are willing/able to sort your scores in your application, you can use a single view to get all the date precision you want. For example:
function (doc) {
var d = new Date(doc.date)
emit([ d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), d.getUTCHour(), d.getUTCMinutes(), d.getUTCSeconds(), d.getUTCMilliseconds() ])
}
With this view and the group_level parameter, you can get all the scores by year, month, date, hour, etc. As I mentioned, in this case it won't be sorted by score yet, but maybe this opens up other queries to you. (eg: what users participated this month?)

CouchDB view collation sorted by date

I am using a couchDB database.
I can get all documents by category and paginate results with a key like ["category","document_id"]and a query likestartkey=["category","document_id"]&endkey=["category",{}]`
Now I want to sort those results by date to have latest documents first.
I tried a lot of keys such as ["category","date","document_id"]
but nothing works (or I can't get it working).
I would use something like
startkey=["queried_category","queried_date","queried_document_id"]&endkey=["queried_category"]
but ignore the "queried_date" key part (sort but do not take documents where "document_id" > "queried_document_id")
EDIT:
Example :
With a key like :
startkey=["apple","2012-12-27","ZZZ"]&endkey=["apple",{}]&descending=true
I will have (and it is the normal behavior)
"apple","2012-12-27","ABC"
"apple","2012-05-01","EFG"
...
"apple","2012-02-13","ZZZ"
...
But the result set I want should start with
"apple","2012-02-13","ZZZ"
Emit the category and the timestamp (you don't need the document_id):
emit(category, timestamp);
And then filter on the category:
?startkey=[":category"]&endkey=[":category",{}]
You must understand that this is only a sort, so you need the startkey to be before the first row, and the endkey to be after the last row.
Last but not least, don't forget to have a representation for the timestamp that is adequate to the sort.
The problem with pagination with timestamp instead of doc ID is that timestamp is not unique. That's why you will have problem with paging Aurélien's solution.
I would stay with what you tried but use timestamp as the number (standard UNIX milliseconds since 1970). You can reverse the order of single numeric field just by multiplying by -1:
emit(category, -timestamp, doc_id)
This way result sorted lexicographically (ascending) will be ordered according to your needs:
first dates descending,
then document id's ascending.

Influencing Solr search results with a field value

I've recently started experimenting with Solr. My data is indexed and searchable. My problem is in the sorting. I have three fields: Author, Title, Sales.
I would like to search against the author & title fields, but have the sales value influence the score so that matches with higher sales move toward the top, even if the initial match score is not the highest.
Simply sorting by sales does not produce valid results as a result with a near 0 score for the search term, but a lot of sales in general could end up above a perfect match for the term that has never been sold.
I am seeing results that, while great term matches, are not necessarily the product I want showing at the top of the list.
If you're using the dismax handler, you can add a boost function (bf) with the field you want to boost on, e.g.
http://...?q=foo&bf="fieldValue(sales)^1.5"
...to make the value of the sales figure give a bump. You can, of course, make the function more complex if you want to munge the sales data in some way.
More info is easily found.
You may also just want to do this at index time since the sales data isn't going to be changing on the fly.
You can also use Index-time boosting.
And here's detailed info on using function queries to influence scoring.

Get Max Date Using CAML Query From alist

how can i get max Date and Min Date from a list Date Column
The brute force approach is to create two queries that will retrieve the list content sorted by date asc and desc. I know that this sucks but at least you can move on with you project and refine the query later on.
If only it was possible to retrieve top 1 then it might even work in production.

Resources