CouchDB pagination sorted by date, queried by id - couchdb

I want to create pagination on application level using the CouchDB view API. The pagination uses cursors, so given a cursor, I will query the view for the n+1 documents starting with the given cursor as start key and output the n results as page and provide the n+1 result row as the cursor for the next page.
This works well as long as the view keys are also the keys for my view rows. Now this time all my docs have a date field and I emit them as map keys, because I want to sort via date. However, I can't use my cursors anymore like before.
I thought that is the reason the view API also provides startkey_docid for submitting such a cursor doc id, however this is obviously not true. It seems like this value is only applied if there are several equal rows per keys.
So, in short: I want a date-ordered view, but cursors based on the document ids. How can I do this?
Thanks in advance
Simplified view
function map(doc)
{
emit(doc.date, {_id: doc._id});
}
Simplified view result:
{
"rows":[
{"id":"123","key":"2010-06-26T01:28:13.555Z", value:{...}},
{"id":"234","key":"2010-06-22T12:21:23.123Z", value:{...}},
{"id":"987","key":"2010-06-16T13:48:43.321Z", value:{...}},
{"id":"103","key":"2010-05-01T17:38:31.123Z", value:{...}},
{"id":"645","key":"2009-07-21T21:21:13.345Z", value:{...}}
]
}
Application-level query with cursor 234, page size 3 should return:
234, 987, 103
So how can I map this to a view?

Why do you want cursors based on docid?
Map Reduce creates single dimensional indexes, so any non-key traversal will be expensive. However, I think you can do what you want without requiring traversing 2 indexes at the same time.
See for instance here how I paginate through a posts with a certain tag:
Sofa's CouchApp tag pagination
aka
http://jchris.couchone.com/sofa/_design/sofa/_list/index/tags?descending=true&reduce=false&limit=10&startkey=[%22life%22%2C{}]&endkey=[%22life%22]
The key in that view looks like ["tag","2008/10/25 04:49:10 +0000"] so you can paginate through by tag and, within tags, by time.
Edited
Ha! I just realized what you are trying to do. It is so very simple.
Forget all about docids, they should be random anyway and not related to anything so just forget docs even have ids for a second.
You say "Application-level query with cursor 234, page size 3 should return:
234, 987, 103"
Your cursor should not be 234. It should be the key "2010-06-22T12:21:23.123Z".
So in essence you use the key of the last row of results as the startkey for the next query. So eg startkey=""2010-06-22T12:21:23.123Z""&limit=3, then for each page you render, link to a query where the new startkey is the last returned key.
Bonus: with what I've just described, you will have the bottom row of page 2 be the top row of page 3. To fix this, add skip=1 to your query.
Bonus bonus: OK, what about when I have more than 3 docs that emitted to the same key in the view? Then the last key will always be the same as the first key, so you can't progress in pagination without expanding the limit parameter. Unless... you use startkey_docid (and set it do the id of the last row). That is the only time you should use startkey_docid.

Related

Xpages how to sort preselected large amount data from view

I have a domino database with the following view:
Project_no Realization_date Author
1/2005 2015-01-02 Alex/Acme
3/2015 2015-02-20 John/Acme
33/2015 2016-06-20 Henry/Acme
44/2015 2015-02-13 John/Acme
...
Now I want to get all projects from this view that starts i.e with "3" (partial match), sort them by Realization_date descending and display first 1000 of them on Xpage.
View is large - some selection can give me 500.000 documents.
The FT search view option is not acceptable because it returns 5.000 docs only.
Creation of ArrayList or ListMap resulted with java out of memory exception (java Domino objects are recycled). Exceeding the memory may help of course but we have 30k users so it may be insufficient.
Do you have any ideas how can I achive this?
I think the key is goiong to be what the users want to do with the output, as Frantisek says.
If it's for an export, I'd export the data without sorting, then sort in the spreadsheet.
If it's for display, I would hope there's some paging involved, otherwise it will take a very long time to push the HTML from the server to the browser, so I'd recommend doing an FT Search on Project_no and Realization_date between certain ranges and "chunking" your requests. You'll need a manual pager to load the next set of results, but if you're expecting that many, you won't get a pager that calculates the total number of pages anyway.
Also, if it's an XAgent or displaying everything in one go, set viewState="nostate" on the relevant XPage. Otherwise, every request will get serialized to disk. So the results of your search get serialized to disk / memory, which is probably what's causing the Java memory issues you're seeing.
Remember FT_MAX_SEARCH_RESULTS notes.ini variable can be amended on the server to increase the (default) maximum from 5000.
500,000 is a very high set of results and is probably not going to make it very user-friendly for any subsequent actions on them. I'd probably also recommend restricting the search, e.g. forcing a separate entry of the "2015" portion or preventing entry of just one number, so it has to be e.g. "30" instead of just "3". That may also mean amending your view so the Project_no format displays as #Right("0000" + #Left(Project_no,"/"), 4), so users don't get 3, 30, 31, 32....300, 301, 302...., but can search for "003" and find just 30, 31, 32..., 39. It really depends on what the users are wanting to do and may require a bit of thinking outside the box, to quickly give them access to the targeted set of documents they want to action.
I would optimize data structure for your view. For example make a ArrayList<view entry>, that will represent the minimum information from your view. It mimics the index. The "view entry" is NOT Notes object (Document, ViewEntry), but a simplified POJO that will hold just enough information to sort it (via comparator) and show or lookup real data - for example Subject column to be shown and UNID to make a link to open that document.
This structure should fit into few hundred bytes per document. Troublesome part is to populate that structure - even with ViewNavigator it may take minutes to build such list.
Proper recycling should be ok but...
You could also "revert" to classic Domino URLS for ex ?yourviewname?ReadViewEntries&startkey=3&outputformat=JSON and render that JSON via Javascript UI component of some kind
If the filtering is based on partial match for the first sorted column, there's a pure Domino based solution. It requires that the Domino server is 8.5.3 or newer (view.resortView was introduced in 8.5.3), and that the realization_date column has click to sort.
Create a filtered collection with getAllEntriesByKey( key, false ) <-- partial match
Call view.resortView( "name_of_realization_date_column" )
Create a collection of all entries, now sorted by realization_date
Intersect the sorted collection with the filtered collection. This gives you the entries you want sorted by realization_date. E.g. sortedCollection.intersect( filteredCollection )
Pseudocode:
..
View view = currentDb.getView( "projectsView" );
view.setAutoUpdate( false );
ViewEntryCollection filteredCollection = view.getAllEntriesByKey( userFilter, False );
// Use index where view is sorted by realization_date
view.resortView( "realization_date" );
// All entries sorted by realization_date
ViewEntryCollection resortedCollection = view.getAllEntries();
resortedCollection.intersect( filteredCollection );
// resortedCollection now contains only the entries in filteredCollection, sorted by realization_date
..
I'm not certain if this would be faster than creating a custom data structure, but I would think it's worth to test :)

Search for documents by key using Domino Data Service

Domino Data Service is a good thing but is it possible to search for documents by key.
I didnt find anything in the api and the url parameters about it.
I tried the above and the requests usually fail on the server timeout after 30 seconds. Calls to /api/data/documents won't serve the purpose with parameters like sortcolumn or keysexactmatch, therefore calls to
/api/data/collections should be used for these.
Also, I don't think that arguments like sortcolumn would work on a document collection, because there isn't a column to be sorted in the first place, columns are in the views and not in documents, so view collection should be queried instead. That also mimics the behavior of getDocumentByKey method, which can't be called against document, but against the view. So instead:
http://HOSTNAME/DATABASE.nsf/api/data/documents?search=QUERY&searchmaxdocs=N
I would call
http://HOSTNAME/DATABASE.nsf/api/data/collections/name/viewname?search=QUERY&searchmaxdocs=N
and instead of
http://HOSTNAME/DATABASE.nsf/api/data/documents?sortcolumn=COLUMN&sortorder=ascending&keys=ROWVALUE&keysexactmatch=true
I would call:
http://HOSTNAME/DATABASE.nsf/api/data/collections/name/viewname?sortcolumn=COLUMN&sortorder=ascending&keys=ROWVALUE&keysexactmatch=true
where 'viewname' is the name of the view that is searched.
That is much faster, which comes in handy when working with larger databases.
You would do something like the following:
GET http://HOSTNAME/DATABASE.nsf/api/data/documents?search=QUERY&searchmaxdocs=N
N would be the total number of documents to return and QUERY would be your search phrase. The QUERY would be the same as doing a full text search.
For column lookups it should be something like this:
GET http://HOSTNAME/DATABASE.nsf/api/data/documents?sortcolumn=COLUMN&sortorder=ascending&keys=ROWVALUE&keysexactmatch=true
COLUMN would be the column name. ROWVALUE would be the key you are looking for.
There are further options for this. More details here.
http://infolib.lotus.com/resources/domino/8.5.3/doc/designer_up1/en_us/DominoDataService.html#migratingtowebsphereportalversion7.0

Pagination in CouchDB using variable keys

There's a bunch of questions on here related to pagination using CouchDB, but none that quite fit what I'm wondering about.
Basically, I have a result set ranked by number of votes, and I want to page through the set in descending order.
Here's the map for reference.
function(doc) {
emit(doc.votes);
}
Now, the problem. I found out that startkey_docid doesn't work on it's own. You have to use it in combination with startkey. The thing is, for the query, I don't use a startkey parameter (I'm not looking to restrict the results, just get the most->least). I was thinking I could just use startkey={{doc.votes}}&startkey_docid={{doc._id}} instead, but the number of votes for a document could have changed by the time someone clicks the "Next Page" link.
The way to solve this seemed obvious: just set startkey=99999999 so that it will return all documents in the database and I can just use startkey_docid to start at the one where we left off last time. Oddly, when I do that, the startkey_docid stopped working and just allowed all results to be returned again. Apparently startkey needs to exactly equal the key on the document whose _id is used in startkey_docid.
What I'm asking is whether anyone knows a workaround for using startkey_docid to page when the actual startkey could have changed by the time you want to use it? Should my application just lookup the document by _id and immediately use the doc.votes value hoping it hasn't changed in the few milliseconds between requests? Even that doesn't seem very reliable.
EDIT: Ended up switching to Mongo for the speed, so this question turned out to be kinda moot.
I have never done something like this but I think I have some idea how to do it. What you can do is to take a snapshot of the ratings and refer to it in every page. You probably want your view not to consume to much space, so you should not map separate copies of the documents with votes not changed after taking the snapshot. So, you can do the following:
Add some history of ratings with timestamp to your document.
Map the ratings AND history like this.
In your app get the current time: start_time = Date.now() and query all pages.
Cleanup the history older then the oldest active sessions.
The problem is that if you emit [votes, date] and try to paginate you will never know how many document you have to fetch to get desired number per page. There can always be some older version which you will have to skip, and you will have make next get from DB. Thats why you can consider emitting: [date, votes], read the view always twice -- for start_time and current time, and merge and sort the result (like in merge-sort).
Ad.1:
{ ...,
votes: 12,
history: [
{date: 1357390271342, votes: 10},
{date: 1357390294682, votes: 11}
]
}
Ad.2:
function (doc) {
emit([{}, doc.votes], null);
doc.history && doc.history.forEach(function(h) {
emit([h.date, h.votes], null);
});
}
Ad.3:
?startkey=[start_time, votes]&limit=items_per_page_plus1
?startkey=[{}, votes]&limit=items_per_page_plus1
Merge lists, sort by votes in your app (on in a list function).
If you will have problems with using start_docid then you can emit [date, votes, id] and query with the ID explicitly. Even when this particular doc changes its votes it will still be available in the history.
Ad.4:
If you emit [date, votes] then you can just get outdated history width: ?startkey=[0]&endkey=[oldest_active_session_time]&inclusive_end=false and update them with update handler:
function(doc, req) {
if (!doc || !doc.history) return [null, 'Error'];
var history = new Array();
var oldest = +(req.query.date);
doc.history.forEach(function(h) {
if (h.date >= oldest)
history.push(h);
});
doc.history = history;
return [doc, 'OK'];
}
Note: I have not tested it, so it is expected not to run without modifications :)
As far as I know CouchDB uses b-tree shadowing to make updates and in principle is should be possible to access older revisions of the view. I am not into the CouchDB design, so it is just a guess and there seems not to be any (documented) API for this.
I can't figure out any simple solution by now, but there are options:
Replicate not-so-often your sorting list to small dedicated db so it will be much more stale than stale=ok
Modify your schema in a way that you'll be able to sort by some more stable data. Look at the banking/ledger example in CouchDb guide: http://guide.couchdb.org/draft/recipes.html#banking. Try to log every vote and reduce them hourly for example. As a bonus you'll get a history/trends :)
I'm kind of surprised this question has been left unanswered because the functionality of CouchDB Futon basically does this when you are paginating through the results of a map function. I opened up firebug to see what was happening in the javascript console as I paginated and saw that for every set of paginated results it is passing the startkey along with startkey_docid. So although the question is how do I paginate without including startkey, CouchDB specifies that the startkey is required and demonstrates how it can work. The endkey is not specified, so if there is only one result for the specified startkey, the next set of paginated results will also contain the next key of the sorted results that do not match the startkey.
So to clarify a bit, the answer to this problem is that as you are paginating and keeping track of the startkey_docid, you also need to capture the startkey of the same document that will be the start of the next set of results. When you are calling the paginated results use both the captured startkey and startkey_docid as couchdb requires. Leave endkey off so that the results will continue on to the next key of the sorted results.
The usecase scenario for wanting to be able to paginate without specifying a key is kind of odd. So let's say that the start docid of the next paginated result did change it's key value drastically from a 9 to a 3. And we are also assuming that there is only one instance of the docid existing in the map results, even though it could potentially appear multiple times (which I believe is why the startkey needs to be specified). As the user is clicking the next button, the user's paginated results will have now moved from looking at rank 9 to rank 3. But if you are including the startkey in addition to the startkey_docid, the paginated results would just start all over at the beginning of the rank 9 results which is a more logical progression than potentially jumping over a large set of results.

ordering records by time

I created a simple view to return the blog title and time
function(doc) {
if ( doc.TITLE) emit(doc.TIME, doc.TITLE);
}
what is a simple way to display newest blog articles first (by default it is the other way around)?
Just apply descending sort order at view request e.g.
GET /dbname/_design/titles/_view/by_time?descending=True
And view output would be sorted in reversed way - newest blog articles will go first. Remember that startkey/endkey parameters will limit key range for this reversed order. More about view query parameters you could found in CouchDB wiki

How can I configure Sitecore search to retrieve custom values from the search index

I am using the AdvancedDatabaseCrawler as a base for my search page. I have configured it so that I can search for what I want and it is very fast. The problem is that as soon as you want to do anything with the search results that requires accessing field values the performance goes through the roof.
The main search results part is fine as even if there are 1000 results returned from the search I am only showing 10 or 20 results per page which means I only have to retrieve 10 or 20 items. However in the sidebar I am listing out various filtering options with the number or results associated with each filtering option (eBay style). In order to retrieve these filter options I perform a relationship search based on the search results. Since the search results only contain SkinnyItems it has to call GetItem() on every single result to get the actual item in order to get the value that I'm filtering by. In other words it will call Database.GetItem(id) 1000 times! Obviously that is not terribly efficient.
Am I missing something here? Is there any way to configure Sitecore search to retrieve custom values from the search index? If I can search for the values in the index why can't I also retrieve them? If I can't, how else can I process the results without getting each individual item from the database?
Here is an idea of the functionality that I’m after: http://cameras.shop.ebay.com.au/Digital-Cameras-/31388/i.html
Klaus answered on SDN: use facetting with Apache Solr or similar.
http://sdn.sitecore.net/SDN5/Forum/ShowPost.aspx?PostID=35618
I've currently resolved this by defining dynamic fields for every field that I will need to filter by or return in the search result collection. That way I can achieve the facetted searching that is required without needing to grab field values from the database. I'm assuming that by adding the dynamic fields we are taking a performance hit when rebuilding the index. But I can live with that.
In the future we'll probably look at utilizing a product like Apache Solr.

Resources