Building a pagination cursor - pagination

I have activities that are stored in a graph database. Multiple activities are grouped and aggregated into 1 activity in some circumstances.
A processed activity feed could look like this:
Activity 1
Activity 2
Grouped Activity
Activity 3
Activity 4
Activity 5
Activities have an updated timestamp and a unique id.
The activities are ordered by their updated time and in the case of a grouped activity, the most recent updated time within its child activities is used.
Activities can be inserted anywhere in the list (for example, if we start following someone, their past activities would be inserted into the list).
Activities can be removed from anywhere in the list.
Due to the amount of data, using the timestamp with microseconds can still result in conflicts (2 items can have the same timestamp).
Cursor identifiers should be unique and stable. Adding and removing feed items should not change the identifier.
I would like to introduce cursor based paging to allow clients to paginate through the feed similar to twitter's. There doesn't seem to be much information on how they are built as I have only found this blog post talking about implementing them. However it seems to have a problem if the cursor's identifier happens to be pointing to the item that was removed.
With the above, how can I produce an identifier that can be used as a cursor for the above? Initially, I considered combining the timestamp with the unique id: 1371813798111111.myuniqueid. However, if the item at 1371813798111111.myuniqueid is deleted, I can get the items with the 1371813798111111 timestamp, but would not be able to determine which item with that timestamp I should start with.
Another approach I had was to assign an incrementing number to each feed result. Since the number is incrementing and in order, if the number/id is missing, I can just choose the next one. However, the problem with this is that the cursor ids will change if I start removing and adding feed items in the middle of the feed. One solution I had to this problem is to have a huge gap between each number, but it is difficult to determine how new items can be added to the space between each number in a deterministic way. In addition, as the new items are added, and the gaps are being filled up, we would end up with the same problem.
Simply put, if I have a list of items where items can be added and removed from anywhere in the list, what is the best way to generate an id for each list item such that if the item for the id is deleted, I can still determine its position in the list?

You need to have additional (or existing) column which sequentially increased for every new added row to target table. Let's call this column seq_id.
When client request cursor for the first time:
GET /api/v1/items?sort_by={sortingFieldName}&size={count}
where sortingFieldName is name of field by which we apply sorting
What happened under the hood:
SELECT * FROM items
WHERE ... // apply search params
ORDER BY sortingFieldName, seq_id
LIMIT :count
Response:
{
"data": [...],
"cursor": {
"prev_field_name": "{result[0].sortingFieldName}",
"prev_id": "{result[0].seq_id}",
"nextFieldName": "{result[count-1].sortingFieldName}",
"next_id": "{result[count-1].seq_id}",
"prev_results_link": "/api/v1/items?size={count}&cursor=bw_{prevFieldName}_{prevId}",
"next_results_link": "/api/v1/items?size={count}&cursor=fw_{nextFieldName}_{nextId}"
}
}
Next of cursor will not be present in response if we retrieved less than count rows.
Prev part of cursor will not be present in response if we don't have cursor in request or don't have data to return.
When client perform request again - he need to use cursor. Forward cursor:
GET /api/v1/items?size={count}&cursor=fw_{nextFieldName}_{nextId}
What happened under the hood:
SELECT * FROM items
WHERE ... // apply search params
AND ((fieldName = :cursor.nextFieldName AND seq_id > :cursor.nextId) OR
fieldName > :cursor.nextFieldName)
ORDER BY sortingFieldName, seq_id
LIMIT :count
Or backward cursor:
GET /api/v1/items?size={count}&cursor=fw_{prevFieldName}_{prevId}
What happened under the hood:
SELECT * FROM items
WHERE ... // apply search params
AND ((fieldName = :cursor.prevFieldName AND seq_id < :cursor.prevId) OR
fieldName < :cursor.prevFieldName)
ORDER BY sortingFieldName DESC, seq_id DESC
LIMIT :count
Response will be similar to previous one

Related

Query Couchdb by date while maintaining sort order

I am new to couchdb, i have looked at the docs and SO posts but for some reason this simple query is still eluding me.
SELECT TOP 10 * FROM x WHERE DATE BETWEEN startdate AND enddate ORDER BY score
UPDATE: It cannot be done. This is unfortunate since to get this type
of data you have to pull back potentially millions of records (a few
fields) from couch then do either filtering, sorting or limiting
yourself to get the desired results. I am now going back to my
original solution of using _changes to capture and store elsewhere the data i do need to perform that query on.
Here is my updated view (thanks to Dominic):
emit([d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), score], doc.name);
What I need to do is:
Always sort by score descending
Optionally filter by date range (for instance, TODAY only)
Limit by x
Update: Thanks to Dominic I am much closer - but still having an
issue.
?startkey=[2017,1,13,{}]&endkey=[2017,1,10]&descending=true&limit=10&include_docs=true
This brings back documents between the dates sorted by score
However if i want top 10 regardless of date then i only get back top 10 sorted by date (and not score)
For starters, when using complex keys in CouchDB, you can only sort from left to right. This is a common misconception, but read up on Views Collation for a more in-depth explanation. (while you're at it, read the entire Guide to Views as well since you're getting started)
If you want to be able to sort by score, but filter by date only, you can accomplish this by breaking down your timestamp to only show the degree you care about.
function (doc) {
var d = new Date(doc.date)
emit([ d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), score ])
}
You'll end up outputting a more complex key than what you currently have, but you query it like so:
startkey=[2017,1,1]&endkey=[2017,1,1,{}]
This will pick out all the documents on 1-1-2017, and it'll be sorted by score already! (in ascending order, simply swap startkey and endkey to get descending order, no change to the view needed)
As an aside, avoid emitting the entire doc as the value in your view. It is likely more efficient to leverage the include_docs=true parameter, and leaving the value of your emit empty. (please refer to this SO question for more information)
With this exact setup, you'd need separate views in order to query by different precisions. For example, to query by month you just use the year/month and so on.
However, if you are willing/able to sort your scores in your application, you can use a single view to get all the date precision you want. For example:
function (doc) {
var d = new Date(doc.date)
emit([ d.getUTCFullYear(), d.getUTCMonth() + 1, d.getUTCDate(), d.getUTCHour(), d.getUTCMinutes(), d.getUTCSeconds(), d.getUTCMilliseconds() ])
}
With this view and the group_level parameter, you can get all the scores by year, month, date, hour, etc. As I mentioned, in this case it won't be sorted by score yet, but maybe this opens up other queries to you. (eg: what users participated this month?)

MongoDB API pagination

Imagine situation when a client has feed of objects with limit 10.
When the next 10 are required it sends request with skip 10 and limit 10.
But what if there are some new objects were added (or deleted) to collection since the 1st request with offset == 0.
Then on 2nd request (with offset == 10) response may have wrong objects order.
Sorting on time of their creation does not work here, because I have some feeds which are formed on sorting via some numeric field.
You can add a time field like created_at or updated_at. It must updated when ever the document is created or modified and the field must be unique.
Then query the DB for the range of time using $gte and $lte along with a sort on this time field.
This ensures that any changes made outside the time window will not get reflected in the pagination, provided that the time field does not have duplicates. Most probably if you include microtime, duplicates wont happen.
It really depends on what you want the result to be.
If you want the original objects in their original order regardless of Delete and Add operations then you need to make a copy of the list (or at least of the order) and then page through that. Copy every Id to a new collection that doesn't change once the page has loaded and then paginate through that.
Alternatively, and perhaps more likely, what you want is to see the next 10 after the last one in the current set including any Delete or Add operations that have take place since. For this, you can use the sorted order in which you are viewing them and a filter, $gt whatever the last item was. BUT that doesn't work when there are duplicates in the field on which you are sorting. To get around that you will need to index on that field PLUS some other field which is unique per record, for example, the _id field. Now, you can take the last record in the first set and look for records that are $eq the indexed value and $gt the _id OR are simply $gt the indexed value.

Create Notes view for duplicate parent documents

We have an Xpages application and recently discovered an issue where there are several Notes documents that have duplicates but the duplicates are PARENT documents too and NOT response documents. Is it possible to create a Notes view that will show duplicates where all the duplicates are parents? I know the formula for showing conflicts is the following but what about where they are all parents?
SELECT #IsAvailable($Conflict)
Expounding on my comment:
Create a view which is categorized on the first column
In the first column formula, put in criteria that you would use to determine a duplicate. This may be the Document Unique ID, or maybe another field or combination of fields.
Add a second column that contains the number 1. Then enable column totals on this column.
Now look at this view you created. With the view categories collapsed, look for any number greater that 1 to determine which documents are duplicates.
I think what you are asking is not how to identify the duplicates - but how to find out which of them are parent documents. So basically you would create a view as Steve suggests - but instead of putting a constant of 1 into the second column I would suggest putting either #DocChildren (for immediate responses) or #DocDescendants (for all responses and responses to responses).
If I understand your logic then all the ones returning 0 (zero) are child documents and those returning 1 or higher would be parent documents. Of course you could also use an item on the document in your view formula - if it only exists on the parent doc (or its value can tell that it is a parent doc)
View selection formulas act on only one document at a time. They cannot perform lookups. They have no way to compare two documents. There is therefore no possible way for a view to identify duplicates.
A view can, as per the other answers, categorize documents based on common values. If there is a single field that is supposed to be unique across all documents, you can categorize on that field. That will give you a visualization of the duplicates, but it won't filter them in or out.
The only way for a view to filter duplicates - either to show only duplicates, or to exlude duplicates - would be if you run an agent that reads all documents, looks for those that are duplicates, and marks them with a special field value - e.g., IsDuplicate = 1. Once you do that, you can create a view that selects all documents with IsDuplicated = 1, or a view that excludes IsDuplicated = 1.

How to dynamically limit time range?

I have two sourcetypes:
A defines the period of activities:
_time, entity, start_time, end_time, activity, ...
B defines the 2D position of the entities:
_time, entity, x, y, ....
Now I tried to extract only those rows of all the entities in B that is within the periods defined in A, how can I do that? It seems I can't make a comparison with the command 'join' for time?
You're right, join won't be much help here. I've found that the splunk way to match up information in two indexes is to start with both indexes and manipulate the heterogeneous events as if they were a single index.
In this case, one approach uses streamstats to produce events that are denormalized to include relevant activity fields on each position event. First, make sure each event from index A considers start_time to be the _time field. Then, use streamstats to fill each event with null start_time, end_time, or activity fields (which should be coming from index B) with the latest value for that entity that was non-null (which should be coming from index A). Finally filter out any events where _time > end_time, which would be any position event that falls outside an activity window.
index=A OR index=B
| eval _time=coalesce(start_time, _time)
| streamstats latest(start_time) as activity_start_time, latest(end_time) as activity_end_time, latest(activity) as activity by entity
| where _time<=end_time
Keep in mind that this approach assumes that activities are neatly ordered, so that no activity overlaps another. This would be a bit trickier if activities can overlap.
Another method I sometimes use is to use transaction instead of streamstats. This gives much more control over the logic around when one activity starts and ends, and results in a single event per activity with multi-valued fields for the position. You'd want to start with a single "point" field for each position if you took this route.

Model and ordered list in Cassandra

I need to model a list of items which is sorted by the time of last update of the item.
Consider for instance a user task list. Each user has a list of tasks and each tasks has a due date. Tasks can be added to that list, but also the due date of a task can change after it has been added to the list.
That is, a task which is in the 3rd position in the task list of User A may have to be moved to the 1st, as a result of the due date of the task being updated.
What I have right now is the following CF:
Create Table UserTasks (
user_id uuid,
task_id timeuuid,
new_due_date timestamp
PRIMARY KEY (user_id, task_id));
I understand that I cannot sort on 'new_due_date' unless it is made part of the key.
But if its part of the key then it cannot be updated unless but rather deleted and recreated.
My concerns in doing so is that if a task exists in the task list of 100.000 users, then I need to make 100.000 select/delete/insert sequence.
While if I could sort on new_due_date it's be 100.000 updates
Any suggestions would be greatly appreciated.
Well, one option is if use PlayOrm with cassandra, you can partition by user_id and query for UserTasks of a user. If you query where time > 0 and time < MAX, it returns a cursor(reading in batchSize rows at a time) and you can traverse the cursor in reverse order or just plain order. This solution scales infinitely with number of users, but only scales to millions of tasks per user which may be ok but I don't know your domain well enough.
Dean

Resources