Paginating a mongoose mapReduce, for a ranking algorithm - node.js

I'm using a MongoDB mapReduce to code a ranking feed algorithm, it almost works but the latest thing to implement is the pagination. The map reduce supports the results limitation but how could I implement the offset (skipping) based e.g. on the latest viewed _id of the results, knowing that I'm using mongoose?
This is the procedure I wrote:
o = {};
o.map = function() {
//log10(likes+comments) / elapsed hours from the post creation
emit(Math.log(this.likes + this.comments + 1) / Math.LN10 / Math.abs((now - this.createdAt) / 6e7 + 1), this);
};
o.reduce = function(key, values) {
//sort the values, when they have the same score
values.sort(function(a, b) {
a.createdAt - b.createdAt;
});
//serialize the values, because mongoose does not support multiple returned values
return JSON.stringify(values);
};
o.scope = {now: new Date()};
o.limit = 15;
Posts.mapReduce(o, function(err, results) {
if (err) return console.log(err);
console.log(results);
});
Also, if the mapReduce it's not the way to go, do you suggest other on how to implement something like this?

What you need is a page delimiter which is not the id of the latest viewed as you say, but your sorting property. In this case, it seems to be the formula Math.log(this.likes + this.comments + 1) / Math.LN10 / Math.abs((now - this.createdAt) / 6e7 + 1).
So, in your mapReduce query needs to hold a where value of that formula above. Or specifically, 'formula >= . And also it needs to hold the value of createdAt at the last page, since you don't sort by that. (Assuming createdAt is unique). So yourqueryof mapReduce would saywhere: theFormulaExpression, createdAt: { $lt: lastCreatedAt }`
If you do allow multiple identical createdAt values, you have to play a little outside of the database itself.
So you just search by formula.
Ideally, that gives you one element with exactly that value, and the next ones sorted after that. So in reply to the module caller, remove this first element off the array (and make sure you actually ask for more results then you need because of this).
Now, since you allow for multiple similar values, you need another identifying prop, say, object id or created_at. Your consumer (caller of this module) will have to provide both (last value of the score, createdAt of the last object). Say you have a page split exactly in the middle - one or more objects is on the previous page, another set on the next
. You'd have to not simply remove the top value (because that same score is already served on the previous page), but possibly several of them from the top.
Then it goes really crazy, because potentially your whole page was already served - compare the _ids, look for the first one after the one your module caller has provided you with. Or look into the data and determine how many matching values like that are there, try to get at least as many more values from mapReduce then you have on your actual page size.
Aside from that, I would do this with aggregation instead, it should be much more preformant.

Related

How to filter view query results based on reduce value (and not just on key)

Using map/reduce functions only (not Mango),and the following example from the documentation, using the map and reduce functions below One may obtain the number of unique labels:
Documents return by the view
{"total_rows":9,"offset":0,"rows":[
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"bike","value":null},
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"couchdb","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"couchdb","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"couchdb","value":null},
{"id":"3525ab874bc4965fa3cda7c549e92d30","key":"drums","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"hypertext","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"music","value":null},
{"id":"da5ea89448a4506925823f4d985aabbd","key":"mustache","value":null},
{"id":"53f82b1f0ff49a08ac79a9dff41d7860","key":"philosophy","value":null}
]}
Map function
function(doc) {
if(doc.name && doc.tags) {
doc.tags.forEach(function(tag) {
emit(tag, 1);
});
}
}
Reduce function
function(keys, values) {
return sum(values);
}
Response with grouping
{"rows":[
{"key":"bike","value":1},
{"key":"couchdb","value":3},
{"key":"drums","value":1},
{"key":"hypertext","value":1},
{"key":"music","value":1},
{"key":"mustache","value":1},
{"key":"philosophy","value":1}
]}
Now my question is, using map/reduce views only (not Mango) how can I query the view to only select rows having a specific value following reduce (for example "3"). It looks like all view parameters focus on filtering based on the key, but I need to filter based on value. Ideally, being able to also use greater than, lesser than for reduce value filtering would also be great.
The ability to filter based on the value is essential for scenarios like the one above, but also for more advanced scenarios involving linked documents. Of course, I am not interested in filtering in memory in the application layer since in real world scenarios, the result set would be much larger than a dozen lines.

Timeseries differencing - ArangoDB (AQL or Python)

I have a collection which holds documents, with each document having a data observation and the time that the data was captured.
e.g.
{
_key:....,
"data":26,
"timecaptured":1643488638.946702
}
where timecaptured for now is a utc timestamp.
What I want to do is get the duration between consecutive observations, with SQL I could do this with LAG for example, but with ArangoDB and AQL I am struggling to see how to do this at the database. So effectively the difference in timestamps between two documents in time order. I have a lot of data and I don't really want to pull it all into pandas.
Any help really appreciated.
Although the solution provided by CodeManX works, I prefer a different one:
FOR d IN docs
SORT d.timecaptured
WINDOW { preceding: 1 } AGGREGATE s = SUM(d.timecaptured), cnt = COUNT(1)
LET timediff = cnt == 1 ? null : d.timecaptured - (s - d.timecaptured)
RETURN timediff
We simply calculate the sum of the previous and the current document, and by subtracting the current document's timecaptured we can therefore calculate the timecaptured of the previous document. So now we can easily calculate the requested difference.
I only use the COUNT to return null for the first document (which has no predecessor). If you are fine with having a difference of zero for the first document, you can simply remove it.
However, neither approach is very straight forward or obvious. I put on my TODO list to add an APPEND aggregate function that could be used in WINDOW and COLLECT operations.
The WINDOW function doesn't give you direct access to the data in the sliding window but here is a rather clever workaround:
FOR doc IN collection
SORT doc.timecaptured
WINDOW { preceding: 1 }
AGGREGATE d = UNIQUE(KEEP(doc, "_key", "timecaptured"))
LET timediff = doc.timecaptured - d[0].timecaptured
RETURN MERGE(doc, {timediff})
The UNIQUE() function is available for window aggregations and can be used to get at the desired data (previous document). Aggregating full documents might be inefficient, so a projection should do, but remember that UNIQUE() will remove duplicate values. A document _key is unique within a collection, so we can add it to the projection to make sure that UNIQUE() doesn't remove anything.
The time difference is calculated by subtracting the previous' documents timecaptured value from the current document's one. In the case of the first record, d[0] is actually equal to the current document and the difference ends up being 0, which I think is sensible. You could also write d[-1].timecaptured - d[0].timecaptured to achieve the same. d[1].timecaptured - d[0].timecaptured on the other hand will give you the inverted timestamp for the first record because d[1] is null (no previous document) and evaluates to 0.
There is one risk: UNIQUE() may alter the order of the documents. You could use a subquery to sort by timecaptured again:
LET timediff = doc.timecaptured - (
FOR dd IN d SORT dd.timecaptured LIMIT 1 RETURN dd.timecaptured
)[0]
But it's not great for performance to use a subquery. Instead, you can use the aggregation variable d to access both documents and calculate the absolute value of the subtraction so that the order doesn't matter:
LET timediff = ABS(d[-1].timecaptured - d[0].timecaptured)

Couchdb - date range + multiple query parameters

I want to be able query the couchdb between dates, I know that this can be done with startkey and endkey (it works fine), but is it possible to do query for example like this:
SELECT *
FROM TABLENAME
WHERE
DateTime >= '2011-04-12T00:00:00.000' AND
DateTime <= '2012-05-25T03:53:04.000'
AND
Status = 'Completed'
AND
Job_category = 'Installation'
Generally-speaking, establishing indexes on multiple fields grows in complexity as the number of fields increases.
My main question is: do Status and Job_category need to be queried dynamically too? If not, your view is simple:
function (doc) {
if (doc.Status === 'Completed' && doc.Job_category === 'Installation') {
emit(doc.DateTime); // this line may change depending on how you break up and emit the datetimes
}
}
Views are fairly cheap, (depending on the size of your database) so don't be afraid to establish several that cover different cases. I would expect something like Status to have predefined list of available options, as oppposed to Job_category which seems like it could be more related to user input.
If you need those fields to be dynamic, you can just add them to the index as well:
function (doc) {
emit([ doc.Status, doc.Job_category, doc.DateTime ]);
}
Then you can use an array as your start_key. For example:
start_key=["Completed", "Installation", ...]
tl;dr: use "static" views where you have a predetermined list of values for a given field. while possible to query "dynamic" views with multiple fields, the complexity grows very quickly.

Couchdb query for values calculated from key input

suppose i have the following data in my database:
[1,2],[2,1],[1,3],[3,1]...
were the numbers represent the a and b values of the formula a*x+b
what i now want is a query that returns the difference to a given point x,y.
for example: the point [2,6] is given. i want my query to return
[1,2] = -2 (1*2+2=4 4-6=-2)
[2,1] = -1 (2*2+1=5 5-6=-1)
[1,3] = -1 (1*2+3=5 4-6=-1)
[3,1] = 1 (3*2+1=7 7-6=-1)
I know how to do this in SQL but the data is already in a couchdb. I'm quite new to the NoSQL world and was wondering if something like this would be possible in couchdb.
what you can do is to use the standard MapReduce functionality of CouchDB.
Map is function you put in a view, which finds your data. You can have various criteria how to locate the docs you need. Next, if you specify so in the query with reduce=true, a reduce function is executed on each document that matched the map condition. You can use JavaScript to perform various operations on the document's values.
In your case, the map can look something like this:
function(doc) {
if(doc.a && doc.b) {
emit(doc._id,[doc.a, doc.b]);
}
}
then, the reduce gets called, like this:
function(keys, values, rereduce) {
var res;
//do something with values...
return res;
}
In your case keys will be list of document ID's and values will be the array of your a & b fields.
When you call the MapReduce (depending what method you use to access the DB), you should specify reduce=true.
Good resources on MapReduce (and on Views, Sorting and List funtions) are:
http://guide.couchdb.org/draft/views.html
http://www.slideshare.net/okurow/couchdb-mapreduce-13321353
Another way to go is to use a list function on the Map result, if you want to output the result in HTML. A good reason to use List function is that you can pass arguments to it with querystring, in your case it may be the point for which you want to calculate distances.
For detailed description on List functions, have a look here:
http://guide.couchdb.org/draft/transforming.html
Hope this helps.

Couchdb: filter and group in a single view

I have a Couchdb database with documents of the form: { Name, Timestamp, Value }
I have a view that shows a summary grouped by name with the sum of the values. This is straight forward reduce function.
Now I want to filter the view to only take into account documents where the timestamp occured in a given range.
AFAIK this means I have to include the timestamp in the emitted key of the map function, eg. emit([doc.Timestamp, doc.Name], doc)
But as soon as I do that the reduce function no longer sees the rows grouped together to calculate the sum. If I put the name first I can group at level 1 only, but how to I filter at level 2?
Is there a way to do this?
I don't think this is possible with only one HTTP fetch and/or without additional logic in your own code.
If you emit([time, name]) you would be able to query startkey=[timeA]&endkey=[timeB]&group_level=2 to get items between timeA and timeB grouped where their timestamp and name were identical. You could then post-process this to add up whenever the names matched, but the initial result set might be larger than you want to handle.
An alternative would be to emit([name,time]). Then you could first query with group_level=1 to get a list of names [if your application doesn't already know what they'll be]. Then for each one of those you would query startkey=[nameN]&endkey=[nameN,{}]&group_level=2 to get the summary for each name.
(Note that in my query examples I've left the JSON start/end keys unencoded, so as to make them more human readable, but you'll need to apply your language's equivalent of JavaScript's encodeURIComponent on them in actual use.)
You can not make a view onto a view. You need to write another map-reduce view that has the filtering and makes the grouping in the end. Something like:
map:
function(doc) {
if (doc.timestamp > start and doc.timestamp < end ) {
emit(doc.name, doc.value);
}
}
reduce:
function(key, values, rereduce) {
return sum(values);
}
I suppose you can not store this view, and have to put it as an ad-hoc query in your application.

Resources