MongoDB fetch rows before and after find result - node.js

I'm using MongoDB, (Mongoose on Node.js) I have a very large db of events, each event has a field seq (sequence), the order of the events.
I want to allow my users to find all the occurrences of a given event.
For example:
The user is searching for the event "ButtonClicked", I should return the all the locations that this event happened, in this example say [239, 1992, 5932]
This is easy, and I can just search for the requested event, and return the seq field.
Now I want to let the user view 20 events before, and 20 events after a specific seq.
It would have been great if I could do something like this:
db.events.find( { id:"ButtonClicked", seq: 1992 } ).before(20).after(20);
How can I do that?
Please note that the field seq might start with any number, and skip numbers, but it is incremental!
For example: [3,4,5,6,7,12,13,15,56,57...]
Also, note that the solution can ignore seq, I mentioned this field because I think that it can help the solution.
Thanks!

You could use comparison query operators, in particular $gte and $lte, using seq as a offset for the comparison.
Try:
var seqOffset = 1992;
db.events.find( { seq: { $gte: seqOffset - 20, $lte: seqOffset + 20 } } );
You could not get exactly 40 events, since as you mentioned seq might skip numbers.

Related

Efficiently count Documents with different values for a given field

I am trying to count the number of documents that are in each possible state in a particular Arango collection.
This should be possible in 1 pass over all of the documents using a bucket-sort like strategy where you iterate over all documents, if the value for the state hasn't been seen before, you add a counter with a value of 1 to a list. If you have seen that state before, you increment the counter. Once you've reached the end, you'll have a counter for each possible state in the DB that indicates how many documents are currently stored with that state.
I can't seem to figure out how to write this type of logic in AQL to submit as a query. Current strategy is like this:
Loop over all documents, filtering only docs of a particular state.
Loop over all documents, filtering only docs of a different particular state.
...
All states have been filtered.
Return size of each set
This works, but I'm sure it's much slower than it should be. This also means that if we add a new state, we have to update the query to loop over all docs an additional time, filtering based on the new state. A bucket-sort like query would be quick, and would need no updating as new states are created as well.
If these were the documents:
{A}
{B}
{B}
{C}
{A}
Then I'd like the result to be
{ A:2, B:2, C:1 }
Where A,B,&C are values for a particular field. Current strategy filters like so
LET docsA = (
FOR doc in collection
FILTER doc.state == A
RETURN doc
)
Then manually construct the return object calling LENGTH on each list of docs
Any help or additional info would be greatly appreciated
What about using a COLLECT function? (see docs here)
FOR doc IN collection
COLLECT s = doc.state WITH COUNT INTO c
RETURN { state: s, count: c }
This would return something like:
[
{ state: 'A', count: 23 },
{ state: 'B', count: 2 },
{ state: 'C', count: 45 }
]
Would that accomplish what you are after?

Paginating a mongoose mapReduce, for a ranking algorithm

I'm using a MongoDB mapReduce to code a ranking feed algorithm, it almost works but the latest thing to implement is the pagination. The map reduce supports the results limitation but how could I implement the offset (skipping) based e.g. on the latest viewed _id of the results, knowing that I'm using mongoose?
This is the procedure I wrote:
o = {};
o.map = function() {
//log10(likes+comments) / elapsed hours from the post creation
emit(Math.log(this.likes + this.comments + 1) / Math.LN10 / Math.abs((now - this.createdAt) / 6e7 + 1), this);
};
o.reduce = function(key, values) {
//sort the values, when they have the same score
values.sort(function(a, b) {
a.createdAt - b.createdAt;
});
//serialize the values, because mongoose does not support multiple returned values
return JSON.stringify(values);
};
o.scope = {now: new Date()};
o.limit = 15;
Posts.mapReduce(o, function(err, results) {
if (err) return console.log(err);
console.log(results);
});
Also, if the mapReduce it's not the way to go, do you suggest other on how to implement something like this?
What you need is a page delimiter which is not the id of the latest viewed as you say, but your sorting property. In this case, it seems to be the formula Math.log(this.likes + this.comments + 1) / Math.LN10 / Math.abs((now - this.createdAt) / 6e7 + 1).
So, in your mapReduce query needs to hold a where value of that formula above. Or specifically, 'formula >= . And also it needs to hold the value of createdAt at the last page, since you don't sort by that. (Assuming createdAt is unique). So yourqueryof mapReduce would saywhere: theFormulaExpression, createdAt: { $lt: lastCreatedAt }`
If you do allow multiple identical createdAt values, you have to play a little outside of the database itself.
So you just search by formula.
Ideally, that gives you one element with exactly that value, and the next ones sorted after that. So in reply to the module caller, remove this first element off the array (and make sure you actually ask for more results then you need because of this).
Now, since you allow for multiple similar values, you need another identifying prop, say, object id or created_at. Your consumer (caller of this module) will have to provide both (last value of the score, createdAt of the last object). Say you have a page split exactly in the middle - one or more objects is on the previous page, another set on the next
. You'd have to not simply remove the top value (because that same score is already served on the previous page), but possibly several of them from the top.
Then it goes really crazy, because potentially your whole page was already served - compare the _ids, look for the first one after the one your module caller has provided you with. Or look into the data and determine how many matching values like that are there, try to get at least as many more values from mapReduce then you have on your actual page size.
Aside from that, I would do this with aggregation instead, it should be much more preformant.

Compare values inside same subdocument for findOne() [MongoDB]

I have a database full of objects that look ~exactly like this (simplified for clarity):
{
"_id": "GIFT100",
"price": 100,
"priceHistory": [
100, 110
],
"update": 1444183299242
}
What I'm trying to do is create a query document for MongoJS (or MongoDB and I can figure out the rest) that looks for the fact that priceHistory[0] < priceHistory[1].
I would want my query document to return the above record as a result. Alternatively, I could change my document code to compare price < priceHistory[0] but I believe this still leads to the same problem (comparing values inside the same document).
Any help would be appreciated, I've exhausted my Google-foo.
Edit:
I want to return a set of records that indicate a price drop since our last scan (performed daily). Basically a set of "sale" items from a data source I don't control.
You can use the $where clause, but be careful--it's slow, it cannot use your indexes, and it will perform a full table scan. Pass on whatever Javascript you want to use for comparison:
db.collection.findOne({$where: "priceHistory[0] < priceHistory[1]"})
Additionally, you can skip the $where statement if that's the only thing you're querying by:
db.collection.findOne("priceHistory[0] < priceHistory[1]")

mongodb: another "how to add a random record" thread

I've come across many of this same question here on StackOverflow. None providing a valid solid solution, so here we go:
I need to pick a random document from around 5 million documents in my MongoDB database in an efficient way.
I've tried getting the .count and using the .skip to get the random document, but it takes almost three seconds and very, very inefficient.
I can't make changes to the documents (like adding a "random") entry to each document or changing their _id's.
I've tried the solution of adding documents with an incremental _id (to pick a random _id to bypass using .skip) but this brought more headache than what it did when I try to add many documents in a short amount of time.
Adding data in an incremental way, or picking a random document, should not be this hard. I'm either missing some common knowledge, or doing something wrong, or this is what it really is..
Wanted to bring up the topic and get your responses.
Here is a way using the default ObjectId values for _id and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the
// diff between.
// 4-bytes from a hex string is 8 characters
var min = parseInt(db.collection.find()
.sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
max = parseInt(db.collection.find()
.sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
diff = max - min;
// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
// work out a "random" _id value in the range:
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
.sort({ "_id": 1 }).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

couchDB sorting complex key

I have a couchDB database which has several different document "types" which all relate to a main "type".
In the common blog / post example, the main type is the blog post, and the others are comments (though there are 3 different types of comments.
All of the types have a date on them, however, I wish to sort blog posts by date, but return all of the data from the comments as well. I can write an emit which produces keys like so:
[date, postID, docTypeNumber]
where docTypeNumber is 1 for post and > 1 for the different comment document types.
e.g:
["2013-03-01", 101, 1]
[null, 101, 2]
[null, 101, 2]
[null, 101, 3]
["2013-03-02", 101, 1]
[null, 102, 2]
[null, 102, 3]
Of course, If I emit this, all the nulls get sorted together. Is there a way to ignore the nulls, and group them by the seccond item in the array, but sort them by the first if it is not null?
Or, do I have to get all the documents to record the post date in order for sort to work?
I do not want to use lists, they are way too slow and I'm dealing with a potentially large data set.
You can do this by using conditionals in your map function.
if(date != null) {
emit([date, postID, docTypeNumber]);
}
else {
emit([postID, docTypeNumber]);
}
I don't know if you want your array length to be variable or not. If not, you could add the sort variable first. The following snippet could work since date and postID presumably never have the same values.
if(date != null) {
sortValue = date;
}
else {
sortValue = postID;
}
emit(sortValue, date, postID, docTypeNumber);
Update: I thought about this a little more. In general, I make my views based on queries I want to perform. So I ask myself, what do I need to query? It seems that in your case, you might have two distinct queries here. If so, I suggest having two different views. There is a performance penalty to pay since you would run two views instead of one, but I doubt it is perceivable to the user. And it might take up more disk space. The benefit for you would be clearer and more explicit code.
It seems you want to sort all the data (both the post and the comments) with post's date. Since in your design comment document does not contain post date (just comment date) it is difficult with the view collation pattern. I suggest changing the database design to have blog post ID meaningful and contain the date, eg. concatenated date with author id. This way if you emit [doc._id, doc.type] from the post and [doc.post, doc.type] from the comment document you will have post and comments grouped and sorted by date.

Resources