reduce output must shrink more rapidly, on adding new document - couchdb

I have couple of documents in couchdb, each having a cId field, such as -
{
"_id": "ccf8a36e55913b7cf5b015d6c50009f7",
"_rev": "8-586130996ad60ccef54775c51599e73f",
"cId": 1,
"Status": true
}
I have a simple view, which tries to return max of cId with map and reduce functions as follows -
Map
function(doc) {
emit(null, doc.cId);
}
Reduce
function(key, values, rereduce){
return Math.max.apply(null, values);
}
This works fine (output is 1) until I add one more document with cId = 2 in db. I am expecting output as 2 but it starts giving error as "Reduce output must shrink more rapidly". When I delete this document things are back to normal again. What can be the issue here? Is there any alternative way to achieve this?
Note: There are more views in db, which perform different role and few return json as well. They also start failing on this change.

You could simply use the built-in _statsreduce function, in order to get the maximum value. It is returned in the "max" field.

Related

How to keep json object with highest value among duplicates with nodejs

I have JSON objects imported from an external system, some of which are duplicates in an ID value.
Foe example:
{
"ID": "1",
"name": "Bob",
"ink": "100"
},
{
"ID":"2",
"Name": "George",
"ink": "100"
},
{
"ID":"1",
"name": "Bob",
"ink":"200"
}
I am manipulating the information for each object, then push them into a new JSON array:
var array = {};
array.users = [];
for (let user of users) {
function (user) => {
...
array.users.push(user);
}
}
I need to remove all duplicates save the one with the highest value in the ink key.
I found solutions to do this for the array AFTER it is constructed, but that means I use system resources for nothing - no reason to manipulate users that will be removed anyway.
I am looking for a way to check for each new user if a user with that ID:value pair already exists in the array.users[] array, if it does, compare the values of the ink key, if it is higher - remove the existing from the array, then I can continue with my manipulation code and push the new user into the array.
Any ideas of what would be the most elegant/efficient/shortest way to accomplish this?
I am not really sure if I fully understood your question. If I understand correctly you don't want to pass through the entire array after it is constructed and check for duplicates?
"If in doubt throw a hash map at the problem". Use a map instead of a plain array. The map key stores the ID. And save your fields as the value. If a key already exists then you can just check which value is higher.
Code example should somewhat look like this:
let userMap = new Map()
for (let user in users) {
if (userMap.has(user["ID"]) //Look which ink is bigger
else //Store new entry
}
EDIT: My solution does require an extra step though and is not directly done in the original array. However, I still think that maps are probably one of the most efficient ways to handle this...
var array = {};
array.users = users.filter((user)=>{
for (let userSecond of users) {
if(userSecond.id === user.id && +userSecond.ink > +user.ink){
return false;
}
}
return true;
});
Not the cleanest solution perhaps but it should do the job. Basically you filter through users. Within the filter you go through every user again to check if any of them has the same id and more ink, if so the current user should be discarded by returning false. If no user is found with same id and more ink the current user will stay in the array.

Reduce output must shrink more rapidly -- Reducing to a list of documents

I have a few documents in my couch db with json as below. The cId will change for each. And I have created a view with map/reduce function to filter out few documents and return a list of json documents.
Document structure -
{
"_id": "ccf8a36e55913b7cf5b015d6c50009f7",
"_rev": "8-586130996ad60ccef54775c51599e73f",
"cId": 1,
"Status": true
}
Here is the sample map:
function(doc) {
if(doc.Key && doc.Value && doc.Status == true)
emit(null, doc);
}
Here is the sample reduce:
function(key, values, rereduce){
var kv = [];
values.forEach(function(value){
if(value.cId != <some_val>){
kv.push({"k": value.cId, "v" : value});
}
});
return kv;
}
If there are two documents and reduce output has list containing 1 document, this works fine. But if I add one more document (with cId = 2), it throws the errors - "reduce output must shrink more rapidly". Why is this caused? And how can I achieve what I intend to do?
The cause of the error is, that the reduce function does not actually reduce anything (it rather is collecting objects). The documentation mentions this:
The way the B-tree storage works means that if you don’t actually
reduce your data in the reduce function, you end up having CouchDB
copy huge amounts of data around that grow linearly, if not faster
with the number of rows in your view.
CouchDB will be able to compute the final result, but only for views
with a few rows. Anything larger will experience a ridiculously slow
view build time. To help with that, CouchDB since version 0.10.0 will
throw an error if your reduce function does not reduce its input
values.
It is unclear to me, what you intend to achieve.
Do you want to retrieve a list of docs based on certain criteria? In this case, a view without reduce should suffice.
Edit: If the desired result depends on a value stored in a certain document, then CouchDB has a feature called list. It is a design function, that provides access to all docs of a given view, if you pass include_docs=true.
A list URL follow this pattern:
/db/_design/foo/_list/list-name/view-name
Like views, lists are defined in a design document:
{
"_id" : "_design/foo",
"lists" : {
"bar" : "function(head, req) {
var row;
while (row = getRow()) {
if (row.doc._id === 'baz') // Do stuff based on a certain doc
}
}"
},
... // views and other design functions
}

Sorting CouchDB result by value

I'm brand new to CouchDB (and NoSQL in general), and am creating a simple Node.js + express + nano app to get a feel for it. It's a simple collection of books with two fields, 'title' and 'author'.
Example document:
{
"_id": "1223e03eade70ae11c9a3a20790001a9",
"_rev": "2-2e54b7aa874059a9180ac357c2c78e99",
"title": "The Art of War",
"author": "Sun Tzu"
}
Reduce function:
function(doc) {
if (doc.title && doc.author) {
emit(doc.title, doc.author);
}
}
Since CouchDB sorts by key and supports a 'descending=true' query param, it was easy to implement a filter in the UI to toggle sort order on the title, which is the key in my results set. Here's the UI:
List of books with link to sort title by ascending or descending
But I'm at a complete loss on how to do this for the author field.
I've seen this question, which helped a poster sort by a numeric reduce value, and I've read a blog post that uses a list to also sort by a reduce value, but I've not seen any way to do this on a string value without a reduce.
If you want to sort by a particular property, you need to ensure that that property is the key (or, in the case of an array key, the first element in the array).
I would recommend using the sort key as the key, emitting a null value and using include_docs to fetch the full document to allow you to display multiple properties in the UI (this also keeps the deserialized value consistent so you don't need to change how you handle the return value based on sort order).
Your map functions would be as simple as the following.
For sorting by author:
function(doc) {
if (doc.title && doc.author) {
emit(doc.author, null);
}
}
For sorting by title:
function(doc) {
if (doc.title && doc.author) {
emit(doc.title, null);
}
}
Now you just need to change which view you call based on the selected sort order and ensure you use the include_docs=true parameter on your query.
You could also use a single view for this by emitting both at once...
emit(["by_author", doc.author], null);
emit(["by_title", doc.title], null);
... and then using the composite key for your query.

emit doc twice with different key in couchdb

Say I have a doc to save with couchDB and the doc looks like this:
{
"email": "lorem#gmail.com",
"name": "lorem",
"id": "lorem",
"password": "sha1$bc5c595c$1$d0e9fa434048a5ae1dfd23ea470ef2bb83628ed6"
}
and I want to be able to query the doc either by 'id' or 'email'. So when save this as a view I write so:
db.save('_design/users', {
byId: {
map: function(doc) {
if (doc.id && doc.email) {
emit(doc.id, doc);
emit(doc.email, doc);
}
}
}
});
And then I could query like this:
db.view('users/byId', {
key: key
}, function(err, data) {
if (err || data.length === 0) return def.reject(new Error('not found'));
data = data[0] || {};
data = data.value || {};
self.attrs = _.clone(data);
delete self.attrs._rev;
delete self.attrs._id;
def.resolve(data);
});
And it works just fine. I could load the data either by id or email. But I'm not sure if I should do so.
I have another solution which by saving the same doc with two different view like byId and byEmail, but in this way I save the same doc twice and obviously it will cost space of the database.
Not sure which solution is better.
The canonical solution would be to have two views, one by email and one by id. To not waste space for the document, you can just emit null as the value and then use the include_docs=true query paramter when you query the view.
Also, you might want to use _id instead of id. That way, CouchDB ensures that the ID will be unique and you don't have to use a view to loop up documents.
I'd change to the two separate views. That's explicit and clear. When you emit the same doc twice in a single view – by an id and e-mail you're effectively combining the 2 views into one. You may think of it as a search tree with the 2 root branches. I don't see any reason of doing that, and would suggest leaving the data access and storage optimization job to the database.
The views combination may also yield tricky bugs, when for some reason you confuse an id and an e-mail.
There is absolutely nothing wrong with emitting the same document multiple times with a different key. It's about what makes most sense for your application.
If id and email are always valid and interchangeable ways to identify a user then a single view is perfect. For example, when id is some sort of unique account reference and users are allowed to use that or their (more memorable) email address to login.
However, if you need to differentiate between the two values, e.g. id is only meant for application administrators, then separate views are probably better. (You could probably use a complex key instead ... but that's another answer.)

Mongoose update multiple geospacial index with no limit

I have some Mongoose Models with geospacial indexes:
var User = new Schema({
"name" : String,
"location" : {
"id" : String,
"name" : String,
"loc" : { type : Array, index : '2d'}
}
});
I'm trying to update all items that are in an area - for instance:
User.update({ "location.loc" : { "$near" : [ -122.4192, 37.7793 ], "$maxDistance" : 0.4 } }, { "foo" : "bar" },{ "multi" : true }, function(err){
console.log("done!");
});
However, this appears to only update the first 100 records. Looking at the docs, it appears there is a native limit on finds on geospatial indices for that applies when you don't set a limit.
(from docs:
Use limit() to specify a maximum number of points to return (a default limit of 100 applies if unspecified))
This appears to also apply to updates, regardless of the multi flag, which is a giant drag. If I apply an update, it only updates the first 100.
Right now the only way I can think of to get around this is to do something hideous like this:
Model.find({"location.loc" : { "$near" : [ -122.4192, 37.7793 ], "$maxDistance" : 0.4 } },{limit:0},function(err,results){
var ids = results.map(function(r){ return r._id; });
Model.update({"_id" : { $in : ids }},{"foo":"bar"},{multi:true},function(){
console.log("I have enjoyed crippling your server.");
});
});
While I'm not even entirely sure that would work (and it could be mildly optimized by only selecting the _id), I'd really like to avoid keeping an array of n ids in memory, as that number could get very large.
Edit:
The above hack doesn't even work, looks like a find with {limit:0} still returns 100 results. So, in an act of sheer desperation and frustration, I have written a recursive method to paginate through ids, then return them so I can update using the above method. I have added the method as an answer below, but not accepted it in hopes that someone will find a better way.
This is a problem in mongo server core as far as I can tell, so mongoose and node-mongodb-native are not to blame. However, this is really stupid, as geospacial indices is one of the few reasons to use mongo over some other more robust NoSQL stores.
Is there a way to achieve this? Even in node-mongodb-native, or the mongo shell, I can't seem to find a way to set (or in this case, remove by setting to 0) a limit on an update.
I'd love to see this issue fixed, but I can't figure out a way to set a limit on an update, and after extensive research, it doesn't appear to be possible. In addition, the hack in the question doesn't even work, I still only get 100 records with a find and limit set to 0.
Until this is fixed in mongo, here's how I'm getting around it: (!!WARNING: UGLY HACKS AHEAD:!!)
var getIdsPaginated = function(query,batch,callback){
// set a default batch if it isn't passed.
if(!callback){
callback = batch;
batch = 10000;
}
// define our array and a find method we can call recursively.
var all = [],
find = function(skip){
// skip defaults to 0
skip = skip || 0;
this.find(query,['_id'],{limit:batch,skip:skip},function(err,items){
if(err){
// if an error is thrown, call back with it and how far we got in the array.
callback(err,all);
} else if(items && items.length){
// if we returned any items, grab their ids and put them in the 'all' array
var ids = items.map(function(i){ return i._id.toString(); });
all = all.concat(ids);
// recurse
find.call(this,skip+batch);
} else {
// we have recursed and not returned any ids. This means we have them all.
callback(err,all);
}
}.bind(this));
};
// start the recursion
find.call(this);
}
This method will return a giant array of _ids. Because they are already indexed, it's actually pretty fast, but it's still calling the db many more times than is necessary. When this method calls back, you can do an update with the ids, like this:
Model.update(ids,{'foo':'bar'},{multi:true},function(err){ console.log('hooray, more than 100 records updated.'); });
This isn't the most elegant way to solve this problem, you can tune it's efficiency by setting the batch based on expected results, but obviously the ability to simply call update (or find for that matter) on $near queries without a limit would really help.

Resources