Retrieving Hierarchical/Nested Data From CouchDB - nested

I'm pretty new to couchDB and even after reading (latest archive as now deleted) http://wiki.apache.org/couchdb/How_to_store_hierarchical_data (via ‘Store the full path to each node as an attribute in that node's document’) it's still not clicking just yet.
Instead of using the full path pattern as described in the wiki I'm hoping to keep track of children as an array of UUIDs and the parent as a single UUID. I'm leaning towards this pattern so I can maintain the order of children by their positions in the children array.
Here are some sample documents in couch, buckets can contain buckets and items, items can only contain other items. (UUIDs abbreviated for clarity):
{_id: 3944
name: "top level bucket with two items"
type: "bucket",
parent: null
children: [8989, 4839]
}
{_id: 8989
name: "second level item with no sub items"
type: "item"
parent: 3944
}
{
_id: 4839
name: "second level bucket with one item"
type: "bucket",
parent: 3944
children: [5694]
}
{
_id: 5694
name: "third level item (has one sub item)"
type: "item",
parent: 4839,
children: [5390]
}
{
_id: 5390
name: "fourth level item"
type: "item"
parent: 5694
}
Is it possible to look up a document by an embedded document id within a map function?
function(doc) {
if(doc.type == "bucket" || doc.type == "item")
emit(doc, null); // still working on my key value output structure
if(doc.children) {
for(var i in doc.children) {
// can i look up a document here using ids from the children array?
doc.children[i]; // psuedo code
emit(); // the retrieved document would be emitted here
}
}
}
}
In an ideal world final JSON output would look something like.
{"_id":3944,
"name":"top level bucket with two items",
"type":"bucket",
"parent":"",
"children":[
{"_id":8989, "name":"second level item with no sub items", "type":"item", "parent":3944},
{"_id": 4839, "name":"second level bucket with one item", "type":"bucket", "parent":3944, "children":[
{"_id":5694", "name":"third level item (has one sub item)", "type":"item", "parent": 4839, "children":[
{"_id":5390, "name":"fourth level item", "type":"item", "parent":5694}
]}
]}
]
}

You can find a general discussion on the CouchDB wiki.
I have no time to test it right now, however your map function should look something like:
function(doc) {
if (doc.type === "bucket" || doc.type === "item")
emit([ doc._id, -1 ], 1);
if (doc.children) {
for (var i = 0, child_id; child_id = doc.children[i]; ++i) {
emit([ doc._id, i ], { _id: child_id });
}
}
}
}
You should query it with include_docs=true to get the documents, as explained in the CouchDB documentation: if your map function emits an object value which has {'_id': XXX} and you query view with include_docs=true parameter, then CouchDB will fetch the document with id XXX rather than the document which was processed to emit the key/value pair.
Add startkey=["3944"]&endkey["3944",{}] to get only the document with id "3944" with its children.
EDIT: have a look at this question for more details.

Can you output a tree structure from a view? No. CouchDB view queries return a list of values, there is no way to have them output anything other than a list. So, you have to deal with your map returning the list of all descendants of a given bucket.
You can, however, plug a _list post-processing function after the view itself, to turn that list back into a nested structure. This is possible if your values know the _id of their parent — the algorithm is fairly straightforward, just ask another question if it gives you trouble.
Can you grab a document by its id in the map function? No. There's no way to grab a document by its identifier from within CouchDB. The request must come from the application, either in the form of a standard GET on the document identifier, or by adding include_docs=true to a view request.
The technical reason for this is pretty simple: CouchDB only runs the map function when the document changes. If document A was allowed to fetch document B, then the emitted data would become invalid when B changes.
Can you output all descendants without storing the list of parents of every node? No. CouchDB map functions emit a set of key-value-id pairs for every document in the database, so the correspondence between the key and the id must be determined based on a single document.
If you have a four-level tree structure A -> B -> C -> D but only let a node know about its parent and children, then none of the nodes above know that D is a descendant of A, so you will not be able to emit the id of D with a key based on A and thus it will not be visible in the output.
So, you have three choices:
Grab only three levels (this is possible because B knows that C is a descendant of A), and grab additional levels by running the query again.
Somehow store the list of descendants of every node within the node (this is costly).
Store the list of parents of every node within the node.

Related

How can I speed up this MongoDB query in Mongoose?

I have a tree-like schema that specifies a collection of parents, and a collection of children.
The collection of children will likely have millions of documents - each of which contains a small amount of data, and a reference to the parent that it belongs to which is stored as a string (perhaps my first mistake).
The collection of parents is much smaller, but may still be in the tens of thousands, and will slowly grow over time. Generally speaking though, a single parent may have as few as 10 children, or as many as 50,000 (possibly more, although somewhat unlikely).
A single child document might look something like this:
{
_id: ObjectId("507f191e810c19729de860ea"),
info: "Here's some info",
timestamp: 1234567890.0,
colour: "Orange",
sequence: 1000,
parent: "12a4567b909c7654d212e45f"
}
Its corresponding parent record (which lives in a separate collection) might look something like this:
{
_id: ObjectId("12a4567b909c7654d212e45f")
info: "Blah",
timestamp: 1234567890.0
}
My query in mongoose (which contains the parent ID in the request) looks like this:
/* GET all children with the specified parent ID */
module.exports.childrenFromParent = function(req, res) {
parentID = req.params.parentID;
childModel.find({
"parentid": parentID
}).sort({"sequence": "asc"}).exec(
function(err, children) {
if (!children) {
sendJSONResponse(res, 404, {
"message": "no children found"
});
return;
} else if (err) {
sendJSONResponse(res, 404, err);
return;
}
sendJSONResponse(res, 200, children);
}
);
};
So basically what's happening is that the query has to search the entire collection of children for any documents that have a parent which matches the provided ID.
I'm currently saving this parent ID as a string in the children collection schema (childModel in the code above), which is probably a bad idea, however, my API is providing the parent ID as a string in the request.
If anyone has any ideas as to how I can either fix my schema or change the query to improve the performance, it would be much appreciated!
Why are you not using .lean() before your exec? Do you really want all of your documents to be Mongoose documents or just simple JSON docs? With lean() you will not get all the extra getters and setters that come with Mongoose document. This could easily shave off at least a second or two from the response time.
Write up from the comments:
You could help speed up and optimize queries by adding an index on the parent field. You can add an (ascending) index by doing the following:
db.collection.createIndex( { parent: 1 } )
You can analyse the benefit of an index by adding .explain("executionStats") to a query. See the docs for more info.
Adding an index on a large collection may take time, you can check the status by running the following query:
db.currentOp(
{
$or: [
{ op: "query", "query.createIndexes": { $exists: true } },
{ op: "insert", ns: /\.system\.indexes\b/ }
]
}
)
Edit: If you are sorting by sequence, you might want to add a compound index for the parent and the sequence.

Sorting CouchDB result by value

I'm brand new to CouchDB (and NoSQL in general), and am creating a simple Node.js + express + nano app to get a feel for it. It's a simple collection of books with two fields, 'title' and 'author'.
Example document:
{
"_id": "1223e03eade70ae11c9a3a20790001a9",
"_rev": "2-2e54b7aa874059a9180ac357c2c78e99",
"title": "The Art of War",
"author": "Sun Tzu"
}
Reduce function:
function(doc) {
if (doc.title && doc.author) {
emit(doc.title, doc.author);
}
}
Since CouchDB sorts by key and supports a 'descending=true' query param, it was easy to implement a filter in the UI to toggle sort order on the title, which is the key in my results set. Here's the UI:
List of books with link to sort title by ascending or descending
But I'm at a complete loss on how to do this for the author field.
I've seen this question, which helped a poster sort by a numeric reduce value, and I've read a blog post that uses a list to also sort by a reduce value, but I've not seen any way to do this on a string value without a reduce.
If you want to sort by a particular property, you need to ensure that that property is the key (or, in the case of an array key, the first element in the array).
I would recommend using the sort key as the key, emitting a null value and using include_docs to fetch the full document to allow you to display multiple properties in the UI (this also keeps the deserialized value consistent so you don't need to change how you handle the return value based on sort order).
Your map functions would be as simple as the following.
For sorting by author:
function(doc) {
if (doc.title && doc.author) {
emit(doc.author, null);
}
}
For sorting by title:
function(doc) {
if (doc.title && doc.author) {
emit(doc.title, null);
}
}
Now you just need to change which view you call based on the selected sort order and ensure you use the include_docs=true parameter on your query.
You could also use a single view for this by emitting both at once...
emit(["by_author", doc.author], null);
emit(["by_title", doc.title], null);
... and then using the composite key for your query.

MongoDB Relational Data Structures with array of _id's

We have been using MongoDB for some time now and there is one thing I just cant wrap my head around. Lets say I have a a collection of Users that have a Watch List or Favorite Items List like this:
usersCollection = [
{
_id: 1,
name: "Rob",
itemWatchList:[
"111111",
"222222",
"333333"
]
}
];
and a separate Collection of Items
itemsCollection = [
{
_id:"111111",
name: "Laptop",
price:1000.00
},
{
_id:"222222",
name: "Bike",
price:123.00
},
{
_id:"333333",
name: "House",
price:500000.00
}
];
Obviously we would not want to insert the whole item obj inside the itemWatchList array because the items data could change i.e. price.
Lets say we pull that user to the GUI and want to diplay a grid of the user itemWatchList. We cant because all we have is a list of ID's. Is the only option to do a second collection.find([itemWatchList]) and then in the results callback manipulate the user record to display the current items? The problem with that is what if I return an array of multiple Users each with an array of itemWatchList's, that would be a callback nightmare to try and keep the results straight. I know Map Reduce or Aggregation framework cant traverse multiple collections.
What is the best practice here and is there a better data structure that should be used to avoid this issue all together?
You have 3 different options with how to display relational data. None of them are perfect, but the one you've chosen may not be the best option for your use case.
Option 1 - Reference the IDs
This is the option you've chosen. Keep a list of Ids, generally in an array of the objects you want to reference. Later to display them, you do a second round-trip with an $in query.
Option 2 - Subdocuments
This is probably a bad solution for your situation. It means putting the entire array of documents that are stored in the items collection into your user collection as a sub-document. This is great if only one user can own an item at a time. (For example, different shipping and billing addresses.)
Option 3 - A combination
This may be the best option for you, but it'll mean changing your schema. For example, lets say that your items have 20 properties, but you really only care about the name and price for the majority of your screens. You then have a schema like this:
usersCollection = [
{
_id: 1,
name: "Rob",
itemWatchList:[
{
_id:"111111",
name: "Laptop",
price:1000.00
},
{
_id:"222222",
name: "Bike",
price:123.00
},
{
_id:"333333",
name: "House",
price:500000.00
}
]
}
];
itemsCollection = [
{
_id:"111111",
name: "Laptop",
price:1000.00,
otherAttributes: ...
},
{
_id:"222222",
name: "Bike",
price:123.00
otherAttributes: ...
},
{
_id:"333333",
name: "House",
price:500000.00,
otherAttributes: ...
}
];
The difficulty is that you then have to keep these items in sync with each other. (This is what is meant by eventual consistency.) If you have a low-stakes application (not banking, health care etc) this isn't a big deal. You can have the two update queries happen successively, updating the users that have that item to the new price. You'll notice this sort of latency on some websites if you pay attention. Ebay for example often has different prices on the search results pages than the actual price once you open the actual page, even if you return and refresh the search results.
Good luck!

Node.js and MongoDB if document exact match exists, ignore insert

I am maintaining a collection of unique values that has a companion collection that has instances of those values. The reason I have it that way is the companion collection has >10 million records where the unique values collection only add up to 100K and I use those values all over the place and do partial match lookups.
When I upload a csv file it is usually 10k to 500k records at a time that I insert into the companion collection. What is the best way to insert only values that dont already exist into the unique values collection?
Example:
//Insert large quantities of objects into mongo
var bulkInsert = [
{
name: "Some Name",
other: "zxy",
properties: "abc"
},
{
name: "Some Name",
other: "zxy",
properties: "abc"
},
{
name: "Other Name",
other: "zxy",
properties: "abc"
}]
//Need to insert only values that do not already exist in mongo unique values collection
var uniqueValues = [
{
name:"Some Name"
},
{
name:"Other Name"
}
]
EDIT
I tried creating a unique index on the field, but once it finds a duplicate in the Array of documents that I am inserting, it stops the whole process and doesnt proceed to check any values after the break.
Figured it out. If your doing it from the shell, you need to use Bulk() and create insert jobs like this:
var bulk = db.collection.initializeUnorderedBulkOp();
bulk.insert( { name: "1234567890a"} );
bulk.insert( { name: "1234567890b"} );
bulk.insert( { name: "1234567890"} );
bulk.execute();
and in node, the continueOnError flag works on a straight collection.insert()
collection.insert( [{name:"1234567890a"},{name:"1234567890c"}],{continueOnError:true}, function(err, doc){}
Well, I think the solution here is quite simple if I understand correctly your issue.
Since the process is stopped when it finds a duplicated field you should basically check if the value doesn't already exists before to try to add it.
So, for each element in uniqueValues, make a find/findOne query, if it doesn't return any result then add the element, otherwise don't.

How should I model my MongoDB collection for nested documents?

I'm managing a MongoDB database for a building products store. The most immediate collection is products, right?
There are quite several products, however they all belong to one among a set of 5-8 categories and then to one subcatefory among a small set of subcategories.
For example:
-Electrical
*Wires
p1
p2
..
*Tools
p5
pn
..
*Sockets
p11
p23
..
-Plumber
*Pipes
..
*Tools
..
PVC
..
I will use Angular at web site client side to show whole products catalog, I think about AJAX for querying the right subset of products I want.
Then, I wonder whether I should manage one only collection like:
{
MainCategory1: {
SubCategory1: {
{},{},{},{},{},{},{}
}
SubCategory2: {
{},{},{},{},{},{},{}
}
SubCategoryn: {
{},{},{},{},{},{},{}
}
},
MainCategory2: {
SubCategory1: {
{},{},{},{},{},{},{}
}
SubCategory2: {
{},{},{},{},{},{},{}
}
SubCategoryn: {
{},{},{},{},{},{},{}
}
},
MainCategoryn: {
SubCategory1: {
{},{},{},{},{},{},{}
}
SubCategory2: {
{},{},{},{},{},{},{}
}
SubCategoryn: {
{},{},{},{},{},{},{}
}
}
}
Or a single collection per each category. The number of documents might not be higher than 500. However I care about a balance for:
quick DB answer,
easy server side DB querying, and
client-side Angular code for rendering results to html.
I'm using mongodb node.js module, not Mongoose now.
What CRUD operations will I do?
Inserts of products, I'd also like to have a way to obtain autogenerated ids (maybe sequential) per each new register. However, as it might seem natural I wouldn't offer the _id to the user.
Querying the whole documents set of a subcategory. Maybe just obtaining a few attributes at first.
Querying whole or a specific subset of attributes of a document (product) in particular.
Modifying a product's attributes values.
I agree client side should get the easiest result to render. However, to nest categories into products is still a bad idea. The trade off is once you want to change, for example, the name of a category, it will be a disaster. And if you think about the possible usecases, for example:
list all categories
find all subcategories of a certain category
find all products in a certain category
You'll find it hard to do these stuff with your data structure.
I had same situation in my current project. So here's what I do for your reference.
First, categories should be in a separate collection. DON'T nest categories into each other, as it will complicate the procedure to find all subcategories. The traditional way for finding all subcategories is to maintain an idPath property. For example, your categories are divided into 3 levels:
{
_id: 100,
name: "level1 category"
parentId: 0, // means it's the top category
idPath: "0-100"
}
{
_id: 101,
name: "level2 category"
parentId: 100,
idPath: "0-100-101"
}
{
_id: 102,
name: "level3 category"
parentId: 101,
idPath: "0-100-101-102"
}
Note with idPath, parentId is not necessary anymore. It's for you to understand the structure easier.
Once you need to find all subcategories of category 100, simply do the query:
db.collection("category").find({_id: /^0-100-/}, function(err, doc) {
// whatever you want to do
})
With category stored in a separate collection, in your product you'll need to reference them by _id, just like when we use RDBMS. For example:
{
... // other fields of product
categories: [100, 101, 102, ...]
}
Now if you want to find all products in a certain category:
db.collection("category").find({_id: new RegExp("/^" + idPath + "-/"}, function(err, categories) {
var cateIds = _.pluck(categories, "_id"); // I'm using underscore to pluck category ids
db.collection("product").find({categories: { $in: cateIds }}, function(err, products) {
// products are here
}
})
Fortunately, category collection is usually very small, with only hundreds of records inside (or thousands). And it doesn't varies a lot. So you can always store a live copy of categories inside memory, and it can be constructed as nested objects like:
[{
id: 100,
name: "level 1 category",
... // other fields
subcategories: [{
id: 101,
... // other fields
subcategories: [...]
}, {
id: 103,
... // other fields
subcategories: [...]
},
...]
}, {
// another top1 category
}, ...]
You may want to refresh this copy every several hours, so:
setTimeout(3600000, function() {
// refresh your memory copy of categories.
});
That's all I get in mind right now. Hope it helps.
EDIT:
to provide int ID for each user, $inc and findAndModify is very useful. you may have a idSeed collection:
{
_id: ...,
seedValue: 1,
forCollection: "user"
}
When you want to get an unique ID:
db.collection("idSeed").findAndModify({forCollection: "user"}, {}, {$inc: {seedValue: 1}}, {}, function(err, doc) {
var newId = doc.seedValue;
});
The findAndModify is an atomic operator provided by mongodb. It will guarantee thread safety. and the find and modify actually happens in a "transaction".
2nd question is in my answer already.
query subsets of properties is described with mongodb Manual. NodeJS API is almost the same. Read the document of projection parameter.
update subsets is also supported by $set of mongodb operator.

Resources