How to set a field containing unique key - couchdb

I want to save data in CouchDB documents and as I am used to doing it in RDBMS. I want to create a field which can only contain a unique value in the database. If I now save a document and there is already a document with unique key I expect an error from CouchDB.
I guess I can use the document ID and replace the auto generated doc-id by my value, but is there a way to set other field as unique key holder. Any best practice regarding unique keys?

As you said, the generated _id is enforced as unique. That is the only real unique constraint in CouchDB, and some people use it as such for their own applications.
However, this only applies to a single CouchDB instance. Once you start introducing replication and other instances, you can run into conflicts if the same _id is generated on more than 1 node. (depending on how you generate your _ids, this may or may not be a problem)

As Dominic said, the _id is the only parameter that is almost assured to be unique. One thing that is sure is that you have to design your "database" in a different way. Keep in mind that the _id will be database wide. You will be able to have only one document with this _id.
The _id must be a string, which means you can't have an array or a number or anything else.
If you want to make the access public, you'll have to think about how to generate your id in a way that it won't mess with your system.
I came up with ids that looked like that:
"email:email#example.com"
It worked well in my particular case to prevent people from creating multiple auth on the same email. But as Documinic said, if you have multiple masters, you'll have to think about possible conflicts.

Related

How do you count the amount of documents in a MongoDB collection within Node?

Quite an odd situation, I've tried numerous solutions and I can't seem to crack it.
A lot of the resources I've found only include mongo console commands, I'm not sure how you'd write this in Node.js.
The reason I'm trying to work this out is I'm trying to make each documents 'id' iterate from the last, so the attempt is to find the amount of documents in a collection, add one and then use that number as the 'id' for the new document.
To get the number of documents in a collection, use db.collection.find({}).count().
However, what you are trying to do will not work. When you have a lot of parallel accesses to the database, then it is possible that multiple threads do this at the same time, receive the same count and will thus insert a document with the same id. According to the CAP theorem, a distributed database like MongoDB can not provide this kind of consistency.
What you should do instead is rely on MongoDB ObjectId's as unique identifiers for documents. MongoDB generates these automatically for each document when you don't provide an own value for _id. ObjectId's are globally unique (unique enough for any practical purpose), so you won't get any collisions. They also begin with a timestamp, so when you order by _id you get a roughly chronological order (as previously stated, a strict chronological order is impossible to provide by a distributed system).
As a rule of thumb, whenever you would use AUTO_INCREMENT in SQL, you would likely use ObjectId's in MongodDB.

CouchDB - human readable id

Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?
You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.

Using different types/format of _id ok?

I'm using one database for all (users, files and comments).
I was wondering if I can/should use
twitter user id for user doc _id's
md5 hash (of file) for file doc _id's
provided uuid for comment doc _id's
It feels weird to mix those different types of id's.
What speaks agains this scenario? Should I stick to the CouchDB uuid's for consistency?
Use any format of id, or combination of formats, as you see fit. You might wish to add a prefix to ensure there are no overlaps between them, though.
twitter:#rnewson
md5:86f646c11b3bc7d434d06c077aee43d8
And so on.

How do you guarantee the uniqueness of a field in a CouchDB document?

Just what it says on the tin. I have a bunch of (say) Member documents, I want each user to be able to set the Email field on his document to whatever he wants, example an Email that exists on another document. If I just do a check-insert, I'm vulnerable to a race condition. Is there some idiom for "locking" or inserting-then-checking?
The only sure way is to create a doc with the unique value as doc id.
As the other answer notes, the only field guaranteed to be unique in CouchDB is _id.
You might borrow a trick from the replicator here. In order to fast-forward a second replication between the same two hosts, it writes a checkpoint document which records the update sequence it last reached. But how does it find the checkpoint document in the future? Like so;
"_id": md5(source.host + source.port + target.host + target.port)
A general pattern can be extracted where your unique fields form part of the id itself. Running them through md5 guarantees a fixed length identifier.
In your case, you could just use email address as your id.
Changing one of these fields is a two step process, but one that still maintains the uniqueness property.
Copy the document to its new id (http://wiki.apache.org/couchdb/HTTP_Document_API#COPY)
If successful, delete the old one (http://wiki.apache.org/couchdb/HTTP_Document_API#DELETE)
A crash between steps 1 and 2 will leave the old document it place, so you may wish to add a reference to the old document in the new document. You can then create a view of those back references and perform a clean up sweep periodically.
All that said, CouchDB deliberately only supports only one unique field, as opposed to a typical RDBMS which can support elaborate relational constraints, in order for a solution to scale up cleanly in a cluster (c.f, BigCouch). In your case, where email address must be unique, much of what I've said should work (email addresses don't change that often), but obviously this is swimming upstream to a degree.
HTH,
B.

Approaches to generate auto-incrementing numeric ids in CouchDB

Since CouchDB does not have support for SQL alike AUTO_INCREMENT what would be your approach to generate sequential unique numeric ids for your documents?
I am using numeric ids for:
User-friendly IDs (e.g. TASK-123, RQ-001, etc.)
Integration with libraries/systems that require numeric primary key
I am aware of the problems with replication, etc. That's why I am interested in how people try to overcome this issue.
As Dominic Barnes says, auto-increment integers are not scalable, not distributed-friendly or cloud-friendly. It seems every app nowadays needs a mobile version with offline support, and that is not directly compatible with auto-increment integers. We all know this, but it's true: auto-increment integers are necessary for legacy code and arguably other stuff.
In both scenarios, you are responsible for producing the auto-incrementing integer. A view is running emit(the_numeric_id, null). (You could also have a "type" namespace, e.g. by emit([doc.type, the_numeric_id], null). Query for the final row (e.g. with a startkey=MAXINT&descending=true&limit=1, increment the value returned, and that is your next id. The attempt to save is in a loop which can retry if there was a collision.
You can also play tricks if you don't need 100% density of the list of IDs. For example, you can add timestamps to the emit() rows, and estimate the document creation velocity, and increment by that velocity times your computation and transmit time. You could also simply increment by a random integer between 1 and N, so most of the time the first insert works, at a cost of non-homogeneous ID numbers.
About where to store the integer, I think there is the id strategy and the try and check strategy.
The id strategy is simpler and quicker in the short term. Document IDs are an integer (perhaps prefixed with a type to add a namespace). Since Couch guarantees uniqueness on the _id field, you just worry about the auto-incrementing. Do this in a loop: 409 Conflict triggers a retry, 201 Accepted means you're done.
I think the major pain with this trick is, that if and when you get conflicts, you have two completely unrelated documents, and one of them must be copied into a fresh document. If there were relationships with other documents, they must all be corrected. (The CouchDB 0.11 emit(key, {_id: some_foreign_doc_id}) trick comes to mind.)
The try and check strategy uses the default UUID as the doc._id, so every insert will succeed. Ideally, all or most of your inter-document relations are based on the immutable UUID _id, not the integer. That is just used for users and UI. The auto-incrementing integer is simply a field in the document, {"int_id":20}. The view of course does emit(doc.int_id, null). (You can look up a document by integer id with a ?key=23?include_docs=true parameter of the view.
Of course, after a replication, you might have id conflicts (not official CouchDB conflicts, but just documents using the same numeric id). The view which emits by ID would also have a reduce phase: simply _count should be enough. Next you must patrol the DB, querying this view with ?group=true and looking for any row (corresponding to an integer id) which has a count > 1. On the plus side, correcting the numeric id of a document is a minor change because it does not require new document creation.
Those are my ideas. Now that I wrote them down, I feel like you must do relation-shepherding regardless of where the id is stored; so perhaps using _id is better after all. The only other downside I see is that you are permanently married to a fundamentally broken naming model—for some definition of "permanently."
Is there any particular reason you want to use numeric IDs over the UUIDs that CouchDB can generate for you? UUIDs are perfect for the distributed paradigm that CouchDB uses, stick with what is built in.
If you find yourself with any more than 1 CouchDB node in your architecture, you're going to get conflicting document IDs if you rely on something like "auto increment" when it comes time for replication. Even if you're only using 1 node now, that's probably not always going to be the case, especially since CouchDB works so well in a distributed and "offline" architecture.
I have had pretty good luck just using an iso formatted date as my key:
http://wiki.apache.org/couchdb/IsoFormattedDateAsDocId
It's pretty simple to do, human-readable and it basically builds in a few querying options by just existing. :-)
Keeping in mind the issues around replication and conflicts, you can use an update function to generate incrementing IDs that are guaranteed unique in a single master setup.
function(doc, req) {
if (!doc) {
doc = {
_id: req.id,
type: 'idGenerator',
count: 0
};
}
doc.count++;
return [doc, toJSON(doc.count)];
}
Include this function in a design document like so:
{
"_id": "_design/application",
"language": "javascript",
"updates": {
"generateId": "function (doc, req) {\n\t\t\tif (!doc) {\n\t\t\t\tdoc = {\n\t\t\t\t\t_id: req.id,\n\t\t\t\t\ttype: 'idGenerator',\n\t\t\t\t\tcount: 0\n\t\t\t\t};\n\t\t\t}\n\n\t\t\tdoc.count++;\n\t\t\t\n\t\t\treturn [doc, toJSON(doc.count)];\n\t\t}"
}
}
Then call it like so:
curl -XPOST http://localhost:5984/mydb/_design/application/_update/generateId/entityId
Replace entityId with whatever you like to create several independent ID sequences.
Not a perfect solution but something that worked for me. Create an independent service that generates auto-incremented ids. Yes, you probably say "this breaks the offline model of couchdb" but what if you get a pool of N ids that you can then use whenever you need to get a new auto-incremented id. Then every time you're online you get some more ids and if you are running out of ids you tell your users - please go online. If the pool is big enough (say the monthly traffic) this shouldn't happen. Again, not perfect but maybe can be helpful to some people.
Instead of explicitly constructing an increasing integer key, you could use the implicit index couchDB accepts for paging.
The skip parameter accepts an integer that will effectively provide the auto-incrementing index you are used to.
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
The drawback is that it is not a viable solution for "User-friendly IDs". The index is not tied to the doc, and is subject to change if you are rewriting history.
If your only constraint is "integration with libraries/systems that require numeric primary key", this will bridge the gap without loosing the benefits of couchDB's key structure.

Resources