Generating lexographically ascending unique IDs

Generating lexographically ascending unique IDs - node.js

I want to generate IDs for use with CouchDB. I'd like the IDs to be lexographically ascending by time so that I can sort on id without maintaining a seperate timestamp field. I know that CouchDB will generate ids with this property, but I don't want the performance hit of querying the database, I'd rather just run an algorithm on my servers. I'd go with an implementation of rfc 4112 except that the results aren't lexographically ascending. Is there any good reason I shouldn't just do:
(Date.now()) + 'x' + Math.round(Math.random() *1E18)
(I'm using nodejs). Are there any costs of using a non-standard uuid, or of relying on javascript's built in random function?

You have some choices when it comes to uuids.
The first choice is if you want the _id generated client side(node, browser, etc..), or by couch. It sounds like you want to generate your own uuid on the client side. That is fine. Just stick the result of your function into the _id field of the doc you save to couchdb. Couch will just use that.
You could have couch create the id. Couchdb only generates a _id if you don't choose one for yourself. Couchdb by default uses a 'sequential' uuid generation algorithm. You can change the algorithm to others via futon and config. There is a section called 'uuids' with a key of 'algorithm'. You can see the source for these algorithms here:
https://github.com/apache/couchdb/blob/master/src/couchdb/couch_uuids.erl
With descriptions about them here:
http://wiki.apache.org/couchdb/HttpGetUuids?highlight=%28utc%5C_random%29
As you can see the utc_random function is very similiar to your suggestion. But if you wanted your own,If you were inclined you could add you algorithm on the serverside and recompile couch.
The second part of your question is about the performance of choosing different algorithms. I am going to quote Dave Cottlehuber from a user list post:
CouchDB will have best insert time when your doc ids are
continually increasing, as this minimises rewrites to the b~tree. This
will also help
your view build time for the same reason, and also minimises wasted doc space,
although that would also be recovered during compaction.
So both your algorithm and the utc_random should be fine as they doc ids are continually increasing do to the seemingly helpful one direction of time.

I would recommend sticking with the UUID that CouchDB generates for you, but you can configure the server to use utc_random which will prefix a timestamp which you can sort your records by.
http://wiki.apache.org/couchdb/HttpGetUuids

Related

Get the last documents?

CouchDB has a special _all_docs view, which returns documents sorted on ID. But as ID's are random by default, the sorting makes no sense.
I always need to sort by 'date added'. Now I have two options:
Generating my own ID's and make sure they start with a timestamp
Use standard GUID's, but add a timestamp in json, and sort on
that
Now the second solution is less hackish, but I suspect the first solution to be much more efficient and faster, because all queries will be done on the real row id, which is indexed.
Is it true that both solutions differ in performance? And if it's true, which one is likely to be faster or preferred?

Is it true that both solutions differ in performance?
Your examples given describing the primary and secondary index approach in CouchDB.
_all_docs is the only primary index and is always up-to-date. Secondary indexes (views) as in your second solution getting updated when they are requested.
Thats the reason why from the requesters point-of-view _all_docs might be "faster". In real there isn't a difference in requesting already up-to-date indexes. Two workarounds for potentially outdated views (secondary indexes) are the use of the query param stale=ok (update the view after the response to the request) or so called "view-heaters" (send a simple HTTP Get to the view to trigger the update process).
And if it's true, which one is [...] prefered?
The capabilities to build an useful index and response payload are significant higher on the side of secondary indexes.
When you want to use the primary index you have to "design" your id as you have described. You can imagine that is a huge pre-decision of what can also be done with the doc and the ids.
My recommendation would be to use secondary indexes (views). Only if you need data stored in real-time or high-concurrency scenarios you should include the primary index in the search for the best fit to request data.

Im using CouchDB with node.js. Right now there is one node involved and even in remote future its not planned to changed that. While I can remove most of the cases where a short and auto-incremental-like (it can be sparse but not like random) ID is required there remains one place where the users actually needs to enter the ID of a product. I'd like to keep this ID as short as possible and in a more human readable format than something like '4ab234acde242349b' as it sometimes has to be typed by hand and so on.
However in the database it can be stored with whatever ID pleases CouchDB (using the default auto generated UUID) but it should be possible to give it a number that can be used to identify it as well. What I have thought about is creating a document that consists of an array with all the UUIDs from CouchDB. When in node I create a new product I would run an update handler that updates said document with the new unique ID at the end. To obtain the products ID I'd then query the array and client side using indexOf I could get the index as a short ID.
I dont know if this is feasible. From the performance point of view I can say the following: There are more queries that should do numerical ID -> uuid than uuid -> numerical ID. There will be at max 7000 new entries a year in the database. Also there is no use case where a product can be deleted yet I'd like not to rely on that.
Are there any other applicable ways to genereate a shorter and more human readable ID that can be associated with my document?
/EDIT
From a technical point of view: It seems to be working. I can do both conversions number <-> uuid and it seems go well. I dont now if this works well with replication and stuff but as there is said array i guess it should, right?

You have two choices here:
Set your human readable id as _id field. Basically you can just set in create document calls to DB, and it will accept it. This can be a more lightweight solution, but it comes with some limitations:
It has to be unique. You should also be careful about clients trying to create documents, but instead overwrite existing ones.
It can only contain alphanumeric or a few special characters. In my experience it is asking for trouble to have extra character types.
It cannot be longer than a theoretical string length limit(Couchdb doesn't define any, but you should). Long ids will increase your views(indexes) size really bad. And it might make it s lower.
If these things are no problem with you, then you should go with this solution.
As you said yourself, let the _id be a UUID, and set the human readable id to another field. To reach the document by the human readable id, you can just create a view emitting the human readable id as a key, and then either emit the document as value or get the document via include_docs=true option. Whenever the view is reached Couchdb will update the view incrementally and return you the list. This is really same as you creating a document with an array/object of ids inside it. Except with using a couchdb view, you get more performance.
This might be also slightly slower on querying and inserting. If the ids are inserted sequentially, it's fine, if not, CouchDB will slightly take more time to insert it at the right place. These don't work well with huge amounts of insert coming at the DB.
Querying shouldn't be more than 10% of total query time longer than first option. I think 10% is really a big number. It will be most probably less than 5%, I remember in my CouchDB application, I switched from reading by _id to reading from a view by a key and the slow down was very little that from user end point, when making 100 queries at the same time, it wasn't noticeable.
This is how people, query documents by other fields than id, for example querying a user document with email, when the user is logging in.
If you don't know how couchdb views work, you should read the views chapter of couchdb definite guide book.
Also make sure you stay away from documents with huge arrays inside them. I think CouchDB, has a limit of 4GB per document. I remember having many documents and it had really long querying times because the view had to iterate on each array item. In the end for each array item, instead I created one document. It was way faster.

Which is better - auto-generated id or manual id assignment in couchdb documents?

Should I be generating the id of the documents in a CouchDB or should I depend on CouchDB to generate it? What are the advantages or disadvantages in these approaches? Is there any performance implications on any of these options?

There is no difference as far as CouchDB is concerned. Frederick is right that sequential ids are slightly faster. If you query /_uuids?count=10 you will notice that the UUIDs are sequential (by default).
However, even with random IDs, once you run compaction, they will all be in the "right" order internally in the .couch file and at that point there is no difference. So in the long run, I don't usually worry about it.

The main thing is that you should use mostly sequential ids. As this article and this bit of the couchdb book explain, using random ids results in a much less efficient structure internally, both speed wise and in terms of space used on disc.

Self generated ids are almost impossible to deal with if you have two or more separated instances of your app. Because the synchronisation between the different instances is not instantaneous. A solution for this can be to have one server dedicated to generate (or check the availability of) the ids, for example using a SQL database, and acting as a gate for document creation.
On the other hand, if you have only one server and will never need more, there is one advantage I find interesting to self generated uids: since they have to be unique, you can use them in urls. For instance take the slug of the title of a blog post as the _id.
Performance-wise, the CouchDB's generated ids are pretty long so if your own ids are shorter, you will save significant disk space (assuming you have a looot of documents).

Both answers above tell about PROS of sequential IDs.
Here is a major problem arose by sequential IDs.
Predictability of other IDs in documents using a single ID.
Due to this we can't use sequential IDs in application URLs as identifiers due to other IDs being predictable using one ID, and using as url authentication is also not possible.( As done by file sharing services).

Should I implement auto-incrementing in MongoDB?

I'm making the switch to MongoDB from MySQL. A familiar architecture to me for a very basic users table would have auto-incrementing of the uid. See Mongo's own documentation for this use case.
I'm wondering whether this is the best architectural decision. From a UX standpoint, I like having UIDs as external references, for example in shorter URLs: http://example.com/users/12345
Is there a third way? Someone in IRC Freenode's #mongodb suggested creating a range of IDs and caching them. I'm unsure of how to actually implement that, or whether there's another route I can go. I don't necessarily even need the _id itself to be incremented this way. As long as the users all have a unique numerical uid within the document, I would be happy.

I strongly disagree with author of selected answer that No auto-increment id in MongoDB and there are good reasons. We don't know reasons why 10gen didn't encourage usage of auto-incremented IDs. It's speculation. I think 10gen made this choice because it's just easier to ensure uniqueness of 12-byte IDs in clustered environment. It's default solution that fits most newcomers therefore increases product adoption which is good for 10gen's business.
Now let me tell everyone about my experience with ObjectIds in commercial environment.
I'm building social network. We have roughly 6M users and each user has roughly 20 friends.
Now imagine we have a collection which stores relationship between users (who follows who). It looks like this
_id : ObjectId
user_id : ObjectId
followee_id : ObjectId
on which we have unique composite index {user_id, followee_id}. We can estimate size of this index to be 12*2*6M*20 = 2GB. Now that's index for fast look-up of people I follow. For fast look-up of people that follow me I need reverse index. That's another 2GB.
And this is just the beginning. I have to carry these IDs everywhere. We have activity cluster where we store your News Feed. That's every event you or your friends do. Imagine how much space it takes.
And finally one of our engineers made an unconscious decision and decided to store references as strings that represent ObjectId which doubles its size.
What happens if an index does not fit into RAM? Nothing good, says 10gen:
When an index is too large to fit into RAM, MongoDB must read the index from disk, which is a much slower operation than reading from RAM. Keep in mind an index fits into RAM when your server has RAM available for the index combined with the rest of the working set.
That means reads are slow. Lock contention goes up. Writes gets slower as well. Seeing lock contention in 80%-nish is no longer shock to me.
Before you know it you ended up with 460GB cluster which you have to split to shards and which is quite hard to manipulate.
Facebook uses 64-bit long as user id :) There is a reason for that. You can generate sequential IDs
using 10gen's advice.
using mysql as storage of counters (if you concerned about speed take a look at handlersocket)
using ID generating service you built or using something like Snowflake by Twitter.
So here is my general advice to everyone. Please please make your data as small as possible. When you grow it will save you lots of sleepless nights.

Josh,
No auto-increment id in MongoDB and there are good reasons.
I would say go with ObjectIds which are unique in the cluster.
You can add auto increment by a sequence collection and using findAndModify to get the next id to use. This will definitely add complexities to your application and may also affect the ability to shard your database.
As long as you can guarantee that your generated ids will be unique, you will be fine.
But the headache will be there.
You can look at this post for more info about this question in the dedicated google group for MongoDB:
http://groups.google.com/group/mongodb-user/browse_thread/thread/f57b712b2aae6f0b/b4315285e689b9a7?lnk=gst&q=projapati#b4315285e689b9a7
Hope this helps.
Thanks

So, there's a fundamental problem with "auto-increment" IDs. When you have 10 different servers (shards in MongoDB), who picks the next ID?
If you want a single set of auto-incrementing IDs, you have to have a single authority for picking those IDs. In MySQL, this is generally pretty easy as you just have one server accepting writes. But big deployments of MongoDB are running sharding which doesn't have this "central authority".
MongoDB, uses 12-byte ObjectIds so that each server can create new documents uniquely without relying on a single authority.
So here's the big question: "can you afford to have a single authority"?
If so, then you can use findAndModify to keep track of the "last highest ID" and then you can insert with that.
That's the process described in your link. The obvious weakness here is that you technically have to do two writes for each insert. This may not scale very well, you probably want to avoid it on data with a high insertion rate. It may work for users, it probably won't work for tracking clicks.

There is nothing like an auto-increment in MongoDB but you may store your own counters in a dedicated collection and $inc the related value of counter as needed. Since $inc is an atomic operation you won't see duplicates.

The default Mongo ObjectId -- the one used in the _id field -- is incrementing.
Mongo uses a timestamp ( seconds since the Unix epoch) as the first 4-byte portion of its 4-3-2-3 composition, very similar (if not exactly) the same composition as a Version 1 UUID. And that ObjectId is generated at time of insert (if no other type of _id is provided by the user/client)
Thus the ObjectId is ordinal in nature; further, the default sort is based on this incrementing timestamp.
One might consider it an updated version of the auto-incrementing (index++) ids used in many dbms.

Approaches to generate auto-incrementing numeric ids in CouchDB

Since CouchDB does not have support for SQL alike AUTO_INCREMENT what would be your approach to generate sequential unique numeric ids for your documents?
I am using numeric ids for:
User-friendly IDs (e.g. TASK-123, RQ-001, etc.)
Integration with libraries/systems that require numeric primary key
I am aware of the problems with replication, etc. That's why I am interested in how people try to overcome this issue.

As Dominic Barnes says, auto-increment integers are not scalable, not distributed-friendly or cloud-friendly. It seems every app nowadays needs a mobile version with offline support, and that is not directly compatible with auto-increment integers. We all know this, but it's true: auto-increment integers are necessary for legacy code and arguably other stuff.
In both scenarios, you are responsible for producing the auto-incrementing integer. A view is running emit(the_numeric_id, null). (You could also have a "type" namespace, e.g. by emit([doc.type, the_numeric_id], null). Query for the final row (e.g. with a startkey=MAXINT&descending=true&limit=1, increment the value returned, and that is your next id. The attempt to save is in a loop which can retry if there was a collision.
You can also play tricks if you don't need 100% density of the list of IDs. For example, you can add timestamps to the emit() rows, and estimate the document creation velocity, and increment by that velocity times your computation and transmit time. You could also simply increment by a random integer between 1 and N, so most of the time the first insert works, at a cost of non-homogeneous ID numbers.
About where to store the integer, I think there is the id strategy and the try and check strategy.
The id strategy is simpler and quicker in the short term. Document IDs are an integer (perhaps prefixed with a type to add a namespace). Since Couch guarantees uniqueness on the _id field, you just worry about the auto-incrementing. Do this in a loop: 409 Conflict triggers a retry, 201 Accepted means you're done.
I think the major pain with this trick is, that if and when you get conflicts, you have two completely unrelated documents, and one of them must be copied into a fresh document. If there were relationships with other documents, they must all be corrected. (The CouchDB 0.11 emit(key, {_id: some_foreign_doc_id}) trick comes to mind.)
The try and check strategy uses the default UUID as the doc._id, so every insert will succeed. Ideally, all or most of your inter-document relations are based on the immutable UUID _id, not the integer. That is just used for users and UI. The auto-incrementing integer is simply a field in the document, {"int_id":20}. The view of course does emit(doc.int_id, null). (You can look up a document by integer id with a ?key=23?include_docs=true parameter of the view.
Of course, after a replication, you might have id conflicts (not official CouchDB conflicts, but just documents using the same numeric id). The view which emits by ID would also have a reduce phase: simply _count should be enough. Next you must patrol the DB, querying this view with ?group=true and looking for any row (corresponding to an integer id) which has a count > 1. On the plus side, correcting the numeric id of a document is a minor change because it does not require new document creation.
Those are my ideas. Now that I wrote them down, I feel like you must do relation-shepherding regardless of where the id is stored; so perhaps using _id is better after all. The only other downside I see is that you are permanently married to a fundamentally broken naming model—for some definition of "permanently."

Is there any particular reason you want to use numeric IDs over the UUIDs that CouchDB can generate for you? UUIDs are perfect for the distributed paradigm that CouchDB uses, stick with what is built in.
If you find yourself with any more than 1 CouchDB node in your architecture, you're going to get conflicting document IDs if you rely on something like "auto increment" when it comes time for replication. Even if you're only using 1 node now, that's probably not always going to be the case, especially since CouchDB works so well in a distributed and "offline" architecture.

I have had pretty good luck just using an iso formatted date as my key:
http://wiki.apache.org/couchdb/IsoFormattedDateAsDocId
It's pretty simple to do, human-readable and it basically builds in a few querying options by just existing. :-)

Keeping in mind the issues around replication and conflicts, you can use an update function to generate incrementing IDs that are guaranteed unique in a single master setup.
function(doc, req) {
if (!doc) {
doc = {
_id: req.id,
type: 'idGenerator',
count: 0
};
}
doc.count++;
return [doc, toJSON(doc.count)];
}
Include this function in a design document like so:
{
"_id": "_design/application",
"language": "javascript",
"updates": {
"generateId": "function (doc, req) {\n\t\t\tif (!doc) {\n\t\t\t\tdoc = {\n\t\t\t\t\t_id: req.id,\n\t\t\t\t\ttype: 'idGenerator',\n\t\t\t\t\tcount: 0\n\t\t\t\t};\n\t\t\t}\n\n\t\t\tdoc.count++;\n\t\t\t\n\t\t\treturn [doc, toJSON(doc.count)];\n\t\t}"
}
}
Then call it like so:
curl -XPOST http://localhost:5984/mydb/_design/application/_update/generateId/entityId
Replace entityId with whatever you like to create several independent ID sequences.

Not a perfect solution but something that worked for me. Create an independent service that generates auto-incremented ids. Yes, you probably say "this breaks the offline model of couchdb" but what if you get a pool of N ids that you can then use whenever you need to get a new auto-incremented id. Then every time you're online you get some more ids and if you are running out of ids you tell your users - please go online. If the pool is big enough (say the monthly traffic) this shouldn't happen. Again, not perfect but maybe can be helpful to some people.

Instead of explicitly constructing an increasing integer key, you could use the implicit index couchDB accepts for paging.
The skip parameter accepts an integer that will effectively provide the auto-incrementing index you are used to.
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
The drawback is that it is not a viable solution for "User-friendly IDs". The index is not tied to the doc, and is subject to change if you are rewriting history.
If your only constraint is "integration with libraries/systems that require numeric primary key", this will bridge the gap without loosing the benefits of couchDB's key structure.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string