CosmoDB sequence generator - azure

I am migrating legacy application from Oracle to CosmosDB (MongoDB 3.6 API). At current stage of the project and also with respect to current goals refactoring ID from sequence numbering to GUIDs cannot be done. I am aware of many reasons why this is bad design but it is what it is - I need sequence generation :(
I am trying to find a solution that will be reliable. What came to my mind is to create a collection sequences and inside of it keep documents such is seq_foo and seq_bar and so on. Upon each insert I would first do findAndModify and then insert with custom id.
What is the question I currently struggle finding answer: is findAndModify atomic if used on single document with eventual consistency?
In case it is not do I need to use multiple DB Accounts with different consistency levels or is there another solution for this problem?

Related

Atomic update of 2 Couchbase documents

Could you please give me a hint is there any way to atomically update multiple documents on the Couchbase using Java SDK? I know, it's possible to use Embedding of documents, thus guaranteeing required, but, unfortunately, it doesn't work out for me.
In my case, the fact of document update leads to the fact that it's needed to invalidate (set special flag to false) other document, and it should be performed atomically.
I appreciate any help or suggestions from your side.
Thank you!
While there is no built-in way to perform atomic changes to multiple documents, you can use two-phase commit to achieve the same result. Note that in this case 2PC doesn't provide other transactional features, like isolation and consistency, only atomicity - which is what you're asking for. There is no reference 2PC implementation in Java for Couchbase, but there are two in Ruby and PHP in the documentation. I recommend reading the docs on providing transactional logic in Couchbase for an in-depth description of how to implement this. Porting the example code to Java should be fairly straightforward.
Generally speaking, to implement a set of changes on multiple documents atomically, you perform atomic writes to each document in turn, plus a temporary "state" document, in such a way that each step in the process is unique. This way you're able to continue from the same step or roll back your changes if the transaction gets interrupted in the middle for any reason.
Unfortunately, that boils down to a transaction, and Couchbase doesn't offer native transaction support. The level of atomicity you can achieve with Couchbase is at the level of a single document.
I know that some couchbase users have implemented manual transaction management in their applicative code layer, but that's quite a complicated topic and there's no openly available solution that I know of.

RethinkDB Transaction to multiple documents/tables

I need to update 2 tables in one transaction.
In current version RethinkDB doesn't support transactions from the box. So, how can I achieve this?
I can make update in 2 ways:
Update 1st table. If success -> update second table.
Update 2nd tables async.
But how can I resolve case, when 1 of 2 updates was completed well, but another no? Yes, I can check result of update and revert update if error occured. But anyway, there can be case, when something happens with application (lost connection to Rethink, or just crash of script), but one of two updates was completed.
So, my data base will be in inconsistent state. And no way to resolve this.
So, is it possible to simulate transaction behavior in nodejs for RethinkDB?
The best you can do is two-phase commit. (MongoDB has a good document on how to do this, and the exact same technique should work in RethinkDB: http://docs.mongodb.org/master/tutorial/perform-two-phase-commits/ .)
RethinkDB supports per key linearizability and compare-and-set (document level atomicity) and it's known to be enough to implement application level transactions, more over you have several options to choose from:
If you need Serializable isolation level then you can follow the same algorithm which Google use for the Percolator system or Cockroach Labs for CockroachDB. I've blogged about it and create a step-by-step visualization, I hope it will help you to understand the main idea behind the algorithm.
If you expect high contention but it's fine for you to have Read Committed isolation level then please take a look on the RAMP transactions by Peter Bailis.
The third approach is to use compensating transactions also known as the saga pattern. It was described in the late 80s in the Sagas paper but became more actual with the raise of distributed systems. Please see the Applying the Saga Pattern talk for inspiration.
We had a similar requirement to implement transactional support in ReThinkDB, as we wanted to have transactions extending across MySQL and ReThinkDB DB boundaries. We had come-up with this micro library thinktrans https://github.com/jaladankisuresh/thinktrans, which is a promised based declarative javascript library for RethinkDB supporting Atomic transactions. However, It is still in its alpha stages
If you have a specific requirement and you may want to understand its approach Implementing Transactions in NoSQL Databases and implement your own.
Disclaimer: I am the author of this library

Azure DB Sequencial numbering

I'm having a few issues with designing a database in Azure at the moment, down to the following:
SQL2012 Auto-Increment keys can/will jump by 1000 fairly regularly, related to the new "feature" of SQL 2012, as documented here (link). This has been closed on "By Design" in Connect (link)
The recommendation is to use either a startup flag to avoid this behaviour (cannot do with Azure), or to use Sequences to generate incrementing numbers instead.
However, SEQUENCE is unsupported in Azure DB. It in fact alternates between being an active issue and a "won't fix" issue on Connect (link)
So, my question is how to actually go about having a field auto-increment by 1 on insert to an Azure DB table, whilst avoiding large gaps.
I did think about using a Trigger instead, and then using that to find the existing Max value. Didn't seem clean. I also thought it would cause concurrency issues.
I'm happy to have a surrogate key here, but without Sequences I am wondering what the recommended route would be to actually generate the value for the surrogate at insert time.
Any advice appreciated.
Edit: Please note I am using the older "Web/Business" type of DB rather than the new tiers; I don't know if that will make a difference to any answers.
The reason your feedback has been marked as it was is because auto-increment by 1 becomes a very hard problem to solve at a database layer at scale. If you "shard" your database (split it to cope with load) how would each shard know which number to increment to? This should be extracted from the database and built into your application logic.
Without knowing the background requirement to having an incremental number it's hard to advise of the right solution. Is it for uniqueness or for a sequence? If it's for a sequence does it really matter that it isn't by 1 each time as long as the number is (a) unique and (b) increases on each insert? Could you make the setting of this sequence an offline process? That is - use insert date / time via a batch process to assign a sequence number?
If it's for uniqueness use a GUID or similar.

Which is better - auto-generated id or manual id assignment in couchdb documents?

Should I be generating the id of the documents in a CouchDB or should I depend on CouchDB to generate it? What are the advantages or disadvantages in these approaches? Is there any performance implications on any of these options?
There is no difference as far as CouchDB is concerned. Frederick is right that sequential ids are slightly faster. If you query /_uuids?count=10 you will notice that the UUIDs are sequential (by default).
However, even with random IDs, once you run compaction, they will all be in the "right" order internally in the .couch file and at that point there is no difference. So in the long run, I don't usually worry about it.
The main thing is that you should use mostly sequential ids. As this article and this bit of the couchdb book explain, using random ids results in a much less efficient structure internally, both speed wise and in terms of space used on disc.
Self generated ids are almost impossible to deal with if you have two or more separated instances of your app. Because the synchronisation between the different instances is not instantaneous. A solution for this can be to have one server dedicated to generate (or check the availability of) the ids, for example using a SQL database, and acting as a gate for document creation.
On the other hand, if you have only one server and will never need more, there is one advantage I find interesting to self generated uids: since they have to be unique, you can use them in urls. For instance take the slug of the title of a blog post as the _id.
Performance-wise, the CouchDB's generated ids are pretty long so if your own ids are shorter, you will save significant disk space (assuming you have a looot of documents).
Both answers above tell about PROS of sequential IDs.
Here is a major problem arose by sequential IDs.
Predictability of other IDs in documents using a single ID.
Due to this we can't use sequential IDs in application URLs as identifiers due to other IDs being predictable using one ID, and using as url authentication is also not possible.( As done by file sharing services).

Should I implement auto-incrementing in MongoDB?

I'm making the switch to MongoDB from MySQL. A familiar architecture to me for a very basic users table would have auto-incrementing of the uid. See Mongo's own documentation for this use case.
I'm wondering whether this is the best architectural decision. From a UX standpoint, I like having UIDs as external references, for example in shorter URLs: http://example.com/users/12345
Is there a third way? Someone in IRC Freenode's #mongodb suggested creating a range of IDs and caching them. I'm unsure of how to actually implement that, or whether there's another route I can go. I don't necessarily even need the _id itself to be incremented this way. As long as the users all have a unique numerical uid within the document, I would be happy.
I strongly disagree with author of selected answer that No auto-increment id in MongoDB and there are good reasons. We don't know reasons why 10gen didn't encourage usage of auto-incremented IDs. It's speculation. I think 10gen made this choice because it's just easier to ensure uniqueness of 12-byte IDs in clustered environment. It's default solution that fits most newcomers therefore increases product adoption which is good for 10gen's business.
Now let me tell everyone about my experience with ObjectIds in commercial environment.
I'm building social network. We have roughly 6M users and each user has roughly 20 friends.
Now imagine we have a collection which stores relationship between users (who follows who). It looks like this
_id : ObjectId
user_id : ObjectId
followee_id : ObjectId
on which we have unique composite index {user_id, followee_id}. We can estimate size of this index to be 12*2*6M*20 = 2GB. Now that's index for fast look-up of people I follow. For fast look-up of people that follow me I need reverse index. That's another 2GB.
And this is just the beginning. I have to carry these IDs everywhere. We have activity cluster where we store your News Feed. That's every event you or your friends do. Imagine how much space it takes.
And finally one of our engineers made an unconscious decision and decided to store references as strings that represent ObjectId which doubles its size.
What happens if an index does not fit into RAM? Nothing good, says 10gen:
When an index is too large to fit into RAM, MongoDB must read the index from disk, which is a much slower operation than reading from RAM. Keep in mind an index fits into RAM when your server has RAM available for the index combined with the rest of the working set.
That means reads are slow. Lock contention goes up. Writes gets slower as well. Seeing lock contention in 80%-nish is no longer shock to me.
Before you know it you ended up with 460GB cluster which you have to split to shards and which is quite hard to manipulate.
Facebook uses 64-bit long as user id :) There is a reason for that. You can generate sequential IDs
using 10gen's advice.
using mysql as storage of counters (if you concerned about speed take a look at handlersocket)
using ID generating service you built or using something like Snowflake by Twitter.
So here is my general advice to everyone. Please please make your data as small as possible. When you grow it will save you lots of sleepless nights.
Josh,
No auto-increment id in MongoDB and there are good reasons.
I would say go with ObjectIds which are unique in the cluster.
You can add auto increment by a sequence collection and using findAndModify to get the next id to use. This will definitely add complexities to your application and may also affect the ability to shard your database.
As long as you can guarantee that your generated ids will be unique, you will be fine.
But the headache will be there.
You can look at this post for more info about this question in the dedicated google group for MongoDB:
http://groups.google.com/group/mongodb-user/browse_thread/thread/f57b712b2aae6f0b/b4315285e689b9a7?lnk=gst&q=projapati#b4315285e689b9a7
Hope this helps.
Thanks
So, there's a fundamental problem with "auto-increment" IDs. When you have 10 different servers (shards in MongoDB), who picks the next ID?
If you want a single set of auto-incrementing IDs, you have to have a single authority for picking those IDs. In MySQL, this is generally pretty easy as you just have one server accepting writes. But big deployments of MongoDB are running sharding which doesn't have this "central authority".
MongoDB, uses 12-byte ObjectIds so that each server can create new documents uniquely without relying on a single authority.
So here's the big question: "can you afford to have a single authority"?
If so, then you can use findAndModify to keep track of the "last highest ID" and then you can insert with that.
That's the process described in your link. The obvious weakness here is that you technically have to do two writes for each insert. This may not scale very well, you probably want to avoid it on data with a high insertion rate. It may work for users, it probably won't work for tracking clicks.
There is nothing like an auto-increment in MongoDB but you may store your own counters in a dedicated collection and $inc the related value of counter as needed. Since $inc is an atomic operation you won't see duplicates.
The default Mongo ObjectId -- the one used in the _id field -- is incrementing.
Mongo uses a timestamp ( seconds since the Unix epoch) as the first 4-byte portion of its 4-3-2-3 composition, very similar (if not exactly) the same composition as a Version 1 UUID. And that ObjectId is generated at time of insert (if no other type of _id is provided by the user/client)
Thus the ObjectId is ordinal in nature; further, the default sort is based on this incrementing timestamp.
One might consider it an updated version of the auto-incrementing (index++) ids used in many dbms.

Resources