Could you please give me a hint is there any way to atomically update multiple documents on the Couchbase using Java SDK? I know, it's possible to use Embedding of documents, thus guaranteeing required, but, unfortunately, it doesn't work out for me.
In my case, the fact of document update leads to the fact that it's needed to invalidate (set special flag to false) other document, and it should be performed atomically.
I appreciate any help or suggestions from your side.
Thank you!
While there is no built-in way to perform atomic changes to multiple documents, you can use two-phase commit to achieve the same result. Note that in this case 2PC doesn't provide other transactional features, like isolation and consistency, only atomicity - which is what you're asking for. There is no reference 2PC implementation in Java for Couchbase, but there are two in Ruby and PHP in the documentation. I recommend reading the docs on providing transactional logic in Couchbase for an in-depth description of how to implement this. Porting the example code to Java should be fairly straightforward.
Generally speaking, to implement a set of changes on multiple documents atomically, you perform atomic writes to each document in turn, plus a temporary "state" document, in such a way that each step in the process is unique. This way you're able to continue from the same step or roll back your changes if the transaction gets interrupted in the middle for any reason.
Unfortunately, that boils down to a transaction, and Couchbase doesn't offer native transaction support. The level of atomicity you can achieve with Couchbase is at the level of a single document.
I know that some couchbase users have implemented manual transaction management in their applicative code layer, but that's quite a complicated topic and there's no openly available solution that I know of.
Related
I am migrating legacy application from Oracle to CosmosDB (MongoDB 3.6 API). At current stage of the project and also with respect to current goals refactoring ID from sequence numbering to GUIDs cannot be done. I am aware of many reasons why this is bad design but it is what it is - I need sequence generation :(
I am trying to find a solution that will be reliable. What came to my mind is to create a collection sequences and inside of it keep documents such is seq_foo and seq_bar and so on. Upon each insert I would first do findAndModify and then insert with custom id.
What is the question I currently struggle finding answer: is findAndModify atomic if used on single document with eventual consistency?
In case it is not do I need to use multiple DB Accounts with different consistency levels or is there another solution for this problem?
I need to update 2 tables in one transaction.
In current version RethinkDB doesn't support transactions from the box. So, how can I achieve this?
I can make update in 2 ways:
Update 1st table. If success -> update second table.
Update 2nd tables async.
But how can I resolve case, when 1 of 2 updates was completed well, but another no? Yes, I can check result of update and revert update if error occured. But anyway, there can be case, when something happens with application (lost connection to Rethink, or just crash of script), but one of two updates was completed.
So, my data base will be in inconsistent state. And no way to resolve this.
So, is it possible to simulate transaction behavior in nodejs for RethinkDB?
The best you can do is two-phase commit. (MongoDB has a good document on how to do this, and the exact same technique should work in RethinkDB: http://docs.mongodb.org/master/tutorial/perform-two-phase-commits/ .)
RethinkDB supports per key linearizability and compare-and-set (document level atomicity) and it's known to be enough to implement application level transactions, more over you have several options to choose from:
If you need Serializable isolation level then you can follow the same algorithm which Google use for the Percolator system or Cockroach Labs for CockroachDB. I've blogged about it and create a step-by-step visualization, I hope it will help you to understand the main idea behind the algorithm.
If you expect high contention but it's fine for you to have Read Committed isolation level then please take a look on the RAMP transactions by Peter Bailis.
The third approach is to use compensating transactions also known as the saga pattern. It was described in the late 80s in the Sagas paper but became more actual with the raise of distributed systems. Please see the Applying the Saga Pattern talk for inspiration.
We had a similar requirement to implement transactional support in ReThinkDB, as we wanted to have transactions extending across MySQL and ReThinkDB DB boundaries. We had come-up with this micro library thinktrans https://github.com/jaladankisuresh/thinktrans, which is a promised based declarative javascript library for RethinkDB supporting Atomic transactions. However, It is still in its alpha stages
If you have a specific requirement and you may want to understand its approach Implementing Transactions in NoSQL Databases and implement your own.
Disclaimer: I am the author of this library
Is it possible to "scratch" a couchdb document? By that I mean to delete a document, and make sure that the document and its history is completely removed from the database.
I do not want to perform a database compaction, I just want to fully wipe out a single document. And I am looking for a solution that guarantees that there is no trace of the document in the database, without needing to wait for internal database processes to eventually remove the document.
(a python solution is appreciated)
When you delete a document in CouchDB, generally only the _id, _rev, and a deleted flag are preserved. These are preserved to allow for eventual consistency through replication. Forcing an immediate delete across an entire group of nodes isn't really consistent with the architecture.
The closest thing would be to use purge; once you do that, all traces of the doc will be gone after the next compaction. I realize this isn't exactly what you're asking for, but it's the closest thing off the top of my head.
Here's a nice article explaining the basis behind the various delete methods available.
Deleting anything from file system for sure is difficult, and usually quite expensive problem. Even more with databases in general. Depending of what for sure means to you, you may end up with custom db, custom os and custom hw. So it is kind of saying I want fault tolerant system, yes everyone would like to have one, but only few can afford it, but good news is that most can settle for less. Similar is for deleteing for sure, I assume you are trying to adress some security or privacy issue, so try to see if there is some other way to get what you need. Perhaps encrypting the document or sensitive parts of it.
I need to replicate in CouchDB data from one database to another but in the process I want to alter the documents being replicated over,
mostly stripping out particular fields (but other applications mentioned in comments).
The replication would always be 100% one way (but other applications mentioned in comments could use bi-directional and sync)
I would prefer if this process did not increment their revision ID but that might be asking for too much.
But I don't see any of the design document functions that do what I am trying to do.
As it seems doesn't do this, what plans are there for adding this? And meanwhile, what workarounds are there?
No, there is no out-of-the-box solution, as this would defy the whole purpose and logic of multi-master, MVCC logic.
The only option I can see here is to create your own solution, but I would not call this a replication, but rather ETL (Extract, Transform, Load). And for ETL there are tools available that will let you do the trick, like (mixing open source and commercial here):
Scriptella
CloverETL
Pentaho Data Integration, or to be more specific Kettle
Jespersoft ETL
Talend have some tools as well
There is plenty more of ETL tools on the market.
I believe the best approach here would be to break out the fields you want to filter out into a separate document and then filter out the document during replication.
Of course the best way would be to have built-support for this, but a workaround which occurs to me would be, instead of here using the built-in replication, to code and use a custom replication which will do the additional needed alterations/transformations, still using rather than going beneith, the other built-ins, and with good coding, in many situations (especially if each master can push to its slaves), it feels this could be nearly as efficient.
This requires efficient triggers be put on each source/master to detect any changes, which I believe CouchDB does offer (or at least PouchDB appears to), which would then copy the changes to another location also doing the full alterations.
If the source of the change is unable to push the change to the final destination, this fixed store may to be local to it where the destination can pull from -- which could get pretty expensive especially in multi-master, as each location has to not only store & maintain its own data but also the data (being sent) of everyone it sends to.
This replicate would also place each source document's revision ID in the the document's copy...
...that is ideally, including essential if the copy was to be {updated, aka a master}, too.
...in form of either:
ideally the normal "_rev" property. Indeed this looks quite possible per it ("preserve their revisions ID") already done by the normal replication algorithm using the builtin "Bulk Docs API" which seemingly our varient would use, too
otherwise have a new copy object (with its own _rev) plus another field as "_rev_original" ntelling the original rev. But well that would work?
Clearly such copy could be created no problem.
Probably no big if the destination is just reading the data.
Seems hairy if the destination is also writing the data. As we'd now have to merge with these non-standard revisions. But doable.
Relevant to this (coding an a custom/improved replication (to do this apparently-missing functionality) ideally without altering Pouch and especially Couch source code), as starter/basis material (the standard method), here's the normal Couch replication algorithm which unfortunately doens't clearly say it only uses builtin ops but it looks like it, and also the official overview of what it does; I'm suspecting Pouch implements this, likely in Pouch's replicate.js (latest release as of 2014.07).
Futher implementation particulars? - those who would know, please put it here.
This is a "community wiki" answer so please extend it.
Also please comment links & details of anyone/system already doing or trying to do this or similar.
According to the CAP theory, Cassandra can only have eventually consistency. To make things worse, if we have multiple reads and writes during one request without proper handling, we may even lose the logical consistency. In other words, if we do things fast, we may do it wrong.
Meanwhile the best practice to design the data model for Cassandra is to think about the queries we are going to have, and then add a CF to it. In this way, to add/update one entity means to update many views/CFs in many cases. Without atomic transaction feature, it's hard to do it right. But with it, we lose the A and P parts again.
I don't see this concerns many people, hence I wonder why.
Is this because we can always find a way to design our data model to avoid to do multiple reads and writes in one session?
Is this because we can just ignore the 'right' part?
In real practice, do we always have ACID feature somewhere in the middle? I mean maybe implement in application layer or add a middleware to handle it?
It does concern people, but presumably you are using cassandra because a single database server is unable to meet your needs due to scaling or reliability concerns. Because of this, you are forced to work around the limitations of a distributed system.
In real practice, do we always have ACID feature somewhere in the
middle? I mean maybe implement in application layer or add a
middleware to handle it?
No, you don't usually have acid somewhere else, as presumably that somewhere else must be distributed over multiple machines as well. Instead, you design your application around the limitations of a distributed system.
If you are updating multiple columns to satisfy queries, you can look at the eventually atomic section in this presentation for ideas on how to do that. Basically you write enough info about your update to cassandra before you do your write. That way if the write fails, you can retry it later.
If you can structure your application in such a way, using a co-ordination service like Zookeeper or cages may be useful.