Atomic transactions in key-value stores - couchdb

Please excuse any mistakes in terminology. In particular, I am using relational database terms.
There are a number of persistent key-value stores, including CouchDB and Cassandra, along with plenty of other projects.
A typical argument against them is that they do not generally permit atomic transactions across multiple rows or tables. I wonder if there's a general approach would would solve this issue.
Take for example the situation of a set of bank accounts. How do we move money from one bank account to another? If each bank account is a row, we want to update two rows as part of the same transaction, reducing the value in one and increasing the value in another.
One obvious approach is to have a separate table which describes transactions. Then, moving money from one bank account to another consists of simply inserting a new row into this table. We do not store the current balances of either of the two bank accounts and instead rely on summing up all the appropriate rows in the transactions table. It is easy to imagine that this would be far too much work, however; a bank may have millions of transactions a day and an individual bank account may quickly have several thousand 'transactions' associated with it.
A number (all?) of key-value stores will 'roll back' an action if the underlying data has changed since you last grabbed it. Possibly this could be used to simulate atomic transactions, then, as you could then indicate that a particular field is locked. There are some obvious issues with this approach.
Any other ideas? It is entirely possible that my approach is simply incorrect and I have not yet wrapped my brain around the new way of thinking.

If, taking your example, you want to atomically update the value in a single document (row in relational terminology), you can do so in CouchDB. You will get a conflict error when you try to commit the change if an other contending client has updated the same document since you read it. You will then have to read the new value, update and re-try the commit. There is an indeterminate (possibly infinite if there is a lot of contention) number of times you may have to repeat this process, but you are guaranteed to have a document in the database with an atomically updated balance if your commit ever succeeds.
If you need to update two balances (i.e. a transfer from one account to an other), then you need to use a separate transaction document (effectively another table where rows are transactions) that stores the amount and the two accounts (in and out). This is a common bookkeeping practice, by the way. Since CouchDB computes views only as needed, it is actually still very efficient to compute the current amount in an account from the transactions that list that account. In CouchDB, you would use a map function that emitted the account number as key and the amount of the transaction (positive for incoming, negative for outgoing). Your reduce function would simply sum the values for each key, emitting the same key and total sum. You could then use a view with group=True to get the account balances, keyed by account number.

CouchDB isn't suitable for transactional systems because it doesn't support locking and atomic operations.
In order to complete a bank transfer you must do a few things:
Validate the transaction, ensuring there are sufficient funds in the source account, that both accounts are open, not locked, and in good standing, and so on
Decrease the balance of the source account
Increase the balance of the destination account
If changes are made in between any of these steps the balance or status of the accounts, the transaction could become invalid after it is submitted which is a big problem in a system of this kind.
Even if you use the approach suggested above where you insert a "transfer" record and use a map/reduce view to calculate the final account balance, you have no way of ensuring that you don't overdraw the source account because there is still a race condition between checking the source account balance and inserting the transaction where two transactions could simultaneous be added after checking the balance.
So ... it's the wrong tool for the job. CouchDB is probably good at a lot of things, but this is something that it really can not do.
EDIT: It's probably worth noting that actual banks in the real world use eventual consistency. If you overdraw your bank account for long enough you get an overdraft fee. If you were very good you might even be able to withdraw money from two different ATMs at almost the same time and overdraw your account because there's a race condition to check the balance, issue the money, and record the transaction. When you deposit a check into your account they bump the balance but actually hold those funds for a period of time "just in case" the source account doesn't really have enough money.

To provide a concrete example (because there is a surprising lack of correct examples online): here's how to implement an "atomic bank balance transfer" in CouchDB (largely copied from my blog post on the same subject: http://blog.codekills.net/2014/03/13/atomic-bank-balance-transfer-with-couchdb/)
First, a brief recap of the problem: how can a banking system which allows
money to be transfered between accounts be designed so that there are no race
conditions which might leave invalid or nonsensical balances?
There are a few parts to this problem:
First: the transaction log. Instead of storing an account's balance in a single
record or document — {"account": "Dave", "balance": 100} — the account's
balance is calculated by summing up all the credits and debits to that account.
These credits and debits are stored in a transaction log, which might look
something like this:
{"from": "Dave", "to": "Alex", "amount": 50}
{"from": "Alex", "to": "Jane", "amount": 25}
And the CouchDB map-reduce functions to calculate the balance could look
something like this:
POST /transactions/balances
{
"map": function(txn) {
emit(txn.from, txn.amount * -1);
emit(txn.to, txn.amount);
},
"reduce": function(keys, values) {
return sum(values);
}
}
For completeness, here is the list of balances:
GET /transactions/balances
{
"rows": [
{
"key" : "Alex",
"value" : 25
},
{
"key" : "Dave",
"value" : -50
},
{
"key" : "Jane",
"value" : 25
}
],
...
}
But this leaves the obvious question: how are errors handled? What happens if
someone tries to make a transfer larger than their balance?
With CouchDB (and similar databases) this sort of business logic and error
handling must be implemented at the application level. Naively, such a function
might look like this:
def transfer(from_acct, to_acct, amount):
txn_id = db.post("transactions", {"from": from_acct, "to": to_acct, "amount": amount})
if db.get("transactions/balances") < 0:
db.delete("transactions/" + txn_id)
raise InsufficientFunds()
But notice that if the application crashes between inserting the transaction
and checking the updated balances the database will be left in an inconsistent
state: the sender may be left with a negative balance, and the recipient with
money that didn't previously exist:
// Initial balances: Alex: 25, Jane: 25
db.post("transactions", {"from": "Alex", "To": "Jane", "amount": 50}
// Current balances: Alex: -25, Jane: 75
How can this be fixed?
To make sure the system is never in an inconsistent state, two pieces of
information need to be added to each transaction:
The time the transaction was created (to ensure that there is a strict
total ordering of transactions), and
A status — whether or not the transaction was successful.
There will also need to be two views — one which returns an account's available
balance (ie, the sum of all the "successful" transactions), and another which
returns the oldest "pending" transaction:
POST /transactions/balance-available
{
"map": function(txn) {
if (txn.status == "successful") {
emit(txn.from, txn.amount * -1);
emit(txn.to, txn.amount);
}
},
"reduce": function(keys, values) {
return sum(values);
}
}
POST /transactions/oldest-pending
{
"map": function(txn) {
if (txn.status == "pending") {
emit(txn._id, txn);
}
},
"reduce": function(keys, values) {
var oldest = values[0];
values.forEach(function(txn) {
if (txn.timestamp < oldest) {
oldest = txn;
}
});
return oldest;
}
}
List of transfers might now look something like this:
{"from": "Alex", "to": "Dave", "amount": 100, "timestamp": 50, "status": "successful"}
{"from": "Dave", "to": "Jane", "amount": 200, "timestamp": 60, "status": "pending"}
Next, the application will need to have a function which can resolve
transactions by checking each pending transaction in order to verify that it is
valid, then updating its status from "pending" to either "successful" or
"rejected":
def resolve_transactions(target_timestamp):
""" Resolves all transactions up to and including the transaction
with timestamp `target_timestamp`. """
while True:
# Get the oldest transaction which is still pending
txn = db.get("transactions/oldest-pending")
if txn.timestamp > target_timestamp:
# Stop once all of the transactions up until the one we're
# interested in have been resolved.
break
# Then check to see if that transaction is valid
if db.get("transactions/available-balance", id=txn.from) >= txn.amount:
status = "successful"
else:
status = "rejected"
# Then update the status of that transaction. Note that CouchDB
# will check the "_rev" field, only performing the update if the
# transaction hasn't already been updated.
txn.status = status
couch.put(txn)
Finally, the application code for correctly performing a transfer:
def transfer(from_acct, to_acct, amount):
timestamp = time.time()
txn = db.post("transactions", {
"from": from_acct,
"to": to_acct,
"amount": amount,
"status": "pending",
"timestamp": timestamp,
})
resolve_transactions(timestamp)
txn = couch.get("transactions/" + txn._id)
if txn_status == "rejected":
raise InsufficientFunds()
A couple of notes:
For the sake of brevity, this specific implementation assumes some amount of
atomicity in CouchDB's map-reduce. Updating the code so it does not rely on
that assumption is left as an exercise to the reader.
Master/master replication or CouchDB's document sync have not been taken into
consideration. Master/master replication and sync make this problem
significantly more difficult.
In a real system, using time() might result in collisions, so using
something with a bit more entropy might be a good idea; maybe "%s-%s"
%(time(), uuid()), or using the document's _id in the ordering.
Including the time is not strictly necessary, but it helps maintain a logical
if multiple requests come in at about the same time.

BerkeleyDB and LMDB are both key-value stores with support for ACID transactions. In BDB txns are optional while LMDB only operates transactionally.

A typical argument against them is that they do not generally permit atomic transactions across multiple rows or tables. I wonder if there's a general approach would would solve this issue.
A lot of modern data stores don't support atomic multi-key updates (transactions) out of the box but most of them provide primitives which allow you to build ACID client-side transactions.
If a data store supports per key linearizability and compare-and-swap or test-and-set operation then it's enough to implement serializable transactions. For example, this approach is used in Google's Percolator and in CockroachDB database.
In my blog I created the step-by-step visualization of serializable cross shard client-side transactions, described the major use cases and provided links to the variants of the algorithm. I hope it will help you to understand how to implement them for you data store.
Among the data stores which support per key linearizability and CAS are:
Cassandra with lightweight transactions
Riak with consistent buckets
RethinkDB
ZooKeeper
Etdc
HBase
DynamoDB
MongoDB
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on RAMP transactions by Peter Bailis. They can be also implemented for the same set of data stores.

Related

Idempotency of cronjobs - "workflow" tables in database

I'm currently working on a backend system, and am faced with porting cronjobs functionality from a legacy system to the new backend. A bunch of these jobs are not idempotent, and I will want to make them idempotent when porting them.
As I understand it, for a job to be idempotent, its state (whether it has been completed, or possible whether it is being currently performed) should somehow be represented in the database / entity model. Because then, a single task can always conditionally opt-out of running if the data shows that it's already done / being handled.
I can imagine simple scenario's where you can just add an extra field (column) to entities (tables) for certain tasks specifically related to that entity, for example
entity Reservation {
id
user_id
...
reminder_sent(_at) <- whether the "reminder" task has been performed yet
}
But more generally, I feel like tasks often involve a bunch of different entities, and it would "pollute" the entities if they need to "know about" the tasks that operate on them. Also, the "state" of a job can be more complicated than just "done or not done yet" in more complex cases. Here's some examples from our business:
If a user has more than a certain amount in total unpaid invoices, we sent three consecutive email reminders at certain intervals, until it it resolved, and if not, end up sending the data to an external party for collection. If the user pays the said invoices, but then acquires new ones, the workflow should restart instead of continue.
Once every month, certain users get rewarded with vouchers. The description of the voucher will note the details, e.g. "Campaign bla bla, Jul 2022", but that's all we have "in" the data of the voucher to know it relates to this job.
I feel like there must be a general known engineering concept here, but I can't seem to find the right resources on the internet. For the time being, I'm calling these things "workflows", and think it makes sense for them to have their own entity/table, e.g.
entity Workflow_UnpaidInvoicesReminder {
id
# "key" by which the job is idempotent
user_id
invoice_id / invoice_ids
# workflow status/result fields
created_at
paid_at
first_reminder_sent_at
second_reminder_sent_at
third_reminder_sent_at
sent_externally_at
}
entity Workflow_CampaignVouchers {
id
# "key" by which the job is idempotent
user_id
campaign_key
# workflow status/result fields
created_at
voucher_id
}
Can someone maybe help me find the appropriate terms and resources to describe the stuff above? I can't seem to find the relevant information about the general idea of "workflows" like these, on the internet, that well.

Mongodb: is replacing an array with a new version more efficient than adding elements to it?

I have a single /update-user endpoint on my server that triggers an updateUser query on mongo.
Basically, I retrieve the user id thanks to the cookie, and inject the received form, that can comprise any kind of key allowed in the User model, in the mongo query.
It looks like:
const form = {
friends: [{id: "1", name: "paul", thumbnail: "www.imglink.com"},
{id: "2", name: "joe", thumbnail: "www.imglink2.com"}],
locale: "en",
age: 77
}
function updateUser(form, _id){
const query = JSON.stringify(form)
return UserDB.findOneAndUpdate( { _id }, { $set: query })
}
So each time, I erase the necessary data and replace it by a brand new one. Sometimes, this data can be an array of 50 objects (let's say I've removed two persons in a 36 friends array as described above).
It is very convenient, because I can abstract all the logic both in the front and back with a single update function. However, is this a pure heresy from a performance point of view? Should I rather use 10 endpoints to update each part of the form?
The form is dynamic, I never know what is going to be inside, except that it belongs to the User model, this is why I've used this strategy.
From MongoDB's point of view, it doesn't matter much. MongoDB is a journalled database (particularly with the WiredTiger storage engine), and it probably (re)writes a large part of the document on update. It might make a minor difference under very heavy loads when replicating the oplog between primary and replicas, but if you have performance constraints like these, you'll know. If in doubt, benchmark and monitor your production system - don't over-optimize.
Focus on what's best for the business domain. Is your application collaborative? Do multiple users edit the same documents at the same time? What happens when they overwrite one another's changes? Are the JSONs that the client sends to the back-end large, or do they not clog up the network? These are the most important questions you should ask, and performance should only be optimized once you have the UX, the interaction model and the concurrency issues nailed.

CosmosDB - Querying Across All Partitions

I'm creating a logging system to monitor our (200 ish) main application installations, and Cosmos db seems like a good fit due to the amount of data we'll collect, and to allow a varying schema for the log data (particularly the Tags array - see document schema below).
But, never having used CosmosDb before I'm slightly unsure of what to use for my partition key.
If I partitioned by CustomerId, there would likely be several Gb of data in each of the 200 partitions, and the data will usually be queried by CustomerId, so this was my first choice for the partition key.
However I was planning to have a 'log stream' view in the logging system, showing logs coming in for all customers.
Would this lead to running a horribly slow / expensive cross partition query?
If so, is there an obvious way to avoid / limit the cost & speed implications of this cross partition querying? (Other than just taking out the log stream view for all customers!)
{
"CustomerId": "be806507-7cc4-4db4-881b",
"CustomerName": "Our Customer",
"SystemArea": 1,
"SystemAreaName": "ExchangeSync",
"Message": "Updated OK",
"Details": "",
"LogLevel": 2,
"Timestamp": "2018-11-23T10:59:29.7548888+00:00",
"Tags": {
"appointmentId": "109654",
"appointmentGroupId": "86675",
"exchangeId": "AAMkA",
"exchangeAlias": "customer.name#customer.com"
}
}
(Note - There isn't a defined list of SystemArea types we'll use yet, but it would be a lot fewer than the 200 customers)
Cross partition queries should be avoided as much as possible. If your querying is likely to happen with customer id then the customerid is a good logical partition key. However you have to keep in mind that there is a limit of 10GB per logical partition data.
A cross partition query across the whole database will lead to a very slow and very expensive operation but if it's not functionality critical and it's just used for infrequent reporting, it's not too much of a problem.

Are Cassandra user defined data types recommended in view of performance?

I have a Cassandra Customers table which is going to keep a list of customers. Every customer has an address which is a list of standard fields:
{
CustomerName: "",
etc...,
Address: {
street: "",
city: "",
province: "",
etc...
}
}
My question is if I have a million customers in this table and I use a user defined data type Address to keep the address information for each customers in the Customers table, what are the implications of such a model, especially in terms of disk space. Is this going to be very expensive? Should I use the Address user defined data type or flattent the address information or even use a separate table?
Basically what happens in this case is that Cassandra will serialize instances of address into a blob, which is stored as a single column as part of your customer table. I don't have any numbers at hand on how much the serialization will put on top on disk or cpu usage, but it probably will not make a big difference for your use case. You should test both cases to be sure.
Edit: Another aspect I should also have mentioned: handling UDTs as single blobs will imply to replace the complete UDT for any updates. This will be less efficient than updating individual columns and is a potential cause for inconsistencies. In case of concurrent updates both writes could overwrite each others changes. See CASSANDRA-7423.

MongoDB Best way to pair and delete sequential database entries

Okay so let's say I'm making a game of blind war!
Users A & B have x amount of soldiers
There are currently 0 DB docs.
User A sends 50 soldiers making a DB doc
User B sends 62 soldiers after user A!
This creates a new DB doc.
I need the most effective/scalable way to lookup user A's doc, compare it to User B's doc then delete both docs! (After returning the result of course)
Here's the problem! I could potentially have 10,000+ users sending soldiers at relatively the same time! How can I successfully complete the above process without overlapping?
I'm using the MEANstack for development so I'm not limited to doing this in the database but obviously the WebApp has to be 100% secure!
If you need any additional info or explanation please let me know and I'll update this question
-Thanks
One thing that comes to mind here is you may not need to do all the work that you think you need to, and your problem can probably be solved with a little help from TTL Indexes and possibly capped collections. Consider the following entries:
{ "_id" : ObjectId("531cf5f3ba53b9dd07756bb7"), "user" : "A", "units" : 50 }
{ "_id" : ObjectId("531cf622ba53b9dd07756bb9"), "user" : "B", "units" : 62 }
So there are two entries and you got that _id value back when you inserted. So at start, "A" had no-one to play against, but the entry for "B" will play against the one before it.
ObejctId's are monotonic, which means that the "next" one along is always greater in value from the last. So with the inserted data, just do this:
db.moves.find({
_id: {$lt: ObjectId("531cf622ba53b9dd07756bb9") },
user: { $ne: "B" }
}).limit(1)
That gives the preceding inserted "move" to the current move that was just made, and does this because anything that was previously inserted will have an _id with less value than the current item. You also make sure that you are not "playing" against the user's own move, and of course you limit the result to one document only.
So the "moves" will be forever moving forward, When the next insert is made by user "C" they get the "move" from user "B", and then user "A" would get the "move" from user "C", and so on.
All that "could" happen here is that "B" make the next "move" in sequence, and you would pick up the same document as in the last request. But that is a point for your "session" design, to store the last "result" and make sure that you didn't get the same thing back, and as such, deal with that however you want to in your design.
That should be enough to "play" with. But let's get to your "deletion" part.
Naturally you "think" you want to delete things, but back to my initial "helpers" this should not be necessary. From above, deletion becomes only a factor of "cleaning-up", so your collection does not grow to massive proportions.
If you applied a TTL index,in much the same way as this tutorial explains, your collection entries will be cleaned up for you, and removed after a certain period of time.
Also what can be done, and especially considering that we are using the increasing nature of the _id key and that this is more or less a "queue" in nature, you could possibly apply this as a capped collection. So you can set a maximum size to how many "moves" you will keep at any given time.
Combining the two together, you get something that only "grows" to a certain size, and will be automatically cleaned for you, should activity slow down a bit. And that's going to keep all of the operations fast.
Bottom line is that the concurrency of "deletes" that you were worried about has been removed by actually "removing" the need to delete the documents that were just played. The query keeps it simple, and the TTL index and capped collection look after you data management for you.
So there you have what is my take on a very concurrent game of "Blind War".

Resources