I have a single /update-user endpoint on my server that triggers an updateUser query on mongo.
Basically, I retrieve the user id thanks to the cookie, and inject the received form, that can comprise any kind of key allowed in the User model, in the mongo query.
It looks like:
const form = {
friends: [{id: "1", name: "paul", thumbnail: "www.imglink.com"},
{id: "2", name: "joe", thumbnail: "www.imglink2.com"}],
locale: "en",
age: 77
}
function updateUser(form, _id){
const query = JSON.stringify(form)
return UserDB.findOneAndUpdate( { _id }, { $set: query })
}
So each time, I erase the necessary data and replace it by a brand new one. Sometimes, this data can be an array of 50 objects (let's say I've removed two persons in a 36 friends array as described above).
It is very convenient, because I can abstract all the logic both in the front and back with a single update function. However, is this a pure heresy from a performance point of view? Should I rather use 10 endpoints to update each part of the form?
The form is dynamic, I never know what is going to be inside, except that it belongs to the User model, this is why I've used this strategy.
From MongoDB's point of view, it doesn't matter much. MongoDB is a journalled database (particularly with the WiredTiger storage engine), and it probably (re)writes a large part of the document on update. It might make a minor difference under very heavy loads when replicating the oplog between primary and replicas, but if you have performance constraints like these, you'll know. If in doubt, benchmark and monitor your production system - don't over-optimize.
Focus on what's best for the business domain. Is your application collaborative? Do multiple users edit the same documents at the same time? What happens when they overwrite one another's changes? Are the JSONs that the client sends to the back-end large, or do they not clog up the network? These are the most important questions you should ask, and performance should only be optimized once you have the UX, the interaction model and the concurrency issues nailed.
In my domain I have Product and Order aggregates. Orders reference Products. This is a 1 to n relationship, so a Product has many Orders and Orders belong to a Product. When a Product is discontinued a ProductDiscontinued event is published and all Orders that belong to that Product must be cancelled. So there's an adapter that receives the ProductDiscontinued event via RabbitMQ. The adapter then delegates cancelling Orders to an application service. How can I achieve that a single Order is cancelled in a single transaction? Should the adapter iterate all Orders of the discontinued Product and call the application service for every single Order? Should I just ignore that I modify more than one aggregate in a single transaction and call the application service just once with a list of all affected OrderIds? Is there a better solution?
From the DDD point of view, the Aggregate is the transaction boundary. The transaction should not be larger than the Aggregate. This rule exist in order to force one to correctly design the Aggregates, to not depend on multiple Aggregates modified in the same transaction.
However, you already designed your Aggregates having that in mind (from what I can see).
Should the adapter iterate all Orders of the discontinued Product and call the application service for every single Order?
This is the normal way of doing things.
Should I just ignore that I modify more than one aggregate in a single transaction and call the application service just once with a list of all affected OrderIds?
In the context of what I wrote earlier, you may do that if somehow it offers a better performance (I don't see how a bigger transaction can give better performance but hey, it depends on the code also).
DocumentDB is truly schema less implies it will not enforce you to have only certain kind of document and it's a choice by user.
Considering the above point in a partitioned collection (lets say department as partitionedKey in Employee document) the user decide not to pass any partitionKey(just a use case to support my point, a new employee/intern joined and not yet decided on which department he will work and later they might update the document with appropriate department)
Based on the above scenario my question is, In the interim period, to which partition the new employee will go/persist, as I don't have a department(partitionKey) for him?
{
"eid": "",
"entryType": "",
"address":
{
"PIN": "",
"city": "",
"street": ""
},
"name": "",
"id": "",
"age":
}
Great question! We had same query when we started working with partitioned collections.
Based on my understanding, it is certainly possible to create a document in a partitioned collection without specifying partition key attribute (departmentId in your case) though it is not recommended.
When such things happen, Cosmos DB puts such documents in a special partition that are accessible by specifying {} i.e. empty JavaScript object as partition key in your query.
However, please keep in mind that you can't update the partition key attribute value (an employee is assigned a department in your example). You must delete and recreate the document with correct partition key.
If you're using a partitioned collection then you'll be required to provide a value for partitionKey on every insert. The choice of what to use as a partitionKey is based on the needs of your application and the nature of your data. There is no one size fits all answer. From the sounds of things you might want to reconsider using department as a partitionKey.
I have a Cassandra Customers table which is going to keep a list of customers. Every customer has an address which is a list of standard fields:
{
CustomerName: "",
etc...,
Address: {
street: "",
city: "",
province: "",
etc...
}
}
My question is if I have a million customers in this table and I use a user defined data type Address to keep the address information for each customers in the Customers table, what are the implications of such a model, especially in terms of disk space. Is this going to be very expensive? Should I use the Address user defined data type or flattent the address information or even use a separate table?
Basically what happens in this case is that Cassandra will serialize instances of address into a blob, which is stored as a single column as part of your customer table. I don't have any numbers at hand on how much the serialization will put on top on disk or cpu usage, but it probably will not make a big difference for your use case. You should test both cases to be sure.
Edit: Another aspect I should also have mentioned: handling UDTs as single blobs will imply to replace the complete UDT for any updates. This will be less efficient than updating individual columns and is a potential cause for inconsistencies. In case of concurrent updates both writes could overwrite each others changes. See CASSANDRA-7423.
Please excuse any mistakes in terminology. In particular, I am using relational database terms.
There are a number of persistent key-value stores, including CouchDB and Cassandra, along with plenty of other projects.
A typical argument against them is that they do not generally permit atomic transactions across multiple rows or tables. I wonder if there's a general approach would would solve this issue.
Take for example the situation of a set of bank accounts. How do we move money from one bank account to another? If each bank account is a row, we want to update two rows as part of the same transaction, reducing the value in one and increasing the value in another.
One obvious approach is to have a separate table which describes transactions. Then, moving money from one bank account to another consists of simply inserting a new row into this table. We do not store the current balances of either of the two bank accounts and instead rely on summing up all the appropriate rows in the transactions table. It is easy to imagine that this would be far too much work, however; a bank may have millions of transactions a day and an individual bank account may quickly have several thousand 'transactions' associated with it.
A number (all?) of key-value stores will 'roll back' an action if the underlying data has changed since you last grabbed it. Possibly this could be used to simulate atomic transactions, then, as you could then indicate that a particular field is locked. There are some obvious issues with this approach.
Any other ideas? It is entirely possible that my approach is simply incorrect and I have not yet wrapped my brain around the new way of thinking.
If, taking your example, you want to atomically update the value in a single document (row in relational terminology), you can do so in CouchDB. You will get a conflict error when you try to commit the change if an other contending client has updated the same document since you read it. You will then have to read the new value, update and re-try the commit. There is an indeterminate (possibly infinite if there is a lot of contention) number of times you may have to repeat this process, but you are guaranteed to have a document in the database with an atomically updated balance if your commit ever succeeds.
If you need to update two balances (i.e. a transfer from one account to an other), then you need to use a separate transaction document (effectively another table where rows are transactions) that stores the amount and the two accounts (in and out). This is a common bookkeeping practice, by the way. Since CouchDB computes views only as needed, it is actually still very efficient to compute the current amount in an account from the transactions that list that account. In CouchDB, you would use a map function that emitted the account number as key and the amount of the transaction (positive for incoming, negative for outgoing). Your reduce function would simply sum the values for each key, emitting the same key and total sum. You could then use a view with group=True to get the account balances, keyed by account number.
CouchDB isn't suitable for transactional systems because it doesn't support locking and atomic operations.
In order to complete a bank transfer you must do a few things:
Validate the transaction, ensuring there are sufficient funds in the source account, that both accounts are open, not locked, and in good standing, and so on
Decrease the balance of the source account
Increase the balance of the destination account
If changes are made in between any of these steps the balance or status of the accounts, the transaction could become invalid after it is submitted which is a big problem in a system of this kind.
Even if you use the approach suggested above where you insert a "transfer" record and use a map/reduce view to calculate the final account balance, you have no way of ensuring that you don't overdraw the source account because there is still a race condition between checking the source account balance and inserting the transaction where two transactions could simultaneous be added after checking the balance.
So ... it's the wrong tool for the job. CouchDB is probably good at a lot of things, but this is something that it really can not do.
EDIT: It's probably worth noting that actual banks in the real world use eventual consistency. If you overdraw your bank account for long enough you get an overdraft fee. If you were very good you might even be able to withdraw money from two different ATMs at almost the same time and overdraw your account because there's a race condition to check the balance, issue the money, and record the transaction. When you deposit a check into your account they bump the balance but actually hold those funds for a period of time "just in case" the source account doesn't really have enough money.
To provide a concrete example (because there is a surprising lack of correct examples online): here's how to implement an "atomic bank balance transfer" in CouchDB (largely copied from my blog post on the same subject: http://blog.codekills.net/2014/03/13/atomic-bank-balance-transfer-with-couchdb/)
First, a brief recap of the problem: how can a banking system which allows
money to be transfered between accounts be designed so that there are no race
conditions which might leave invalid or nonsensical balances?
There are a few parts to this problem:
First: the transaction log. Instead of storing an account's balance in a single
record or document — {"account": "Dave", "balance": 100} — the account's
balance is calculated by summing up all the credits and debits to that account.
These credits and debits are stored in a transaction log, which might look
something like this:
{"from": "Dave", "to": "Alex", "amount": 50}
{"from": "Alex", "to": "Jane", "amount": 25}
And the CouchDB map-reduce functions to calculate the balance could look
something like this:
POST /transactions/balances
{
"map": function(txn) {
emit(txn.from, txn.amount * -1);
emit(txn.to, txn.amount);
},
"reduce": function(keys, values) {
return sum(values);
}
}
For completeness, here is the list of balances:
GET /transactions/balances
{
"rows": [
{
"key" : "Alex",
"value" : 25
},
{
"key" : "Dave",
"value" : -50
},
{
"key" : "Jane",
"value" : 25
}
],
...
}
But this leaves the obvious question: how are errors handled? What happens if
someone tries to make a transfer larger than their balance?
With CouchDB (and similar databases) this sort of business logic and error
handling must be implemented at the application level. Naively, such a function
might look like this:
def transfer(from_acct, to_acct, amount):
txn_id = db.post("transactions", {"from": from_acct, "to": to_acct, "amount": amount})
if db.get("transactions/balances") < 0:
db.delete("transactions/" + txn_id)
raise InsufficientFunds()
But notice that if the application crashes between inserting the transaction
and checking the updated balances the database will be left in an inconsistent
state: the sender may be left with a negative balance, and the recipient with
money that didn't previously exist:
// Initial balances: Alex: 25, Jane: 25
db.post("transactions", {"from": "Alex", "To": "Jane", "amount": 50}
// Current balances: Alex: -25, Jane: 75
How can this be fixed?
To make sure the system is never in an inconsistent state, two pieces of
information need to be added to each transaction:
The time the transaction was created (to ensure that there is a strict
total ordering of transactions), and
A status — whether or not the transaction was successful.
There will also need to be two views — one which returns an account's available
balance (ie, the sum of all the "successful" transactions), and another which
returns the oldest "pending" transaction:
POST /transactions/balance-available
{
"map": function(txn) {
if (txn.status == "successful") {
emit(txn.from, txn.amount * -1);
emit(txn.to, txn.amount);
}
},
"reduce": function(keys, values) {
return sum(values);
}
}
POST /transactions/oldest-pending
{
"map": function(txn) {
if (txn.status == "pending") {
emit(txn._id, txn);
}
},
"reduce": function(keys, values) {
var oldest = values[0];
values.forEach(function(txn) {
if (txn.timestamp < oldest) {
oldest = txn;
}
});
return oldest;
}
}
List of transfers might now look something like this:
{"from": "Alex", "to": "Dave", "amount": 100, "timestamp": 50, "status": "successful"}
{"from": "Dave", "to": "Jane", "amount": 200, "timestamp": 60, "status": "pending"}
Next, the application will need to have a function which can resolve
transactions by checking each pending transaction in order to verify that it is
valid, then updating its status from "pending" to either "successful" or
"rejected":
def resolve_transactions(target_timestamp):
""" Resolves all transactions up to and including the transaction
with timestamp `target_timestamp`. """
while True:
# Get the oldest transaction which is still pending
txn = db.get("transactions/oldest-pending")
if txn.timestamp > target_timestamp:
# Stop once all of the transactions up until the one we're
# interested in have been resolved.
break
# Then check to see if that transaction is valid
if db.get("transactions/available-balance", id=txn.from) >= txn.amount:
status = "successful"
else:
status = "rejected"
# Then update the status of that transaction. Note that CouchDB
# will check the "_rev" field, only performing the update if the
# transaction hasn't already been updated.
txn.status = status
couch.put(txn)
Finally, the application code for correctly performing a transfer:
def transfer(from_acct, to_acct, amount):
timestamp = time.time()
txn = db.post("transactions", {
"from": from_acct,
"to": to_acct,
"amount": amount,
"status": "pending",
"timestamp": timestamp,
})
resolve_transactions(timestamp)
txn = couch.get("transactions/" + txn._id)
if txn_status == "rejected":
raise InsufficientFunds()
A couple of notes:
For the sake of brevity, this specific implementation assumes some amount of
atomicity in CouchDB's map-reduce. Updating the code so it does not rely on
that assumption is left as an exercise to the reader.
Master/master replication or CouchDB's document sync have not been taken into
consideration. Master/master replication and sync make this problem
significantly more difficult.
In a real system, using time() might result in collisions, so using
something with a bit more entropy might be a good idea; maybe "%s-%s"
%(time(), uuid()), or using the document's _id in the ordering.
Including the time is not strictly necessary, but it helps maintain a logical
if multiple requests come in at about the same time.
BerkeleyDB and LMDB are both key-value stores with support for ACID transactions. In BDB txns are optional while LMDB only operates transactionally.
A typical argument against them is that they do not generally permit atomic transactions across multiple rows or tables. I wonder if there's a general approach would would solve this issue.
A lot of modern data stores don't support atomic multi-key updates (transactions) out of the box but most of them provide primitives which allow you to build ACID client-side transactions.
If a data store supports per key linearizability and compare-and-swap or test-and-set operation then it's enough to implement serializable transactions. For example, this approach is used in Google's Percolator and in CockroachDB database.
In my blog I created the step-by-step visualization of serializable cross shard client-side transactions, described the major use cases and provided links to the variants of the algorithm. I hope it will help you to understand how to implement them for you data store.
Among the data stores which support per key linearizability and CAS are:
Cassandra with lightweight transactions
Riak with consistent buckets
RethinkDB
ZooKeeper
Etdc
HBase
DynamoDB
MongoDB
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on RAMP transactions by Peter Bailis. They can be also implemented for the same set of data stores.