Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 months ago.
Improve this question
I am trying to build a highly available very high volume shopping cart application. The application will have a volume so high that I am considering using cassandra instead of mysql for the database.
Now, in a shopping cart system, most database actions have to be 100% consistent, while others do not have to be.
Example of 100% consistent action:
Saving the payment confirmation.
Saving the purchased items list.
Example of things which do not require 100% consistent action:
Saving the address of the customer (If at the time of payment, no address is saved in the database, assume that it was lost and ask the customer again).
Other similar things.
Now, if I am running a server cluster in the same region (Amazon EC2), are there any major roadblocks to performing all transactions as a maximal consistent transaction. Would that provide identical reliability than mySQl Relational database. Remember, we are dealing with financial transactions here.
Is my data generally "safe" in cassandra. By that I mean complete unexpected power failure, random disc failure, etc, etc.
Specific to your questions about availability and EC2 ... As Theodore wrote, the consistency level in Cassandra will dictate how "safe" the data is. The problems you'll face is how to ensure the data is getting to Cassandra, fulfilling your Transaction goals and is being saved appropriately.
There are some good threads about transactions and solving this problem on the Apache Cassandra User's mailing list.
Cassandra on it's own is not suitable for transactions:
Cassandra - transaction support
To get around this, you need "something" that can leverage Cassandra as a data store that manages the transactions above the data tier.
how to integrate cassandra with zookeeper to support transactions
Cages: http://code.google.com/p/cages/
'Locking and transactions over Cassandra using Cages': http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
There is a section called "Transactions for Cassandra" for more information
Summary ... You cannot guarantee financial transactions with Cassandra alone
There are lots of different ways to define consistency. If by "maximal consistent transaction", you mean reading and writing at ConsistencyLevel ALL, then that will provide consistency in sense that your reads will never return an out-of-date value, and durability in the sense that your writes will be stored on all nodes before returning.
That's not the same as transactions, however. Cassandra does not support transactions. It doesn't provide consistency between different rows, as MySQL does. For example, suppose you add an item to the shopping basket, and update the total cost in the cart. Individually, each operation will be stored consistently and durably. However, there may be a window of time in which you can see one change but not the other. In a relational database, you can group them into a transaction so that you can only see both, or neither.
As far as safety goes, Cassandra stores all your writes to disk in a commit log before it does anything else, in the same way that relational databases use transaction logs. So it is just as safe with regard to system crashes. With regards to node failures, if you write at CL.ALL, then you will never lose data as long as one node in each replica set survives. With regard to disk failure, that is a matter for your underlying hardware setup, e.g. RAID.
As of 2022 Cassandra supports transactions.
Find out how BestBuy are using it:
https://www.slideshare.net/joelcrabb/cassandra-and-riak-at-bestbuycom
Related
Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.
I noticed QLDB does not support LIMIT or SKIP query parameters required to implement basic pagination.
Is this going to be supported in the future or is there some other way to implement pagination in QLDB?
LIMIT/SKIP is not currently supported. QLDB is purpose built for data ingestion. We recommend doing reporting and analytics in another purpose built database.
Let's consider a banking application with 2 use-cases:
Moving money between accounts
Providing monthly statements
The first is a very good fit for QLDB, where indexes are being used to read balances and then few documents are being updated or created. Under OCC, QLDB makes it easy to write these transactions correctly and performance should be very good. For example, if an account has $50 remaining and two competing transactions try to deduct $50, only 1 will succeed (the other will fail to commit). Meanwhile, other transactions will continue to succeed. Beyond being simple and performant, you also get integrity via the QLDB hash chain and proof system.
The second is not a good fit. To compute a statement, we would need to lookup transactions for an account. But, what happens if that account changes (maybe somebody just sent you some money!) while we're doing the lookup? Again, under OCC, we will fail the transaction and the statement generation will need to retry. For a small bank, that's probably fine, but I think you can see where this is going. QLDB is purpose built for data ingestion, and the further you stray from what it was built for, the poorer the performance will be.
This begs the question of how to actually do these queries in another database. You can use the S3 Export or Kinesis Data Streaming features to get data out. S3 Exports are better suited for bulk operations (which many analytic databases prefer, e.g. Redshift), while Streams are better for real-time analytics (e.g. using ElasticSearch).
Conversely, I would not recommend using Redshift or ElasticSearch for the first use-case as you will not get the performance, integrity or durability that databases designed for OLTP use-cases offer (e.g. QLDB, DynamoDb, Aurora).
I am looking to build out a realtime pubsub database backend. RethinkDB is actually a perfect package for what I need, mainly because of it's very low latency changefeeds. But RethinkDB seems to be a DB that you can expect about 10k-20k inserts per second on two machines. Whereas I have seen some postings claim people are getting 1 million inserts per second on DB's like Cassandra with comparable hardware, but Cassandra doesn't have the realtime changefeeds feature.
So my question is, is there another DB, or combination of open source systems, which can provide the low latency changefeed functionality of RethinkDB, but enable it to occur on a scale much much larger than RethinkDB? Both quantity of inserts per second, and amount of users that are subscribed to change feeds are both important requirements that need to be high as possible.
RethinkDB might still fit your needs if you can scale out to a robust cluster (lots of nodes). Below is a link to a report they generated with performance metrics scaling up to a 16-node cluster.
https://rethinkdb.com/docs/2-1-5-performance-report/
When running a web application in a farm that uses a distributed datastore that's eventually consistent (CouchDB in my case), should I be ensuring that a given user is always directed to same the datastore instance?
It seems to me that the alternate approach, where any web request can use any data store, adds significant complexity to deal with consistency issues (retries, checks, etc). On the other hand, if a user in a given session is always directed to the same couch node, won't my consistency issues revolve mostly around "shared" user data and thus be greatly simplified?
I'm also curious about strategies for directing users but maybe I'll keep that for another question (comments welcome).
According to the CAP Theorem, distributed systems can either have complete consistency (all nodes see the same data at the same time) or availability (every request receives a response). You'll have to trade one for the other during a partition or datastore instance failure.
Should I be ensuring that a given user is always directed to same the datastore instance?
Ideally, you should not! What will you do when the given instance fails? A major feature of a distributed datastore is to be available in spite of network or instance failures.
If a user in a given session is always directed to the same couch node, won't my consistency issues revolve mostly around "shared" user data and thus be greatly simplified?
You're right, the architecture would be a lot more simpler that way, but again, what would you do if that instance fails? A lot of engineering effort has gone into distributed systems to allow multiple instances to reply to a query. I am not sure about CouchDB, but Cassandra allows you to choose your consistency model, you'll have to tradeoff availability for higher degree of consistency. The client is configured to request servers in a round-robin fashion by default, which distributes the load.
I would recommend you read the Dynamo paper. The authors describe a lot of engineering details behind a distributed database.
I've been doing some research but have reached the point where I think MongoDB/Mongoose (on Node.js) is not the right tool for the job. Here is the scenario...
Two documents: Account (money) information and Inventory information
Check if user's account has enough money
If so, check and deduct inventory
Deduct funds from Account Information
It seems like I really need a transaction system to prevent other events from altering the data in between steps.
Am I correct, or can this still be handled in MongoDB/Mongoose? If not, is there a NoSQL db that I should check out, preferably with Node.JS support?
Implementing transactional safety is usually tricky and requires more than just transactions on the database, e.g. if you need to communicate with external parties in a reliable fashion or if the transaction runs over minutes, hours or even days. But that's leading to far.
Anyhow, on the db side you can do transactions in MongoDB using two-phase commits, but it's not exactly trivial.
There's a ton of NoSQL databases with transaction support, e.g. redis, cassandra (using the Paxos protocol) and foundationdb.
However, this seems rather random to me because the idea of NoSQL databases is to use one that fits your particular problem. If you just need 'anything' with transactions, an SQL db might do the job, right?
You can always implement your own locking mechanism within your application to lock out other sections of the app while you are making your account and inventory checks and updates. That combined with findAndModify() http://docs.mongodb.org/manual/reference/command/findAndModify/#dbcmd.findAndModify may be enough for your transaction needs while also maintaining the flexibility of a NoSQL solution.
For the distributed lock I'd look at Warlock https://www.npmjs.org/package/node-redis-warlock I've not used it myself but it's node.js based and built on top of Redis, although implementing your own via Redis is not that hard to begin with.