NoSQL, Cassandra and Cloud databases use different approaches to guarantee validity of transactions
what are examples for that different approaches ?
Cassandra uses eventual consistency and extends the concept with tunable consistency levels. The most common consistency level, and probably what most use cases should use is quorum consistency.
Quorum means a majority of the nodes agree on the data values being written or read. This provides strong read and write consistency and availability in the case of node failure.
Related
Not found in https://pingcap.com/ about this, but our DBA suggested that I do this, reading and writing separation
It is fine to mix read and write workloads in one TiDB cluster, because it is designed to support both OLTP and OLAP scenarios.
For OLTP scenarios, TiDB requires high I/O disks for data access and operation. See more: https://pingcap.com/docs/FAQ/#does-tikv-support-sas-sata-disks-or-mixed-deployment-of-ssd-sas-disks
My application crawls user's mailbox and saves it to an RDBMS database. I started using Redis as a cache (simple key-value store) for RDBMS database. But gradually I started storing crawler states and other data in Redis that needs to be persistent. Loosing this data means a few hours of downtime. I must ensure airtight consistency for this data. The data should not be lost in node failures or split brain scenarios. Strong consistency is a must. Sharding is done by my application. One Redis process runs on each of ten EC2 m4.large instances. On each of these instances. I am doing up to 20K IOPS to Redis. I am doing more writes than reads, though I have not determined the actual percentage of both. All my data is completely in memory, not backed by disk.
My only problem is each of these instances are SPOF. I cannot use Redis cluster as it does not guarantee consistency. I have evaluated a few more tools like Aerospike, none gives 'No data loss guarantee'.
Cassandra looks promising as I can tune the consistency level I want. I plan to use Cassandra with a replication factor 2, and a write must be written to both the replicas before considered committed. This gives 'No data loss guarantee.
By launching enough cassandra nodes (ssd backed) can I replace my Redis key-value store and still get similar read/write IOPS and
latency? Will opensource cassandra suffice my use case? If not, will the Datastax enterprise In-Memory version solve it?
EDIT 1:
A bit of clarification:
I think I need to use Write consistency level 'ALL' and Read consistency level 'One'. I understand that with this consistency level my cluster will not tolerate any failure. That is OK for me. A few minutes of downtime occasionally is not a problem, as long as my data is consistent. In my present setup, one Redis instance failure causes a few hours of downtime.
I must ensure airtight consistency for this data.
Cassandra deals with failure better when there are more nodes. Assuming your case allows for having more nodes, this is my suggestion.
So, if you have 5 nodes, use CL of QUORUM for both READ and WRITE. What it means is that you always write to at least 3 nodes and read from 3 nodes.(for 5 nodes , QUORUM is 3).
This ensures a very high level consistency
Also ensures limited downtime. Even if a node is down your writes and reads won't break.
If you use CL ALL, then even if one node is down or overloaded, you will have to take a full downtime.
I hope it helps!
How is Cassandra's eventual consistency model different from HBase? It seems Facebook moved from Cassandra to HBase because consistency issues. Which of these NoSQL DBs are ideal for scale and performance with consistency as near as possible to 'immediate'. What is the factor by which performance degrades when we try to improve upon consistency?
Here's Facebook's original post on why they chose HBase for Messenger. At the time they decided HBase was "ideal for scale and performance with consistency as near as possible to 'immediate'", however they reached its limits and later developed a new service called Iris that handles the most recent week of messages, while storing the older messages in HBase.
Cassandra's consistency model provides a lot of flexibility. The biggest difference is that Cassandra is a shared-nothing architecture: each server is designed to be able to function independently, thus high availability and partition tolerance at the cost of consistency.
With HBase however there is a single source of truth, at the (apparent) cost of availability and partition tolerance. The read process, from the client's perspective, involves finding the location of that data and reading it from that server. Any updates to that data are atomic.
Here's one HBase vs Cassandra benchmark that shows HBase outperforming Cassandra on nearly every test on (mostly) default settings, and here's another benchmark that shows Cassandra outperforming HBase on certain tests. I think the conclusion here is that the answer to your question is highly dependent on your use case.
Here's a good article that sums up the plusses and minuses of each, and can help you decide which one is best for your needs.
I have checked the main features of Cassandra and Infinispan. They seem to have and deliver pretty similar characteristics and functionalities:
NoSQL data store
persistance
decentralized
support replication
scalability
fault tolerant
MapReduce support
Queries
One difference I have found out is that Infinispan does not provide tunable consistency (every node has the same data).
When learning about the Infinispan I came across Cassandra Cache Store (http://infinispan.org/docs/cachestores/cassandra/). It provides persistance of data.
But then why I would still want to use Infinispan and not Cassandra directly?
Do these solutions complement each other or they are more competing on the same level?
Infinispan is mainly used as a distributed cache, like memcached/hazelcast and so on.
Natively data are written in memory but you can persist them into what they call "cache stores" -- there are many cache-stores ready (for File/Cassandra/Hbase/Mongo) or you can make your own implementation.
One difference I have found out is that Infinispan does not provide
tunable consistency (every node has the same data).
Tunable consistency and data distribution are two different things. It's not true that "every node has the same data", it depends on how you choose to cluster data. Infinispan, like others, offers both replication (all nodes stores same cache) and distribution (each node will be responsible for a range of tokens). Tunable consistency in Cassandra means that you can choose how many nodes should be informed about your r/w operation before returning the control to the client.
You might need to use Infinispan and not Cassandra directly for many reasons. If for instance you have huge amount of memory in your application servers and you want keep a bigger/different cache than what you can store inside your Cassandra nodes. Other feature you might need is to plug the infinispan-query module in order to perform full-text searches without installing a solr/elasticsearch/whatever cluster or use the transactional capability within is.
IMHO these two products does not compare directly, they're born for different use cases and offers different features. You can use any, one or both, depend on what's your application architecture and needs.
HTH,
Carlo
How do you configure Apache Cassandra to allow for disaster recovery, to allow for one of two data-centres to fail?
The DataStax documentation talks about using a replication strategy that ensures at least one replication is written to each of your two data-centres. But I don't see how that helps once the disaster has actually happened. If you switch to the remaining data-centre, all your writes will fail because those writes will not be able to replicate to the other data-centre.
I guess you would want your software to operate in two modes: normal mode, for which writes must replicate across both data-centres, and disaster mode, for which they need not. But changing replication strategy does not seem possible.
What I really want is two data-centres that are over provisioned, and during normal operations use the resources of both data-centres, but use the resources of only the one remaining data-centre (with reduced performance) when only one data-centre is functioning.
The trick is to vary the consistency setting given through the API for writes, instead of varying the replication factor. Use the LOCAL_QUORUM setting for writes during a disaster, when only one data-centre is available. During normal operation use EACH_QUORUM to ensure both data-centres have a copy of the data. Reads can use LOCAL_QUORUM all the time.
Here is a summary of the Datastax documentation for multiple data centers and the older but still conceptionally relevant disaster recovery (0.7).
Make a recipe to suite your needs with the two consistencies LOCAL_QUORUM and EACH_QUORUM.
Here, “local” means local to a single data center, while “each” means consistency is strictly maintained at the same level in each data center.
Suppose you have 2 datacenters, one used strictly for disaster recovery then you could set the replication factor to...
3 for the primary write/read center, and two for the failover data center
Now depending how critical it is that your data is actually written to the disaster recovery nodes, you can either use EACH_QUORUM or LOCAL_QUORUM. Assuming you are using a replication placement strategy NetworkTopologyStrategy (NTS),
LOCAL_QUORUM on writes will only delay the client to write locally to the DC1 and asynchronously write to your recovery node(s) in DC2.
EACH_QUORUM will ensure that all data is replicated but will delay writes until both DCs confirm successful operations.
For reads it's likely best to just use LOCAL_QUORUM to avoid inter-data center latency.
There are catches to this approach! If you choose to use EACH_QUORUM on your writes you increase the potential failure points (DC2 is down, DC1-DC2 link is down, DC1 quorum can't be met).
The bonus is once your DC1 goes down, you have a valid DC2 disaster recovery. Also note in the 2nd link it talks about custom snitch settings for routing your IPs properly.