Why cassandra is considered as partition tolerant by CAP theorem despite we can isolate the coordinator?

Why cassandra is considered as partition tolerant by CAP theorem despite we can isolate the coordinator? - cassandra

Here is the definition of partition tolerance by Gilbert and Lynch
When a network is partitioned, all messages sent from nodes in one
component of the partition to nodes in another component are lost.
Let's divide the cluster into two partitions: the first one contains only the coordinator, the second one contains all other nodes. This way coordinator will not be able to contact any replicas and will respond with error. Is it allowed for partition tolerant systems?

More specifically I think the question is which of the other two CAP attributes does Cassandra retain in the face of such a Partition.
The answer is dependent on the configured consistency level. For writes there is the ANY consistency level. At this consistency level, so long as hinted-handoffs are enabled, the coordinator will record the write and maintain Availability. Clients connected to other coordinators will not be able to see the udpated value until the partition is resolved, so reads will not be Consistent. If a stronger consistency level is chosen, then the client is explicitly configuring Consistency over Availability.
So can Cassandra (given that it does not necessarily replicate all data to all nodes) be considered AP when a read coordinator is alone in a partition? If it responds with an error that sounds like Consistency to me, if it responds with an empty result set because the data is not in its partition, then that would be Availability. Since the weakest read consistency level is ONE - requiring at least one replica to respond, Cassandra opts for the former: If the coordinator is not itself one of the replicas owning the requested data then the read will time out and not be Available. As with writes, any stronger read consistency level explicitly configures Cassandra to behave more Consistently at the expense of Availability.

So the "coordinator" node isn't a long-lasting or "leader"-like definition. It changes with practically every query. If there was a non-token-aware operation which needed a coordinator node, and that coordinator was suddenly partitioned-off from the rest, then that one query would fail.
The next query (or a retry) would pick a new node as a coordinator. The only issue, would be that some data rows will be short by one replica (data stored on the partitioned node). But as long as you're querying by ONE and have a RF >= 2, the cluster will continue on like nothing happened.
So "yes," Cassandra is definitely partition-tolerant.
Note: This is why it's important to use a token-aware load balancing policy. That way the driver picks one of the nodes containing the required data as the "coordinator." At consistency ONE, the operation is completed locally, and a network hop is taken out of the equation.

Related

Cassandra Spark Connector : requirement failed, contact points contain multiple data centers

I have two Cassandra datacenters, with all servers in the same building, connected with 10 gbps network. The RF is 2 in each datacenter.
I need to ensure strong consistency inside my app, so I first planed to use QUORUM consistency (3 replicas of 4 must respond) on both reads and writes. With that configuration, I can also be fault tolerant if a node crash on a particular datacenter.
So I set multiples contact point from multiples datacenter to my spark connector, but the following error is immediately returned : requirement failed, contact points contain multiple data centers
So I look at the documentation. It say :
Connections are never made to data centers other than the data center of spark.cassandra.connection.host [...]. This technique guarantees proper workload isolation so that a huge analytics job won't disturb the realtime part of the system.
Okay. So after reading that, I plan to switch to LOCAL_QUORUM (2 replicas of 2 must respond) on write, and LOCAL_ONE on read, to still get strong consistency, and connect by default on datacenter1.
The problem, is still consistency, because Spark apps working on the second datacenter datacenter2 don't have strong consistency on write, because data are just asynchronously synchronized from datacenter1.
To avoid that, I can set write consistency to EACH_QUORUM (= ALL). But the problem in that case, is if a single node is unresponsive or down, the entire writes are unable to process.
So my only option, to have both some fault tolerance, AND strong consistency, is to switch my replication factor from 2 to 3 on each datacenter. Then use EACH_QUORUM on write, and LOCAL_QUORUM on read ? Is that correct ?
Thank you

This comment indicates there is some misunderstanding on your part:
... because data are just asynchronously synchronized from datacenter1.
so allow me to clarify.
The coordinator of a write request sends each mutation (INSERT, UPDATE, DELETE) to ALL replicas in ALL data centres in real time. It doesn't happen at some later point in time (i.e. 2 seconds later, 10s later or 1 minute later) -- it gets sent to all DCs at the same time without delay regardless of whether you have a 1Mbps or 10Gbps link between DCs.
We also recommend a minimum of 3 replicas in each DC in production as well as use LOCAL_QUORUM for both reads and writes. There are very limited edge cases where these recommendations do not apply.
The spark-cassandra-connector requires all contacts points to belong to the same DC so:
analytics workloads do not impact the performance of OLTP DCs (as you already pointed out), and
it can achieve data-locality for optimal performance where possible.

Apache Cassandra Reading explanation

I am currently managing a percona xtradb cluster composed by 5 nodes, that hadle milions of insert every day. Write performance are very good but reading is not so fast, specially when i request a big dataset.
The record inserted are sensors time series.
I would like to try apache cassandra to replace percona cluster, but i don't understand how data reading works. I am looking for something able to split query around all the nodes and read in parallel from more than one node.
I know that cassandra sharding can have shard replicas.
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?

Cassandra read path
The read request initiated by a client is sent over to a coordinator node which checks the partitioner what are the replicas responsible for the data and if the consistency level is met.
The coordinator will check is it is responsible for the data. If yes, will satisfy the request. If no, it will send the request to fastest answering replica (this is determined using the dynamic snitch). Also, a request digest is sent over to the other replicas.
The node will compare the returning data digests and if all are the same and the consistency level has been met, the data is returned from the fastest answering replica. If the digests are not the same, the coordinator will issue some read repair operations.
On the node there are a few steps performed: check row cache, check memtables, check sstables. More information: How is data read? and ReadPathForUsers.
Load balancing queries
Since you have a replication factor that is equal to the number of nodes, this means that each node will hold all of your data. So, when a coordinator node will receive a read query it will satisfy it from itself. In particular(if you would use a LOCAL_ONE consistency level, the request will be pretty fast).
The client drivers implement the load balancing policies, which means that on your client you can configure how the queries will be spread around the cluster. Some more reading - ClientRequestsRead

If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
No. It means you will have up to 5 copies of the data to ensure that your query can be satisfied when nodes are down. Cassandra does not divide up the work for the read. Instead it tries to force you to design your data in a way that makes the reads efficient and fast.

Best way to read cassandra is by making sure that each query you generate hits cassandra partition. Which means the first part of your simple primary(x,y,z) key and first bracket of compound ((x,y),z) primary key are provided as query parameters.
This goes back to cassandra table design principle of having a table design by your query needs.
Replication is about copies of data and Partitioning is about distributing data.
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archPartitionerAbout.html
some references about cassandra modelling,
https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
it is recommended to have 100 MB partitions but not compulsory.
You can use cassandra-stress utility to have look report of how your reads and writes look.

Is it possible to read data only from a single node in a Cassandra cluster with a replication factor of 3?

I know that Cassandra have different read consistency levels but I haven't seen a consistency level which allows as read data by key only from one node. I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read. Even if we choose a consistency level of one we will ask all nodes but wait for the first response from any node. That is why we will load not only one node when we read but 3 (4 with a coordinator node). I think we can't really improve a read performance even if we set a bigger replication factor.
Is it possible to read really only from a single node?

Are you using a Token-Aware Load Balancing Policy?
If you are, and you are querying with a consistency of LOCAL_ONE/ONE, a read query should only contact a single node.
Give the article Ideology and Testing of a Resilient Driver a read. In it, you'll notice that using the TokenAwarePolicy has this effect:
"For cases with a single datacenter, the TokenAwarePolicy chooses the primary replica to be the chosen coordinator in hopes of cutting down latency by avoiding the typical coordinator-replica hop."
So here's what happens. Let's say that I have a table for keeping track of Kerbalnauts, and I want to get all data for "Bill." I would use a query like this:
SELECT * FROM kerbalnauts WHERE name='Bill';
The driver hashes my partition key value (name) to the token of 4639906948852899531 (SELECT token(name) FROM kerbalnauts WHERE name='Bill'; returns that value). If I am working with a 6-node cluster, then my primary token ranges will look like this:
node start range end range
1) 9223372036854775808 to -9223372036854775808
2) -9223372036854775807 to -5534023222112865485
3) -5534023222112865484 to -1844674407370955162
4) -1844674407370955161 to 1844674407370955161
5) 1844674407370955162 to 5534023222112865484
6) 5534023222112865485 to 9223372036854775807
As node 5 is responsible for the token range containing the partition key "Bill," my query will be sent to node 5. As I am reading at a consistency of LOCAL_ONE, there will be no need for another node to be contacted, and the result will be returned to the client...having only hit a single node.
Note: Token ranges computed with:
python -c'print [str(((2**64 /5) * i) - 2**63) for i in range(6)]'

I mean if we have a cluster with a replication factor of 3 then we will always ask all nodes when we read
Wrong, with Consistency Level ONE the coordinator picks the fastest node (the one with lowest latency) to ask for data.
How does it know which replica is the fastest ? By keeping internal latency stats for each node.
With consistency level >= QUORUM, the coordinator will ask for data from the fastest node and also asks for digest from other replicas
From the client side, if you choose the appropriate load balancing strategy (e.g. TokenAwareStrategy) the client will always contact the primary replica when using consistency level ONE

When would Cassandra not provide C, A, and P with W/R set to QUORUM?

When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.

With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.

Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.

The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.

Is to possible to read from cassandra cluster even at any node failure

I have a Cassandra cluster with 4 nodes, is it possible to read the data only from the available nodes, except the node that is down, is this possible? or is there any configurable property to handle this type of scenario.
Thanks

You can do this with replication, yes. There are a few things you need:
Set replication factor at least 2. The more replicas, the more failed nodes you can cope with. However, the more replicas you have the worse your performance is since more nodes duplicate the work.
Choose an appropriate consistency level. The consistency level (CL) determines how many nodes need to be involved with a read or write operation. CL.ALL means use all replicas so you can't tolerate any failures. CL.ONE means use just one node. CL.QUORUM means a majority of replicas (RF/2+1)
You can read and write data from any node, not just ones containing that data. If you use a client library like Hector, you should tell it about all nodes and it will avoid ones that are down, as well as load balance amongst the available nodes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string