I've been looking at Datastax's Architecture in brief web page (and a few others) but I found it didn't really answer key questions I had. So I went ahead and wrote up an edited copy of the Datastax web page (see http://benslade.com/wordpress/?p=152, all feedback welcome).
I know I can figure things out by actually setting up a Cassandra database, but I don't like to have to figure out "what it does" for the user by having to figure out "how it's implemented" by the developer.
So, I have a few more questions about how thing work in Cassandra at an architecture level:
The overview says, "data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second". And later says a cluster is, "All writes are automatically partitioned and replicated throughout the cluster". What is the relationship between a cluster and a data center? Ie. is a data center a part of an overall cluster. Do all nodes in all data centers exchange info with each other every second? Does a write to any node in a particular data center get propagated to other data centers the same as it gets propagated in the current data center?
The overview says "Once the memory structure (memtable) is full, the data is written to disk in an SSTable data file". Can the same data been in the memtable and the SSTable at the same time. Ie. is the memtable a datacache for the SSTable?
In the future, please try to limit your posts to one question at-a-time.
What is the relationship between a cluster and a data center?
A cluster can contain one or more logical data centers. Cassandra is data center-aware, which means you can alter your replication strategy on a per-data center basis. Also, Cassandra has the concept of "locality," which means that the snitch can restrict a request to nodes in a particular data center.
EX: Querying by LOCAL_QUORUM will query data only from nodes in the data center that is determined to be the "closest" (network-wise). Whereas querying by QUORUM will query from (N/2+1) nodes, regardless of data center (where N = node count).
Do all nodes in all data centers exchange info with each other every second?
Again, the snitch handles the distribution of replicas and ensures that all nodes are kept current with the configured replication factor. Of course as Cassandra embraces the Highly-Available, Partition Tolerant side of the CAP Theorem, all replicas operate on the concept of "Eventual Consistency." Meaning, they will all get updated, but it may or may not happen before that data is requested.
Does a write to any node in a particular data center get propagated to other data centers the same as it gets propagated in the current data center?
Yes, but again it depends on the configured replication factor. Consider the following keyspace definition:
CREATE KEYSPACE stackoverflow WITH replication = {
'class': 'NetworkTopologyStrategy',
'WestCoastDC': '2',
'EastCoastDC': '3'
};
With this configuration, the snitch will ensure that a write to a replica in any data center will be propagated to my "WestCoastDC" until it has two copies of the data. Likewise, my "EastCoastDC" will have three copies of the same data. Note, your replication factor must be equal to or less than the number of nodes in that data center.
Can the same data been in the memtable and the SSTable at the same
time. Ie. is the memtable a datacache for the SSTable?
I don't believe this can happen. All writes in Cassandra should be written to the in-memory memtable, and simultaneously persisted on-disk via the commit log. Then once your memtable threshold is reached, the memtable contents should be flushed and persisted to the SSTables. Of course, if your node experiences a plug-out-of-the-wall event, the commit log will be verified and reconciled to ensure that its contents exist in the SSTables.
Related
Here is the definition of partition tolerance by Gilbert and Lynch
When a network is partitioned, all messages sent from nodes in one
component of the partition to nodes in another component are lost.
Let's divide the cluster into two partitions: the first one contains only the coordinator, the second one contains all other nodes. This way coordinator will not be able to contact any replicas and will respond with error. Is it allowed for partition tolerant systems?
More specifically I think the question is which of the other two CAP attributes does Cassandra retain in the face of such a Partition.
The answer is dependent on the configured consistency level. For writes there is the ANY consistency level. At this consistency level, so long as hinted-handoffs are enabled, the coordinator will record the write and maintain Availability. Clients connected to other coordinators will not be able to see the udpated value until the partition is resolved, so reads will not be Consistent. If a stronger consistency level is chosen, then the client is explicitly configuring Consistency over Availability.
So can Cassandra (given that it does not necessarily replicate all data to all nodes) be considered AP when a read coordinator is alone in a partition? If it responds with an error that sounds like Consistency to me, if it responds with an empty result set because the data is not in its partition, then that would be Availability. Since the weakest read consistency level is ONE - requiring at least one replica to respond, Cassandra opts for the former: If the coordinator is not itself one of the replicas owning the requested data then the read will time out and not be Available. As with writes, any stronger read consistency level explicitly configures Cassandra to behave more Consistently at the expense of Availability.
So the "coordinator" node isn't a long-lasting or "leader"-like definition. It changes with practically every query. If there was a non-token-aware operation which needed a coordinator node, and that coordinator was suddenly partitioned-off from the rest, then that one query would fail.
The next query (or a retry) would pick a new node as a coordinator. The only issue, would be that some data rows will be short by one replica (data stored on the partitioned node). But as long as you're querying by ONE and have a RF >= 2, the cluster will continue on like nothing happened.
So "yes," Cassandra is definitely partition-tolerant.
Note: This is why it's important to use a token-aware load balancing policy. That way the driver picks one of the nodes containing the required data as the "coordinator." At consistency ONE, the operation is completed locally, and a network hop is taken out of the equation.
I have two Cassandra datacenters, with all servers in the same building, connected with 10 gbps network. The RF is 2 in each datacenter.
I need to ensure strong consistency inside my app, so I first planed to use QUORUM consistency (3 replicas of 4 must respond) on both reads and writes. With that configuration, I can also be fault tolerant if a node crash on a particular datacenter.
So I set multiples contact point from multiples datacenter to my spark connector, but the following error is immediately returned : requirement failed, contact points contain multiple data centers
So I look at the documentation. It say :
Connections are never made to data centers other than the data center of spark.cassandra.connection.host [...]. This technique guarantees proper workload isolation so that a huge analytics job won't disturb the realtime part of the system.
Okay. So after reading that, I plan to switch to LOCAL_QUORUM (2 replicas of 2 must respond) on write, and LOCAL_ONE on read, to still get strong consistency, and connect by default on datacenter1.
The problem, is still consistency, because Spark apps working on the second datacenter datacenter2 don't have strong consistency on write, because data are just asynchronously synchronized from datacenter1.
To avoid that, I can set write consistency to EACH_QUORUM (= ALL). But the problem in that case, is if a single node is unresponsive or down, the entire writes are unable to process.
So my only option, to have both some fault tolerance, AND strong consistency, is to switch my replication factor from 2 to 3 on each datacenter. Then use EACH_QUORUM on write, and LOCAL_QUORUM on read ? Is that correct ?
Thank you
This comment indicates there is some misunderstanding on your part:
... because data are just asynchronously synchronized from datacenter1.
so allow me to clarify.
The coordinator of a write request sends each mutation (INSERT, UPDATE, DELETE) to ALL replicas in ALL data centres in real time. It doesn't happen at some later point in time (i.e. 2 seconds later, 10s later or 1 minute later) -- it gets sent to all DCs at the same time without delay regardless of whether you have a 1Mbps or 10Gbps link between DCs.
We also recommend a minimum of 3 replicas in each DC in production as well as use LOCAL_QUORUM for both reads and writes. There are very limited edge cases where these recommendations do not apply.
The spark-cassandra-connector requires all contacts points to belong to the same DC so:
analytics workloads do not impact the performance of OLTP DCs (as you already pointed out), and
it can achieve data-locality for optimal performance where possible.
I am currently managing a percona xtradb cluster composed by 5 nodes, that hadle milions of insert every day. Write performance are very good but reading is not so fast, specially when i request a big dataset.
The record inserted are sensors time series.
I would like to try apache cassandra to replace percona cluster, but i don't understand how data reading works. I am looking for something able to split query around all the nodes and read in parallel from more than one node.
I know that cassandra sharding can have shard replicas.
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
Cassandra read path
The read request initiated by a client is sent over to a coordinator node which checks the partitioner what are the replicas responsible for the data and if the consistency level is met.
The coordinator will check is it is responsible for the data. If yes, will satisfy the request. If no, it will send the request to fastest answering replica (this is determined using the dynamic snitch). Also, a request digest is sent over to the other replicas.
The node will compare the returning data digests and if all are the same and the consistency level has been met, the data is returned from the fastest answering replica. If the digests are not the same, the coordinator will issue some read repair operations.
On the node there are a few steps performed: check row cache, check memtables, check sstables. More information: How is data read? and ReadPathForUsers.
Load balancing queries
Since you have a replication factor that is equal to the number of nodes, this means that each node will hold all of your data. So, when a coordinator node will receive a read query it will satisfy it from itself. In particular(if you would use a LOCAL_ONE consistency level, the request will be pretty fast).
The client drivers implement the load balancing policies, which means that on your client you can configure how the queries will be spread around the cluster. Some more reading - ClientRequestsRead
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
No. It means you will have up to 5 copies of the data to ensure that your query can be satisfied when nodes are down. Cassandra does not divide up the work for the read. Instead it tries to force you to design your data in a way that makes the reads efficient and fast.
Best way to read cassandra is by making sure that each query you generate hits cassandra partition. Which means the first part of your simple primary(x,y,z) key and first bracket of compound ((x,y),z) primary key are provided as query parameters.
This goes back to cassandra table design principle of having a table design by your query needs.
Replication is about copies of data and Partitioning is about distributing data.
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archPartitionerAbout.html
some references about cassandra modelling,
https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
it is recommended to have 100 MB partitions but not compulsory.
You can use cassandra-stress utility to have look report of how your reads and writes look.
Do we also need to repair "SYSTEM" keyspaces and "OPSCENTER" keyspaces in Cassandra, along with the keyspaces we created?
The answer is no and maybe respectively. Here's why:
System KS
The SYSTEM keyspace uses Local replication strategy so there is no need or sense in repairing it -- remember, repair is an anti-entropy mechanism through which we ensure that multiple replicas on different nodes are holding the same, latest data. Because Local strategy means there is no replication, there is no need to build merkel trees and compare them.
OpsC KS
OpsCenter uses regular reads and writes into Cassandra to store information about your cluster health / statistics / etc. These will have multiple replicas and it is possible that different nodes may get out of sync (say one node is down for some reason and exceeds the max hint window). In this case, you might see stale data if you're reading CL ONE from that node and a Repair would be beneficial. OpsC tables also have a TTL -- so you could see zombie data if for some reason tombstones don't get propagated across the cluster. But the impact of stale data in your OpsCenter statistics will not make or break your business.
So if you have the system resources to run repairs (hopefully using the OpsC repair service) on the OpsC keyspace, it won't hurt and might keep you from seeing stale data, etc. But turning these off for the OpsC keyspace may free up some system resources for your regular workload.
I have a two machine cluster which is running Cassandra 1.2.6. I am using a keyspace which has a replication factor of 2. But my application demands me to write to both the replicas in parallel and also let the Cassandra do the replication and hoping that Cassandra does not duplicate the key/value on the replica nodes.
For example:
I have nodes Node1 and Node2. I have a keyspace which has replication factor 2 configured on it and a column family to push key/value pairs
I use a python client (pycassa) to write to the cluster.
A key, "KeyX", hashes to Node1 and Node2. (I find out which key hashes to which servers through the node tool command. (`$nodetool getendpoints KeyspaceName ColumnFamilyName KeyHexString`)
I use a client to write (KeyX, Value) concurrently to the nodes Node1 and Node2. (In the connection pool I give only the specific server name)
When writing, I wait for one write to succeed (to the master). (Consistency level ONE)
Now, I monitor through the `$nodetool status` command the amount of disk space that the cluster uses.
I write around 100 keys each having 2MB values.
Ideally this should store around 400MB on disk with some overhead for storing keys which should be marginal compared to the value sizes that I using.
Observations:
If I do not write to all the nodes that the key hashes to, Cassandra internally handles replication and the data size is around 400MB. (200MB on each node for 100 keys with 2MB value)
If I do write to all the nodes the key hashes to, Cassandra is writing more than the expected amount of data to the disk. It is as high as 15% more. In our tests Cassandra write ~460MB instead of 400MB.
My question is, is the behavior (15% overhead) expected? Is there any configuration that we need to tweak so that Cassandra properly handles concurrent writes to all the replicas.
Thanks!
There are two possible causes of the 15% extra space that I can think of.
One is because sometimes a replica will store two copies of a column temporarily. If you write a column twice in Cassandra at slightly different times, the two copies may go into separate memtables so end up in separate SSTables on disk. At some point later, when the SSTables get merged through the compaction process, the older value will be discarded, freeing up the space. In your test you could run nodetool compact to force compaction and see if the space usage goes down.
Another possible cause depends on how you did the test when you didn't write to both nodes. If you did this at consistency level ONE, it is possible some of the writes were dropped by the other replica, so it doesn't have all the keys yet. You can be sure it does by running nodetool repair. So the space used in your first observation may not be for all the keys.
You should be aware that writing to all replicas at consistency level ONE does not guarantee that each replica holds a copy. The node that is receiving the data does not have to store it to return success for the write, even if it is a replica. It may be overloaded (in your workload, this would most likely be due to not enough I/O to write the data out) and drop the write, while succeeding in writing it to a different replica. This would cause less space to be used in your second observation, but probably isn't happening in your test since it is a relatively small amount of data.
If you need to guarantee you have two copies you should write at consistency level ALL and only write it once.