How the peer-to-peer Cassandra architecture really works ? I mean :
When the request hits the Cluster, it must hit some machine based on an IP, right ?
So which machine it will hit first ? : one of the nodes, or something in the Cluster who is responsible to balance and redirect the request to the right node ?
Could you describe what it is ? And how this differ from the Master/Folowers architecture ?
For the purposes of my answer, I will use the Java driver as an example since it is the most popular.
When you connect to a cluster using one of the driver, you need to configure it with details of your cluster including:
Contact points - the entry point to your cluster which is a comma-separated list of IPs/hostnames for some of the nodes in your cluster.
Login credentials - username and password if authentication is enabled on your cluster.
SSL/TLS certificate and credentials - if encryption is enabled on your cluster.
When your application starts, a control connection is established with the first available node in the list of contact points. The driver uses this control connection for admin tasks such as:
get topology information about the cluster including node IPs, rack placement, network/DC information, etc
get schema information such as keyspaces and tables
subscribe to metadata changes including topology and schema updates
When you configure the driver with a load-balancing policy (LBP), the policy will determine which node the driver will pick as the coordinator for each and every single query. By default, the Java driver uses a load balancing policy which picks nodes in the local datacenter. If you don't specify which DC is local to the app, the driver will set the local DC to the DC of the first contact point.
Each time a driver executes a query, it generates a query plan or a list of nodes to contact. This list of nodes has the following characteristics:
A query plan is different for each query to balance the load across nodes in the cluster.
A query plan only lists available nodes and does not include nodes which are down or temporarily unavailable.
Nodes in the local DC are listed first and if the load-balancing policy allows it, remote nodes are included last.
The driver tries to contact each node in the query plan in the order they are listed. If the first node is available then the driver uses it as the coordinator. If the first node does not respond (for whatever reason), the driver tries the next node in the query plan and so on.
Finally, all nodes are equal in Cassandra. There is no active-passive, no leader-follower, no primary-secondary and this makes Cassandra a truly high availability (HA) cluster with no single point-of-failure. Any node can do the work of any other node and the load is distributed equally to all nodes by design.
If you're new to Cassandra, I recommend having a look at datastax.com/dev which has lots of free hands-on interactive learning resources. In particular, the Cassandra Fundamentals learning series lets you learn the basic concepts quickly.
For what it's worth, you can also use the Stargate.io data platform. It allows you to connect to a Cassandra cluster using APIs you're already familiar with. It is fully open-source so it's free to use. Here are links to the Stargate tutorials on datastax.com/dev: REST API, Document API, GraphQL API, and more recently gRPC API. Cheers!
Working with Cassandra, we have to remember two very important things: data is partitioned (split into chunks) and data is replicated (each chunk is stored on a few different servers). Partitioning is needed for scalability purposes while Replication serves High Availability. Given that Cassandra is designed to handle petabytes of data under huge pressure (dozens of millions of queries per second), and there is no single server able to handle such the load, each cluster server is responsible only for a range of data, not for the whole dataset. A node storing data you need for a particular query is called a "replica node". Notice that the different queries there will have different replica nodes.
Together, it brings a few implications:
We have to reach multiple servers during a single query to assure the data is consistent (read) / write data to all responsible servers (write).
How do we know which node is right for that particular query? What happens if a query hits a "wrong" node? How do we configure the application so it sends queries to the replica nodes?
Funny enough, as a developer you have to do one and only one thing: understand partitions and partition keys, and then Cassandra will take care of all the potential issues. Simple as that. When you design a table, you have to declare partition keys and the data placement will be based on that - automagically. Next thing, you have to always specify partition keys while doing your queries. That's it, your job is done, get yourself some coffee!
Meanwhile, Cassandra starts her job. Cassandra nodes are smart, they know data placement, they know what servers are responsible for the data you are writing, and they know the partitions - in Cassandra language it's called token-aware. That does not matter which server will receive the query, as literally every server is able to answer it. Any node that got the request (it's called query coordinator because it coordinates the query operations) will find replica nodes based on the placement of the partitions. With that, the query coordinator will execute the query, making proper calls to the replicas - the coordinator knows which nodes to ask because you did your part of the job and specified partition key value in the query, which is used for the routing.
In short, you can ask any of your cluster nodes to write/read your data, Cassandra is decentralized and you'll get it done. But how do we make it better and get directly to the replica to avoid bothering nodes that don't store our data?
So which machine it will hit first ?
The travel of a request starts much earlier than we could think of - when your application starts, a Cassandra driver connects to a cluster and reads information about data placement: which partition is stored on which nodes, It means that driver knows which node has to be contacted for different queries. You got it right, a driver is token-aware too!
Token-aware drivers understand data placement and will route a query to a proper replica node. Answering the question: under normal circumstances, your query will first hit one of the replica nodes, this node will get answers or write data to the other replica nodes and that's it, we are good. In some rare situations, your query may hit a "wrong" non-replica server, but it doesn't really matter as it also will do the job, with just a minor delay - for example, if your Replication Factor = 3 (you have three replicas), and your query got to a "wrong" node, it will have to ask all three replicas while hitting the "right one" still require 2 network operations. It's not a big deal though as all the operations are done in parallel.
how this differ from the Master/Folowers architecture
With leader/follower architecture, you can read from any server but you can write only to a leader server, which gives two issues:
Your app needs to know who is the leader (or you need to have a special proxy)
Single Point of Failure (SPoF) - if the leader is down, you can't write to the DB at all
With Cassandra's peer-to-peer architecture you can write to any of the cluster nodes, even if there are thousands of them. Of course, there is no SPoF.
P.S. Cassandra is an extremely powerful technology, but great power comes with great responsibility, it's quite complex too. If you plan to work with it, you better invest some time into learning to use it properly. I do suggest taking a Developer Path on the academy.datastax.com (it's free!) or at least watch DataStax "Intro to Cassandra" workshop
It is based on the driver that you used to connect to the Cassandraâ„¢ cluster. Again, all nodes in the datacenter are one and same. It would connect to any of the nodes the localdatacenter that you have provided in driver configs based on the contact points configuration (i.e. datastax-java-driver.basic.contact-points in Java Driver).
For example, the Java driver (& most drivers logic will be the same) uses system.peers.rpc-address to connect to newly discovered nodes. For special network topologies, an address translation component can be plugged in.
advanced.address-translator in the configuration.
none by default. Also available: EC2-specific (for deployments that span multiple regions), or write your own.
Each node in the Cassandra cluster is uniquely identified by an IP address that the driver will use to establish connections.
for contact points, these are provided as part of configuring the CqlSession object;
for other nodes, addresses will be discovered dynamically, either by inspecting system.peers on already connected nodes, or via push notifications received on the control connection when new nodes are discovered by gossip.
More info can be found here.
It seems you are asking how specifically Cassandra selects which Node gets hit with data and which ones doesn't.
There are two sides to this: the client and the servers
On the client
When a CQL Connection is established the client (if implemented in the client library and configured) usually also retrieves the Topology from the Cluster. A topology is the information about the token ownership inside the ring as well as information about quorums etc..
So the client itself can already make a decision on the next request what Node to contact for a certain amount of information due to Consistent Hashing of the primary keys in Cassandra. The client is aware who would be the right choice of Node to contact.
But still the client can choose not to use this information and just send the information to any node of the ring - the nodes will then forward the requests to the appropriate token owners -> See the next section.
In the Cluster
The same applies to the nodes themselves. If a client sends a request to a node it will simply look up the owner nodes in it's topology table and forward the request to exactly the nodes that do own this token.
It will always forward it to all of them so the data is consistent across the cluster. Depending on the replication factor it will return a success response to the client if the required replication is acknowledged by the cluster (eg. LOCAL_QUORUM with RF=3 will return a success response when 2 nodes acknowledge the receipt while the 3rd node is still pending).
If a node is detected as down or can't be reached the Command that would have been sent to the node is saved in the local hints table - a buffer that keeps all the operations that haven't been successfully sent to other nodes.
You can read more on Hints in the Cassandra Docs
Compared to a Leader/Follower architecture the Cassandra model is actually simpler and depends mostly on all involved nodes seeing all the mutation commands happening to the data they "own" via the tokens.
I'm going build apache cassandra 3.11.X cluster with 44 nodes. Each application server will have one cluster node so that application do r/w locally.
I have couple of questions running in my mind kindly answer if possible.
1.How many server Ip's should mention in seednode parameter?
2.How HA works when all the mentioned seed node goes down?
3.What is the dis-advantage to mention all the serverIP's in seednode parameter?
4.How cassandra scales with respect to data other than(Primary key and Tunable consistency). As per my assumption replication factor can improve HA chances but not performances.
then how performance will increase by adding more nodes?
5.Is there any sharding mechanism in Cassandra.
Answers are in order:
It's recommended to point to at least to 2 nodes per DC
Seed/contact node is used only for initial bootstrap - when your program reaches any of listed nodes, it "learns" the topology of whole cluster, and then driver listens for nodes status change, and adjust a list of available hosts. So even if seed node(s) goes down after connection is already established, driver will able to reach other nodes
it's harder to maintain usually - you need to keep a configuration parameters for your driver & list of nodes in sync.
When you have RF > 1, Cassandra may read or write data from/to any replica. Consistency level regulates how many nodes should return answer for read or write operation. When you add the new node, the data is redistributed to new node, and if you have correctly selected partition key, then new node start to receive requests in parallel to old nodes
Partition key is responsible for selection of replica(s) that will hold data associated with it - you can see it as a shard. But you need to be careful with selection of partition key - it's easy to create too big partitions, or partitions that will be "hot" (receiving most of operations in cluster - for example, if you're using the date as partition key, and always writing reading data for today).
P.S. I would recommend to read DataStax Architecture guide - it contains a lot of information about Cassandra as well...
I have checked the main features of Cassandra and Infinispan. They seem to have and deliver pretty similar characteristics and functionalities:
NoSQL data store
persistance
decentralized
support replication
scalability
fault tolerant
MapReduce support
Queries
One difference I have found out is that Infinispan does not provide tunable consistency (every node has the same data).
When learning about the Infinispan I came across Cassandra Cache Store (http://infinispan.org/docs/cachestores/cassandra/). It provides persistance of data.
But then why I would still want to use Infinispan and not Cassandra directly?
Do these solutions complement each other or they are more competing on the same level?
Infinispan is mainly used as a distributed cache, like memcached/hazelcast and so on.
Natively data are written in memory but you can persist them into what they call "cache stores" -- there are many cache-stores ready (for File/Cassandra/Hbase/Mongo) or you can make your own implementation.
One difference I have found out is that Infinispan does not provide
tunable consistency (every node has the same data).
Tunable consistency and data distribution are two different things. It's not true that "every node has the same data", it depends on how you choose to cluster data. Infinispan, like others, offers both replication (all nodes stores same cache) and distribution (each node will be responsible for a range of tokens). Tunable consistency in Cassandra means that you can choose how many nodes should be informed about your r/w operation before returning the control to the client.
You might need to use Infinispan and not Cassandra directly for many reasons. If for instance you have huge amount of memory in your application servers and you want keep a bigger/different cache than what you can store inside your Cassandra nodes. Other feature you might need is to plug the infinispan-query module in order to perform full-text searches without installing a solr/elasticsearch/whatever cluster or use the transactional capability within is.
IMHO these two products does not compare directly, they're born for different use cases and offers different features. You can use any, one or both, depend on what's your application architecture and needs.
HTH,
Carlo
If you are using the Cassandra distributed key-value store, you will have several Cassandra nodes, and thus several computers. The data doesn't just sit there, of course, you also have one or more clients programs that communicate with the Cassandra nodes. Computationally intensive work done by the clients might also be distributed over several computers. Should the clients and the Cassandra nodes be separate computers? Is it OK to use the same computer as a Cassandra node and as a Cassandra client? I expect it would work, in the sense of performing correctly, but would there be unacceptable performance problems?
The Cassandra documentation I've seen talks in terms that suggest Cassandra nodes and clients should be separate computers, but I've not seen an explicit recommendation.
Why do I ask? Why might I want to do that? The application I have in mind does not require that the clients store any data locally, they use Cassandra for all persistent data. Their job is computationally intensive, so the bottleneck is likely to be client CPU processing rather than Cassandra processing. Not also using them as Cassandra nodes seems wasteful.
Also, if each computation (client) node is also a Cassandra node, I can use the Cassandra token of each node (used for distributing Cassandra's data) to distribute the client computations.
This is a valid setup for certain types of deployments. The most common case where people do this is when running Hadoop jobs against Cassandra. The Cassandra Wiki recommends you run one Hadoop TaskTracker on each node in your cluster. That type of deployment is similar to what you are describing.
I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
Zookeeper creates znodes to perform read and write operations and data to and fro goes through znodes in Zookeeper. I want to know that how to support rollback and commit feature in cassandra using Zookeeper. Is there any way by which we can specify cassandra configurations in zookeeper or zookeeper configurations in cassandra.
I know cassandra and zookeeper individually how data is read and written but I dont know how to integrate both of them using Java.
how can we do transactions in Cassandra using Zookeeper.
Thanks.
I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
With great difficulty. Cassandra does not work well as a transactional system. Writes to multiple rows are not atomic, there is no way to rollback writes if some writes fail, and there is no way to ensure readers read a consistent view when reading.
I want to know that how to support rollback and commit feature in cassandra using Zookeeper.
Zookeeper won't help you with this, especially the commit feature. You may be able to write enough information to zookeeper to roll back in case of failure, but if you are doing that, you might as well store the rollback info in cassandra.
Zookeeper and Cassandra work well together when you use Zookeeper as a locking service. Look at the Cages library. Use zookeeper to co-ordinate read/writes to cassandra.
Trying to use cassandra as a transactional system with atomic commits to multiple rows and rollbacks is going to be very frustrating.
There are ways you can use to implement transactions in Cassandra without ZooKeeper.
Cassandra itself has a feature called Lightweight transactions which provides per key linearizability and compare-and-set. With such primitives you can implement serializable transactions on the application level by youself.
Please see the Visualization of serializable cross shard client-side transactions post for for details and step-by-step visualization.
The variants of this approach are used in Google's Percolator system and in CockroachDB.
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on the RAMP transactions paper by Peter Bailis.
There is a BATCH feature for Cassandra's CQL3 (Cassandra 1.2 is the formal version that released CQL3), which allegedly can atomically apply all the updates in the BATCH as one unit all-or-nothing.
This does not mean you can rollback a successfully executed BATCH as an RDBMS could do, that would have to be manually done.
Depending on the consistency and preferences you provide to the BATCH statement, guarantees of atomicity of the updates can be increased or decreased to some degree with the UNLOGGED option.
http://www.datastax.com/docs/1.2/cql_cli/cql/BATCH
Well, I'm not an exepert at this (far from it actually) but the way I see it, either you deploy some middleware made by yourself, in order to guarantee the specific properties you are looking for or you can just have Cassandra write data to auxilliary files and then copy them through the file system, since the copy function in Java works as an atomic operation.
I don't know anything about the size of the data files you are considering so I don't really know if it is doable, however there might be a way to use this property through smaller bits of information and then combining them as a whole.
Just my 2 cents...