How to make Cassandra nodes have the same data?

How to make Cassandra nodes have the same data? - cassandra

I have two computers each one being a Cassandra node and they communicate well with each other.
From what I understand Cassandra will replicate the data to each other but will always query certain portions from one of them.
I would like though to have the data being copied to each other so they have the same data but they only use data from the local node. Is that possible?
The background reason is that the application in each node keeps generating and downloading a lot of data and at the same time both are doing some CPU super intensive tasks. What happens is that one node saves the data and suddenly can't find it anymore because it has been saved in the other node which is busy enough to reply with that data.

Technically, you just need to change replication factor to a number of nodes, and set your application to always read from a local node using whitelist load balancing mode. But it may not help you because if your nodes are very busy, replication of data from another node may also not happen, so the query will fail as well. Or replication will add an additional overhead making the situation much worse.
You need to rethink your approach - typically, you need to separate application nodes from database nodes, so application processes doesn't affect database processes.

Related

How does peer to peer architecture work in Cassandra?

How the peer-to-peer Cassandra architecture really works ? I mean :
When the request hits the Cluster, it must hit some machine based on an IP, right ?
So which machine it will hit first ? : one of the nodes, or something in the Cluster who is responsible to balance and redirect the request to the right node ?
Could you describe what it is ? And how this differ from the Master/Folowers architecture ?

For the purposes of my answer, I will use the Java driver as an example since it is the most popular.
When you connect to a cluster using one of the driver, you need to configure it with details of your cluster including:
Contact points - the entry point to your cluster which is a comma-separated list of IPs/hostnames for some of the nodes in your cluster.
Login credentials - username and password if authentication is enabled on your cluster.
SSL/TLS certificate and credentials - if encryption is enabled on your cluster.
When your application starts, a control connection is established with the first available node in the list of contact points. The driver uses this control connection for admin tasks such as:
get topology information about the cluster including node IPs, rack placement, network/DC information, etc
get schema information such as keyspaces and tables
subscribe to metadata changes including topology and schema updates
When you configure the driver with a load-balancing policy (LBP), the policy will determine which node the driver will pick as the coordinator for each and every single query. By default, the Java driver uses a load balancing policy which picks nodes in the local datacenter. If you don't specify which DC is local to the app, the driver will set the local DC to the DC of the first contact point.
Each time a driver executes a query, it generates a query plan or a list of nodes to contact. This list of nodes has the following characteristics:
A query plan is different for each query to balance the load across nodes in the cluster.
A query plan only lists available nodes and does not include nodes which are down or temporarily unavailable.
Nodes in the local DC are listed first and if the load-balancing policy allows it, remote nodes are included last.
The driver tries to contact each node in the query plan in the order they are listed. If the first node is available then the driver uses it as the coordinator. If the first node does not respond (for whatever reason), the driver tries the next node in the query plan and so on.
Finally, all nodes are equal in Cassandra. There is no active-passive, no leader-follower, no primary-secondary and this makes Cassandra a truly high availability (HA) cluster with no single point-of-failure. Any node can do the work of any other node and the load is distributed equally to all nodes by design.
If you're new to Cassandra, I recommend having a look at datastax.com/dev which has lots of free hands-on interactive learning resources. In particular, the Cassandra Fundamentals learning series lets you learn the basic concepts quickly.
For what it's worth, you can also use the Stargate.io data platform. It allows you to connect to a Cassandra cluster using APIs you're already familiar with. It is fully open-source so it's free to use. Here are links to the Stargate tutorials on datastax.com/dev: REST API, Document API, GraphQL API, and more recently gRPC API. Cheers!

Working with Cassandra, we have to remember two very important things: data is partitioned (split into chunks) and data is replicated (each chunk is stored on a few different servers). Partitioning is needed for scalability purposes while Replication serves High Availability. Given that Cassandra is designed to handle petabytes of data under huge pressure (dozens of millions of queries per second), and there is no single server able to handle such the load, each cluster server is responsible only for a range of data, not for the whole dataset. A node storing data you need for a particular query is called a "replica node". Notice that the different queries there will have different replica nodes.
Together, it brings a few implications:
We have to reach multiple servers during a single query to assure the data is consistent (read) / write data to all responsible servers (write).
How do we know which node is right for that particular query? What happens if a query hits a "wrong" node? How do we configure the application so it sends queries to the replica nodes?
Funny enough, as a developer you have to do one and only one thing: understand partitions and partition keys, and then Cassandra will take care of all the potential issues. Simple as that. When you design a table, you have to declare partition keys and the data placement will be based on that - automagically. Next thing, you have to always specify partition keys while doing your queries. That's it, your job is done, get yourself some coffee!
Meanwhile, Cassandra starts her job. Cassandra nodes are smart, they know data placement, they know what servers are responsible for the data you are writing, and they know the partitions - in Cassandra language it's called token-aware. That does not matter which server will receive the query, as literally every server is able to answer it. Any node that got the request (it's called query coordinator because it coordinates the query operations) will find replica nodes based on the placement of the partitions. With that, the query coordinator will execute the query, making proper calls to the replicas - the coordinator knows which nodes to ask because you did your part of the job and specified partition key value in the query, which is used for the routing.
In short, you can ask any of your cluster nodes to write/read your data, Cassandra is decentralized and you'll get it done. But how do we make it better and get directly to the replica to avoid bothering nodes that don't store our data?
So which machine it will hit first ?
The travel of a request starts much earlier than we could think of - when your application starts, a Cassandra driver connects to a cluster and reads information about data placement: which partition is stored on which nodes, It means that driver knows which node has to be contacted for different queries. You got it right, a driver is token-aware too!
Token-aware drivers understand data placement and will route a query to a proper replica node. Answering the question: under normal circumstances, your query will first hit one of the replica nodes, this node will get answers or write data to the other replica nodes and that's it, we are good. In some rare situations, your query may hit a "wrong" non-replica server, but it doesn't really matter as it also will do the job, with just a minor delay - for example, if your Replication Factor = 3 (you have three replicas), and your query got to a "wrong" node, it will have to ask all three replicas while hitting the "right one" still require 2 network operations. It's not a big deal though as all the operations are done in parallel.
how this differ from the Master/Folowers architecture
With leader/follower architecture, you can read from any server but you can write only to a leader server, which gives two issues:
Your app needs to know who is the leader (or you need to have a special proxy)
Single Point of Failure (SPoF) - if the leader is down, you can't write to the DB at all
With Cassandra's peer-to-peer architecture you can write to any of the cluster nodes, even if there are thousands of them. Of course, there is no SPoF.
P.S. Cassandra is an extremely powerful technology, but great power comes with great responsibility, it's quite complex too. If you plan to work with it, you better invest some time into learning to use it properly. I do suggest taking a Developer Path on the academy.datastax.com (it's free!) or at least watch DataStax "Intro to Cassandra" workshop

It is based on the driver that you used to connect to the Cassandra™ cluster. Again, all nodes in the datacenter are one and same. It would connect to any of the nodes the localdatacenter that you have provided in driver configs based on the contact points configuration (i.e. datastax-java-driver.basic.contact-points in Java Driver).
For example, the Java driver (& most drivers logic will be the same) uses system.peers.rpc-address to connect to newly discovered nodes. For special network topologies, an address translation component can be plugged in.
advanced.address-translator in the configuration.
none by default. Also available: EC2-specific (for deployments that span multiple regions), or write your own.
Each node in the Cassandra cluster is uniquely identified by an IP address that the driver will use to establish connections.
for contact points, these are provided as part of configuring the CqlSession object;
for other nodes, addresses will be discovered dynamically, either by inspecting system.peers on already connected nodes, or via push notifications received on the control connection when new nodes are discovered by gossip.
More info can be found here.

It seems you are asking how specifically Cassandra selects which Node gets hit with data and which ones doesn't.
There are two sides to this: the client and the servers
On the client
When a CQL Connection is established the client (if implemented in the client library and configured) usually also retrieves the Topology from the Cluster. A topology is the information about the token ownership inside the ring as well as information about quorums etc..
So the client itself can already make a decision on the next request what Node to contact for a certain amount of information due to Consistent Hashing of the primary keys in Cassandra. The client is aware who would be the right choice of Node to contact.
But still the client can choose not to use this information and just send the information to any node of the ring - the nodes will then forward the requests to the appropriate token owners -> See the next section.
In the Cluster
The same applies to the nodes themselves. If a client sends a request to a node it will simply look up the owner nodes in it's topology table and forward the request to exactly the nodes that do own this token.
It will always forward it to all of them so the data is consistent across the cluster. Depending on the replication factor it will return a success response to the client if the required replication is acknowledged by the cluster (eg. LOCAL_QUORUM with RF=3 will return a success response when 2 nodes acknowledge the receipt while the 3rd node is still pending).
If a node is detected as down or can't be reached the Command that would have been sent to the node is saved in the local hints table - a buffer that keeps all the operations that haven't been successfully sent to other nodes.
You can read more on Hints in the Cassandra Docs
Compared to a Leader/Follower architecture the Cassandra model is actually simpler and depends mostly on all involved nodes seeing all the mutation commands happening to the data they "own" via the tokens.

Hazelcast Embedded Topology | Latency increases with number of nodes in cluster

We are running a 5 node cluster of Hazelcast in Embedded mode.
We are running a simple use case of locking using Hazelcast IMap APi.
However, the latency of request flow increases linearly
with addition of nodes.Is this expected?
Thanks.

It depends on the data structure, but in general "yes".
For IMap the data is spread across the available nodes.
If you have a 3 node cluster, you have the primary copy of 1/3 of the data locally. If you are accessing randomly, then you'll find 66.66% of the calls need to go to other nodes, so will see the impact of the network.
If you expand this to a 5 node cluster, then you have primary copy of 1/5 of the data locally. For the same random access, now it's 80% of the calls involve the network.
As the number of nodes goes up, the benefits of data locality in embedded mode reduce.
Note also this is for random access, if you frequently access the same key you could be lucky and it's local or unlucky and it's remote.

Janusgraph components that scale

If I understand right multiple gremlin servers don't communicate with each other. The scale is in the cassandra/ES only.
If that is true how many vertexes can each gremlin server support?
When the graph is updated by one gremlin server when will the other gremlin servers see that change?
Thanks!

The number of vertices supported is 500 trillion (2^59)
The storage backend is the sole source of state between multiple Gremlin servers. The number of vertices will not be increased by adding additional Gremlin servers.
The limitations on the number of vertices is outlined in the Technical Limitations Page in the JanusGraph Manual.
When one Gremlin Server sees changes made by another is determined by the storage backend choice, but it's still tricky to answer
As far as when the other Gremlin servers will see changes, that is a bit tricky to answer. If you are using a consistent data backend, the answer will generally be as soon as Gremlin finishes its transaction.
But Cassandra is a different beast.
Using an eventually consistent storage backend
Cassandra is what's known as an eventually-consistent database. This means that it trades transactional consistency for availability and partition tolerance; even if you started to lose nodes in the cluster, it will continue to function and serve requests.
The downside to this is that mutations in Cassandra do not instantly become available to consumers; you can even have the case where a client writes a change to Cassandra and that very same client doesn't see the change if they immediately try to read that data.
Chapter 31 in the JanusGraph Manual covers dealing with an eventually consistent storage backend like Cassandra.
Realistically, the amount of time between a mutation and all clients being able to see the mutation in Cassandra depends entirely upon the data load, the nature of the write, and the read/write consistency levels that JanusGraph is configured to read and write to Cassandra with.

Can a Cassandra cluster serve as a replacement for an in-memory Redis key-value store?

My application crawls user's mailbox and saves it to an RDBMS database. I started using Redis as a cache (simple key-value store) for RDBMS database. But gradually I started storing crawler states and other data in Redis that needs to be persistent. Loosing this data means a few hours of downtime. I must ensure airtight consistency for this data. The data should not be lost in node failures or split brain scenarios. Strong consistency is a must. Sharding is done by my application. One Redis process runs on each of ten EC2 m4.large instances. On each of these instances. I am doing up to 20K IOPS to Redis. I am doing more writes than reads, though I have not determined the actual percentage of both. All my data is completely in memory, not backed by disk.
My only problem is each of these instances are SPOF. I cannot use Redis cluster as it does not guarantee consistency. I have evaluated a few more tools like Aerospike, none gives 'No data loss guarantee'.
Cassandra looks promising as I can tune the consistency level I want. I plan to use Cassandra with a replication factor 2, and a write must be written to both the replicas before considered committed. This gives 'No data loss guarantee.
By launching enough cassandra nodes (ssd backed) can I replace my Redis key-value store and still get similar read/write IOPS and
latency? Will opensource cassandra suffice my use case? If not, will the Datastax enterprise In-Memory version solve it?
EDIT 1:
A bit of clarification:
I think I need to use Write consistency level 'ALL' and Read consistency level 'One'. I understand that with this consistency level my cluster will not tolerate any failure. That is OK for me. A few minutes of downtime occasionally is not a problem, as long as my data is consistent. In my present setup, one Redis instance failure causes a few hours of downtime.

I must ensure airtight consistency for this data.
Cassandra deals with failure better when there are more nodes. Assuming your case allows for having more nodes, this is my suggestion.
So, if you have 5 nodes, use CL of QUORUM for both READ and WRITE. What it means is that you always write to at least 3 nodes and read from 3 nodes.(for 5 nodes , QUORUM is 3).
This ensures a very high level consistency
Also ensures limited downtime. Even if a node is down your writes and reads won't break.
If you use CL ALL, then even if one node is down or overloaded, you will have to take a full downtime.
I hope it helps!

How can Cassandra have no single point of failure when it has no master and data is not replicated but distributed?

Perhaps I am misunderstanding, but the Apache Cassandra Wikipedia article says:
"Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request."
How can each node contain different data, but there be no single point of failure? For instance, I would imagine that in this senario, if a node when down containing the record I was querying, then a different node would pickup that request, however, it would not have the data to satisfy it..since that data was on the node that went down..
Can someone clear this up for me?
Thanks!

Cassandra clusters do replicate data across the nodes. The specific number of replicas is configurable, but generally production clusters will use a replication factor of 3. This means that a given row will be stored on three different machines in the cluster. See the reference documentation on replication for more details.
In terms of servicing requests, if a node receives a request for data that it does not have it will forward that request to the nodes that do own the data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string