practicallity of having a single node cassandra multisitecluster(3 way) - cassandra

Is it possible to put a Cassandra cluster with single node DC with 2 remote DC which is also having a single node assuming the replication factor is required to be 3 in this case? The remote cluster is in the same geographical area but not same building for HA. Or is there any hard rules that for high availability and consistency for a need for a local quorum node to achieve that?
Our setup may be smaller compared to big data and usually used to store time series data with approximately 2000/3000 (on different key) sampling per second.
Is there other implications other than read/write may be slow due to the comms delay?

disclaimer: I am new to cassandra.
Turns out I want to deploy a similar setup: 3 nodes on aws, each in its own AZ (But all in the same region). from what I read, this setup is just a single DC, with 3 nodes.
You need to use Ec2Snitch to reduce the latency between your clients and the nodes.
Using RF=3 provides you with the HA that you need, since every node has all the data
Inter-AZ communication should be fairly fast. refer to this: http://highscalability.com/blog/2016/8/1/how-to-setup-a-highly-available-multi-az-cassandra-cluster-o.html
becuase you'll be running in a single DC, local-quorum == quorum. so as long as you'll be writing to QUROUM (which requires 2/3 nodes (AZs) to be up), you'll be strongly consistent and HA.

Related

Hazelcast Embedded Topology | Latency increases with number of nodes in cluster

We are running a 5 node cluster of Hazelcast in Embedded mode.
We are running a simple use case of locking using Hazelcast IMap APi.
However, the latency of request flow increases linearly
with addition of nodes.Is this expected?
Thanks.
It depends on the data structure, but in general "yes".
For IMap the data is spread across the available nodes.
If you have a 3 node cluster, you have the primary copy of 1/3 of the data locally. If you are accessing randomly, then you'll find 66.66% of the calls need to go to other nodes, so will see the impact of the network.
If you expand this to a 5 node cluster, then you have primary copy of 1/5 of the data locally. For the same random access, now it's 80% of the calls involve the network.
As the number of nodes goes up, the benefits of data locality in embedded mode reduce.
Note also this is for random access, if you frequently access the same key you could be lucky and it's local or unlucky and it's remote.

How cassandra improve performance by adding nodes?

I'm going build apache cassandra 3.11.X cluster with 44 nodes. Each application server will have one cluster node so that application do r/w locally.
I have couple of questions running in my mind kindly answer if possible.
1.How many server Ip's should mention in seednode parameter?
2.How HA works when all the mentioned seed node goes down?
3.What is the dis-advantage to mention all the serverIP's in seednode parameter?
4.How cassandra scales with respect to data other than(Primary key and Tunable consistency). As per my assumption replication factor can improve HA chances but not performances.
then how performance will increase by adding more nodes?
5.Is there any sharding mechanism in Cassandra.
Answers are in order:
It's recommended to point to at least to 2 nodes per DC
Seed/contact node is used only for initial bootstrap - when your program reaches any of listed nodes, it "learns" the topology of whole cluster, and then driver listens for nodes status change, and adjust a list of available hosts. So even if seed node(s) goes down after connection is already established, driver will able to reach other nodes
it's harder to maintain usually - you need to keep a configuration parameters for your driver & list of nodes in sync.
When you have RF > 1, Cassandra may read or write data from/to any replica. Consistency level regulates how many nodes should return answer for read or write operation. When you add the new node, the data is redistributed to new node, and if you have correctly selected partition key, then new node start to receive requests in parallel to old nodes
Partition key is responsible for selection of replica(s) that will hold data associated with it - you can see it as a shard. But you need to be careful with selection of partition key - it's easy to create too big partitions, or partitions that will be "hot" (receiving most of operations in cluster - for example, if you're using the date as partition key, and always writing reading data for today).
P.S. I would recommend to read DataStax Architecture guide - it contains a lot of information about Cassandra as well...

Repartitioning of data in Cassandra when adding new servers

Let's imagine I have a Cassandra cluster with 3 nodes, each having 100GB of available hard disk space. Replication Factor for this cluster is set to 3 and R/W CLs are set to 2, meaning I can tolerate one of my nodes going down without sacrificing consistency or availability.
Now imagine my servers have started to fill up (80GB as an example) and I would like to add another 3 servers of the same specification to my cluster, maintaining the same CLs and RFs.
My question is: after I've added the new nodes to my cluster and run the node repair tool, is it fair to assume that each of my nodes should roughly (more or less a few GBs) contain 40GB of data each?
If not, how can I add new nodes without having the fear of running out of hard disk space?
A little background of why I'm asking this question: I am developing an app that connects to a server that runs Cassandra for its data storage. As this is only developed by me, and I have limited resources in terms of money to buy servers, I've decided that I would like to buy small, cheap "servers" instead of the more expensive rack options but I'm really worried about the nodes running out of space if the disk allocation is not (at least partially)
homogenous.
Many thanks for you help,
My question is: after I've added the new nodes to my cluster and run
the node repair tool, is it fair to assume that each of my nodes
should roughly (more or less a few GBs) 40GB of data each
After also running nodetool cleanup you should see roughly 40GB of data on each node. Cleanup removes data which the node is no longer responsible for. If you don't run this command the old data will remain on the machine.

Configure cassandra to use different network interfaces for data streaming and client connection?

I have a cassandra cluster deployed with 3 cassandra nodes with replication factor of 3. I have a lot of data being written to cassandra on daily basis (10-15GB). I have provisioned these cassandra on commodity hardware as suggested by "Big data community" and I am expecting the nodes to go down frequently which is handled using redundancy provided by cassandra.
My problem is, I have observed cassandra to slow down with writes when a new node is provisioned and the data is being streamed while bootstrapping. So, to overcome this hurdle, We have decided to have a separate network interface for inter-node communication and for client application to write data to cassandra. My question is how can this be configured, if at all this is possible ?
Any help is appreciated.
I think you are chasing the wrong solution.
I am confused by the fact that you only have 3 nodes, yet your concern is around slow writes while bootstrapping. Why? Are you planning to grow your cluster regularly? What is your consistency level on write, as this has a big impact on performance? Obviously if you only have 2 or 3 nodes and you're trying to bootstrap, you will see a slowdown, because you're tying up a significant percentage of your cluster to do the streaming.
Note that "commodity hardware" doesn't mean cheap, low-performance hardware. It just means you don't need the super high-end database-class machines used for databases like Oracle. You should still use really good commodity hardware. You may also need more nodes, as setting RF equal to cluster size is not typically a great idea.
Having said that, you can set your listen_address to the inter-node interface and rpc_address to the client address if you feel that will help.

Cassandra consistency model performance evaluation

Hi I am a student and am trying to evaluate the latency(Insert, read and Upsert) of cassandra for different consistency models and for different replication factors.
I am using Virtual box on my host system and have 10 ubuntu VMs to form a cluster.
When I run the tests, sometimes the average latency comes out lesser for a stronger consistency model.
Also the latency does not increase as I increase the replication factor in some cases which is also not an expected result.
I wanted to know what all could be the possible reasons for such behavior?
There are a few things:
Performance benchmarks using virtual box on a single system will give you very different resutls from a live cluster. For instance, network latencies would be considerably reduced. A real cluster would have different resources available whereas vbox instances are sharing the same resources. Even on a cloud platform, you'd see different numbers.
When a write request comes in, the coordinator sends to all required replicas a write request in parallel. They all process the write and respond. If your lower consistency write went to a busy node, and the higher consistency write went to enough "faster / available" nodes to make a quorum, then the latter will have lower latency. Also, increasing the replication factor means the data is available in more nodes. So reads can be faster (depending on consistency levels).

Resources