Can a Cassandra cluster serve as a replacement for an in-memory Redis key-value store? - cassandra

My application crawls user's mailbox and saves it to an RDBMS database. I started using Redis as a cache (simple key-value store) for RDBMS database. But gradually I started storing crawler states and other data in Redis that needs to be persistent. Loosing this data means a few hours of downtime. I must ensure airtight consistency for this data. The data should not be lost in node failures or split brain scenarios. Strong consistency is a must. Sharding is done by my application. One Redis process runs on each of ten EC2 m4.large instances. On each of these instances. I am doing up to 20K IOPS to Redis. I am doing more writes than reads, though I have not determined the actual percentage of both. All my data is completely in memory, not backed by disk.
My only problem is each of these instances are SPOF. I cannot use Redis cluster as it does not guarantee consistency. I have evaluated a few more tools like Aerospike, none gives 'No data loss guarantee'.
Cassandra looks promising as I can tune the consistency level I want. I plan to use Cassandra with a replication factor 2, and a write must be written to both the replicas before considered committed. This gives 'No data loss guarantee.
By launching enough cassandra nodes (ssd backed) can I replace my Redis key-value store and still get similar read/write IOPS and
latency? Will opensource cassandra suffice my use case? If not, will the Datastax enterprise In-Memory version solve it?
EDIT 1:
A bit of clarification:
I think I need to use Write consistency level 'ALL' and Read consistency level 'One'. I understand that with this consistency level my cluster will not tolerate any failure. That is OK for me. A few minutes of downtime occasionally is not a problem, as long as my data is consistent. In my present setup, one Redis instance failure causes a few hours of downtime.

I must ensure airtight consistency for this data.
Cassandra deals with failure better when there are more nodes. Assuming your case allows for having more nodes, this is my suggestion.
So, if you have 5 nodes, use CL of QUORUM for both READ and WRITE. What it means is that you always write to at least 3 nodes and read from 3 nodes.(for 5 nodes , QUORUM is 3).
This ensures a very high level consistency
Also ensures limited downtime. Even if a node is down your writes and reads won't break.
If you use CL ALL, then even if one node is down or overloaded, you will have to take a full downtime.
I hope it helps!

Related

How cassandra improve performance by adding nodes?

I'm going build apache cassandra 3.11.X cluster with 44 nodes. Each application server will have one cluster node so that application do r/w locally.
I have couple of questions running in my mind kindly answer if possible.
1.How many server Ip's should mention in seednode parameter?
2.How HA works when all the mentioned seed node goes down?
3.What is the dis-advantage to mention all the serverIP's in seednode parameter?
4.How cassandra scales with respect to data other than(Primary key and Tunable consistency). As per my assumption replication factor can improve HA chances but not performances.
then how performance will increase by adding more nodes?
5.Is there any sharding mechanism in Cassandra.
Answers are in order:
It's recommended to point to at least to 2 nodes per DC
Seed/contact node is used only for initial bootstrap - when your program reaches any of listed nodes, it "learns" the topology of whole cluster, and then driver listens for nodes status change, and adjust a list of available hosts. So even if seed node(s) goes down after connection is already established, driver will able to reach other nodes
it's harder to maintain usually - you need to keep a configuration parameters for your driver & list of nodes in sync.
When you have RF > 1, Cassandra may read or write data from/to any replica. Consistency level regulates how many nodes should return answer for read or write operation. When you add the new node, the data is redistributed to new node, and if you have correctly selected partition key, then new node start to receive requests in parallel to old nodes
Partition key is responsible for selection of replica(s) that will hold data associated with it - you can see it as a shard. But you need to be careful with selection of partition key - it's easy to create too big partitions, or partitions that will be "hot" (receiving most of operations in cluster - for example, if you're using the date as partition key, and always writing reading data for today).
P.S. I would recommend to read DataStax Architecture guide - it contains a lot of information about Cassandra as well...

practicallity of having a single node cassandra multisitecluster(3 way)

Is it possible to put a Cassandra cluster with single node DC with 2 remote DC which is also having a single node assuming the replication factor is required to be 3 in this case? The remote cluster is in the same geographical area but not same building for HA. Or is there any hard rules that for high availability and consistency for a need for a local quorum node to achieve that?
Our setup may be smaller compared to big data and usually used to store time series data with approximately 2000/3000 (on different key) sampling per second.
Is there other implications other than read/write may be slow due to the comms delay?
disclaimer: I am new to cassandra.
Turns out I want to deploy a similar setup: 3 nodes on aws, each in its own AZ (But all in the same region). from what I read, this setup is just a single DC, with 3 nodes.
You need to use Ec2Snitch to reduce the latency between your clients and the nodes.
Using RF=3 provides you with the HA that you need, since every node has all the data
Inter-AZ communication should be fairly fast. refer to this: http://highscalability.com/blog/2016/8/1/how-to-setup-a-highly-available-multi-az-cassandra-cluster-o.html
becuase you'll be running in a single DC, local-quorum == quorum. so as long as you'll be writing to QUROUM (which requires 2/3 nodes (AZs) to be up), you'll be strongly consistent and HA.

when does Cassandra node fail?

How does cassandra guarantee no failure of node at any given point of time,i know data is replicated so there might not be issues of losing the data
Cassandra nodes can fail due to alot of reasons like, very heavy write, out of memory error, hardware failure, tombstone limit 100k error, compaction failures, network errors, and so on.
Cassandra cannot guarantee no failure of node, because it just like any other software is vulnerable to dependent component and hardware.
What it does guarantee is that you won't have data loss, until you have minimum number of required nodes up and running, based on replication factor.
Cassandra could not guarantee no failure of nodes like any other systems, but with a correct setup of cassandra cluster, with enough number of nodes and replicas configured, even some of the nodes down, the entire cluster will still be available and no data lost, which could be transparent to clients. Clients will not realize it.

Configure cassandra to use different network interfaces for data streaming and client connection?

I have a cassandra cluster deployed with 3 cassandra nodes with replication factor of 3. I have a lot of data being written to cassandra on daily basis (10-15GB). I have provisioned these cassandra on commodity hardware as suggested by "Big data community" and I am expecting the nodes to go down frequently which is handled using redundancy provided by cassandra.
My problem is, I have observed cassandra to slow down with writes when a new node is provisioned and the data is being streamed while bootstrapping. So, to overcome this hurdle, We have decided to have a separate network interface for inter-node communication and for client application to write data to cassandra. My question is how can this be configured, if at all this is possible ?
Any help is appreciated.
I think you are chasing the wrong solution.
I am confused by the fact that you only have 3 nodes, yet your concern is around slow writes while bootstrapping. Why? Are you planning to grow your cluster regularly? What is your consistency level on write, as this has a big impact on performance? Obviously if you only have 2 or 3 nodes and you're trying to bootstrap, you will see a slowdown, because you're tying up a significant percentage of your cluster to do the streaming.
Note that "commodity hardware" doesn't mean cheap, low-performance hardware. It just means you don't need the super high-end database-class machines used for databases like Oracle. You should still use really good commodity hardware. You may also need more nodes, as setting RF equal to cluster size is not typically a great idea.
Having said that, you can set your listen_address to the inter-node interface and rpc_address to the client address if you feel that will help.

Cassandra consistency model performance evaluation

Hi I am a student and am trying to evaluate the latency(Insert, read and Upsert) of cassandra for different consistency models and for different replication factors.
I am using Virtual box on my host system and have 10 ubuntu VMs to form a cluster.
When I run the tests, sometimes the average latency comes out lesser for a stronger consistency model.
Also the latency does not increase as I increase the replication factor in some cases which is also not an expected result.
I wanted to know what all could be the possible reasons for such behavior?
There are a few things:
Performance benchmarks using virtual box on a single system will give you very different resutls from a live cluster. For instance, network latencies would be considerably reduced. A real cluster would have different resources available whereas vbox instances are sharing the same resources. Even on a cloud platform, you'd see different numbers.
When a write request comes in, the coordinator sends to all required replicas a write request in parallel. They all process the write and respond. If your lower consistency write went to a busy node, and the higher consistency write went to enough "faster / available" nodes to make a quorum, then the latter will have lower latency. Also, increasing the replication factor means the data is available in more nodes. So reads can be faster (depending on consistency levels).

Resources