How does Apache Cassandra mash with Infinispan? - cassandra

I have checked the main features of Cassandra and Infinispan. They seem to have and deliver pretty similar characteristics and functionalities:
NoSQL data store
persistance
decentralized
support replication
scalability
fault tolerant
MapReduce support
Queries
One difference I have found out is that Infinispan does not provide tunable consistency (every node has the same data).
When learning about the Infinispan I came across Cassandra Cache Store (http://infinispan.org/docs/cachestores/cassandra/). It provides persistance of data.
But then why I would still want to use Infinispan and not Cassandra directly?
Do these solutions complement each other or they are more competing on the same level?

Infinispan is mainly used as a distributed cache, like memcached/hazelcast and so on.
Natively data are written in memory but you can persist them into what they call "cache stores" -- there are many cache-stores ready (for File/Cassandra/Hbase/Mongo) or you can make your own implementation.
One difference I have found out is that Infinispan does not provide
tunable consistency (every node has the same data).
Tunable consistency and data distribution are two different things. It's not true that "every node has the same data", it depends on how you choose to cluster data. Infinispan, like others, offers both replication (all nodes stores same cache) and distribution (each node will be responsible for a range of tokens). Tunable consistency in Cassandra means that you can choose how many nodes should be informed about your r/w operation before returning the control to the client.
You might need to use Infinispan and not Cassandra directly for many reasons. If for instance you have huge amount of memory in your application servers and you want keep a bigger/different cache than what you can store inside your Cassandra nodes. Other feature you might need is to plug the infinispan-query module in order to perform full-text searches without installing a solr/elasticsearch/whatever cluster or use the transactional capability within is.
IMHO these two products does not compare directly, they're born for different use cases and offers different features. You can use any, one or both, depend on what's your application architecture and needs.
HTH,
Carlo

Related

Can a Cassandra cluster serve as a replacement for an in-memory Redis key-value store?

My application crawls user's mailbox and saves it to an RDBMS database. I started using Redis as a cache (simple key-value store) for RDBMS database. But gradually I started storing crawler states and other data in Redis that needs to be persistent. Loosing this data means a few hours of downtime. I must ensure airtight consistency for this data. The data should not be lost in node failures or split brain scenarios. Strong consistency is a must. Sharding is done by my application. One Redis process runs on each of ten EC2 m4.large instances. On each of these instances. I am doing up to 20K IOPS to Redis. I am doing more writes than reads, though I have not determined the actual percentage of both. All my data is completely in memory, not backed by disk.
My only problem is each of these instances are SPOF. I cannot use Redis cluster as it does not guarantee consistency. I have evaluated a few more tools like Aerospike, none gives 'No data loss guarantee'.
Cassandra looks promising as I can tune the consistency level I want. I plan to use Cassandra with a replication factor 2, and a write must be written to both the replicas before considered committed. This gives 'No data loss guarantee.
By launching enough cassandra nodes (ssd backed) can I replace my Redis key-value store and still get similar read/write IOPS and
latency? Will opensource cassandra suffice my use case? If not, will the Datastax enterprise In-Memory version solve it?
EDIT 1:
A bit of clarification:
I think I need to use Write consistency level 'ALL' and Read consistency level 'One'. I understand that with this consistency level my cluster will not tolerate any failure. That is OK for me. A few minutes of downtime occasionally is not a problem, as long as my data is consistent. In my present setup, one Redis instance failure causes a few hours of downtime.
I must ensure airtight consistency for this data.
Cassandra deals with failure better when there are more nodes. Assuming your case allows for having more nodes, this is my suggestion.
So, if you have 5 nodes, use CL of QUORUM for both READ and WRITE. What it means is that you always write to at least 3 nodes and read from 3 nodes.(for 5 nodes , QUORUM is 3).
This ensures a very high level consistency
Also ensures limited downtime. Even if a node is down your writes and reads won't break.
If you use CL ALL, then even if one node is down or overloaded, you will have to take a full downtime.
I hope it helps!

Cassandra vs HBase Consistency Model

How is Cassandra's eventual consistency model different from HBase? It seems Facebook moved from Cassandra to HBase because consistency issues. Which of these NoSQL DBs are ideal for scale and performance with consistency as near as possible to 'immediate'. What is the factor by which performance degrades when we try to improve upon consistency?
Here's Facebook's original post on why they chose HBase for Messenger. At the time they decided HBase was "ideal for scale and performance with consistency as near as possible to 'immediate'", however they reached its limits and later developed a new service called Iris that handles the most recent week of messages, while storing the older messages in HBase.
Cassandra's consistency model provides a lot of flexibility. The biggest difference is that Cassandra is a shared-nothing architecture: each server is designed to be able to function independently, thus high availability and partition tolerance at the cost of consistency.
With HBase however there is a single source of truth, at the (apparent) cost of availability and partition tolerance. The read process, from the client's perspective, involves finding the location of that data and reading it from that server. Any updates to that data are atomic.
Here's one HBase vs Cassandra benchmark that shows HBase outperforming Cassandra on nearly every test on (mostly) default settings, and here's another benchmark that shows Cassandra outperforming HBase on certain tests. I think the conclusion here is that the answer to your question is highly dependent on your use case.
Here's a good article that sums up the plusses and minuses of each, and can help you decide which one is best for your needs.

Apache Cassandra : Key auto increment

I started with Cassandra. I use cql 2.0 and I would like to create table with primary key auto_increment. I use cassandra on one node.
Cassandra doesn't have any type of key auto increment feature that you would normally find in an RDBMS. The coordination cost across nodes is too high to make it a worthwhile feature.
Generally you should be using UUIDs whenever you would have used an auto incrementing sequence in an RDBMS. Clients can create these independently of each other with a guarantee of uniqueness (if you are using them correctly). You can use TimeUUIDs if you want to be able to order your keys by creation time (assuming that your clients have synchronized clocks).
You said you are only using a 1 node cluster. If you don't ever plan on growing your cluster to be larger than 1 node then I would suggest using a different database. Cassandra sacrifices many of the traditional database features to make it work really well distributed across a ring of machines. When you only run a single node cluster you lose all of the nice features from an RDMBS without gaining any of the benefits of running a multinode Cassandra cluster.

Using the same computer as a Cassandra node and a Cassandra client

If you are using the Cassandra distributed key-value store, you will have several Cassandra nodes, and thus several computers. The data doesn't just sit there, of course, you also have one or more clients programs that communicate with the Cassandra nodes. Computationally intensive work done by the clients might also be distributed over several computers. Should the clients and the Cassandra nodes be separate computers? Is it OK to use the same computer as a Cassandra node and as a Cassandra client? I expect it would work, in the sense of performing correctly, but would there be unacceptable performance problems?
The Cassandra documentation I've seen talks in terms that suggest Cassandra nodes and clients should be separate computers, but I've not seen an explicit recommendation.
Why do I ask? Why might I want to do that? The application I have in mind does not require that the clients store any data locally, they use Cassandra for all persistent data. Their job is computationally intensive, so the bottleneck is likely to be client CPU processing rather than Cassandra processing. Not also using them as Cassandra nodes seems wasteful.
Also, if each computation (client) node is also a Cassandra node, I can use the Cassandra token of each node (used for distributing Cassandra's data) to distribute the client computations.
This is a valid setup for certain types of deployments. The most common case where people do this is when running Hadoop jobs against Cassandra. The Cassandra Wiki recommends you run one Hadoop TaskTracker on each node in your cluster. That type of deployment is similar to what you are describing.

how to integrate cassandra with zookeeper to support transactions

I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
Zookeeper creates znodes to perform read and write operations and data to and fro goes through znodes in Zookeeper. I want to know that how to support rollback and commit feature in cassandra using Zookeeper. Is there any way by which we can specify cassandra configurations in zookeeper or zookeeper configurations in cassandra.
I know cassandra and zookeeper individually how data is read and written but I dont know how to integrate both of them using Java.
how can we do transactions in Cassandra using Zookeeper.
Thanks.
I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
With great difficulty. Cassandra does not work well as a transactional system. Writes to multiple rows are not atomic, there is no way to rollback writes if some writes fail, and there is no way to ensure readers read a consistent view when reading.
I want to know that how to support rollback and commit feature in cassandra using Zookeeper.
Zookeeper won't help you with this, especially the commit feature. You may be able to write enough information to zookeeper to roll back in case of failure, but if you are doing that, you might as well store the rollback info in cassandra.
Zookeeper and Cassandra work well together when you use Zookeeper as a locking service. Look at the Cages library. Use zookeeper to co-ordinate read/writes to cassandra.
Trying to use cassandra as a transactional system with atomic commits to multiple rows and rollbacks is going to be very frustrating.
There are ways you can use to implement transactions in Cassandra without ZooKeeper.
Cassandra itself has a feature called Lightweight transactions which provides per key linearizability and compare-and-set. With such primitives you can implement serializable transactions on the application level by youself.
Please see the Visualization of serializable cross shard client-side transactions post for for details and step-by-step visualization.
The variants of this approach are used in Google's Percolator system and in CockroachDB.
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on the RAMP transactions paper by Peter Bailis.
There is a BATCH feature for Cassandra's CQL3 (Cassandra 1.2 is the formal version that released CQL3), which allegedly can atomically apply all the updates in the BATCH as one unit all-or-nothing.
This does not mean you can rollback a successfully executed BATCH as an RDBMS could do, that would have to be manually done.
Depending on the consistency and preferences you provide to the BATCH statement, guarantees of atomicity of the updates can be increased or decreased to some degree with the UNLOGGED option.
http://www.datastax.com/docs/1.2/cql_cli/cql/BATCH
Well, I'm not an exepert at this (far from it actually) but the way I see it, either you deploy some middleware made by yourself, in order to guarantee the specific properties you are looking for or you can just have Cassandra write data to auxilliary files and then copy them through the file system, since the copy function in Java works as an atomic operation.
I don't know anything about the size of the data files you are considering so I don't really know if it is doable, however there might be a way to use this property through smaller bits of information and then combining them as a whole.
Just my 2 cents...

Resources