Using the same computer as a Cassandra node and a Cassandra client - cassandra

If you are using the Cassandra distributed key-value store, you will have several Cassandra nodes, and thus several computers. The data doesn't just sit there, of course, you also have one or more clients programs that communicate with the Cassandra nodes. Computationally intensive work done by the clients might also be distributed over several computers. Should the clients and the Cassandra nodes be separate computers? Is it OK to use the same computer as a Cassandra node and as a Cassandra client? I expect it would work, in the sense of performing correctly, but would there be unacceptable performance problems?
The Cassandra documentation I've seen talks in terms that suggest Cassandra nodes and clients should be separate computers, but I've not seen an explicit recommendation.
Why do I ask? Why might I want to do that? The application I have in mind does not require that the clients store any data locally, they use Cassandra for all persistent data. Their job is computationally intensive, so the bottleneck is likely to be client CPU processing rather than Cassandra processing. Not also using them as Cassandra nodes seems wasteful.
Also, if each computation (client) node is also a Cassandra node, I can use the Cassandra token of each node (used for distributing Cassandra's data) to distribute the client computations.

This is a valid setup for certain types of deployments. The most common case where people do this is when running Hadoop jobs against Cassandra. The Cassandra Wiki recommends you run one Hadoop TaskTracker on each node in your cluster. That type of deployment is similar to what you are describing.

Related

What is the proper setup for spark with cassandra

After using and playing around with the spark connector, I want to utilize it in the most efficient way, for our batch processes.
is the proper approach to set up a spark worker on the same host where Cassandra node is on? does the spark connector ensure data locality?
I am a bit concerned that a memory intensive spark worker will cause the entire machine to stop, then I will lose a Cassandra node, so I'm a bit confused whether I should place the workers on the Cassandra nodes, or separate (which means no data locality). what is the common way and why?
This depends on your particular use case. Some things to be aware of
1) CPU Sharing, while memory will not be shared (heaps will be separate) between Spark and Cassandra. There is nothing stopping spark executors from stealing time on C* cpu cores. This can lead to load and slowdowns in C* if the spark process is very cpu intensive. If it isn't then this isn't much of a problem.
2) Your network speed, if your network is very fast then there is much less value to locality than if you are on a slower network.
So you have to ask yourself, do you want a simpler setup (everything in one place) or do you want a complicated setup but more isolated.
For instance DataStax (the company I work for) ships Spark running colocated with Cassandra by default, but we also offer the option of having it run separately. Most of our users colocate possibly because of this default, those who don't usually do so because of easier scaling.

What is meant by a node in cassandra?

I am new to Cassandra and I want to install it. So far I've read a small article on it.
But there one thing that I do not understand and it is the meaning of 'node'.
Can anyone tell me what a 'node' is, what it is for, and how many nodes we can have in one cluster ?
A node is the storage layer within a server.
Newer versions of Cassandra use virtual nodes, or vnodes. There are 256 vnodes per server by default.
A vnode is essentially the storage layer.
machine: a physical server, EC2 instance, etc.
server: an installation of Cassandra. Each machine has one installation of Cassandra. The Cassandra server runs core processes such as the snitch, the partitioner, etc.
vnode: The storage layer in a Cassandra server. There are 256 vnodes per server by default.
Helpful tip:
Where you will get confused is that Cassandra terminology (in older blog posts, YouTube videos, and so on) had been used inconsistently. In older versions of Cassandra, each machine had one Cassandra server installed, and each server contained one node. Due to the 1-to-1-to-1 relationship between machine-server-node in old versions of Cassandra people previously used the terms machine, server and node interchangeably.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. Like all other distributed database systems, it provides high availability with no single point of failure.
You may got some ideas from the description of above paragraph. Generally, when we talk Cassandra, we mean a Cassandra cluster, not a single PC. A node in a cluster is just a fully functional machine that is connected with other nodes in the cluster through high internal network. All nodes work together to make sure that even if one of them failed due to unexpected error, they as a whole cluster can provide service.
All nodes in a Cassandra cluster are same. There is no concept of Master node or slave nodes. There are multiple reason to design like this, and you can Google it for more details if you want.
Theoretically, you can have as many nodes as you want in a Cassandra cluster. For example, Apple used 75,000 nodes served Cassandra summit in 2014.
Of course you can try Cassandra with one machine. It still work while just one node in this cluster.
What is meant by a node in cassandra?
Cassandra Node is a place where data is stored.
Data centerĀ is a collection of related nodes.
A cluster is a component which contains one or more data centers.
In other words collection of multiple Cassandra nodes which communicates with each other to perform set of operation.
In Cassandra, each node is independent and at the same time interconnected to other nodes.
All the nodes in a cluster play the same role.
Every node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
In the case of failure of one node, Read/Write requests can be served from other nodes in the network.
If you're looking to understand Cassandra terminology, then the following post is a good reference:
http://exponential.io/blog/2015/01/08/cassandra-terminology/

Configure cassandra to use different network interfaces for data streaming and client connection?

I have a cassandra cluster deployed with 3 cassandra nodes with replication factor of 3. I have a lot of data being written to cassandra on daily basis (10-15GB). I have provisioned these cassandra on commodity hardware as suggested by "Big data community" and I am expecting the nodes to go down frequently which is handled using redundancy provided by cassandra.
My problem is, I have observed cassandra to slow down with writes when a new node is provisioned and the data is being streamed while bootstrapping. So, to overcome this hurdle, We have decided to have a separate network interface for inter-node communication and for client application to write data to cassandra. My question is how can this be configured, if at all this is possible ?
Any help is appreciated.
I think you are chasing the wrong solution.
I am confused by the fact that you only have 3 nodes, yet your concern is around slow writes while bootstrapping. Why? Are you planning to grow your cluster regularly? What is your consistency level on write, as this has a big impact on performance? Obviously if you only have 2 or 3 nodes and you're trying to bootstrap, you will see a slowdown, because you're tying up a significant percentage of your cluster to do the streaming.
Note that "commodity hardware" doesn't mean cheap, low-performance hardware. It just means you don't need the super high-end database-class machines used for databases like Oracle. You should still use really good commodity hardware. You may also need more nodes, as setting RF equal to cluster size is not typically a great idea.
Having said that, you can set your listen_address to the inter-node interface and rpc_address to the client address if you feel that will help.

How does Apache Cassandra mash with Infinispan?

I have checked the main features of Cassandra and Infinispan. They seem to have and deliver pretty similar characteristics and functionalities:
NoSQL data store
persistance
decentralized
support replication
scalability
fault tolerant
MapReduce support
Queries
One difference I have found out is that Infinispan does not provide tunable consistency (every node has the same data).
When learning about the Infinispan I came across Cassandra Cache Store (http://infinispan.org/docs/cachestores/cassandra/). It provides persistance of data.
But then why I would still want to use Infinispan and not Cassandra directly?
Do these solutions complement each other or they are more competing on the same level?
Infinispan is mainly used as a distributed cache, like memcached/hazelcast and so on.
Natively data are written in memory but you can persist them into what they call "cache stores" -- there are many cache-stores ready (for File/Cassandra/Hbase/Mongo) or you can make your own implementation.
One difference I have found out is that Infinispan does not provide
tunable consistency (every node has the same data).
Tunable consistency and data distribution are two different things. It's not true that "every node has the same data", it depends on how you choose to cluster data. Infinispan, like others, offers both replication (all nodes stores same cache) and distribution (each node will be responsible for a range of tokens). Tunable consistency in Cassandra means that you can choose how many nodes should be informed about your r/w operation before returning the control to the client.
You might need to use Infinispan and not Cassandra directly for many reasons. If for instance you have huge amount of memory in your application servers and you want keep a bigger/different cache than what you can store inside your Cassandra nodes. Other feature you might need is to plug the infinispan-query module in order to perform full-text searches without installing a solr/elasticsearch/whatever cluster or use the transactional capability within is.
IMHO these two products does not compare directly, they're born for different use cases and offers different features. You can use any, one or both, depend on what's your application architecture and needs.
HTH,
Carlo

Apache Cassandra : Key auto increment

I started with Cassandra. I use cql 2.0 and I would like to create table with primary key auto_increment. I use cassandra on one node.
Cassandra doesn't have any type of key auto increment feature that you would normally find in an RDBMS. The coordination cost across nodes is too high to make it a worthwhile feature.
Generally you should be using UUIDs whenever you would have used an auto incrementing sequence in an RDBMS. Clients can create these independently of each other with a guarantee of uniqueness (if you are using them correctly). You can use TimeUUIDs if you want to be able to order your keys by creation time (assuming that your clients have synchronized clocks).
You said you are only using a 1 node cluster. If you don't ever plan on growing your cluster to be larger than 1 node then I would suggest using a different database. Cassandra sacrifices many of the traditional database features to make it work really well distributed across a ring of machines. When you only run a single node cluster you lose all of the nice features from an RDMBS without gaining any of the benefits of running a multinode Cassandra cluster.

Resources