Local Persistent Volume for Cassandra Hosted in Kubernetes - cassandra

We are trying to deploy Cassandra within Kubernetes. Thinking of the storage and how to make it work its fastest at each datacenter, without the expense of implementing network attached storage at each data center, it would seem reasonable to make use of a Local Persistent Volume at each datacenter and leverage Cassandra to handle the cross-datacenter replication.
Am I thinking about this problem correctly? Is there a better way to consider implementing Cassandra in each of our data centers to make our application run their fastest by connecting to a more local data center?

#Simon Fontana Oscarsson is right.
I just want to add a bit more details about that feature for people who will find that question, because it is a common case.
Local Persistent Volumes are available only from 1.7 in alpha stage and from 1.10 in beta.
It requires pre-configured LVM on nodes, and it should be done before you will use it.
Here you may find examples of configuration here.

Related

Migrate Data from one Riak cluster to another

I have a situation where we need to migrate data from one Riak cluster to another and then remove the old cluster. The ring size will be same, even the region will be the same. We need to do this to upgrade the instances to AL2. Is there a clean approach to do so on Prod, without realtime data loss?
The answer to this may be tied to your version of Riak KV. If you have the open source version of Riak KV 2.2.3 or earlier, this will require an in-situ upgrade to Riak KV 2.2.6 before progressing. See https://www.tiot.jp/riak-docs/riak/kv/2.2.6/setup/upgrading/version/ with packages at https://files.tiot.jp/riak/kv/2.2/2.2.6/
For an Enterprise Editions of Riak KV 2.2.3 and earlier or the open source edition of Riak KV 2.2.6 or higher, you can use multi-data centre replication (MDC).
Use both of these at the same time for proper replication and to prevent data loss:
fullsync replication will copy across all stored data on its first run and then any missing data on subsequent runs.
realtime replication will replicate all transactions in almost realtime.
If you then set this up as bidirectional replication (get each cluster to replicate to the other for both fullsync and realtime) then you will be able to seemlessly switch your production environment from one cluster to the other without any issues. Once you are happy everything is working as expected, you can kill the old cluster.
Please see the documentation for replication at https://www.tiot.jp/riak-docs/riak/kv/2.2.6/using/cluster-operations/v3-multi-datacenter/

What is meant by a node in cassandra?

I am new to Cassandra and I want to install it. So far I've read a small article on it.
But there one thing that I do not understand and it is the meaning of 'node'.
Can anyone tell me what a 'node' is, what it is for, and how many nodes we can have in one cluster ?
A node is the storage layer within a server.
Newer versions of Cassandra use virtual nodes, or vnodes. There are 256 vnodes per server by default.
A vnode is essentially the storage layer.
machine: a physical server, EC2 instance, etc.
server: an installation of Cassandra. Each machine has one installation of Cassandra. The Cassandra server runs core processes such as the snitch, the partitioner, etc.
vnode: The storage layer in a Cassandra server. There are 256 vnodes per server by default.
Helpful tip:
Where you will get confused is that Cassandra terminology (in older blog posts, YouTube videos, and so on) had been used inconsistently. In older versions of Cassandra, each machine had one Cassandra server installed, and each server contained one node. Due to the 1-to-1-to-1 relationship between machine-server-node in old versions of Cassandra people previously used the terms machine, server and node interchangeably.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. Like all other distributed database systems, it provides high availability with no single point of failure.
You may got some ideas from the description of above paragraph. Generally, when we talk Cassandra, we mean a Cassandra cluster, not a single PC. A node in a cluster is just a fully functional machine that is connected with other nodes in the cluster through high internal network. All nodes work together to make sure that even if one of them failed due to unexpected error, they as a whole cluster can provide service.
All nodes in a Cassandra cluster are same. There is no concept of Master node or slave nodes. There are multiple reason to design like this, and you can Google it for more details if you want.
Theoretically, you can have as many nodes as you want in a Cassandra cluster. For example, Apple used 75,000 nodes served Cassandra summit in 2014.
Of course you can try Cassandra with one machine. It still work while just one node in this cluster.
What is meant by a node in cassandra?
Cassandra Node is a place where data is stored.
Data centerĀ is a collection of related nodes.
A cluster is a component which contains one or more data centers.
In other words collection of multiple Cassandra nodes which communicates with each other to perform set of operation.
In Cassandra, each node is independent and at the same time interconnected to other nodes.
All the nodes in a cluster play the same role.
Every node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
In the case of failure of one node, Read/Write requests can be served from other nodes in the network.
If you're looking to understand Cassandra terminology, then the following post is a good reference:
http://exponential.io/blog/2015/01/08/cassandra-terminology/

How to run Couchbase on multiple server or multiple AWS instances?

I am trying to evaluate couchbase`s performance on multiple nodes. I have a Client that generates data for me based on some schema(for 1 node currently, local). But I want to know how I can horizontally scale Couchbase and how it works. Like If I have multiple machines or AWS instances or Windows Azure how can I configure Couchbase to shard the data and than I can evaluate its performance for multiple nodes. Any suggestions and details as to how I can do this?
I am not (yet) familiar with Azure but you can find a very good white paper about Couchbase on AWS:
Running Couchbase on AWS
Let's talk about the cluster itself, you just need to
install Couchbase on multiple nodes
create a "cluster" on one of then
then you simply have to add other nodes to the cluster and rebalance.
I have created an Ansible script that use exactly the steps to create a cluster from command line, see
Create a Couchbase cluster with Ansible
Once you have done that your application will leverage all the nodes automatically, and you can add/remove nodes as you need.
Finally if you want to learn more about Couchbase architecture, how sharding, failover, data consistency, indexing work, I am inviting your to look at this white paper:
Couchbase Server: An Architectural Overview

Representing Network Drive as Cassandra Data Directory

I am new to cassandra. In cassandra,in order to store cores we do specify the local directory of cassandra installed machine using the property data_file_directories in Cassandra.yalm configuration file. My need is to define the data_file_directories as network directory(something like 192..x.x.x/data/files/). I am using only single node cluster for rapid data write(For logging activities). As I don't rely on replication, My replication factor is 1.Any one help in defining network directory for cassandra data directory....
thanks in advance......
1) I have stored the data for the cassandra on amazons EBS volume (Network volume), But in EC2 case it is simple as we can mount the EBS volumes on a machine as if it is a local one.
2) In other cases you will have to use NFS to configure the network directory.I have never done this but it looks straight forword.
Cassandra is firmly designed around using local storage instead of EBS or other network-mounted data. This gives you better performance, better reliability, and better cost-effectiveness.

Dynamically adding new nodes in Cassandra

Is it possible to add new hosts to a Cassandra cluster dynamically?
What I'm trying to do is set up a program that can:
Set up a local version of the database for each user
Each user's machine will become part of the cluster (the machines will be hosts)
Data will be replicated across all the clusters
Building a cluster of multiple hosts usually entails configuring the cassandra.yaml to store the seeds, listen_address and rpc_address of each host.
My idea is to edit these files through java and insert the new host addresses as required but making sure that data is accurate across each users's cassandra.yaml files would be challenging.
I'm wondering if someone has done something similar or has any advice on a better way to achieve this.
Yes is possible. Look at Netflix's Priam for an complete example of a dynamic cassandra cluster management (but designed to work with Amazon EC2).
For rpc_address and listen_address, you can setup a startup script that configures the cassandra.yaml if it's not ok.
For seeds you can configure a custom seed provider. Look at the seed provider used for Netflix's Priam for some ideas how to implement it
The most difficult part will be managing the tokens assigned to each node in a efficient way. Cassandra 1.2 is around the corner and will include a feature called virtual nodes that, IMO, will work well in your case. See the Acunu presentation about it

Resources