Warnings : Your replication factor 3 for keyspace books is higher than the number of nodes 1

Warnings : Your replication factor 3 for keyspace books is higher than the number of nodes 1 - cassandra

When I tried:
CREATE KEYSPACE IF NOT EXISTS books WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 };
I got:
Warnings :
Your replication factor 3 for keyspace books is higher than the number of nodes 1
I have installed Cassandra on my local Ubuntu 22.04 machine and don't remember if I specified how many nodes it should have if it was ever possible doing that. I am just wondering to know is it possible to have multi nodes on a local machine? How should I check the number of Cassandra nodes and how can I change them?

A "node" in Cassandra is an individual machine instance. If you're deploying in the cloud or on Kubernetes, increasing the number of nodes is a trivial thing. However, if you're just testing on your own machine, and you've installed Cassandra, then you have one node.
How should I check the number of Cassandra nodes?
You can check this by querying the local node's system.peers table. That table holds data on each of the other nodes in the cluster.
SELECT * FROM system.peers;
So, if you get two rows back, then you have a three node cluster. If nothing comes back, then you have a one node cluster.
is it possible to have multi nodes on a local machine?
Yes. It's not easy, though. Cassandra makes exclusive use of a few ports on a machine instance, so you'd need multiple Cassandra directories, and each would need its own offsets of ports 7000, 7001, 7199, and 9042.
The easier way, is to use MiniKube to run multiple, small Cassandra nodes inside a Kubernetes cluster. I've build a repo to help folks with doing this on Windows: https://github.com/aploetz/cassandra_minikube
There's also a companion YouTube video for it: https://www.youtube.com/watch?v=eMKXiItZ0Q4

Related

Three node Cassandra cluster all nodes configured different rack in same dc

I am a newbie in Cassandra.
In our production environment three node Cassandra clusters are running and serving production traffic but I have below mentioned doubts:-
1) All nodes are configured in different racks i.e rack 1, rack 2 and rack 3 in the same dc. Is it fine or does this configuration have some drawbacks?
2) We are using rf2 and network topology for all the keyspaces except system tables and these system tables are configured with rf2 and simplestrategy ..is it fine or does this need to be changed? should we increase the replication factor of system_auth? ..please let me know..
3) Now I want to add another node in the same dc, what will be the best procedure to do the same without impacting the live traffic?
Cassandra version is Apache cassandra 3.11.
Thanks in advance..

Ans 1) It seems good to have Cassandra nodes in different racks for availability and fault tolerance .
Ans 2) You must increase RF on system_auth so that you can avoid cqlsh login issue from other nodes.
Ans 3) You can add new node without affecting the live traffic on existing cluster. please follow below procedure.
http://cassandra.apache.org/doc/latest/operating/topo_changes.html

Cassandra is designed as a distributed system. Cassandra’s distributed architecture is specifically tailored for multiple-data center deployment. These features are robust and flexible enough that you can configure the cluster for optimal geographical distribution, for redundancy for fail-over and disaster recovery.
Multiple data center deployments are excellent for global solutions where in some applications are operational in one region and other applications in another region and yet using a single cluster of Cassandra which is working in multiple data centers across regions.
For single region applications, still having multiple data-centers is preferred option because it provides disaster recovery even in case one region goes down.
Ans 1) For a single DC Cassandra cluster , recommendation is to have 4 nodes with RF3. Rack 1 with 2 nodes and Rack 2 with 2 nodes. Remember that nodes in the same rack have faster network than nodes in different racks. With two nodes on the same Rack, queries with LOCAL_QUORUM will be faster as compared to queries on a cluster with all nodes on different racks.
If you are not concerned with the query latency , all nodes in different racks (3 racks) will give better disaster recovery as compared with two RACK deployment. Having said that, it's always recommended to use multi DC deployments for production cluster.
Ans 2) It’s always recommended to increase the replication factor of System_auth keyspace and change the replication class to NetworkTopologyStrategy. Please follow this documentation for more details https://docs.datastax.com/en/security/6.0/security/secSystemKeyspace.html
Ans 3) Yes, You can add a new node to existing cluster with ease without impacting the traffic. Please follow this documentation for more details: https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsAddNodeToCluster.html

Creating Cassandra sub-clusters

I need to create K overlapping Cassandra clusters on N machines (K>>N). Each cluster can have between 1 to N nodes. I know that one way of doing so is to create a separate process (or docker container) for each cluster a node is a member of.
My question however is that can I change Cassandra to allow the creation of sub-clusters? meaning that there would be only 1 Cassandra instance running on each node, but I would be able to take control of data replication and data placement so that within a sub-cluster for example, I would be able to do a Quorum write for example.

No, it's not possible to define the sub-cluster as you describe - there is always a single Cassandra cluster per process.
But Cassandra has a notion of the Datacenter that defines where machine resides, and the keyspace that defines how the data is replicated between datacenters and nodes. And consistency level, like, QUORUM depends on the keyspace configuration.
In your case I would think in that direction - define datacenters, create necessary keyspaces, and setup correct replication factors for that keyspaces.

Way to determine healthy Cassandra cluster?

I've been tasked with re-writing some sub-par Ansible playbooks to stand up a Cassandra cluster in CentOS. Quite frankly, there doesn't seem to be much information on Cassandra out there.
I've managed to get the service running on all three nodes at the same time, using the following configuration file, info scrubbed.
HOSTIP=10.0.0.1
MSIP=10.10.10.10
ADMIN_EMAIL=my#email.com
LICENSE_FILE=/tmp/license.conf
USE_LDAP_REMOTE_HOST=n
ENABLE_AX=y
MP_POD=gateway
REGION=test-1
USE_ZK_CLUSTER=y
ZK_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
ZK_CLIENT_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
USE_CASS_CLUSTER=y
CASS_HOSTS="10.0.0.1:1,1 10.0.0.2:1,1 10.0.0.3:1,1"
CASS_USERNAME=test
CASS_PASSWORD=test
The HOSTIP changes depending on which node the configuration file is on.
The problem is, when I run nodetool ring, each node says there's only two nodes in the cluster: itself and one other, seemingly random from the other two.
What are some basic sanity checks to determine a "healthy" Cassandra cluster? Why is nodetool saying each one thinks there's a different node missing from the cluster?

nodetool status - overview of the cluster (load, state, ownership)
nodetool info - more granular details at the node-level
As for the node mismatch I would check the following:
cassandra-topology.properties - identical across the cluster (all 3 IPs listed)
cassandra.yaml - I typically keep this file the same across all nodes. The parameters that MUST stay the same across the cluster are: cluster_name, seeds, partitioner, snitch).
verify all nodes can reach each other (ping, telnet, etc)
DataStax (Cassandra Vendor) has some good documentation. Please note that some features are only available on DataStax Enterprise -
http://docs.datastax.com/en/landing_page/doc/landing_page/current.html
Also check out the Apache Cassandra site -
http://cassandra.apache.org/community/
As well as the user forums -
https://www.mail-archive.com/user#cassandra.apache.org/

Actually, the thing you really want to check is if all the nodes "AGREE" on schema_id. nodetool status shows if nodes or up, down, joining, yet it does not really mean 'healthy' enough to make schema changes or do other changes.
The simplest way is:
nodetool describecluster
Cluster Information:
Name: FooBarCluster
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2, 10.136.2.3]
If schema IDs do not match, you need to wait for schema to settle, or run repairs, say for example like this:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2]
43fe9177-382c-327e-904a-c8353a9dxxxx: [10.136.2.3]
However, running nodetool is 'heavy' and hard to parse.
The information is inside the database, you can check here:
'SELECT schema_version, release_version FROM system.local' and
'SELECT peer, schema_version, release_version FROM system.peers'
Then you compare schema_version across all nodes... if they match, the cluster is very likely healthy. You should ALWAYS check this before making any changes to schema.
Now, during a rolling upgrade, when changing engine versions, the release_version is different, so to support automatic rolling upgrades, you need to check schema_id matching within release_versions separately.

I'm not sure all of the problems you might be having, but...
Check the cassandra.yaml file. You need minimum 3 things to be the same - seeds: list (but do not list all nodes as seeds!), cluster_name, and snitch. Make sure your listen_address is correct.
If you are using gossipingPropertyFileSnitch then check cassandra-topology.properties and/or cassandra-rackdc.properties files for accuracy.
Don't start all the nodes at the same time. Start the seed nodes 1st - the other nodes will "gossip" with the seed node to learn cluster topology. Shutdown the seed nodes last.
Don't use shared storage. That defeats the purpose of distributed data and is considered a cassandra anti-pattern.
If you're in AWS, don't use auto-scaling groups unless you know what you're doing.
Once you've done all that, use nodetool status | ring | info or jmx to see what the cluster is doing.
Datastax does have decent documentation for cassandra.

Cassandra Cluster Vs Nodes

I am beginner in Cassandra, I am trying to understand few basic things.
1) Cassandra Cluster : Does it mean a physical server? Is it possible to run multiple clusters on a single physical machine?
2) Cassandra Nodes : By definition it looks that one cluster can have multiple nodes. Can we have multiple nodes on a single physical machine? or one node means one single machine?
3) I have two physical machines and I just installed Cassandra server on both machines and configured syncing between the both Cassandra servers, so if I create any keyspace with NetworkTopologyStrategy I am able to see on both servers. Does it mean that I created two clusters or two nodes?
Need help on above questions.

Let us work with JVM as a unit.
Cassandra Node: It is a single JVM instance to run Cassandra. It can be run on a single physical machine or on a VM or docker container.
Cassandra Cluster: One or more group of Cassandra nodes form a Cassandra cluster.
So if you have 2 physical machines you can always run more than 2 nodes depending on the capacity of your machine. You can also run multiple clusters. i.e eg: you can create 6 VM to prepare 6 nodes and group them into two clusters with 3 each. This is controlled by cassandra.yaml.
Does it mean that I created two clusters or two nodes?
No, it means you created two nodes and grouped them into one cluster.

What is meant by a node in cassandra?

I am new to Cassandra and I want to install it. So far I've read a small article on it.
But there one thing that I do not understand and it is the meaning of 'node'.
Can anyone tell me what a 'node' is, what it is for, and how many nodes we can have in one cluster ?

A node is the storage layer within a server.
Newer versions of Cassandra use virtual nodes, or vnodes. There are 256 vnodes per server by default.
A vnode is essentially the storage layer.
machine: a physical server, EC2 instance, etc.
server: an installation of Cassandra. Each machine has one installation of Cassandra. The Cassandra server runs core processes such as the snitch, the partitioner, etc.
vnode: The storage layer in a Cassandra server. There are 256 vnodes per server by default.
Helpful tip:
Where you will get confused is that Cassandra terminology (in older blog posts, YouTube videos, and so on) had been used inconsistently. In older versions of Cassandra, each machine had one Cassandra server installed, and each server contained one node. Due to the 1-to-1-to-1 relationship between machine-server-node in old versions of Cassandra people previously used the terms machine, server and node interchangeably.

Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. Like all other distributed database systems, it provides high availability with no single point of failure.
You may got some ideas from the description of above paragraph. Generally, when we talk Cassandra, we mean a Cassandra cluster, not a single PC. A node in a cluster is just a fully functional machine that is connected with other nodes in the cluster through high internal network. All nodes work together to make sure that even if one of them failed due to unexpected error, they as a whole cluster can provide service.
All nodes in a Cassandra cluster are same. There is no concept of Master node or slave nodes. There are multiple reason to design like this, and you can Google it for more details if you want.
Theoretically, you can have as many nodes as you want in a Cassandra cluster. For example, Apple used 75,000 nodes served Cassandra summit in 2014.
Of course you can try Cassandra with one machine. It still work while just one node in this cluster.

What is meant by a node in cassandra?
Cassandra Node is a place where data is stored.
Data center is a collection of related nodes.
A cluster is a component which contains one or more data centers.
In other words collection of multiple Cassandra nodes which communicates with each other to perform set of operation.
In Cassandra, each node is independent and at the same time interconnected to other nodes.
All the nodes in a cluster play the same role.
Every node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
In the case of failure of one node, Read/Write requests can be served from other nodes in the network.

If you're looking to understand Cassandra terminology, then the following post is a good reference:
http://exponential.io/blog/2015/01/08/cassandra-terminology/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string