Adding nodes to a dispersed gluster cluster - glusterfs

I was reading the gluster documentation and am having difficulty figuring out exactly how my cluster ought to be configured.
Suppose I decided to set up a dispersed distributed cluster with 3 bricks and redundancy = 1.
If I did this do I have to add bricks in groups of 3, or can I add 1 or 2 bricks if desired?
If I add 3 bricks to the cluster does the redundancy number change? I looked at this: https://lists.gluster.org/pipermail/gluster-users/2018-July/034491.html and it said that the redundancy number is constant throughout the life of the cluster, which I find odd - if I start out tiny with like 3 nodes and then hit the jackpot and want to seriously ramp up my cluster's size so I make it so there are 60 nodes having a redundancy number of 1 is probably not appropriate, whereas a redundancy number of 1 is appropriate if there are 3 nodes. With this in mind, if the redundancy number is constant (per the website quoted) how does one scale a gluster cluster up by an order of magnitude?

Yes you need to add bricks in groups of 3.
When you add more nodes (in multiples of 3) to expand the volume, you are increasing the distribute count thereby increasing the volume capacity. The redundancy number is something that is to be viewed as being applicable to each disperse 'sub volume' of the cluster and not something like 1 node redundancy for every 60 nodes. So your volume scales from a 1x(2+1) to a 30x(2+1) and each of those 30 disperse sub volumes each have a redundancy factor of 1.

Related

Cassandra - Replication Factor vs Nodes Count Relation

One of my C* cluster design expects nodes to hold between 1 and 2 TBs of data each, and I expect a huge amount of data in a few months. Pretending I can get 1PB of data and that each node will hold 1TB of data, that means I should plan for a 1000x growth over time, and starting from a "misere" N=3 nodes with RF=3 for 1TB of data, I would keep adding nodes up to N=3000 over time.
The high number of nodes involved put some pressures on how to deal with disks/servers failures, keep the cluster healthy and how to perform backups.
Healthy Cluster
Assuming you don't want any data loss and perform reads/writes with LOCAL_QUOROM Consistency Level, using RF=3 when you have N<10 nodes is very reasonable, however when you go up with N the MTBF of your nodes goes down accordingly, so keeping RF=3 is going to call for troubles and you may want to "upgrade" to RF=5 or more.
Q1: What's a good RF that would fight against the decreased MTBF and keep the cluster healthy (and you sleeping peacefully) with say 100 nodes? and 500? and 1000?
BACKUP
Making backups of all the nodes seems to be a bit not viable due to the following reasons:
Doubles the costs of the solution instantly.
I would backup the redundant data due to the RF of the cluster.
I see no way to remove the redundancy introduced by the RF and backup only the data expect adding another DC to C* with RF=2 (I could go for RF=1 but if I lose one node all the backup cluster is down). That
would mean adding 2/RF of the cost of the cluster for backup
purposes which seems to me a good alternative.
Q2: Are there any other methods to perform this task without increasing too much the cost of the solution?

Brand new to Cassandra, having trouble understanding replication topology

So I'm taking over our Cassandra cluster after the previous admin left so I'm busy trying to learn as much as I can about it. I'm going through all the documentation on Datastax's site as we're using their product.
That said, on the replication factor part I'm having a bit of trouble understanding why I wouldn't have the replication factor set to the number of nodes I have. I have four nodes currently and one datacenter, all nodes are located in the same physical location as well.
What, if any, benefit would there be to having a replication factor of less than 4?
I'm just thinking that it would be beneficial from a fault tolerance standpoint if each node had its own copy/replica of the data, not sure why I would want less replicas than the number of nodes I have. Are there performance tradeoffs or other reasons? Am I COMPLETELY missing the concept here (entirely possible)?
There are a few reasons why you might not want to increase your RF from 3 to 4:
Increasing your RF effectively multiplies your original data volume
by that amount. Depending on your data volume and data density you
may not want to incur the additional storage hit. RF > number of nodes will help you scale beyond one node's capacity.
Depending on your consistency level you could experience a performance hit. I.E. when writing with quorum consistency level (CL) to an RF of 3 you wait for 2 nodes to come back before confirming the write to the client. In RF of 4 you would be waiting for 3 nodes to come back.
Regardless of the CL, every write will eventually be going to every node. This is more activity on your cluster and may not perform well if your nodes aren't scaled for that workload.
You mentioned fault tolerance. With an RF of 4 and reads on CL one, you can absorb up to 3 of your servers being down simultaneously and your app will still be up. From a fault tolerance perspective this is pretty impressive, but also unlikely. My guess would be if you have 3 nodes down at the same time in the same dc, the 4th is probably also down (natural disaster, flood, who knows...).
At the end of the day it all depends on your needs and C* is nothing if not configurable. An RF of 3 is very common among Cassandra implementations
Check out this deck by Joe Chu
The reason why your RF is often less than the number of nodes in the cluster is explained in the post: Cassandra column family bigger than nodes drive space. This post provides insight into this interesting aspect of Cassandra replication. Here's a summary of the post:
QUESTION: . .. every node has 2Tb drive space and column family is replicated on every node so every node contains a full copy of it . . . after some years that column family will exceed 2Tb . . .
Answer: RF can be less than the number of nodes and does not need to scale if you add more nodes.
For example, if you today had 3 nodes with RF 3, each node will
contain a copy of all the data, as you say. But then if you add 3 more
nodes and keep RF at 3, each node will have half the data. You can
keep adding more nodes so each node contains a smaller and smaller
proportion of the data . . . no limit in principle to
how big your data can be.

Cassandra rack concept and database structure

I am new to Cassandra and I would like to learn more about Cassandra's racks and structure.
Suppose I have around 70 column families in Cassandra and two AWS2 instances.
How many Data Centres will be used?
How many nodes will each rack have?
Is it possible to divide a column family in multiple keyspaces?
The intent of making Cassandra aware of logical racks and data centers is to provide additional levels of fault tolerance. The idea (as described in this document, under the "Network Topology Strategy") is that the application should still be able to function if one rack or data center goes dark. Essentially, Cassandra...
places replicas in the same data center by walking the ring clockwise
until reaching the first node in another rack. NetworkTopologyStrategy
attempts to place replicas on distinct racks because nodes in the same
rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
In this way, you can also query your data by LOCAL_QUORUM, in which QUORUM ((replication_factor / 2) + 1) is only computed from the nodes present in the same data center as the coordinator node. This reduces the effects of inter-data center latency.
As for your questions:
How many data centers are used are entirely up to you. If you only have two AWS instances, putting them in different logical data centers is possible, but only makes sense if you are planning to use consistency level ONE. As-in, if one instance goes down, your application only needs to worry about finding one other replica. But even then, the snitch can only find data on one instance, or the other.
Again, you can define the number of nodes that you wish to have for each rack. But as I indicated with #1, if you only have two instances, there isn't much to be gained by splitting them into different data centers or racks.
I do not believe it is possible to divide a column family over multiple keyspaces. But I think I know what you're getting at. Each keyspace will be created on each instance. As you have 2 instances, you will be able to specify a replication factor of 1 or 2. If you had 3 instances, you could set a replication factor of 2, and then if you lost 1 instance you would still have access to all the data. As you only have 2 instances, you need to be able to handle one going dark, so you will want to make sure both instances have a copy of every row (replication factor of 2).
Really, the logical datacenter/rack structure becomes more-useful as the number of nodes in your cluster increases. With only two, there is little to be gained by splitting them with additional logical barriers. For more information, read through the two docs I linked above:
Apache Cassandra 2.0: Data Replication
Apache Cassandra 2.0: Snitches

Cassandra Partial Replication

This is my configuration for 4 Data Centers of Cassandra:
create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};
In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.
For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).
For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.
Is it possible ?
Thank you
Best Regards
Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.
I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.
The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:
DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c
nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c
nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c
nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c
In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).

Cassandra token for three replicas

I'm trying to build two 3-node Cassandra clusters in separate data centers. I want to have NetworkToplogyStrategy replication between them, with a replication factor of 3 in each. Thus, I want each node in each data center to have the same records.
Question, what should my token assignment look like for each node? (since i'm not actually partitioning, just replicating).
Thank you!
If you're using Cassandra 1.2 use virtual nodes with automatic assignment.
If you're using 1.1 or earlier, use for one DC the evenly distributed tokens:
0
56713727820156410577229101238628035242
113427455640312821154458202477256070484
(0, 1 and 2 times 2**127/3)
For the other DC, you can choose anything as long as it is also evenly distributed. Offsetting by 1 works:
1
56713727820156410577229101238628035243
113427455640312821154458202477256070485
Although for now the tokens don't matter since all nodes hold the same data, if you want to scale in the future it will help to have them already balanced.

Resources