Can I use num_tokens as load factor in Cassandra? - cassandra

If you run cassandra on machines with different sizes of hard disk (e.g. one with one TB, another with 2TB), can I use num_tokens as load factor?
I want to reduce the risk of one node running out of disk space and balance the usage of disk on different machines.
I know, the more data one node collects, the more probable it might become a hotspot. Apart from that, which other considerations do I need to take care of? Which limits or practical restrictions exist for the number of nodes?
Can I change the number of nodes later if disk space changes without trouble?
I would appreciate some advice on that topic because I have not found much information about that in google or at the website of cassandra.
EDIT: numnodes replaced by num_tokens

Are you referring to num_tokens settings? Yes, you can use a different number of tokens based on the hardware resources. Nodes with a larger number of tokens will see higher load and disk usage. Once set, the num_tokens setting cannot be changed at a later point without decommissioning the node.

Related

The contact between Replication factor and Resource Usage

I am a Cassandra user in china. Recently we want to use Cassandra in our production environment. But I don't know the impact of data replica factor and resource consumption.
My stress test show that 3 replication factor use three times more resources than 1 replication factor. But I'm not sure it's right.
So, I would like to ask if there is a formula for replication factor and resource consumption? Or has anyone ever tested it?
I'm very grateful if anyone can reply me;
First of all, RF=3 means you need at least three servers (obviously). But really, it depends on what you mean by "resources." If that's mainly referring to disk space, then "yes" setting a RF=3 will use 3x the disk space that a single copy (RF=1) would.
So why would you want that? Because supporting data loads in highly-available (HA) scenarios is what Cassandra does really well. This means that Cassandra needs to be able to continue to serve requests if a node should fail. Achieving that means setting RF>1.
As for the remaining resources, if you're referring to network, CPU & RAM as well, then the answer is "it depends." An application can choose to query at different consistency levels, such as ONE, QUORUM, or ALL (and others). For ONE, it does just what it says: an operation (read or write) waits for acknowledgement from a single node.
So if an app is querying at a consistency of ONE, the answer is "no," it won't use three times the resources if RF=3.
Cassandra is distributed database so it stores the data based on partition and hash algorithm. We can configure replica of our data based on requirement and application nature. Default Cassandra cluster with minimum 3 node recommended for production but you should use or configure the replication factor(replica/copy of data) totally on your wish.
If you use 3 node cluster with RF=3 then your data will be distributed on each node (approx 1/3 data on each node). We need to consider the resource here for all 3 nodes like disk, CPU, Memory, I/O etc equally for better performance. However, we can tune multiple things(like consistency, compaction, network, OS) inside the Cassandra to improve the performance and resource effective. 3 copy of data will use more memory and disk as compared to 1 copy of data. But if you consider availability and performance you should use at least 2 copy of data. you can refer below link for more details regarding RF calculation etc:-
https://www.ecyrd.com/cassandracalculator/

Cassandra lower vtokens

We have the need to reduce vtokens on a Cassandra cluster (2 nodes) to compensate for both machines having different storage capabilities. The replication factor is currently 1 so there is no replication of data happening.
Can't we simple reduce vtokens to 32 instead of current 256 and restart server? What'll happen if we try this? Will it stream the extra tokens or will we loose data?
We read about decommissioning the node to copy all data to the bigger one, reconfigure it to have less vtokens, delete cassandra data locally and make it rejoin the cluster, just wondering what happens if we try to reduce vtokens before decommissioning it?
Thanks!
You can't do balancing with vnodes. By virtue of statistics you should have a pretty even distribution of data across your nodes even with 32 vnodes. And fewer vnodes will give you better search performance.
Also keep an eye on CASSANDRA-7032, this should let us go to even lower num_tokens without sacrificing data distribution.

Cassandra with mixed disk sizes?

Can we use Cassandra on nodes with different disk sizes? If so, how does Cassandra balances nodes and do we have any control over it?
I've found this thread but it's quite old http://grokbase.com/t/cassandra/user/113nvs23r4/cassandra-nodes-with-mixed-hard-disk-sizes
Its highly recommended not to introduce imbalance of nodes in a cluster (at least within the same DC) in terms of hard disk, CPU, Memory. All nodes in the cluster are treated equal and there is no intelligence behind the disk capacity on each node.
Unless you can take the pain of manually distributing tokens instead of using vnodes, this is not advisable. In case of manual distribution, one has control over which node to assign more tokens and where less. Again hoping and praying that the data distribution is uniform and hence the node with less tokens will get less data.

How to add new nodes in Cassandra with different machine configuration

I have 6 cassandra nodes in two datacenters with 16GB of memory and 1TB HD drive.
Now I am adding 3 more nodes with 32GB of memory. will these machines will cause overhead for existing machines ( May be in token distribution )? if so please suggest how to configure these machine to avoid those problems.
Thanks in advance.
The "balance" between nodes is best regulated using vnodes. If you recall (if you don't, you should read about it), the ring that Cassandra nodes form is actually consisted out of virtual nodes (vnodes). Each node in the ring has a certain portion of vnodes, which is set up in the Cassandra configuration on each node. Based on that number of vnodes, or rather the proportion between them, the amount of data going to those nodes is calculated. The configuration you are looking for is num_tokens. If you have similarly powerful machines, than an equal vnode number is available. The default is 256.
When adding a new, more powerful machine, you should assign a greater number of vnodes to it. How much? I think it's hard to tell. It's unwise to give it twice more, only be looking at the RAM, since those nodes will have twice as many data than the others. Than you might expect more IO operations on them (remember, you still have the same HDD) and CPU utilization (and the same CPU). You might want to take a look at this answer also.

What does it mean when we say cassandra is scalable?

I have created two node Cassandra cluster and try to perform load test. I find that one node or two node not making much difference in the through put I have supposed if 1 node can provide me 2000 tps for insert the two node should double the amount. Is it work like that?
if it is not then what actually Scaling means and how can I relate with it latency or throughput.
Cassandra is scalable. Just your case is a bit simplified since two nodes is not really the case of high scalability. You should be aware or the token partitioning algorithm used by Cassandra. As soon as you understand it, there should not be any quesitons. There is plenty of presentations about that. E.g. this one: http://www.datastax.com/resources/tutorials/partitioning-and-replication
In case of replication factor 1 everything is simple:
Each key-value pair you save/read from/to Cassandra is a query to one of Cassandra nodes in the cluster. Data is evenly distributed among nodes (see details of partitioning algorithm). So you always have total load evenly distributed among all nodes -> more nodes you have more load they can carry (and it is linear). In this case the system should of course be configured in a right way to avoid different kinds of network bottlenecks.
In case of replication factor more than 1 the situation is a bit more complicated, however the principle is the same.
There are lot of factors that contribute to this result.
A) check your replication factor. Although not desirable, in your case you can set it to 1
B) look into the shard in your primary key. If in your tests you are not changing it, then you are loading the data skewed and that the table is not scaling out to 2 nodes.
What does it mean when we say Casssandra is scalable?
There are basically two ways to scale a database.
Vertical scaling: Increasing the resources of the existing nodes in your cluster (more RAM, faster HDDs, more cores).
Horizontal scaling: Adding additional nodes to your cluster.
Vertical scaling tends to be more of a "band-aid" or temporary solution, because it has very finite limits. Your machines will only support so much RAM or so many cores, and once you max that out you really don't have anywhere to go.
Cassandra is "scalable" because it simplifies horizontal scaling. If you find that your existing nodes are maxing-out their available resources, you can simply add another node(s), adjust your replication factor, and run a nodetool repair. If you have had to do this with other database products, you will appreciate how (relatively) easy Cassandra makes it.
In your case, it's hard to know what exactly is going on without (a lot) more detail. But if your load tests are being adequately handled by your first node, then I can see why you wouldn't notice much of a difference by adding another.
If you haven't already, check out the Cassandra Stress Tool.
Additionally, be sure to check your current methods against this article, which is appropriately titled: How not to benchmark Cassandra

Resources