Can we use Cassandra on nodes with different disk sizes? If so, how does Cassandra balances nodes and do we have any control over it?
I've found this thread but it's quite old http://grokbase.com/t/cassandra/user/113nvs23r4/cassandra-nodes-with-mixed-hard-disk-sizes
Its highly recommended not to introduce imbalance of nodes in a cluster (at least within the same DC) in terms of hard disk, CPU, Memory. All nodes in the cluster are treated equal and there is no intelligence behind the disk capacity on each node.
Unless you can take the pain of manually distributing tokens instead of using vnodes, this is not advisable. In case of manual distribution, one has control over which node to assign more tokens and where less. Again hoping and praying that the data distribution is uniform and hence the node with less tokens will get less data.
Related
We have the need to reduce vtokens on a Cassandra cluster (2 nodes) to compensate for both machines having different storage capabilities. The replication factor is currently 1 so there is no replication of data happening.
Can't we simple reduce vtokens to 32 instead of current 256 and restart server? What'll happen if we try this? Will it stream the extra tokens or will we loose data?
We read about decommissioning the node to copy all data to the bigger one, reconfigure it to have less vtokens, delete cassandra data locally and make it rejoin the cluster, just wondering what happens if we try to reduce vtokens before decommissioning it?
Thanks!
You can't do balancing with vnodes. By virtue of statistics you should have a pretty even distribution of data across your nodes even with 32 vnodes. And fewer vnodes will give you better search performance.
Also keep an eye on CASSANDRA-7032, this should let us go to even lower num_tokens without sacrificing data distribution.
I have 6 cassandra nodes in two datacenters with 16GB of memory and 1TB HD drive.
Now I am adding 3 more nodes with 32GB of memory. will these machines will cause overhead for existing machines ( May be in token distribution )? if so please suggest how to configure these machine to avoid those problems.
Thanks in advance.
The "balance" between nodes is best regulated using vnodes. If you recall (if you don't, you should read about it), the ring that Cassandra nodes form is actually consisted out of virtual nodes (vnodes). Each node in the ring has a certain portion of vnodes, which is set up in the Cassandra configuration on each node. Based on that number of vnodes, or rather the proportion between them, the amount of data going to those nodes is calculated. The configuration you are looking for is num_tokens. If you have similarly powerful machines, than an equal vnode number is available. The default is 256.
When adding a new, more powerful machine, you should assign a greater number of vnodes to it. How much? I think it's hard to tell. It's unwise to give it twice more, only be looking at the RAM, since those nodes will have twice as many data than the others. Than you might expect more IO operations on them (remember, you still have the same HDD) and CPU utilization (and the same CPU). You might want to take a look at this answer also.
If you run cassandra on machines with different sizes of hard disk (e.g. one with one TB, another with 2TB), can I use num_tokens as load factor?
I want to reduce the risk of one node running out of disk space and balance the usage of disk on different machines.
I know, the more data one node collects, the more probable it might become a hotspot. Apart from that, which other considerations do I need to take care of? Which limits or practical restrictions exist for the number of nodes?
Can I change the number of nodes later if disk space changes without trouble?
I would appreciate some advice on that topic because I have not found much information about that in google or at the website of cassandra.
EDIT: numnodes replaced by num_tokens
Are you referring to num_tokens settings? Yes, you can use a different number of tokens based on the hardware resources. Nodes with a larger number of tokens will see higher load and disk usage. Once set, the num_tokens setting cannot be changed at a later point without decommissioning the node.
I am considering the design of a Cassandra cluster.
The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.
However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).
I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks).
Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).
The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.
I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).
Thanks in advance for your feedback.
Best regards.
The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.
In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.
One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).
Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.
Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.
I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:
See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.
I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.
If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.
Hope that helps.
I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).
For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].
You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.
[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling
[3] http://wiki.apache.org/cassandra/CassandraLimitations
I have created two node Cassandra cluster and try to perform load test. I find that one node or two node not making much difference in the through put I have supposed if 1 node can provide me 2000 tps for insert the two node should double the amount. Is it work like that?
if it is not then what actually Scaling means and how can I relate with it latency or throughput.
Cassandra is scalable. Just your case is a bit simplified since two nodes is not really the case of high scalability. You should be aware or the token partitioning algorithm used by Cassandra. As soon as you understand it, there should not be any quesitons. There is plenty of presentations about that. E.g. this one: http://www.datastax.com/resources/tutorials/partitioning-and-replication
In case of replication factor 1 everything is simple:
Each key-value pair you save/read from/to Cassandra is a query to one of Cassandra nodes in the cluster. Data is evenly distributed among nodes (see details of partitioning algorithm). So you always have total load evenly distributed among all nodes -> more nodes you have more load they can carry (and it is linear). In this case the system should of course be configured in a right way to avoid different kinds of network bottlenecks.
In case of replication factor more than 1 the situation is a bit more complicated, however the principle is the same.
There are lot of factors that contribute to this result.
A) check your replication factor. Although not desirable, in your case you can set it to 1
B) look into the shard in your primary key. If in your tests you are not changing it, then you are loading the data skewed and that the table is not scaling out to 2 nodes.
What does it mean when we say Casssandra is scalable?
There are basically two ways to scale a database.
Vertical scaling: Increasing the resources of the existing nodes in your cluster (more RAM, faster HDDs, more cores).
Horizontal scaling: Adding additional nodes to your cluster.
Vertical scaling tends to be more of a "band-aid" or temporary solution, because it has very finite limits. Your machines will only support so much RAM or so many cores, and once you max that out you really don't have anywhere to go.
Cassandra is "scalable" because it simplifies horizontal scaling. If you find that your existing nodes are maxing-out their available resources, you can simply add another node(s), adjust your replication factor, and run a nodetool repair. If you have had to do this with other database products, you will appreciate how (relatively) easy Cassandra makes it.
In your case, it's hard to know what exactly is going on without (a lot) more detail. But if your load tests are being adequately handled by your first node, then I can see why you wouldn't notice much of a difference by adding another.
If you haven't already, check out the Cassandra Stress Tool.
Additionally, be sure to check your current methods against this article, which is appropriately titled: How not to benchmark Cassandra