How does Yugabyte handle single disk failure? - yugabytedb

I want to know what is the behavior of Yugabyte in case of single disk failures. Specifically:
What procedure should the sysadmin follow while dealing with a single disk failure? e.g. does stopping the TServer momentarily, changing the disk and restarting it automatically copy that disk's tablets from other TServers?
Does a TServer automatically detect a disk failure and spread the data on that disk to other disks if the failed disk is not replaced within a stipulated time frame?
Thank you.

What procedure should the sysadmin follow while dealing with a single disk failure? e.g. does stopping the TServer momentarily, changing the disk and restarting it automatically copy that disk's tablets from other TServers?
Yes, the following procedure works, YugabyteDB would recognize the data was missing and rebuild the data:
Stop the tservers
Replace the disks
Restart the tservers
You can try this in a sandbox-like environment on your laptop using yb-ctl in case you want to understand this behavior. The setup/procedure is the same as this fault-tolerance experiment. Instead of removing nodes, simply stop a node, delete one or more "disks" (which are really just directories) and start the node back. I believe there are two disks created by default.
Here is the manual for yb-ctl.
PS: If you're doing this exercise, please consider contributing to the fault-tolerance section on docs, would be happy to help you if you're interested.
Does a TServer automatically detect a disk failure and spread the data on that disk to other disks if the failed disk is not replaced within a stipulated time frame?
Yes, this is done automatically if there are enough other nodes to replicate data on. For example:
In a single-zone setup with replication factor 3: if you started with 4 nodes or more, then there would be at least three nodes left after one failed. In this case, re-replication is automatically started if a tserver is down for 10 minutes.
In a multi-zone setup with replication-factor 3: YugabyteDB will try to keep one copy of data per zone. In this case, for automatic re-replication of data, a zone needs to have at least two tservers so that if one fails, its data can be re-replicated to the other. Thus, this would mean at least a 6 node cluster.

Related

High CPU usage and traffic on some Cassandra nodes

As stated in the title, we are having a problem with our Cassandra cluster. There are 9 nodes with a replication factor of 3 using NetworkTopologyStrategy. All in the same DC and Rack. Cassandra version is 3.11.4 (planning to move on 3.11.10). Instances have 4 CPU and 32 GB RAM. (planning to move on 8 CPU)
Whenever we try to run repair on our cluster (using Cassandra Reaper on one of our nodes), we lose one node somewhere in the process. We quickly stop the repair, restart Cassandra service on the node and wait for it to join the ring. Therefore we are never able to run repair these days.
I observed the problem and realized that this problem is caused by high CPU usage on some of our nodes (exactly 3). You may see the 1 week interval graph in below. Ups and downs are caused by the usage of the app. In the mornings, it's very low.
I compared the running processes on each node and there is nothing extra on the high CPU nodes. I compared the configurations. They are identical. Couldn't find any difference.
I also realized that these nodes are the ones that take most of the traffic. See the 1 week interval graph in below. Both sent & received bytes.
I made some research. I found this thread and at the end it is recommended to set dynamic_snitch: false in Cassandra configuration. I looked at our snitch strategy which is GossipingPropertyFileSnitch. In practice, this strategy should work properly but I guess it doesn't.
The job of a snitch is to provide information about your network topology so that Cassandra can efficiently route requests.
My only observation that could be cause of this issue is there is a file called cassandra-topology.properties which is specifically told to be removed if using GossipingPropertyFileSnitch
The rack and datacenter for the local node are defined in cassandra-rackdc.properties and propagated to other nodes via gossip. If cassandra-topology.properties exists, it is used as a fallback, allowing migration from the PropertyFileSnitch.
I did not remove this file as I couldn't find any hard proof that this is causing the issue. If you have any knowledge on this or see any other reason to my problem, I would appreciate your help.
These two sentences tell me some important things about your cluster:
high CPU usage on some of our nodes (exactly 3).
I also realized that these nodes are the ones that take most of the traffic.
The obvious point, is that your replication factor (RF) is 3 (most common). The not-so-obvious, is that your data model is likely keyed on date or some other natural key which results in the same (3?) nodes serving all of the traffic for long periods of time. Running repair during those high-traffic periods will likely lead to issues.
Some things to try:
Have a look at the data model, and see if there's a better way to partition the data to distribute traffic over the rest of the cluster. This is often done with a modeling technique known as "bucketing" (adding another component...usually time based...to the partition key).
Are the partitions large? (Check with nodetool tablehistograms) And by "large," like > 10MB? It could also be that the large partitions are causing the repair operations to fail. If so, hopefully lowering resource consumption (below) will help.
Does your cluster sustain high amounts of write throughput? If so, it may also be dealing with compactions (nodetool compactionstats). You could try lowering compaction throughput (nodetool setcompactionthroughput) to free up some resources. Repair operations can also invoke compactions.
Likewise, you can also lower streaming throughput (nodetool setstreamthroughput) during repairs. Repairs will take longer to stream data, but if that's what is really tipping-over the node(s), it might be necessary.
In case you're not already, set up another instance and use Cassandra Reaper for repairs. It is so much better than triggering from cron. Plus, the UI allows for some finely-tuned config which might be necessary here. It also lets you pause and resume repairs, to pick-up where it leaves off.

Disk space of Cassandra node is over 80%

I'm running 12 nodes of Cassandra in AWS EC2 instance, 4 of them are using almost 80% of the disk space, so compaction failed on these nodes, since the type of the server is EC2 instance, I can't add mode disk space to the existing data volume on the fly, I can't ask IT team to add more nodes to scale and spread the clustre as disk space of other nodes is less than 40%, before fixing the unbalanced cluster issue, is there any way to free up some disk space?
My question is how can I find unused sstables and move them to another partition to run compaction and free up some space?
Any other suggestion to free up some disk space.
PS: I already dropped all the snapshots and backups.
If you are using vnodes then data sizes difference should not be that much. Before coming to solution we must find the reason for big difference in data sizes on different nodes.
You must look into logs to see if there is corruption of some big SStable which resulted in compaction failures and increase in data sizes. Or you can find something in your logs which points to the reason in increasing of disk sizes.
We faced an issue in Cassandra 2.1.16 due to some bug it happened that even after compaction old sstable files were not removed. We read the logs and identified the files which can be removed. This is an example where we found the reason of increased data size after reading the logs.
So your must identify the reason before solution. If it is a dire state you can identify keyspaces/tables which are not used during your main traffic and move those sstables in backup and remove those sstables. Once your compaction process is over you can bring them back.
Warning :Test any procedure before trying on production.

Cassandra cluster - data density (data size per node) - looking for feedback and advises

I am considering the design of a Cassandra cluster.
The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.
However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).
I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks).
Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).
The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.
I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).
Thanks in advance for your feedback.
Best regards.
The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.
In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.
One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).
Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.
Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.
I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:
See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.
I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.
If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.
Hope that helps.
I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).
For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].
You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.
[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling
[3] http://wiki.apache.org/cassandra/CassandraLimitations

moving Cassandra snapshots to a different disk/server/datacenter

I have Cassandra 1.2.6 cluster running on datacenter A, each node has a solid state drive with somewhat limited space (aprox 50% of disk space is free).
Now I need to implement somehow a way of having automatic backups of each node. Ideally I want to have a way of moving all of the cluster's datafiles to a different disk (standard cheaper disks), or even to a different server in the same datacenter A and possibly moving all the data once in a while to a datacenter B in a different location.
From what I've read I can use snapshots on each node to get the files to copy using whatever tool I want and in this case I have the option to move the data to a different disk/server/datacenter.
My question is, since each of my nodes is about 50% full, taking a snapshot will consume all that space? or the hard links will consume way less space than I anticipate?, if so, is there a better way of doing this, maybe with an already made tool, or everything should be custom made when it comes to this type of backups in Cassandra?
Thanks in advance!
A hard link just creates a new directory entry for the same file (http://en.wikipedia.org/wiki/Hard_link). So a snapshot takes up effectively zero space, but you'll want to clean it up after you're done copying it off to whatever your archive is, because when the "original" sstable is deleted (typically post-compaction), space won't be reclaimed as long as the snapshot reference is still there.
My impression is that tablesnap is the most popular tool for automating backups to s3. It also supports Cassandra incremental backups. If you want more control over where you're backing up to, DataStax OpsCenter supports running a custom script when it takes snapshots.

Cassandra Compaction takes all the resources and leads to node failure

I met very strange problem during testing cassandra. I have a very simple column family that stores video data (keys point to time period and there is only one column containing ~2MB video for this period).
Use Case
I start to load data using Hector API (round-robin) to 6 empty nodes (8GB RAM for Cassandra)- load is run in 4 threads adding 4 rows in second for each thread.
After a while (running load for hour or so) near 100-200 GB are added to the node (depending on the replication factor) and then one or several nodes become unreachable. (no pinging just reboot helps)
Why Compaction
I do use tiered-level compaction and monitoring the system(Debian) i can see that it actually not writes but compaction that takes almost all resources (disk, memory) and cause server to refuse writes and than fail.
After like 30-40 minutes of test compaction tasks just cannot be handled and get queued. Interesting thing is that there are no deletes and updates - so compaction just reads/writes data again and again without bringing actual value to me (like it can be compacted once in the evening).
When i slow down the pace - i.e running 2 threads with 1 second delay things go better but whether it still be working when i have 20TB not 100 GB on a node.
Is Cassandra optimized for such type of workload? How the resources are normally distributed between compaction and reads/writes?
Update
Update of network driver solved problem with unreachable cluster
Thanks,
Sergey.
Cassandra will use up to in_memory_compaction_limit_in_mb memory for a compaction. It is routine to have compaction running while reads and writes are served simultaneously. It is also normal that compaction can fall behind if you continue to throw writes at it as fast as possible; if your read workload requires that compaction be up to date or close to it at all times, then you'll need a larger cluster to spread the load around more machines.
Recommended amount of disk per node for online queries is up to 500GB, maybe 1TB if you're pushing it. Remember that this amount of data will have to be rebuilt if a node fails. Typical Cassandra workloads are CPU-bound or iops-bound, not disk-space bound, so you won't be able to make good use of that space anyway.
(It's also possible to do batch analytics against Cassandra, which we do with the Cassandra Filesystem, in which case higher disk:cpu ratios are desirable, but we use a custom compaction strategy for that as well.)
It's not clear from your report why a server would become unreachable. This is really an OS-level problem. (Are you swapping? Disabling swap would be a good first step.)

Resources