Want to spread current cluster data over more and smallere sstables - cassandra

Our cassandra 2.1.15 application' KS (using STCS) are leveling in less than 100 sstables/node of which some data sstables are now getting into the +1TB size. This means heavy/longer compactions plus longer time before tombstones and their evicted data gets in the same compaction view (application do both create/read/delete of data), thus longer before real disk space gets reclaimed, this sucks :(
Our Application Vendor later revealed to us, that they normally recommend hashing the data over 10-20 CFs in the application KS rather than our currently created 3 CFs, guessing as an way to keep ratio of sstables vs sizes in a 'workable' range. Only the application can't have this changed now we have begun hashing data out in our 3 CFs.
Currently we got 14x linux node cluster, nodes of same HW and size (running w/equal amount of vnodes), originally constructed with two data_file_directories in two xfs FS on each their logical volumes - LVs backed each by a PV (6+1 raid5). Then as some nodes began to compact data skewed in these data dirs/LVs when growning sstable sizes, we merged both data dirs onto one LV and expanded this LV with the thus released PV. So we now got 7x nodes with two data dirs in one LV backed by two PVs and 7x nodes with two data dirs in two LVs on each their PV.
1) Now as sstable sizes keeps growning due to more data and using STCS (as recommend by App Vendor) we're thinking we might be able spread data over more and smallere sstables by simply adding more data dirs in our LVs as compensation for having less CFs rather than adding more HW nodes :) Wouldn't this work to spread data over more and smallere sstables or is the a catch in using multiple data dir compared with fewer?
1) Follow-up: must have had a brain fa.. that day, off course it won't :) The Compaction Strategy doesn't bother with over how many data dirs a CF' sstables are scattered only bothers with the sstables them selves according to the strategy. So only way to spread over more and smallere sstables is to hash data over more CFs. Too bad Vendor did the time-space trade off not to record in which CF a partition key is hashed a long with the key it self, then hashing might have been reseeded to a larger number of CFs. Now only way is to built a new cluster w/more CFs and migrate data there.
2) We could then possibly use either sstablesplit on the largest sstables or removing/rejoining with more than two data dirs node by node to get rit of the currently real big sstables. Would either approach work to get sstable sizes scaled down and which way is most recommendable?
2) Follow-up: well if one node is decommissioned is token range will be scatter to other nodes, specially when using multiple vnodes/node and thus one big sstables would be scatter over more nodes and left to the mercy of the compaction strategy at other nodes. But generally if 1 out of 14 nodes, each with 256 vnodes, would be scattered to the 13 other nodes for sure, right?
Thus only increasing other nodes' amount of data by roughly 1/13 of decommissioned node' content. But rejoining such a node again would properly only send roughly same amount of data back eventually getting compacted into similar sized sstables, meaning we've done a lot IO+streaming for nothing... Unless tombstones were among the original data but just to far apart to be lucky enough to enter same compaction views (small sstable vs large sstable), such an exercise may possible get data shuffled around giving better/other chance to get some tombstone+their data evicted through the scatter+rejoining faster than waiting to strategy to get TS+data in same compaction view, dunno... any thoughs on the value of possible doing this?

Huh that was a huge thought dump.
I'll try to get straight to the point. Using ANY type of raid (except stripe) is a deathtrap. If your nodes don't have sufficient space then you either add disks as JBODs to your nodes or scale out. Second thing is your application creating, deleting, updating and reading data and you are using STCS? And with all that you have 1TB+ per node? I don't even want to get into questioning the performance of that setup.
My suggestion would be to rethink the setup having data size, access patterns, read/write/delete/update ratios and data retention plans in mind. 14 nodes with 1TB+ of data each is not catastrophic (even thou the docu states that going past 600-800GB is bad, its not) but you need to change the approach. LCS works wonders for scenarios like yours and with proper planning you can have that cluster running a long time before having to scale out (or TTL your data) with decent performance.

Related

Cassandra CPU imbalance in Azure

We have a 30+ node Cassandra cluster (3.11.2) in 4 data centers. One of the centers consists of 8 nodes in Azure running on Standard DS12 v2 (4cpu, 28gb) nodes with a 500GB premium SSD drive. All in the same data center (central US).
We are seeing a dramatic CPU imbalance in the node activity when pushed to the max. We have a keyspace with about 200 million records, and we're running a process to check and refresh the records if necessary from another data stream.
What's happening, is we have 4 nodes that are running at 70-90% CPU compared to 15-25% of the other 4. The measurement of the CPU is being done in the nodes themselves, because Azure's own metrics is broken and never represents what is actually happening.
Digging into a pair of nodes (one low CPU and one high) the difference is the iowait% of the two. The data in the keyspace is balanced (within reason - they are all within 5% of another in record count and size). It looks like the number of reads is balanced, and even the read latency as reported by Cassandra is similar.
When I do an iostat compare of the nodes, the high CPU node is reporting a much higher (by 50 to 100%) rKB/s numbers... which is likely leading to the difference in iowait% time.
These nodes are 100% configured the same, running the same version of everything (OS, libraries, everything) that I can think to look. I cannot figure out why some nodes are deciding to do more disk reads that the others, resulting in the cluster as a whole slowing down.
Anybody have any suggestions on where I can look for differences?
The only thing that is a pattern, is the nodes that are slower are the 4 nodes that were added later in our expansion. We started with 4 nodes for a while and added 4 more when we needed space. All the appropriate repairs and other tasks required with node additions were done - the fact that the records and physical size of the data files on disk being equal should attest to that.
When we shut down our refresh process, the all the nodes settle down to a even 5% or less CPU across the board. No compaction or any other maintenance is happening that would indicate something different.
plz help... :)
Our final solution for this - to fix ONLY the unbalanced problem was to cleanup, full repair and compact. At that point the nodes are relatively equally used. We suspect expanding the cluster (adding nodes) may have left elements of data on the older nodes that were not compacted out based on regular compaction events.
We are still working to try to solve the load issue; but now at least all the nodes are feeling the same CPU crunch.

Cleanup space in almost full Cassandra Node

I have a Cassandra Cluster (2 DC) with 6 nodes each and RF 2. 4 of the nodes (in each DC) getting full so I need to cleanup space very soon.
I tried to run a full repair but ended up as a bad idea since the space start increased even more and the repair eventually hanged. As a last solution I am thinking to start repairing and then cleanup specific columns starting from the smallest to the biggest.
i.e
nodetool repair -full foo_keyspace bar_columnfamily
nodetool cleanup foo_keyspace bar_columnfamily
Do you think that this procedure will be safe for the data?
Thank you
The commands that you presented in your question make several incorrect assumptions. First, "repair" is not supposed to, and will not, save any space. All repair does is to find inconsistencies between different replicas and repair them. It will either do nothing (if there's no inconsistencies), or add data, not remove data.
Second, "cleanup" is something you need to do after adding new nodes to the cluster - after each node sent some of its data to the new node, a "cleanup" removes the data from the old nodes. But cleanup is not relevant when not adding node.
The command you may be looking for is "compact". This can save space, but only when you know you had a lot of overwrites (rewriting existing rows), deletions or data expirations (TTL). What compaction strategy are you using? If it's the default, size-tiered compaction strategy (STCS) you can start major compaction (nodetool compact) but should be aware of a big risk involved:
Major compaction merges all the data into one sstable (Cassandra's on-disk file format), dropping deleted, expired or overwritten data. However, during this compaction process, you have both input and output files, and at worst case this may double your disk usage, and may fail if the disk is more than 50% full. This is why a lot of Cassandra best-practice guides suggest never to fill more than 50% of the disk. But this is just the worst case. You can get along with less free space if you know that the output file will be much smaller than the input (because most of the data has been deleted). Perhaps more usefully, if you have many separate tables (column family), you can compact each one separately (as you suggested, from smallest to biggest) and the maximum amount of disk space needed temporarily during the compaction can be much less than 50% of the disk.
Scylla, a C++ reimplementation of Cassandra, is developing something known as "hybrid compaction" (see https://www.slideshare.net/ScyllaDB/scylla-summit-2017-how-to-ruin-your-performance-by-choosing-the-wrong-compaction-strategy) which is like Cassandra's size-tiered compaction but does compaction in small pieces instead of generating one huge file, to avoid the huge temporary disk usage during compaction. Unfortunately, Cassandra doesn't have this feature yet.
Good idea is first start repair on smallest table on smallest keyspace one by one and complete repair. It will take time but safer way and no chance to hang and traffic loss.
Once repair completed start cleanup in the same way as repair. This way no impact on node and cluster as well.
You shouldn't fill more than about 50-60 % of your disks to make room for compaction. If you're above that amount of disk usage you need to consider getting bigger disks or add more nodes.
Datastax recommendations are usually good to follow: https://docs.datastax.com/en/dse-planning/doc/planning/planPlanningDiskCapacity.html

Avoid sstable sizes to grow into the sky w/STCS

Got a DSC 2.1.15 14-node cluster, which is using STCS and it seems to be hovering around what seems a stable number of sstables even as we insert more and more data, so currently starting to see sstables data files in the excess of +1TB. See graphs:
Reading this we fear that having too large file sizes, might postpone compacting tombstones to finally release space as we'll have to wait for at least 4 similar sized sstables to get created.
Every node currently have two data directories each, we were hoping cassandra would spread data across those dirs using space relative equally, but as sstables are growing due to compaction, We fear ending with larger and larger sstables and maybe in one data dir primarily.
Howto possible control this better maybe, LCS or...?
Howto determine a sweet spot for number of sstables vs their sizes?
What affects the number of sstables vs their sizes vs in what data dir they get placed?
Currently few nodes are beginning to look skewed:
/dev/mapper/vg--blob1-lv--blob1 6.4T 3.3T 3.1T 52% /blob/1
/dev/mapper/vg--blob2-lv--blob2 6.6T 545G 6.1T 9% /blob/2
Could we stop a node, merge all keyspace's sstables (they seem uniquely named with an id/seq.# even though spread in two data dirs) into one data dir and expand the underlying volume and restart the node again and thus avoid running out of 'space' when only one data dir FS gets filled?

Cassandra cluster - data density (data size per node) - looking for feedback and advises

I am considering the design of a Cassandra cluster.
The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.
However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).
I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks).
Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).
The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.
I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).
Thanks in advance for your feedback.
Best regards.
The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.
In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.
One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).
Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.
Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.
I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:
See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.
I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.
If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.
Hope that helps.
I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).
For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].
You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.
[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling
[3] http://wiki.apache.org/cassandra/CassandraLimitations

Concurrent writes to cassandra replicas - Is duplication possible?

I have a two machine cluster which is running Cassandra 1.2.6. I am using a keyspace which has a replication factor of 2. But my application demands me to write to both the replicas in parallel and also let the Cassandra do the replication and hoping that Cassandra does not duplicate the key/value on the replica nodes.
For example:
I have nodes Node1 and Node2. I have a keyspace which has replication factor 2 configured on it and a column family to push key/value pairs
I use a python client (pycassa) to write to the cluster.
A key, "KeyX", hashes to Node1 and Node2. (I find out which key hashes to which servers through the node tool command. (`$nodetool getendpoints KeyspaceName ColumnFamilyName KeyHexString`)
I use a client to write (KeyX, Value) concurrently to the nodes Node1 and Node2. (In the connection pool I give only the specific server name)
When writing, I wait for one write to succeed (to the master). (Consistency level ONE)
Now, I monitor through the `$nodetool status` command the amount of disk space that the cluster uses.
I write around 100 keys each having 2MB values.
Ideally this should store around 400MB on disk with some overhead for storing keys which should be marginal compared to the value sizes that I using.
Observations:
If I do not write to all the nodes that the key hashes to, Cassandra internally handles replication and the data size is around 400MB. (200MB on each node for 100 keys with 2MB value)
If I do write to all the nodes the key hashes to, Cassandra is writing more than the expected amount of data to the disk. It is as high as 15% more. In our tests Cassandra write ~460MB instead of 400MB.
My question is, is the behavior (15% overhead) expected? Is there any configuration that we need to tweak so that Cassandra properly handles concurrent writes to all the replicas.
Thanks!
There are two possible causes of the 15% extra space that I can think of.
One is because sometimes a replica will store two copies of a column temporarily. If you write a column twice in Cassandra at slightly different times, the two copies may go into separate memtables so end up in separate SSTables on disk. At some point later, when the SSTables get merged through the compaction process, the older value will be discarded, freeing up the space. In your test you could run nodetool compact to force compaction and see if the space usage goes down.
Another possible cause depends on how you did the test when you didn't write to both nodes. If you did this at consistency level ONE, it is possible some of the writes were dropped by the other replica, so it doesn't have all the keys yet. You can be sure it does by running nodetool repair. So the space used in your first observation may not be for all the keys.
You should be aware that writing to all replicas at consistency level ONE does not guarantee that each replica holds a copy. The node that is receiving the data does not have to store it to return success for the write, even if it is a replica. It may be overloaded (in your workload, this would most likely be due to not enough I/O to write the data out) and drop the write, while succeeding in writing it to a different replica. This would cause less space to be used in your second observation, but probably isn't happening in your test since it is a relatively small amount of data.
If you need to guarantee you have two copies you should write at consistency level ALL and only write it once.

Resources