Why is it so bad to have large partitions in Cassandra?

Why is it so bad to have large partitions in Cassandra? - cassandra

I have seen this warning everywhere but cannot find any detailed explanation on this topic.

For starters
The maximum number of cells (rows x columns) in a single partition is
2 billion.
If you allow a partition to grow unbounded you will eventually hit this limitation.
Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.
As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.
Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.
You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

Related

What is the difference between scylla read path and cassandra read path?

What is the difference between Scylla read path and Cassandra read path? When I stress Cassandra and Scylla then Scylla read performance poor by 5 times than Cassandra using 16 core and normal HDD.
I expect better read performance on Scylla compared to Cassandra using normal HDD, because my company doesn't provide SSD's.
Can someone please confirm, is it possible to achieve better read performance using normal HDD or not?
If yes, what changes required scylla config?. Please guide me!

Some other responses focused on write performance, but this isn't what you asked about - you asked about reads.
Uncached read performance on HDDs is bound to be poor in both Cassandra and Scylla, because reads from disk each requires several seeks on the HDD, and even the best HDD cannot do more than, say, 200 of those seeks per second. Even with a RAID of several of these disks, you will rarely be able to do more than, say, 1000 requests per second. Since a modern multi-core can do orders of magnitude more CPU work than 1000 requests per second, in both Scylla and Cassandra cases, you'll likely see free CPU. So Scylla's main benefit, of using much less CPU per request, will not even matter when the disk is the performance bottleneck. In such cases I would expect Scylla's and Cassandra's performance (I am assuming that you're measuring throughput when you talk about performance?) should be roughly the same.
If, still, you're seeing better throughput from Cassandra than Scylla, there are several details that may explain why, beyond the general client mis-configuration issues raised in other responses:
If you have low amounts of data, that can fit in memory, Cassandra's caching policy is better for your workload. Cassandra uses the OS's page cache, which reads whole disk pages and may cache multiple items in one read, as well as multiple index entries. While Scylla works differently, and has a row cache - only caching the specific data read. Scylla's caching is better for large volumes of data that do not fit in memory, but much worse when the data can fit in memory, until the entire data set has been cached (after everything is cached, it becomes very efficient again).
On HDDs, the details of compaction are very important for read performance - if in one setup you have more sstables to read, it can increase the number of reads and lower the performance. This can change depending on your compaction configuration, or even randomly (depending on when compaction was run last). You can check if this explains your performance issues by doing a major compaction ("nodetool compact") on both systems and checking the read performance afterwards. You can switch the compaction strategy to LCS to ensure that random-access read performance is better, at the cost of more write work (on HDDs, this can be a worthwhile compromise).
If you are measuring scan performance (reading an entire table) instead of reading individual rows, other issues become relevant: As you may have heard, Scylla subdivides each nodes into shards (each shard is a single CPU). This is fantastic for CPU-bounded work, but could be worse for scanning tables which aren't huge, because each sstable is now smaller and the amount of contiguous data you can read before needing to seek again is lower.
I don't know which of these differences - or something else - is causing performance of your use-case to be lower in Scylla, but I please keep in mind that whatever you fix, your performance is always going to be bad with HDDs. With SDDs, we've measured in the past more than a million random-access read requests per second on a single node. HDDs cannot come anything close. If you really need optimum performance or performance per dollar, SDDs are really the way to go.

There can be various reasons why you are not getting the most out of your Scylla Cluster.
Number of concurrent connections from your clients/loaders is not high enough, or you're not using sufficient amount of loaders. In such case, some shards will be doing all the work, while others will be mostly idle. You want to keep your parallelism high.
Scylla likes have a minimum of 2 connections per shard (you can see the number of shards in /etc/scylla.d/cpuset.conf)
What's the size of your dataset? Are you reading a large amount of partitions or just a few? You might be hitting a hot partition situation
I strongly recommend reading the following docs that will provide you more insights:
https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/
https://docs.scylladb.com/operating-scylla/benchmarking-scylla/

#Sateesh, I want to add to the answer by #TomerSan that both Cassandra and ScyllaDB utilize the same disk storage architecture (LSM). That means that they have relatively the same disk access patterns because the algorithms are largely the same. The LSM trees were built with the idea in mind that it is not necessary to do instant in-place updates. It consists of immutable data buckets that are large continuous pieces of data on disk. That means less random IO, more sequential IO for which the HDD works great (not counting utilized parallelism by modern database implementations).
All the above means that the difference that you see, is not induced by the difference in how those databases use a disk. It must be related to the configuration differences and what happens underneath. Maybe ScyllaDB tries to utilize more parallelism or more aggressively do compaction. It depends.
In order to be able to say anything specific, please share your tests, envs, and configurations.

Both databases use LSM tree but Scylla has thread-per-core architecture on top plus we use O_Direct while C* uses the page cache. Scylla also has a sophisticated IO scheduler that makes sure not to overload the disk and thus scylla_setup runs a benchmark automatically to tune. Check your output of it in io.conf.
There are far more things to review, better to send your data to the mailing list. In general, Scylla should perform better in this case as well but your disk is likely to be the bottleneck in both cases.

As a summary I would say Scylladb and cassandra have the same read / write path
memtable, commitlog, sstable.
However implementation is very different:
- cassandra rely on OS for low level IO and network (most DBMS does)
- scylladb rely on its own lib (seastar) to handle IO and network at a low level independently from OS page cache etc. This is why they can provide feature such as workload scheduling within the same cluster that would be very hard to implement in cassandra.

LeveledCompactionStrategy : what is the impact of tuning the sstable_size_in_mb?

To enhance read performance, I try to have fewer underlying SSTables with LCS, so I set sstable_size_in_mb to 1280MB as suggested by some articles, which pointed out that the 160MB default value was picked out by Cassandra core team a long time ago, on a pretty old server by now with only 2GB RAM. However, my concern is about implications of having a higher value of sstable_size_in_mb.
What I understand is LCS regularly compact all the SSTables in L0 together with all the SSTables in L1, then replacing the entire content of L1. So each time L1 is replaced, the hardware requirements CPU/RAM and write amplification may be higher with a higher value of sstable_size_in_mb. Indeed, if sstable_size_in_mb = 1280MB, so 10 tables of 1280MB in L1 have to be merged each time with all L0 tables. And maybe there are also implications on a higher level, even if the SSTables to replace seems lower (one L1 SSTables is merged with 10 L2 SSTables, then those 10 L2 SSTables are replaced).
Questions :
Having a higher value of sstable_size_in_mb may increase read performance by lowering the number of SSTables involved in a CQL Table. However, what are the others implications to have such higher value (like 1280MB) for sstable_size_in_mb?
If higher value, are there any corresponding configuration to tune (Garbage Collector, chunk cache, ...) to allow better performance for compactions of those larger SSTables, and having less GC activity?
More subjective question, what is the typical value of sstable_size_in_mb you use in your deployment?

To answer your first question, I'd like to quote some original text from Jonathan Ellis in CASSANDRA-5727 when the community initially looked into the sstable_size_in_mb (and subsequently decided on the 160 number).
"larger files mean that each level contains more data, so reads will
have to touch less sstables, but we're also compacting less unchanged
data when we merge forward." (Note: I suspect there was a typo and he meant "we're also compacting more unchanged data when we merge forward", which aligns with what you stated in your second paragraph, and what he meant by larger file impacting "compaction efficiency".)
As for any other implication: it might push the envelope on the LCS node density upper bound, as it would allow much higher density for the same number of SSTables per node.
To answer your second question, compaction does create a lot of churns in the heap, as it creates many short lived objects from SSTables. Given much bigger SSTables involved in the compaction when you use the 1280MB size, you should pay attention to your gc.log and watch out for "Humongous Allocation" messages (if you use G1GC). If they turn out to happen a lot, you can increase the region size to avoid costly collections of humongous objects by using the -XX:G1HeapRegionSize option.
For your third question, as far as I know, many have used the 160MB default value for a long time, as we don't have a comprehensive analysis published on the impact/benefit from benchmarking larger SSTable size with modern hardware yet (I attempted to run some quick tests, but got busy with other things and didn't finish that effort, sorry). However, I do think if people are interested in achieving higher node density with LCS, this SSTable size is a parameter that's worth exploring.

What is the Impact of ALLOW FILTERING on Cassandra?

According to official Cassandra blog, ALLOW FILTERING is highly inefficient. But if for some reason one has to use such query, what would be the impact on other applications that use Cassandra to get data? Would only the thread(s) that are busy fetching rows for my query would be slow, or would whole Cassandra would be slow, and consequently, all other applications that are getting data from Cassandra will get their response slow?

It will likely affect the whole node. A problem around it is that your one query with a limit of 10 will not just read 10 records and return, but (possibly) a LOT of data. It is possible to make efficient ALLOW FILTERING queries, which things like the spark driver (token limited queries per token range or within a partition) can do. I would very strongly recommend not even attempting it though. It might work at first but your poor operations team will curse your name.
With faster disks, the obj allocations since this is unthrottled will cause serious GC overhead. This is very similiar to the issue seen when using queues or a lot of tombstones, the JVM building and throwing away the rows overruns the allocation rate the garbage collector can keep up with without longer pauses (early promotions, fragmentation in cms, allocation spikes messing up g1 younggen ratios).
If cross partitions, like with normal range queries, the coordinator will attempt to estimate the ranges it will need to read and the replicas for them to fan out with some limited concurrency. Its a rough estimate because it only has its own data to extrapolate but when the data is then further filtered and not just "number of partitions within range" its likely gonna be wrong and underestimate. Most likely it will query one range at a time, querying next replica set range if it isnt met. With vnodes this can be a very long list, and sequentially walking them will likely not complete within timeout. Luckily this will impact mostly just the one query, but it is still essentially reading the entire dataset off disk from every replica set in the cluster from 1 query. If you make 100/sec the cluster will probably be hosed.

Dedicated commitlog storage vs Read/Write ratio?

As we are using SSD disks to provide storage for our cluster on servers with 30 GB of memory.
There is an argument about the commitlog directory, whether to dedicate an individual disk or having it on the same data disk.
As we already using SSD disks, performance should be fine having both commitlogs and data on the same disk, as there is no mechanical moving head for writing.
However, there is another factor, that is the read/write ratio. How would such a ratio affect the performance of writing or reading when we have both commitlogs and data on the same disk?
Using SSD, when would it become important to dedicate a high performance disk for the commitlog directory?

A dedicated commitlog device usually makes a lot of sense when you have HDDs, but is less obvious if you're using SSDs.
Even if you asked only if it makes sense with SSDs setups, I will try to give some general hints about the subject, primarily based on my understandings and my own experience. I admit the focus is probably too much on HDDs, but HDDs allow a deep insight on how Cassandra stuff works and why backing a commitlog/data directory with an SSD can be a life saver.
Background: IOPS and OPS are not the same thing.
I will start from a (very) far point: Device Performance. Here's a start-point lecture about storage device performances in general. Even if the article's neutrality is under discussion, it can provide some insights about the general metrics and performance you can expect from some systems. Of course, your mileage may vary, depending on what device (type/brand/model etc...) and how much stress (intended as type of workload) you put on the device, but I think it is a good starting point for our discussion here.
The reason I prefer to start from IOPS is because it is the very starting point for understanding storage performance. The C* literature speaks about OPS, Operations Per Second, because people usually don't think in terms of IOPS, especially when looking at stats. This really hides a lot of details, the operation size for starters.
A Cassandra Operation usually consists of multiple IOPS. Cassandra documentation usually refers to spinning disks (even if SSDs are referenced too), clearly states what happens when performing reads/writes, and people tend to ignore the fact that when their software stack (that spans from up the application down to Cassandra and its data files on the storage) hit the disks the performance decreases by a huge amount just because they have failed to recognize a random workload, and even if "Cassandra is an high-performance etc... etc.. etc...".
As an example, looking at the picture in the read path documentation, you can clearly see what data structures are in memory/on disk, and how the SSTable data is accessed. Further, the row cache paragraph says:
... If row cache is enabled, desired partition data is read from the row cache, potentially saving two seeks to disk for the data...
And here's where the catch starts: these two seeks are potentially saved from Cassandra's point of view. This simply means that Cassandra won't make two requests to the storage system: it will avoid to request the partition index, and the data because everything is already in RAM, but it doesn't really translates to "the storage system will save two IO operations". Indeed, how (generic) data is retrieved from the storage device is a very different thing, and of course depends on how the files are layed-out on the disk itself: are you using EXT4, XFS, or what? Assuming no cache is available (eg for very big data set sizes you can't really cache everything...), looking for a file is IOPS consuming, and this tends to amplify the potentially saved seeks when you have data in RAM, and tends to amplify the penalty you perceive when your data is not.
You can't escape physics: HDDs pay some taxes, SSDs no.
As you already know, the main "problem" (performance-wise) of HDDs is the average seek time, that is the time the HDD needs to wait on average in order to have a target sector under the heads. Once the sector is under the heads, if the system have to read a bunch of sequential bits everything is smooth and the throughput is proportional to the rotational speed of the HDD (to be precise to the tangential speed of the platters under the head, which depends also on the track etc...).
In other terms, HDDs have an average fixed performance tax (the average seek time), and everything after is almost "free". If an application requests a bunch of sectors that are not "contiguous" (from the disk point of view, eg a fragmented file is splitted across multiple sectors, but an application can't really know this), the disk will have to wait the average seek time on average multiple times, and this fixed tax influences its maximum throughput.
The strongest argument about storage is: every device have its own maximum magic average IOPS number. This number express the number of random IOPS the device can perform. You can't force an HDD to have more IOPS on average, it's a physical problem. The OS is usually smart enough to "enqueue" sector requests in the attempt to reduce the seek times, eg ordering by ascending requested sector number (trying to exploit some sequential operations), but nothing will save the performances from a random IO workload. You have X allotted available IOPS and must face your problems with that. No matter what.
You need to take advantage of the allotted IOPS of your device, and you must be wise on how you use them.
Suppose you have an HDD that maxes out at 100 IOPS on average. If your application performs a bunch of small (say 4KB) file reads, you have an application that performs 100 * 4KB reads every second: the throughput will be around 400KB/s (unless some caching is involved, and in that case the cache saved you precious IOPS). Astonishing. This is simply because you keep paying the seek time multiple times. If you change your access pattern to something that reads 16MB (contiguous) files, you get an higher throughput because you won't pay the seek time so much, you are exploiting a sequential pattern. What changes under the hood is the Request Size of each operation.
Now an interesting question is: how are "IOPS" and "Request Size" related to? Does one request size of 16MB can be considered one IOPS? And what about a 128MB request size? This is indeed a good question. At lower level, the Request Size spans from 512 bytes (the minimum sector size) to 128KB (32*4K sectors in one request). If the operation has small size, its transfer time, the time the disk needs to fetch the data, is also small. Higher request sizes have higher transfer times obviously. However, if you are able to perform 100 4KB IOPS, you will probably be able to perform around 80 IOPS #8KB. The relation can't be linear, because the transfer time depends on the rotational speed of the disks only (the transfer time is negligible compared to the seek time), and since you are actually reading from two adjacent sectors, you'll hit the seek time penalty once per request. This translates to a throughput of around 400KB/s for 4K requests and 1.6MB/s for 8K requests. And so on.... The larger the request size, the longer it takes to transfer data, the lesser IOPS you have, the higher throughput you have. (These are random numbers, pun intended, no measurements done! Just to let you understand. I however think they are in the ballpark).
SSDs don't suffer mechanical penalties and that's why they are capable of performing much better than HDDs. They have much more IOPS, and their limits come from the onboard electronics, bus connection etc.... Having an higher IOPS device is a big plus, these can be consumed by applications that are not IOPS friendly, and the user won't notice that the applications suck. However, with SSDs, the Request Size linearly influences the number of IOPS you can perform. When you look at some device that have 100k IOPS, these are usually referred at 4K. You'll be able to perform only 6.2k requests if you perform 64K requests.
Why Cassandra has a such good read performances even with HDDs then?
Speaking from a single node point of view (because given the performance of a cluster Cassandra scales linearly with the number of nodes in the cluster), the problem lies in the question itself. This is only true if you model your data in this particular way:
You must fetch all your data with one query only.
Your data must be ordered.
If you can't fetch your data with one data, denormalize in order to retrieve it with one query only.
You fetch a relative good amount of data on every read
These are well-known Cassandra modeling rules, but the key point is that these rules do really have a reason to be applied IOPS-wise. Indeed, these rules allow Cassandra to:
Be a super fast database because it will just require the partition index and the SSTable offset index of the data: two IOPS in the best case, much more IOPS in the worst case.
Be a super fast database because it will exploit the sequential capabilities of the HDDs and will not stress the IO subsystem by issuing other IO (random) seeks.
Be a super fast database because it will just fetch more data like the point number 1.
Be a super fast database because it will exploit longer the sequential capabilities of the HDDs.
In other terms, following these basic data modeling rules allows Cassandra to be IOPS friendly when reading data back.
What happens if you screw-up your data model? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
What happens if you read/write a small amount of data (eg due to misconfigured flush sizes, small commit log etc...)? Cassandra won't be IOPS friendly, and as a consequence the performances will be predictably horrible. Unless you use an SSD, which has greater IOPS and then you won't notice slowness too much.
How a read/write ratio pattern can influence performance in a Cassandra node?
Cassandra is a complex system, with different components that interact each other. I will try to explain from my point of view what are the main points when you put everything on one device only.
Writes/Deletes/Updates in Cassandra are fast because they are simply append-only writes to the CommitLog device. Reads, on the contrary, can be very IOPS consuming. When both CommitLog and Data are on the same physical disk (either HDD or SSD), the read/write paths interact, and they both consume IOPS.
Two important questions are:
How many IOPS a read (using the read path) consumes?
How many IOPS a write consumes?
These are important question because you have to remember that your device can perform at most X IOPS, and your system will have to split these X IOPS among these operations.
It is quite difficult to answer to the "read" question because, when you request some data, Cassandra needs to locate all the SSTables needed to satisfy the request. Assuming a very big dataset size, where caching is not effective, this imply that the Cassandra read path can be very IOPS hungry. Indeed, if your data is spread into 3 different SSTables, Cassandra will have to locate all of them, and for each SSTable will follow the read path: will read the partition index, and then will read the data in the SSTable. These are at least two IOPS, because if your filesystem is not "collaborative" enough, locating a file and/or pointing at a file offset could require some more IOPS. In the end, in this example Cassandra is consuming at least six IOPS per read.
Answering the "write" question is also tricky, because compactions and flushes can be triggered. They will consume a lot of IOPS. Flushes are easy to understand: they write data from memtables to disk with a sequential pattern. Instead, compactions read data back from different SSTables on disk, and while reading the tables they flush the result out to a new disk file. This is a mixed read/write pattern, and on HDDs this is very disruptive, because will force the disk to perform multiple seeks.
Mixing percentages: TL;DR
If you have a R/W ratio of 95% reads and 5% writes, having a separate CommitLog device can be a waste of resources, because writes will hardly impact your read performances, and you write so rarely that write performance may be considered not critical.
If you have a R/W ratio of 5% reads and 95% writes, having a separate CommitLog device can be again a waste of resources, because reads will hardly impact your write performances, and your read performances will hardly suffer from a bunch of sequential appends on the commitlog.
And finally, if you have a R/W ratio of 50% reads and 50% writes, having a separate CommitLog device is NOT a waste of resources, because every write performed on the CommitLog device won't produce at least two IOPS on the data drive (one for writing, and one for going back to read).
Please note that I didn't mention compactions, because independently on your workload, when compaction triggers in, your workload will be disrupted by mixed read/write background operations on different files (consuming disk IOPS all the way), and you will suffer both on reads and writes.
All this should be clear enough for HDDs because you run out of IOPS very fast, and when you do you notice it immediately. On SSDs, however, you don't run out of IOPS that fast, but you could do if your data consists of a lot small data rows.
The reality is that getting out of IOPS on an SSD is very hard because you'll get out (by a far amount) of CPU resources, But once you do you will see your performance slowly decrease. The effect however won't be such dramatic as in the cases of HDDs. As an example, if you have a 100 IOPS HDD and you run-out of IOPS by trying to issue 500 random IO stuff, you cleary get a penalty. By calling this penalty P, if you have an SSD with 100k IOPS, to get the same penalty P you should issue 500k IOPS, which can be very difficult to do without exhausting CPU or RAM.
In general, when you run out of some type of resource in your system, you need to increase its quantity. The most important thing (to me) is not to run out of IOPS in the "Data" part of your Cassandra cluster. In the case of SSDs IOPS, it's rare enough that you'll get the limit. You'll burn your CPU well before I think. But you will if you don't tune your system, or if your workload put too much stress on the disk subsystem (eg Leveled Compaction). I'd suggest to put an ordinary HDD instead of an high performance SSD for the commitlog, saving money. But if you have a lot of very small commitlog flushes an SSD is a completely life saver, because your writers won't suffer the latency of HDDs.
Finally, in my opition, you should go in pre-production with some sort of real data, and check your IOPS requirements. If you have enough room to put the SSD there don't worry. Go and save money. If your system gets too much pressure due to compaction then having a separate device is suggested. Analyze your commitlog pattern, and if its not IOPS demanding put it on a separate disk. Moreover, if you have a virtual environment you can provision a relatively small commitlog device regardless of other factors. It won't rise the cost of your solution too much.

The actual numbers will depend highly on the type of workload you have the configuration you have etc. You can have a look at Netflix tech blog posts for ballpark numbers, e.g. #1, #2.
Dedicating a disk for commitlog directory is a sort of scale up strategy. Cassandra works well with scale out approach. You just add more nodes into the cluster to spread the load - 2nd from the linked articles has a nice graph showing near linear scalability.

Does having 1000's of CF's will lead to OOM in Cassandra

I am having a cluster with multiple CF's (around 1000 maybe more). And I get OOM errors time to time from different nodes. We have three Cassandra nodes? Is it an expected behavior in cassandra?

Each table (columnfamily) requires a minimum of 1MB of heap memory, so it's quite possible this is causing some pressure for you.
The best solution is to redesign your application to use less tables; most of the time I've seen this it's because someone designed it to have "one table per X" where X is a customer or a data source or even a time period. Instead, combine tables with a common schema and add a column to the primary key with the distinguishing element.
In the short term, you probably need to increase your heap size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string