Cassandra - Limiting Table Size - cassandra

Is it possible to limit the size of a table in Cassandra, by 'fair means or foul'? So either through the actual expected usage of Cassandra (I can't find anything in the docs) or by something a bit more hacky, like setting disk quotas for the locations storing SSTables or similar.

No, you can't.
By design, C* will vary the amount of disk space used, eg. during compaction, saving key/row caches to disk, index files, bloom filters, snapshots etc (all config dependant) so it may not just be the data you've inserted that you need to account for. What should be included/excluded from this hard limit?
There's no C* feature to do what you need so using it probably isn't a good fit for your use case.
As for the disk quotas - try it. See what happens when the limit is reached. I expect (although don't know for sure) that C* will throw an exception and shutdown. Your nodes will probably fall like dominoes as each one reaches its quota or your client will choke when read/write consistency can't be met.

Related

What is the difference between scylla read path and cassandra read path?

What is the difference between Scylla read path and Cassandra read path? When I stress Cassandra and Scylla then Scylla read performance poor by 5 times than Cassandra using 16 core and normal HDD.
I expect better read performance on Scylla compared to Cassandra using normal HDD, because my company doesn't provide SSD's.
Can someone please confirm, is it possible to achieve better read performance using normal HDD or not?
If yes, what changes required scylla config?. Please guide me!
Some other responses focused on write performance, but this isn't what you asked about - you asked about reads.
Uncached read performance on HDDs is bound to be poor in both Cassandra and Scylla, because reads from disk each requires several seeks on the HDD, and even the best HDD cannot do more than, say, 200 of those seeks per second. Even with a RAID of several of these disks, you will rarely be able to do more than, say, 1000 requests per second. Since a modern multi-core can do orders of magnitude more CPU work than 1000 requests per second, in both Scylla and Cassandra cases, you'll likely see free CPU. So Scylla's main benefit, of using much less CPU per request, will not even matter when the disk is the performance bottleneck. In such cases I would expect Scylla's and Cassandra's performance (I am assuming that you're measuring throughput when you talk about performance?) should be roughly the same.
If, still, you're seeing better throughput from Cassandra than Scylla, there are several details that may explain why, beyond the general client mis-configuration issues raised in other responses:
If you have low amounts of data, that can fit in memory, Cassandra's caching policy is better for your workload. Cassandra uses the OS's page cache, which reads whole disk pages and may cache multiple items in one read, as well as multiple index entries. While Scylla works differently, and has a row cache - only caching the specific data read. Scylla's caching is better for large volumes of data that do not fit in memory, but much worse when the data can fit in memory, until the entire data set has been cached (after everything is cached, it becomes very efficient again).
On HDDs, the details of compaction are very important for read performance - if in one setup you have more sstables to read, it can increase the number of reads and lower the performance. This can change depending on your compaction configuration, or even randomly (depending on when compaction was run last). You can check if this explains your performance issues by doing a major compaction ("nodetool compact") on both systems and checking the read performance afterwards. You can switch the compaction strategy to LCS to ensure that random-access read performance is better, at the cost of more write work (on HDDs, this can be a worthwhile compromise).
If you are measuring scan performance (reading an entire table) instead of reading individual rows, other issues become relevant: As you may have heard, Scylla subdivides each nodes into shards (each shard is a single CPU). This is fantastic for CPU-bounded work, but could be worse for scanning tables which aren't huge, because each sstable is now smaller and the amount of contiguous data you can read before needing to seek again is lower.
I don't know which of these differences - or something else - is causing performance of your use-case to be lower in Scylla, but I please keep in mind that whatever you fix, your performance is always going to be bad with HDDs. With SDDs, we've measured in the past more than a million random-access read requests per second on a single node. HDDs cannot come anything close. If you really need optimum performance or performance per dollar, SDDs are really the way to go.
There can be various reasons why you are not getting the most out of your Scylla Cluster.
Number of concurrent connections from your clients/loaders is not high enough, or you're not using sufficient amount of loaders. In such case, some shards will be doing all the work, while others will be mostly idle. You want to keep your parallelism high.
Scylla likes have a minimum of 2 connections per shard (you can see the number of shards in /etc/scylla.d/cpuset.conf)
What's the size of your dataset? Are you reading a large amount of partitions or just a few? You might be hitting a hot partition situation
I strongly recommend reading the following docs that will provide you more insights:
https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/
https://docs.scylladb.com/operating-scylla/benchmarking-scylla/
#Sateesh, I want to add to the answer by #TomerSan that both Cassandra and ScyllaDB utilize the same disk storage architecture (LSM). That means that they have relatively the same disk access patterns because the algorithms are largely the same. The LSM trees were built with the idea in mind that it is not necessary to do instant in-place updates. It consists of immutable data buckets that are large continuous pieces of data on disk. That means less random IO, more sequential IO for which the HDD works great (not counting utilized parallelism by modern database implementations).
All the above means that the difference that you see, is not induced by the difference in how those databases use a disk. It must be related to the configuration differences and what happens underneath. Maybe ScyllaDB tries to utilize more parallelism or more aggressively do compaction. It depends.
In order to be able to say anything specific, please share your tests, envs, and configurations.
Both databases use LSM tree but Scylla has thread-per-core architecture on top plus we use O_Direct while C* uses the page cache. Scylla also has a sophisticated IO scheduler that makes sure not to overload the disk and thus scylla_setup runs a benchmark automatically to tune. Check your output of it in io.conf.
There are far more things to review, better to send your data to the mailing list. In general, Scylla should perform better in this case as well but your disk is likely to be the bottleneck in both cases.
As a summary I would say Scylladb and cassandra have the same read / write path
memtable, commitlog, sstable.
However implementation is very different:
- cassandra rely on OS for low level IO and network (most DBMS does)
- scylladb rely on its own lib (seastar) to handle IO and network at a low level independently from OS page cache etc. This is why they can provide feature such as workload scheduling within the same cluster that would be very hard to implement in cassandra.

Where does the idea of a 10MB partition size come from?

I'm doing some data modelling for time series data in Cassandra, and I've decided to implement buckets to regulate my partition sizes and maintain reasonable distribution on my cluster.
I decided to bucketise such that my partitions would not exceed a size of 10MB, as I've seen numerous sources that state this as an ideal partition size, but I can't find any information on why 10MB was chosen. On top of this I can't find anything from DataStax or Apache that mentions this soft 10MB limit at all.
Our data can be requested for large periods of time, meaning lots of partitions will be required to service 1 request if the partition sizes remain at 10MB. I'd rather increase the size of the partitions, and have fewer partitions required to service these requests.
Where does this idea of a 10MB partition size come from? Is it still relevant? What would be so bad if my partitions were 20MB in size? Or even 50MB?
With 10MB referenced in so many places, I feel like there must be something to it. Any information would be appreciated. Cheers.
I think that many of these advises are coming from old time, when support for wide partitions weren't very good - it was a lot of pressure on heap when we read data, etc.. Since Cassandra 3.0 the situation heavily improved, but it's still recommended to keep the size on the disk under 100Mb.
For example, DataStax planning guide says in section "Estimating partition size":
a good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB
In recent versions of Cassandra we can go beyond this recommendation, but it still not advised, although it heavily depends on the access patterns. You can find more information in the following blog post, and this video.
I have seen users with 60+Gb partitions - system still works, but the data distribution is not ideal, so nodes are becoming "hot", and performance may suffer.

memtable_flush_writer significance and uses

Can anyone tell about memtable_flush_writers use case and significance. And in what situation we should tune from default value? I have already read the datastax docs but not clear the actual uses and benefits.
By default, memtable_cleanup_threshold is computed as: 1 / ( memtable_flush_writers + 1)
There is some guidance in the YAML about how to set this value, as Mehul pointed out. Contrary to that, I would never set that to number of cores, regardless of whether or not you're using SSDs.
The problems come when the memtable_flush_writers is set too high, your node can become overwhelmed with small flushes that trigger compaction. This has the unfortunate side effect of causing your commitlog to fill up, and eventually get to a point where it cannot keep up with the flush frequency.
If that happens, you can force a flush manually using nodetool flush. But if you see your commitlog filling your disk, lowering your memtable_flush_writers is a good thing to try.
NoteL: As with all "tuning" like changes with Cassandra, I'd make incremental changes over time, as opposed to a drastic change. Just to be on the safe side.
memtable_cleanup_threshold : When the total amount of memory used by all non-flushing memtables exceeds this ratio, Cassandra flushes the largest memtable to disk.
memtable_flush_writers : THis defines the number of memtable flush writer threads. The threads will write parallel on disk (sstables). But changing this parameter is suggest in case solid-state drive (SSD) is used.
Note : If your data directories are backed by SSDs, increase this setting to the number of cores.
I hope this solves your query.

Cassandra, removing old, not needed data

I have a two-node Cassandra cluster, with RF of 2. So both nodes contain 100% of data.
Now, I am running short on disk space. I can remove some old data, since they were aggregated and processed before, and I don't need them anymore.
I tried running a delete query from cqlsh, but I get a timeout. I tried increasing timeouts, but it seems that running a query from cqlsh will take much more time.
How can I disable this timeout for a single query or connection? Is there any other way, besides increasing timeout, to remove some data from a node?
My Cassandra version is 3.11.0.
PS. I increases write_request_timeout_in_ms in cassandra.yaml. Is this the correct one for delete queries?
Deletes really shouldn't timeout unless there is a problem related to something else. Its inserting a tombstone with no reads or anything and should be fast/cheap regardless of what exists already. Reading on other hand can be impacted a lot. I would guess GC related problems related to reads. You could check GC logs and maybe increase heap and reduce CMSInitiatingOccupancyFraction (if using cms and not g1).
So check GC and normal logs for issues (look for WARN, ERROR in system log) and at pause times in gc logs >1 second, there should be none.
After issuing delete you could try to do a force compaction (nodetool compact keyspace table) to see if it helps disk space. The delete by itself will not reduce disk space until the data has been compacted with the tombstone.
write_request_timeout_in_ms is the right setting, but if your hitting it something is wrong and your just masking it. It should really take less than 1 millisecond normal use.
Side note: RF=2 on a 2 node cluster is not how C* is designed to run. You have no availability on a database that sacrificed consistency for high availability.

Cassandra: Storing and retrieving large sized values (50MB to 100 MB)

I want to store and retrieve values from Cassandra which ranges from 50MB to 100MB.
As per documentation, Cassandra works well when the column value size is less than 10MB. Refer here
My table is as below. Is there a different approach to this ?
CREATE TABLE analysis (
prod_id text,
analyzed_time timestamp,
analysis text,
PRIMARY KEY (slno, analyzed_time)
) WITH CLUSTERING ORDER BY (analyzed_time DESC)
As for my own experience, although in theory Cassandra can handle large blobs, in practise it may be really painful. As for one of my past projects, we stored protobuf blobs in C* ranged from 3kb to 100kb, but there were some (~0.001%) of them with size up to 150mb. This caused problems:
Write timeouts. By default C* has 10s write timeout which is really not enough for large blobs.
Read timeouts. The same issue with read timeout, read repair, hinted handoff timeouts and so on. You have to debug all these possible failures and raise all these timeouts. C* has to read the whole heavy row to RAM from disk which is slow.
I personally suggest not to use C* for large blobs as it's not very effective. There are alternatives:
Distributed filesystems like HDFS. Store an URL of the file in C* and file contents in HDFS.
DSE (Commercial C* distro) has it's own distributed FS called CFS on top of C* which can handle large files well.
Rethink your schema in a way to have much lighter rows. But it really depends of your current task (and there's not enough information in original question about it)
Large values can be problematic, as the coordinator needs to buffer each row on heap before returning them to a client to answer a query. There's no way to stream the analysis_text value.
Internally Cassandra is also not optimized to handle such use case very well and you'll have to tweak a lot of settings to avoid problems such as described by shutty.

Resources