What is the difference between row cache and Partition key cache? shall i need to use both for the good performance Perspective.
I have already read the basic definition from dataStax website
The partition key cache is a cache of the partition index for a
Cassandra table. Using the key cache instead of relying on the OS page
cache saves CPU time and memory. However, enabling just the key cache
results in disk (or OS page cache) activity to actually read the
requested data rows.
The row cache is similar to a traditional cache like memcached. When a
row is accessed, the entire row is pulled into memory, merging from
multiple SSTables if necessary, and cached, so that further reads
against that row can be satisfied without hitting disk at all.
Can anyone elaborate the area of uses . do need to have both implement both . ?
TL;DR : You want to use Key Cache and most likely do NOT want row cache.
Key cache helps C* know where a particular partition begins in the SStables. This means that C* does not have to read anything to determine the right place to seek to in the file to begin reading the row. This is good for almost all use cases because it speeds up reads considerably by potentially removing the need for an IOP in the read-path.
Row Cache has a much more limited use case. Row cache pulls entire partitions into memory. If any part of that partition has been modified, the entire cache for that row is invalidated. For large partitions this means the cache can be frequently caching and invalidating big pieces of memory. Because you really need mostly static partitions for this to be useful, for most use cases it is recommended that you do not use Row Cache.
Related
In this config :
64 Gb, 16 cores, Linux CentOS with Cassandra 3.1
row_cache_size_in_mb is set to zero now (cassandra.yaml)
It seems working well since the OS Page cache is used for caching read.
So, is there any benefits/risks (JVM heap) to increase this number
vs using Linux page caching?
Row cache is used only for the tables that explicitly enable caching of the rows data, and not used by default. Row cache usually is used only for most read data that doesn't change very often, otherwise, change of the data will lead to an additional performance overhead from invalidating cache data & re-populating of cache entries from disk. You can read more in the following document from the "best practices" series published by DataStax.
Regarding relation between row cache and Linux's buffer cache - the main distinction is that row cache keeps the full rows that potentially could be assembled from multiple SSTables, while buffer cache keeps the chunks of the SSTables, that are often compressed, and Cassandra will need to decompress them again and again. Also, if partition is scattered over multiple SSTables, then Cassandra will need to check them when reading the row.
Its all about the workload and the application query pattern.
If you application frequently reads a small subset of rows (hot) and each row in its entirety, enabling this can bring in a significant performance benefit by avoiding a disk read. There are some row cache hit rate JMX metrics available which can inform about any performance variation between row and key cache sizes for your application load.
If you haven't manually configured row cache a table description should look like below.
Default: { 'keys': 'ALL', 'rows_per_partition': 'NONE' }.
If enabled the size should be proportional to in memory size of a row data and its column values over the hot subset. For a rough estimate use nodetool cfstats, multiply the Row cache size which is the number of rows in the cache, by the Compacted row mean size and sum them.
As with any memory allocation it has impact on garbage collection though there are some partial or complete off heap implementation classes available. From Datastax docs :
row_cache_class_name
Default: disabled. note The classname of the row cache provider to use. Valid values: OHCProvider (fully off-heap) or SerializingCacheProvider (partially off-heap).
As the entire row is cached it can be expensive. One thing to note is if rows are frequently evicted from the row cache (size is set too low or row data frequently change), the garbage collector will definitely have more to do.
Bottomline : For an ideal row cache use, a small set of rows must be hot. Row cache provides benefit when the entire row is accessed at once. If an off-heap implementation is used it poses little risk to heap. In the end do some load testing and capture some latency metrics to determine the size of cache that best fits your need and is adequate.
Recently I have gone through a tutorial about key cache and row cache. Can anyone help me with some real time examples where these caches can impact? And what is the impact if we increase these values in the config file?
On using desc table I found this
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
Your main concern is the memory profile of your application.
This diagram demonstrates how the key cache optimises the readpath, it allows us to skip the partition summary and partition index, and go straight to the compression offset. As for the row cache, if you get a hit, you've got your answer and don't need to go down the read path at all.
Key cache - The key cache is on by default as it only keeps the key of the row. Keys are typically smaller relative to the rest of the row so this cache can hold many entries before it's exhausted.
Row cache - The row cache holds an entire row and is useful when you have a fairly static querying pattern. The argument for the row cache is that if you read the same rows over and over, you can just keep them in memory rather going to the SSTable (storage medium) level and thus bypass an expensive seek on the read path. In practice the memory slow downs caused by usage of the row cache in non-optimal use-cases makes it an unpopular feature.
So what happens if you fill up the cache? Well, there's an eviction policy but if you're constantly kicking stuff out of either cache to make room for new items, then the caches won't exactly be useful as the gc related performance degradation will hurt overall performance.
What about having very high cache values? This is where there are better alternatives, more on this later. Making the row cache huge would just lead to GC issues, which depending on what you're doing exactly, typically leads to an overall net-loss in performance.
One idea I've seen being utilised relatively well is having a caching layer on top of Cassandra, such as Apache Ignite or Memcached. You load hot data in the caching layer to get fast READs and you write with an application that writes to the cache layer then to C* for persistence. These architectures come with many of their own headaches but if you want to cache data for lower query latencies, the C* row cache isn't the best tool for the job.
I'm new to Cassandra and trying to get a better understanding on how the row cache can be tuned to optimize the performance.
I came across think this article: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsConfiguringCaches.html
And it suggests not to even touch row cache unless read workload is > 95% and mostly rely on machine's default cache mechanism which comes with OS.
The default row cache size is 0 in cassandra.yaml file so the row cache won't be utilized at all.
Therefore, I'm wondering how exactly I can decide to chose to tweak row cache if needed. Are there any good pointers one can provide on this?
What they are saying in this article, is that OS cache is better than row cache.
Cassandra's rows cache known as not efficient for most cases. The only case i see you can even start trying, is that 95% of your workload are reads + you have a relatively small set of hot rows that are not updated frequently.
I have been reading about Cassandra's row cache, and came across this post: Difference between Cassandra Row caching and Partition key caching
In the newer implementation of row cache, the whole partition doesn't need to be saved. Rather you can specify the number of rows one wants to save per partition while creating the table. However, what's the eviction policy when a write request comes? Does it still invalidate the whole partition even if only one row is modified in the given partition?
Row cache not recommended for most cases.
And yes, it still invalidates whole partition.
Tip: Enable a row cache only when the number of reads is much bigger
(rule of thumb is 95%) than the number of writes. Consider using the
operating system page cache instead of the row cache, because writes
to a partition invalidate the whole partition in the cache.
Source:
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsConfiguringCaches.html
My use case expects heavy read load - there are two possible model design strategies:
Tiny rows with row cache: In this case row is small enough to fit into RAM and all columns are being cached. Read access should be fast.
Wide rows with key cache. Wide rows with large columns amount are to big for row cache. Access to column subset requires HDD seek.
As I understand using wide rows is a good design pattern. But we would need to disable row cache - so .... what is the benefit of such wide row (at least for read access)?
Which approach is better 1 or 2?
Row cache does not necessary increase read performance.
When row cache is disabled and with enabled key cache Cassandra will read data directly from HDD jumping directly to right offset (based on key cache). In this case operating system will cache HDD access.
Cassandra opens file as virtual file - in this case file is handled as "read from memory" in reality first read goes to HDD and second read is being served from RAM. Only already accessed file parts are loaded into RAM (plus read ahead 128kb)
My load tests (3 Servers with 8 Core xenon, 24GB RAM, 60GB data in Cassandra) has showed, that row cache and file system cache have similar performance - OS cache causes lower CPU load