Cassandra Key cache is a map structure where key is {sstable_file_descriptor+partition_key} and value is partition offset, now why Cassandra during read, checks all sstables (using bloom filter), if the data may present in that stable. Why can't key cache be like partition_key=sstable_file_descriptor+offset
Its actually (tableid, indexName, descriptor, partition_key) (KeyCacheKey extends CacheKey). The same partition key can exist on multiple tables, and on multiple sstables within them. In order to key by just the key you would need additional structure which would be quite a bit more coordination and contention.
The keycache does not store all data either, only things considered to be likely to get a hit based on Window TinyLfu algorithm. There are potentially billions of keys in a single table so it cannot store them all. The absence from the keycache does not ensure that it does not exist so the bloom filter must be checked anyway. Something to note too, the BF check is in memory and very fast. If the BF passes it checks the cache next. Before any of this it actually also filters based on the range of columns and tokens within an sstable, and skips ones whose data would be tombstoned by the min/max timestamps (see queryMemtableAndDiskInternal).
Related
Implementation wise, how exactly does the memtable (in Cassandra, RocksDB, LevelDB, or any LSM-tree) flush to an SSTable?
I get that a memtable is some sorted data structured, like a red-black tree, but how do we turn that into a file of sorted key/value pairs? Do we iterate through the tree from the smallest key to the largest tree in a for-loop and insert the data one by one into an memory buffer (in SSTable format), and then write that to disk? Do we use some sort of tree-serialize method (if so, how is that still in SSTable format)? Can we just use a min-heap for the memtable and when flushing, keep getting the min-element and adding it to our array to flush?
I'm trying to understand the super specific details. I was looking at this file but was having a hard time understanding it: https://github.com/facebook/rocksdb/blob/fbfcf5cbcd3b09b6de0924d3c52a744a626135c0/db/flush_job.cc
You are correct.
The memtable is looped over from smallest to largest and written out to file.
In practicality there are other things as well written to the file but the foundation of the file is the section that has all the keys that were previously in the memtable. Such as bloom filters, seek sparse indices, and other metadata such as count, max key, min key
You don't need a minheap. As the data is already sorted in the skiplist
RocksDB's default memtable is implemented using skiplist, which is a linked list with binary search capability, similar to a B+ tree. When writing out to an SST table, it iterates all the keys in the sorted order.
We use a very simple key-value datamodel in Cassandra, and our partition key is in 17 SSTables. I would like to understand how read works in our concrete case.
If I undestand correctly, general Cassandra reads will need to search for the newest version of each column in the memtable and in different SSTables, until it retrieves all columns and merges them.
Since SSTables are sorted by time, and our data-model is single-column, Ideally our read operations should just hit the newest SSTable containing our partition key since this will contain the whole data.
Will our read operations hit the 17 SSTables? or just the newest one containing the searched partition key?
Cassandra will search all of them as it isn't sure which columns exist where (DML occurs at the cell level and because of that, variants can exist where reconciliation is performed). Reads are done at the partition level. However, Cassandra can filter out sstables if it knows the partition key doesn't exist in certain ones. That's why compaction is important for optimal reads - to remove the unnecessary cells.
Will our read operations hit the 17 SSTables? or just the newest one containing the searched partition key?
To add to Jim's answer, Cassandra has something called a bloom filter for this. Essentially, it's a probabilistic structure that can tell you one of two things:
The SSTable might contain the data requested.
OR
The SSTable definitely does not contain the data requested.
This should prevent Cassandra from having to scan all 17 SSTables. My advice would be to run a query with TRACING ON in cqlsh, and it'll tell you just how many SSTables it needed to look through.
One of the benefits of Cassandra (or Scylla) is that:
When a table has multiple clustering columns, the data is stored in nested sort order.
https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/whereClustering.html
Because of this I think reading the data back in that same sorted order should be very fast.
If data is written in a different order than the clustering columns specify, when does Cassandra (or Scylla) actually re-order the data?
Is it when the memtables are flushed to SSTables?
What if a memtable has already been flushed, and I add a new record that should be before records in an existing SSTable?
Does it keep the data out of order on disk for a while and re-order it during compaction?
If so, what steps does it take to make sure reads are in the correct order?
Data is always sorted in any given sstable.
When a memtable is flushed to disk, that will create a new sstable, which is sorted within itself. This happens naturally since memtables store data in sorted order, so no extra sorting is needed at that point. Sorting happens on insertion into the memtable.
A read, which is using natural ordering, will have to read from all sstables which are relevant for the read, merging multiple sorted results into one sorted result. This merging happens in memory on-the-fly.
Compaction, when it kicks in, will replace multiple sstables with one, creating a merged stream much like a regular read would do.
This technique of storing data is known as a log-structured merge tree.
The data is reordered during the compaction.
Basically, any write is just an append, in order to be very fast. There are no reads or seeks involved.
When reading data, Cassandra is reading from the active memtable and from one or more SSTables. Data is aggregated and the query is satisfied.
Since data distribution might require accessing a growing number of SSTables, compaction has the role to reorganize the data on disk so it will eliminate the potential overhead of reading data from multiple SSTables. It is worth mentioning that SSTables are immutable and new SSTables are created. The old ones are discarded.
The process is similar in both Scylla and Cassandra.
I have a table in Cassandra where I populate some rows with 1000s of entries (each row is with 10000+ columns). The entries in the rows are very frequently updated, basically just a field (which is an integer) is updated with different values. All other values for the columns remains unmodified. My question is, will the updates be done in-place ? How good is Cassandra for frequent update of entries ?
First of all every update is also a sequential write for cassandra so, as far as cassandra goes it does not make any difference to cassandra whether you update or write.
The real question is how fast do you need to read those writes to be available for reading? As #john suggested, first all the writes are written to a mutable CQL Memtable which resides in memory. So, every update is essentially appended as a new sequential entry to memtable for a particular CQL table. It is concurrently periodically also written to `commitlog' (every 10 seconds) for durability.
When Memtable is full or total size for comittlog is reached, cassandra flushes all the data to immutable Sorted String Table (SSTable). After the flush, compaction is the procedure where all the PK entries for the new column values are kept and all the previous values (before update) are removed.
With flushing frequently comes the overhead on frequent sequential writes to disk and compaction which could take lot of I/O and have a serious impact on cassandra performance.
As far as read goes, first cassandra will try to read from row cache (if its enabled) or from memtable. If it fails there it will go to bloom filter, key cache, partition summary, partition index and finally to SSTable in that order. When the data is collected for all the column values, its aggregate in memory and the column values with latest timestamp are returned to client after aggregation and an entry is made in row cache for that partition key`.
So, yes when you query a partition key, it will scan across all the SSTable for that particular CQL table and the memtable for all the column values that are not being flushed to disk yet.
Initially these updates are stored in an in-memory data structure called Memtable. Memtables are flushed to immutable SSTables at regular intervals.
So a single wide row will be read from various SSTables. It is during a process called 'compacation' the different SSTables will be merged into a bigger SSTable on the disk.
Increasing thresholds for flushing Memtables is one way of optimization. If updates are coming very fast before Memtable is flushed to disk, i think that update should be in-place in memory, not sure though.
Also each read operation checks Memtables first, if data is still there, it will be simply returned – this is the fastest possible access.
Cassandra read path:
When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question
Cassandra write path:
No, in place updates are not possible.
As #john suggested, if you have frequent writes then you should delay the flush process. During the flush, the multiple writes to the same partition that are stored in the MemTable will be written as a single partition in the newly created SSTable.
C* is fine for heavy writes. However, you'll need to monitor the number of SSTables accessed per read. If the # is too high, then you'll need to review your compaction strategy.
I would like to use Cassandra to store session related informations. I do not have real HTTP session - it's different protocol, but the same concept.
Memcached would be fine, but I would like to additionally persist data.
Cassandra setup:
non replicated Key Space
single Column Family, where key is session ID and each column within row stores single key/value - (Map<String,Set<String,String>>)
column TTL = 10 minutes
write CL = ONE
read CL = ONE
2.000 writes/s
5.000 reads/s
Data example:
session1:{ // CF row key
{prop1:val1, TTL:10 min},
{prop2:val2, TTL:10 min},
.....
{propXXX:val3, TTL:10 min}
},
session2:{ // CF row key
{prop1:val1, TTL:10 min},
{prop2:val2, TTL:10 min},
},
......
sessionXXXX:{ // CF row key
{prop1:val1, TTL:10 min},
{prop2:val2, TTL:10 min},
}
In this case consistency is not a problem, but the performance could be, especially disk IO.
Since data in my session leaves for short time, I would like to avoid storing it on hard drive - except for commit log.
I have some questions:
If column expires in Memtable before flushing it to SSTable, will
Cassandra anyway store such column in SSTable (flush it to HDD)?
Replication is disabled for my Key Space, in this case storing such expired column in SSTable would not be necessary, right?
Each CF hat max 10 columns. In such case I would enable row cache and disable key cache. But I am expecting my data to be still
available in Memtable, in this case I could disable whole cache, right?
Any Cassandra configuration hints for such session-store use case would be really appreciated :)
Thank you,
Maciej
Here is what I did - and it works fine:
Set replication_factor to 1 - means disable replication
Set gc_grace to 0 - means delete columns on first compaction. This is fine, since data is not replicated.
Increase memtable size and decrease cache size. We want to read data from memtable and omit cache - flushing data to HDD and reading it again from HDD into cache.
Additionally commit log can be disabled - durable_writes=false
In this setup, data will be read from memtable and cache will be not used. Memtable can allocate enough heap to keep my data until it expires or even longer.
After flushing data to SSTable, compaction will immediately remove expired rows, since gc_grace=0.
Considering your use case if I'm not wrong you wish to have all your key value[sessionID=>sessionData] pairs in memory and those values will expire every 10min[Means you don't want persistence].
Then why can't you try something like redis which is a in-memory store.
From Doc:
Redis is an open source, advanced key-value store. It is often referred to as a data
structure server since keys can contain strings, hashes, lists, sets and sorted sets.
Since u don't need replication redis master slave architecture even might not affect you
Redis supports TTL also
AFAIK cassandra is good for wide fat rows[More columns less rows] rather skinny rows[transpose of previous]. Your use case doesn't seem so.
Regards,
Tamil