Cassandra as session store under heavy load - cassandra

I would like to use Cassandra to store session related informations. I do not have real HTTP session - it's different protocol, but the same concept.
Memcached would be fine, but I would like to additionally persist data.
Cassandra setup:
non replicated Key Space
single Column Family, where key is session ID and each column within row stores single key/value - (Map<String,Set<String,String>>)
column TTL = 10 minutes
write CL = ONE
read CL = ONE
2.000 writes/s
5.000 reads/s
Data example:
session1:{ // CF row key
{prop1:val1, TTL:10 min},
{prop2:val2, TTL:10 min},
.....
{propXXX:val3, TTL:10 min}
},
session2:{ // CF row key
{prop1:val1, TTL:10 min},
{prop2:val2, TTL:10 min},
},
......
sessionXXXX:{ // CF row key
{prop1:val1, TTL:10 min},
{prop2:val2, TTL:10 min},
}
In this case consistency is not a problem, but the performance could be, especially disk IO.
Since data in my session leaves for short time, I would like to avoid storing it on hard drive - except for commit log.
I have some questions:
If column expires in Memtable before flushing it to SSTable, will
Cassandra anyway store such column in SSTable (flush it to HDD)?
Replication is disabled for my Key Space, in this case storing such expired column in SSTable would not be necessary, right?
Each CF hat max 10 columns. In such case I would enable row cache and disable key cache. But I am expecting my data to be still
available in Memtable, in this case I could disable whole cache, right?
Any Cassandra configuration hints for such session-store use case would be really appreciated :)
Thank you,
Maciej

Here is what I did - and it works fine:
Set replication_factor to 1 - means disable replication
Set gc_grace to 0 - means delete columns on first compaction. This is fine, since data is not replicated.
Increase memtable size and decrease cache size. We want to read data from memtable and omit cache - flushing data to HDD and reading it again from HDD into cache.
Additionally commit log can be disabled - durable_writes=false
In this setup, data will be read from memtable and cache will be not used. Memtable can allocate enough heap to keep my data until it expires or even longer.
After flushing data to SSTable, compaction will immediately remove expired rows, since gc_grace=0.

Considering your use case if I'm not wrong you wish to have all your key value[sessionID=>sessionData] pairs in memory and those values will expire every 10min[Means you don't want persistence].
Then why can't you try something like redis which is a in-memory store.
From Doc:
Redis is an open source, advanced key-value store. It is often referred to as a data
structure server since keys can contain strings, hashes, lists, sets and sorted sets.
Since u don't need replication redis master slave architecture even might not affect you
Redis supports TTL also
AFAIK cassandra is good for wide fat rows[More columns less rows] rather skinny rows[transpose of previous]. Your use case doesn't seem so.
Regards,
Tamil

Related

Cassandra Key Cache

Cassandra Key cache is a map structure where key is {sstable_file_descriptor+partition_key} and value is partition offset, now why Cassandra during read, checks all sstables (using bloom filter), if the data may present in that stable. Why can't key cache be like partition_key=sstable_file_descriptor+offset
Its actually (tableid, indexName, descriptor, partition_key) (KeyCacheKey extends CacheKey). The same partition key can exist on multiple tables, and on multiple sstables within them. In order to key by just the key you would need additional structure which would be quite a bit more coordination and contention.
The keycache does not store all data either, only things considered to be likely to get a hit based on Window TinyLfu algorithm. There are potentially billions of keys in a single table so it cannot store them all. The absence from the keycache does not ensure that it does not exist so the bloom filter must be checked anyway. Something to note too, the BF check is in memory and very fast. If the BF passes it checks the cache next. Before any of this it actually also filters based on the range of columns and tokens within an sstable, and skips ones whose data would be tombstoned by the min/max timestamps (see queryMemtableAndDiskInternal).

Check table size in cassandra historically

I've a Cassandra table (Cassandra version is 2.0) with terabytes of data, here is what the schema looks like
"my_table" (
key ascii,
timestamp bigint,
value blob,
PRIMARY KEY ((key), timestamp)
)
I'd like to delete some data, but before want to estimate how much disk space it will reclaim.
Unfortunately stats from JMX metrics are only available for last two weeks, so thats not very useful.
Is there any way to check how much space is used by certain set of data (for example where timestamp < 1000)?
I was wondering also if there is a way to check query result set size, so that I can do something like select * from my_table where timestamp < 1000 and see how many bytes the result occupies.
There is no mechanism to see the size on disk from the data, it can be pretty far removed from the coordinator of the request and theres levels that impact it like compression and multiple sstables which would make it difficult.
Also be aware that issuing a delete will not immediately reduce disk space. C* does not delete data, the sstables are immutable and cannot be changed. Instead it writes a tombstone entry that after gc_grace_seconds will disappear. When sstables are being merged, the tombstone + data would combine to be just the tombstone. After it is past the gc_grace_seconds the tombstone will no longer be copied during compaction.
The gc_grace is to prevent losing deletes in a distributed system, since until theres a repair (should be scheduled ~weekly) theres no absolute guarantee that the delete has been seen by all replicas. If a replica has not seen the delete and you remove the tombstone, the data can come back.
No, not really.
Using sstablemetadata you can find tombstone drop times, minimum timestamp and maximum timestamp in the mc-####-big-data.db files.
Additionally if you're low on HDD space consider nodetool cleanup, nodetool clearsnapshot and then finally nodetool repair.

How do we track the impact expired entries have on a time series table?

We are processing the oldest data as it comes into the time-series table. I am taking care to make sure that the oldest entries expire as soon as they are processed. Expectation is to have all the deletes at the bottom part of the clustering column of TimeUUID. So query will always read time slot without any deleted entries.
Will this scheme work? Are there any impacts of the expired columns that I should be aware of?
So keeping the timeuuid as part of clustering key guarantee the sort order to provide the most recent data.
If Cassandra 3.1 (DSE 5.x) and above :-
Now regarding the deletes, "avoid manual and use TWCS": Here is how
Let's say every X minutes your job process the data. Lets say X = 5min, (hopefully less than 24hours). Set the compaction to TWCS: Time Window Compaction Strategy and lets assume with TTL of 24hours.
WITH compaction= {
'compaction_window_unit': 'HOURS',
'compaction_window_size': '1',
};
Now there are 24buckets created in a day, each with one hour of data. These 24 buckets simply relates to 24 sstables (after compaction) in your Cassandra data directory. Now during the 25hour, the entire 1st bucket/sstable would automatically get dropped by TTL. Hence instead of coding for deletes, let Cassandra take care of the cleanup. The beauty of TWCS is to TTL the entire data within that sstable.
Now the READs from your application always goes to the recent bucket, 24th sstable in this case always. So the reads would never have to scan through the tombstones (caused by TTL).
If Cassandra 2.x or DSE 4.X, if TWCS isn't available yet :-
A way out till you upgrade to Cassandra 3.1 or above is to use artificial buckets. Say you introduce a time bucket variable as part of the partition key and keep the bucket value to be date and hour. This way each partition is different and you could adjust the bucket size to match the job processing interval.
So when you delete, only the processed partition gets deleted and will not come in the way while reading unprocessed ones. So scanning of tombstones could be avoided.
Its an additional effort on application side to start writing to the correct partition based on the current date/time bucket. But its worth it in production scenario to avoid Tombstone scan.
You can use TWCS to easily manage expired data, and perform filtering by some timestamp column on query time, to ensure that your query always getting the last results.
How do you "taking care" about oldest entries expiry? Cassandra will not show records with expired ttl, but they will persist in sstables until next compaction for this sstable. If you are deleting the rows by yourself, you can't make sure that your query will always read latest records, since Cassandra is eventually consistent, and theoretically there's can be some moments, when you will read stale data (or many such moments, based on your consistency settings).

Is update in place possible in Cassandra?

I have a table in Cassandra where I populate some rows with 1000s of entries (each row is with 10000+ columns). The entries in the rows are very frequently updated, basically just a field (which is an integer) is updated with different values. All other values for the columns remains unmodified. My question is, will the updates be done in-place ? How good is Cassandra for frequent update of entries ?
First of all every update is also a sequential write for cassandra so, as far as cassandra goes it does not make any difference to cassandra whether you update or write.
The real question is how fast do you need to read those writes to be available for reading? As #john suggested, first all the writes are written to a mutable CQL Memtable which resides in memory. So, every update is essentially appended as a new sequential entry to memtable for a particular CQL table. It is concurrently periodically also written to `commitlog' (every 10 seconds) for durability.
When Memtable is full or total size for comittlog is reached, cassandra flushes all the data to immutable Sorted String Table (SSTable). After the flush, compaction is the procedure where all the PK entries for the new column values are kept and all the previous values (before update) are removed.
With flushing frequently comes the overhead on frequent sequential writes to disk and compaction which could take lot of I/O and have a serious impact on cassandra performance.
As far as read goes, first cassandra will try to read from row cache (if its enabled) or from memtable. If it fails there it will go to bloom filter, key cache, partition summary, partition index and finally to SSTable in that order. When the data is collected for all the column values, its aggregate in memory and the column values with latest timestamp are returned to client after aggregation and an entry is made in row cache for that partition key`.
So, yes when you query a partition key, it will scan across all the SSTable for that particular CQL table and the memtable for all the column values that are not being flushed to disk yet.
Initially these updates are stored in an in-memory data structure called Memtable. Memtables are flushed to immutable SSTables at regular intervals.
So a single wide row will be read from various SSTables. It is during a process called 'compacation' the different SSTables will be merged into a bigger SSTable on the disk.
Increasing thresholds for flushing Memtables is one way of optimization. If updates are coming very fast before Memtable is flushed to disk, i think that update should be in-place in memory, not sure though.
Also each read operation checks Memtables first, if data is still there, it will be simply returned – this is the fastest possible access.
Cassandra read path:
When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question
Cassandra write path:
No, in place updates are not possible.
As #john suggested, if you have frequent writes then you should delay the flush process. During the flush, the multiple writes to the same partition that are stored in the MemTable will be written as a single partition in the newly created SSTable.
C* is fine for heavy writes. However, you'll need to monitor the number of SSTables accessed per read. If the # is too high, then you'll need to review your compaction strategy.

Difference between Cassandra Row caching and Partition key caching

What is the difference between row cache and Partition key cache? shall i need to use both for the good performance Perspective.
I have already read the basic definition from dataStax website
The partition key cache is a cache of the partition index for a
Cassandra table. Using the key cache instead of relying on the OS page
cache saves CPU time and memory. However, enabling just the key cache
results in disk (or OS page cache) activity to actually read the
requested data rows.
The row cache is similar to a traditional cache like memcached. When a
row is accessed, the entire row is pulled into memory, merging from
multiple SSTables if necessary, and cached, so that further reads
against that row can be satisfied without hitting disk at all.
Can anyone elaborate the area of uses . do need to have both implement both . ?
TL;DR : You want to use Key Cache and most likely do NOT want row cache.
Key cache helps C* know where a particular partition begins in the SStables. This means that C* does not have to read anything to determine the right place to seek to in the file to begin reading the row. This is good for almost all use cases because it speeds up reads considerably by potentially removing the need for an IOP in the read-path.
Row Cache has a much more limited use case. Row cache pulls entire partitions into memory. If any part of that partition has been modified, the entire cache for that row is invalidated. For large partitions this means the cache can be frequently caching and invalidating big pieces of memory. Because you really need mostly static partitions for this to be useful, for most use cases it is recommended that you do not use Row Cache.

Resources