Expiration of Persistent data in gemfire - persistent

When a persistent entry in gemfire expires, is the disk space used by the expired entry immediately available for reuse?

Not exactly. It's immediately marked as garbage. Gemfire always appends updates to the most recent oplog files. Old oplog files will be compacted when they are 50% garbage (by default). So your expired entry will be considered garbage in the old oplog, and that space will be reclaimed when that oplog reaches 50% garbage and gets compacted.

Related

How to flush Cassandra CDC changes periodically to disk?

Desired behaviour
I'm trying to configure cassandra cdc in a way that the commitlogsegments are flushed periodically to the cdc_raw directory (let's say every 10 seconds).
Based upon documentation from http://abiasforaction.net/apache-cassandra-memtable-flush/ and from https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configCDCLogging.html I found:
memtable_flush_period_in_ms – This is a CQL table property that
specifies the number of milliseconds after which a memtable should be
flushed. This property is specified on table creation.
and
Upon flushing the memtable to disk, CommitLogSegments containing data
for CDC-enabled tables are moved to the configured cdc_raw directory.
Putting those together I would think that by setting memtable_flush_period_in_ms: 10000 cassandra flushes it's CDC changes to disk every 10 seconds, which is what I want to accomplish.
My configuration
Based upon aforementioned and my configuration I would expect that the memtable gets flushed to the cdc_raw directory every 10 seconds. I'm using the following configuration:
cassandra.yaml:
cdc_enabled: true
commitlog_segment_size_in_mb: 1
commitlog_total_space_in_mb: 2
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
table configuration:
memtable_flush_period_in_ms = 10000
cdc = true
Problem
The memtable is not flushed periodically to the cdc_raw directory, but instead gets flushed to the commitlogs directory when a certain size threshold is reached.
In detail, the following happens:
When a commitlogsegment reaches 1MB, it's flushed to the commitlog directory. There is a maximum of 2 commitlogs in the commitlog directory (see configuration commitlog_total_space_in_mb: 2). When this threshold is reached, the oldest commitlog file in the commitlog directory is moved to the cdc_raw directory.
Question
How to flush Cassandra CDC changes periodically to disk?
Apache Cassandra's CDC in current version is tricky.
Commit log is 'global', meaning changes to any table go to the same commit log.
Your commit log segment can (and will) contain logs from tables other than the ones with CDC enabled. These include system tables.
Commit log segment is deleted and moved to cdc_raw directory after every logs in the commit log segment are flushed.
So, even you configure your CDC-enabled table to flush every 10 sec, there are logs from other tables still in the same commit log segment, which prevent from moving commit log to CDC directory.
There is no way to change the behavior other than trying to speed up the process by reducing commitlog_segment_size_in_mb (but you need to be careful not to reduce it to the size smaller than your single write requset).
This behavior is improved and will be released in next major version v4.0. You can read your CDC as fast as commit log is synced to disk (so when you are using periodic commit log sync, then you can read your change every commit_log_sync_period_in_ms milliseconds.
See CASSANDRA-12148 for detail.
By the way, you set commitlog_total_space_in_mb to 2, which I definitely do not recommend. What you are seeing right now is that Cassandra flushes every table when your commit log size exceeded this value to make more space. If you cannot reclaim your commit log space, then Cassandra would start throwing error and rejects writes.

Does Cassandra's commit log have a write amplification problem when placed on SSDs?

AFAIK, when Cassandra does a write, it writes to the Memtable as well as writing to the commit log on disk (appending). If the commit message is very small, too small to fill up a SSD page, won't this cause some fragmentation and write amplification in the long run? (After disk fills up)
This datastax article may answer your question:
https://www.datastax.com/dev/blog/updates-to-cassandras-commit-log-in-2-2
Specifically:
Since version 1.1 a feature of the commit log infrastructure in Cassandra has been the ability to reuse segments. This is done in order to reduce fragmentation on the logging drive -- a number of commitlog segments will be kept reserved by the database for overwriting after the data they contain has been flushed, which means that most of the time the commit log will not need to allocate new space in order to write. This does not eliminate all fragmentation that can be caused by the log, as it will continue writing after its space quota has been reached while memtable flushes are in progress, and afterwards it will release the overallocated space. Still, since less space is allocated and freed, there is a lower chance of introducing fragmentation on the drive.

Can key cache objects be moved on off-heap memory?

From page 11 of the slide, memtable_allocation_type Cassandra allows to keeping memtables and key cache objects in the native memory, instead of the Java JVM heap. But I found no other evidence that memtable_allocation_type can change the position of key cache.
I'm using apache-cassandra 3.11.3, and are suffering from low key cache hit rate. As increasing key cache size would lead to long gc, is there any way to move key cache to offheap memory?
No, right now the key cache is still in the heap.
I wouldn't say that increase from 1/20th (or 100Mb) to something higher, like 200-300Mb will dramatically increase the garbage collection times...

How are Cassandra Tombstones deleted in old SSTables?

If I have compaction enabled, like SizeTieredCompaction, my SSTables get compacted until a certain size level is reached. When I "delete" an old entry which is in an SSTable partition that is quite old and wont be compacted again in the near future, when is the deletion taking place?
Imagine you delete 100 entries and all are part of a really old SSTable that was compacted several times, has no hot data and is already quite big. It will take ages until it's compacted again and tombstones are removed, right?
When the tombstone is merged with the data in a compaction the data will be deleted from disk. When that happens depends on the rate new data is being added and your compaction strategy. The tombstones are not purged until after gc_grace_seconds to prevent data resurrection (make sure repairs complete within this period of time).
If you override or delete data a lot and not ok with a lot of obsolete data on disk you should probably use LeveledCompactionStrategy instead (I would recommend always defaulting to LCS if using ssds). It can take a long time for the largest sstables to get compacted if using STCS. STCS is more for constantly appending data (like logs or events). If the entries expire over time and you rely heavily on TTLs you will probably want to use the timed window strategy.

physical disk space management of cassandra

Recently I have been looking into Cassandra from our new project's perspective and learned a lot from this community and its wiki too. But I have not found anything about about how updates are managed in Cassandra in terms of physical disk space management though it seems to be very much similar to record delete management using compaction.
Suppose there are 100 records with 5 column values each so when all changes would be flushed disk all records will be written adjacently and when delete operation is done then its marked in Memory table first and physically record is deleted after some time as set in configuration or when its full. And the compaction process claims the space.
Now question is that at one side being schema less there is no fixed number of columns at the the beginning but on the other side when compaction process takes place then.. does it put records adjacently on disk like traditional RDBMS to speed up the read process as for RDBMS its easy because they have to allocate fixed amount of space as per declaration of columns datatype.
But how Cassandra exactly makes the records placement on disk in compaction process (both for update/delete) to speed up the reads?
One more question related to compaction is that when there is no delete queries but there is an update query which updates an existent record with some variable length data or insert altogether a new column then how compaction makes its space available on disk between already existent data rows?
Rows and columns are stored in sorted order in an SSTable. This allows a compaction of multiple SSTables to output a new, (sorted) SSTable, with only sequential disk IO. This new SSTable will be outputted into a new file and freespace on the disks. This process doesn't depend on the number of rows of columns, just on them being stored in a sorted order. So yes, in all SSTables (even those resulting form compactions) rows and columns will be arranged in a sorted order on disk.
Whats more, as you hint at in your question, updates are no different from inserts - they do not overwrite the value on disk, but instead get buffered in a Memtable, then get flushed into a new SSTable. When the new SSTable eventually gets compacted with the SSTable containing the original value, the newer value will annihilate the old one - ie the old value will not be outputted from the compaction. Timestamps are used to decide which values is newest.
Deletes are handled in the same fashion, effectively inserted an "anti-value", or tombstone. The limitation of this process is that is can require significant space overhead. Deletes are effectively 'lazy, so the space doesn't get freed until some time later. Also, while the output of the compaction can be the same size as the input, the old SSTables cannot be deleted until the new one is completed, so this can reduce disk utilisation to 50%.
In the system described above, new values for an existing key can be a different size to the existing key without padding to some pre-determined length, as the new value does not get written over the old value on update, but to a new SSTable.

Resources