Does the SSTable change when I update a column? - cassandra

I understand that all insert operations will come to commitlog first, and Memtable second, after that it will be compacted to SSTable.
After compaction to SSTable, I perform an update operation for an existing column.
Must the SSTable be changed to update the value of that column?

http://wiki.apache.org/cassandra/MemtableSSTable says:
Once flushed, SSTable files are immutable; no further writes may be done
Your updated value will be written to commitlog, and a memtable, and will eventually be flushed to a new SSTable on-disk. Later, this SSTable may be merged with other SSTables to form a new larger SSTable (and the old ones will be discarded) - this is the compaction stage.

Answering your follow-up question: yes, and no - your values may end up in different SSTables until they are compacted, but you don't need to wait until a compaction to get the 'true' value, because the various versions of your value are timestamped, so Cassandra can return the most recent.

Related

How are Cassandra Tombstones deleted in old SSTables?

If I have compaction enabled, like SizeTieredCompaction, my SSTables get compacted until a certain size level is reached. When I "delete" an old entry which is in an SSTable partition that is quite old and wont be compacted again in the near future, when is the deletion taking place?
Imagine you delete 100 entries and all are part of a really old SSTable that was compacted several times, has no hot data and is already quite big. It will take ages until it's compacted again and tombstones are removed, right?
When the tombstone is merged with the data in a compaction the data will be deleted from disk. When that happens depends on the rate new data is being added and your compaction strategy. The tombstones are not purged until after gc_grace_seconds to prevent data resurrection (make sure repairs complete within this period of time).
If you override or delete data a lot and not ok with a lot of obsolete data on disk you should probably use LeveledCompactionStrategy instead (I would recommend always defaulting to LCS if using ssds). It can take a long time for the largest sstables to get compacted if using STCS. STCS is more for constantly appending data (like logs or events). If the entries expire over time and you rely heavily on TTLs you will probably want to use the timed window strategy.

Is update in place possible in Cassandra?

I have a table in Cassandra where I populate some rows with 1000s of entries (each row is with 10000+ columns). The entries in the rows are very frequently updated, basically just a field (which is an integer) is updated with different values. All other values for the columns remains unmodified. My question is, will the updates be done in-place ? How good is Cassandra for frequent update of entries ?
First of all every update is also a sequential write for cassandra so, as far as cassandra goes it does not make any difference to cassandra whether you update or write.
The real question is how fast do you need to read those writes to be available for reading? As #john suggested, first all the writes are written to a mutable CQL Memtable which resides in memory. So, every update is essentially appended as a new sequential entry to memtable for a particular CQL table. It is concurrently periodically also written to `commitlog' (every 10 seconds) for durability.
When Memtable is full or total size for comittlog is reached, cassandra flushes all the data to immutable Sorted String Table (SSTable). After the flush, compaction is the procedure where all the PK entries for the new column values are kept and all the previous values (before update) are removed.
With flushing frequently comes the overhead on frequent sequential writes to disk and compaction which could take lot of I/O and have a serious impact on cassandra performance.
As far as read goes, first cassandra will try to read from row cache (if its enabled) or from memtable. If it fails there it will go to bloom filter, key cache, partition summary, partition index and finally to SSTable in that order. When the data is collected for all the column values, its aggregate in memory and the column values with latest timestamp are returned to client after aggregation and an entry is made in row cache for that partition key`.
So, yes when you query a partition key, it will scan across all the SSTable for that particular CQL table and the memtable for all the column values that are not being flushed to disk yet.
Initially these updates are stored in an in-memory data structure called Memtable. Memtables are flushed to immutable SSTables at regular intervals.
So a single wide row will be read from various SSTables. It is during a process called 'compacation' the different SSTables will be merged into a bigger SSTable on the disk.
Increasing thresholds for flushing Memtables is one way of optimization. If updates are coming very fast before Memtable is flushed to disk, i think that update should be in-place in memory, not sure though.
Also each read operation checks Memtables first, if data is still there, it will be simply returned – this is the fastest possible access.
Cassandra read path:
When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question
Cassandra write path:
No, in place updates are not possible.
As #john suggested, if you have frequent writes then you should delay the flush process. During the flush, the multiple writes to the same partition that are stored in the MemTable will be written as a single partition in the newly created SSTable.
C* is fine for heavy writes. However, you'll need to monitor the number of SSTables accessed per read. If the # is too high, then you'll need to review your compaction strategy.

When does Cassandra remove data from an SSTable

In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period. What happens to the data? I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
In Cassandra 2.x when I delete one or multiple columns, they receive a tombstone in the Memtable but the data is not removed. At some point, the Memtable is flushed to an SSTable including the deleted data and the tombstone. When compaction is running, it will retain the tombstone with the specified grace period.
True.
What happens to the data?
The data will remain on disk at least for gc_grace_seconds. Next minor compaction right after gc_grace_seconds may remove it, but real timing depends mostly on your dataset and workload type.
I have deleted a bunch of columns last week - less than gc_grace_seconds ago. I am not sure compaction has run yet. I haven't seen any change on disk size used yet, so I was wondering at which point is the data physically removed from disk?
If you want to free some disk space, you can:
wait for gc_grace_seconds for normal minor compaction.
run nodetool compact which will trigger major compaction on current node freeing disk space right now.

Retrieving "tombstoned" records in Cassandra

My question is very simple. Is it in any way possible to retrieve columns that have been marked tombstone before the GCGraceSeconds period expiry(default 10 days). If yes what would be the exact CQL query for that?
If I were to understand the deletion process the tombstones are marked on the MemTables and the SSTable being immutable waiting for compaction still has the the deleted data waiting for compaction. So before compaction occurs is there any way to read the tombstoned data from either the Memtable or SSTable?
Using CQL 3.0 on CQLSH command prompt & Cassandra 2.0.
You are right, when a tombstone is inserted it usually doesn't immediately delete the underlying data (unless all your data is in a memtable). However, you can't control when it does. If you don't have much data and compaction happens quickly, the underlying data may be deleted very quickly, much sooner than 10 days.
There is no query to read deleted data, but you can inspect all your SSTables with sstable2json to see if they contain the deleted data.
Just to add on to the previous comment. Have a low value of gc_grace_seconds for the column families that have frequent deletions. It will take some time for gc but tombstones are expected to get cleared .

What does cassandra do during compaction?

I know that cassandra merges sstables, row-keys, remove tombstone and all.
But i am really interested to know how it performs compaction ?
As sstables are immutable does it copy all the relevant data to new file? and while writing to this new file it discard the tombstone marked data.
i know what compaction does but want to know how it make this happen(T)
I hope this thread helps, provided if you follow all the posts and comments in it
http://comments.gmane.org/gmane.comp.db.cassandra.user/10577
AFAIK
Whenever memtable is flushed from memory to disk they are just appended[Not updated] to new SSTable created, sorted via rowkey.
SSTable merge[updation] will take place only during compaction.
Till then read path will read from all the SSTable having that key you look up and the result from them is merged to reply back,
Two types : Minor and Major
Minor compaction is triggered automatically whenever a new sstable is being created.
May remove all tombstones
Compacts sstables of equal size in to one [initially memtable flush size] when minor compaction threshold is reached [4 by default].
Major Compaction is manually triggered using nodetool
Can be applied over a column family over a time
Compacts all the sstables of a CF in to 1
Compacts the SSTables and marks delete over unneeded SSTables. GC takes care of freeing up that space
Regards,
Tamil
Are two ways to run compaction :
A- Minor compaction. Run automatically.
B- Major compaction. Run mannualy.
In both cases takes x files (per CF) and process them. In this process mark the rows with expired ttl as tombstones, and delete the existing tombstones. With this generates a new file. The tombostones generated in this compaction, will be delete in the next compaction (if spend the grace period, gc_grace).
The difference between A and B are the quantity of files taken and the final file.
A takes a few similar files (similar size) and generate a new file.
B takes ALL the files and genrate only one big file.

Resources