Recently I have been looking into Cassandra from our new project's perspective and learned a lot from this community and its wiki too. But I have not found anything about about how updates are managed in Cassandra in terms of physical disk space management though it seems to be very much similar to record delete management using compaction.
Suppose there are 100 records with 5 column values each so when all changes would be flushed disk all records will be written adjacently and when delete operation is done then its marked in Memory table first and physically record is deleted after some time as set in configuration or when its full. And the compaction process claims the space.
Now question is that at one side being schema less there is no fixed number of columns at the the beginning but on the other side when compaction process takes place then.. does it put records adjacently on disk like traditional RDBMS to speed up the read process as for RDBMS its easy because they have to allocate fixed amount of space as per declaration of columns datatype.
But how Cassandra exactly makes the records placement on disk in compaction process (both for update/delete) to speed up the reads?
One more question related to compaction is that when there is no delete queries but there is an update query which updates an existent record with some variable length data or insert altogether a new column then how compaction makes its space available on disk between already existent data rows?
Rows and columns are stored in sorted order in an SSTable. This allows a compaction of multiple SSTables to output a new, (sorted) SSTable, with only sequential disk IO. This new SSTable will be outputted into a new file and freespace on the disks. This process doesn't depend on the number of rows of columns, just on them being stored in a sorted order. So yes, in all SSTables (even those resulting form compactions) rows and columns will be arranged in a sorted order on disk.
Whats more, as you hint at in your question, updates are no different from inserts - they do not overwrite the value on disk, but instead get buffered in a Memtable, then get flushed into a new SSTable. When the new SSTable eventually gets compacted with the SSTable containing the original value, the newer value will annihilate the old one - ie the old value will not be outputted from the compaction. Timestamps are used to decide which values is newest.
Deletes are handled in the same fashion, effectively inserted an "anti-value", or tombstone. The limitation of this process is that is can require significant space overhead. Deletes are effectively 'lazy, so the space doesn't get freed until some time later. Also, while the output of the compaction can be the same size as the input, the old SSTables cannot be deleted until the new one is completed, so this can reduce disk utilisation to 50%.
In the system described above, new values for an existing key can be a different size to the existing key without padding to some pre-determined length, as the new value does not get written over the old value on update, but to a new SSTable.
Related
We are using TWCS for time series data with default TTL of 30 days an compaction window size of 1 day.
Unfortunately, there are cases when incoming data rate gets higher and not so much disk space left to write it. At the same time due to budget constraints adding new nodes to the cluster is not an option. Currently we resort to manually deleting old sstables, but it is error prone.
What is the best way in TWCS case to make Cassandra delete, say, all records that are older than certain date? I mean not to create tombstones in new sstable, but to actually delete old records from disk to free up space.
Of course, I can reduce TTL, but it will affect only new records (so will help only in a long run, but not immediately) and in a case when there is not so much incoming data records will be stored for a shorter period than could be.
Basically, that's the intent of the TTLs to automatically remove the old data. The explicit deletion always creates a tombstone, and it won't work well with with TWCS. So right now the solution would be to stop node, remove old files to free space, start the node - repeat on all nodes. But you're doing that already.
One of the benefits of Cassandra (or Scylla) is that:
When a table has multiple clustering columns, the data is stored in nested sort order.
https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/whereClustering.html
Because of this I think reading the data back in that same sorted order should be very fast.
If data is written in a different order than the clustering columns specify, when does Cassandra (or Scylla) actually re-order the data?
Is it when the memtables are flushed to SSTables?
What if a memtable has already been flushed, and I add a new record that should be before records in an existing SSTable?
Does it keep the data out of order on disk for a while and re-order it during compaction?
If so, what steps does it take to make sure reads are in the correct order?
Data is always sorted in any given sstable.
When a memtable is flushed to disk, that will create a new sstable, which is sorted within itself. This happens naturally since memtables store data in sorted order, so no extra sorting is needed at that point. Sorting happens on insertion into the memtable.
A read, which is using natural ordering, will have to read from all sstables which are relevant for the read, merging multiple sorted results into one sorted result. This merging happens in memory on-the-fly.
Compaction, when it kicks in, will replace multiple sstables with one, creating a merged stream much like a regular read would do.
This technique of storing data is known as a log-structured merge tree.
The data is reordered during the compaction.
Basically, any write is just an append, in order to be very fast. There are no reads or seeks involved.
When reading data, Cassandra is reading from the active memtable and from one or more SSTables. Data is aggregated and the query is satisfied.
Since data distribution might require accessing a growing number of SSTables, compaction has the role to reorganize the data on disk so it will eliminate the potential overhead of reading data from multiple SSTables. It is worth mentioning that SSTables are immutable and new SSTables are created. The old ones are discarded.
The process is similar in both Scylla and Cassandra.
I'm trying to understand how quickly space is reclaimed in Cassandra after deletes. I've found a number of articles that describe tombstoning and the problems this can create when you are doing range queries and Cassandra has to scan through lots of tombstoned rows to find the much more scarce live ones. And I get that you can't set gc_grace_seconds too low or you will have zombie records that can pop up if a node goes offline and comes back after the tombstones disappeared off the remaining machines. That all makes sense.
However, if the tombstone is placed on the key then it should be possible for the space from rest of the row data to be reclaimed.
So my question is, for this table:
create table somedata (
category text,
id timeuuid,
data blob,
primary key ((category), id)
);
If I insert and then remove a number of records in this table and take care not to run into the tombstone+range issues described above and at length elsewhere, when will the space for those blobs be reclaimed?
In my case, the blobs may be larger than the recommended size (1mb I believe) but they should not be larger than ~15mb, which I think is still workable. But it makes a big space difference if all of those blobs stick around for 10 days (default gc_grace_seconds value) vs if only the keys stick around for 10 days.
When I looked I couldn't find this particular aspect described anywhere.
The space will be reclaimed after the gc_grace_seconds clause is done, and you will have keys and blobs sticking around. Also you'll need to consider that this may increase if you also have updates (which will be different versions of the same record identified by the timestamp of when it was created) and the replication factor used (amount of copies of the same record distributed across the nodes).
You will always have trade-offs between fault resilience and disk usage, the customization of your settings (gc_grace_seconds, ttl, replication factor, consistency level) will depend on your use case and the SLA's that you need to fulfill.
I have a table whose rows get overwritten frequently using the regular INSERT statements. This table holds ~50GB data, and the majority of it is overwritten daily.
However, according to OpsCenter, disk usage keeps going up and is not freed.
I have validated that rows are being overwritten and not simply being appended to the table. But they're apparently still taking up space on disk.
How can I free disk space?
Under the covers the way Cassandra during these writes is that a new row is being appended to the SSTable with a newer time stamp. When you perform a read the newest row (based on time stamp) is being returned to you as the row. However this also means that you are using twice the disk space to accomplish this. It is not until Cassandra runs a compaction operation that the older rows will be removed and the disk space recovered. Here is some information on how Cassandra writes to disk which explains the process:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_write_path_c.html?scroll=concept_ds_wt3_32w_zj__dml-compaction
A compaction is done on a node by node basis and is a very disk intensive operation which may effect the performance of your cluster during the time it is running. You can run a manual compaction using the nodetool compact command:
https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html
As Aaron mentioned in his comment above overwriting all the data in your cluster daily is not really the best use case for Cassandra because of issues such as this one.
I have a table in Cassandra where I populate some rows with 1000s of entries (each row is with 10000+ columns). The entries in the rows are very frequently updated, basically just a field (which is an integer) is updated with different values. All other values for the columns remains unmodified. My question is, will the updates be done in-place ? How good is Cassandra for frequent update of entries ?
First of all every update is also a sequential write for cassandra so, as far as cassandra goes it does not make any difference to cassandra whether you update or write.
The real question is how fast do you need to read those writes to be available for reading? As #john suggested, first all the writes are written to a mutable CQL Memtable which resides in memory. So, every update is essentially appended as a new sequential entry to memtable for a particular CQL table. It is concurrently periodically also written to `commitlog' (every 10 seconds) for durability.
When Memtable is full or total size for comittlog is reached, cassandra flushes all the data to immutable Sorted String Table (SSTable). After the flush, compaction is the procedure where all the PK entries for the new column values are kept and all the previous values (before update) are removed.
With flushing frequently comes the overhead on frequent sequential writes to disk and compaction which could take lot of I/O and have a serious impact on cassandra performance.
As far as read goes, first cassandra will try to read from row cache (if its enabled) or from memtable. If it fails there it will go to bloom filter, key cache, partition summary, partition index and finally to SSTable in that order. When the data is collected for all the column values, its aggregate in memory and the column values with latest timestamp are returned to client after aggregation and an entry is made in row cache for that partition key`.
So, yes when you query a partition key, it will scan across all the SSTable for that particular CQL table and the memtable for all the column values that are not being flushed to disk yet.
Initially these updates are stored in an in-memory data structure called Memtable. Memtables are flushed to immutable SSTables at regular intervals.
So a single wide row will be read from various SSTables. It is during a process called 'compacation' the different SSTables will be merged into a bigger SSTable on the disk.
Increasing thresholds for flushing Memtables is one way of optimization. If updates are coming very fast before Memtable is flushed to disk, i think that update should be in-place in memory, not sure though.
Also each read operation checks Memtables first, if data is still there, it will be simply returned – this is the fastest possible access.
Cassandra read path:
When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question
Cassandra write path:
No, in place updates are not possible.
As #john suggested, if you have frequent writes then you should delay the flush process. During the flush, the multiple writes to the same partition that are stored in the MemTable will be written as a single partition in the newly created SSTable.
C* is fine for heavy writes. However, you'll need to monitor the number of SSTables accessed per read. If the # is too high, then you'll need to review your compaction strategy.